5 Python Data Processing Tips & Code Snippets

This is a small collection of Python code snippets that a beginner might find useful for data processing.

By Matthew Mayo, KDnuggets Managing Editor on July 9, 2021 in Data Preprocessing, Data Processing, Pandas, Programming, Python

comments

This article contains 5 useful Python code snippets that a beginner might find helpful for data processing.

Python is a flexible, general purpose programming language, providing for many ways to approach and achieve the same task. These snippets shed light on one such approach for a given situation; you might find them useful, or find that you have come across another approach that makes more sense to you.

1. Concatenate Multiple Text Files

Let's start with concatenating multiple text files. Should you have a number of text files in a single directory you need concatenated into a single file, this Python code will do so.

First we get a list of all the txt files in the path; then we read in each file and write out its contents to the new output file; finally, we read the new file back in and print its contents to screen to verify.

import glob

# Load all txt files in path
files = glob.glob('/path/to/files/*.txt')

# Concatenate files to new file
with open('2020_output.txt', 'w') as out_file:
    for file_name in files:
        with open(file_name) as in_file:
            out_file.write(in_file.read())

# Read file and print
with open('2020_output.txt', 'r') as new_file:
    lines = [line.strip() for line in new_file]
for line in lines: print(line)

file 1 line 1
file 1 line 2
file 1 line 3
file 2 line 1
file 2 line 2
file 2 line 3
file 3 line 1
file 3 line 2
file 3 line 3

2. Concatenate Multiple CSV Files Into a DataFrame

Staying with the theme of file concatenation, this time let's tackle concatenating a number of comma separated value files into a single Pandas dataframe.

We first get a list of the CSV files in our path; then, for each file in the path, we read the contents into its own dataframe; afterwards, we combine all dataframes into a single frame; finally, we print out the results to inspect.

import pandas as pd
import glob

# Load all csv files in path
files = glob.glob('/path/to/files/*.csv')

# Create a list of dataframe, one series per CSV
fruit_list = []
for file_name in files:
    df = pd.read_csv(file_name, index_col=None, header=None)
    fruit_list.append(df)

# Create combined frame out of list of individual frames
fruit_frame = pd.concat(fruit_list, axis=0, ignore_index=True)

print(fruit_frame)

            0   1    2
0      grapes   3  5.5
1      banana   7  6.8
2       apple   2  2.3
3      orange   9  7.2
4  blackberry  12  4.3
5   starfruit  13  8.9
6  strawberry   9  8.3
7        kiwi   7  2.7
8   blueberry   2  7.6

3. Zip & Unzip Files to Pandas

Let's say you are working with a Pandas dataframe, such as the resulting frame in the above snippet, and want to compress the frame directly to file for storage. This snippet will do so.

First we will create a dataframe to use with our example; then we will compress and save the dataframe directly to file; finally, we will read the frame back into a new frame directly from compressed file and print out for verificaiton.

import pandas as pd

# Create a dataframe to use
df = pd.DataFrame({'col_A': ['kiwi', 'banana', 'apple'],
	           'col_B': ['pineapple', 'grapes', 'grapefruit'],
		   'col_C': ['blueberry', 'grapefruit', 'orange']})

# Compress and save dataframe to file
df.to_csv('sample_dataframe.csv.zip', index=False, compression='zip')
print('Dataframe compressed and saved to file')

# Read compressed zip file into dataframe
df = pd.read_csv('sample_dataframe.csv.zip',)
print(df)

Dataframe compressed and saved to file

    col_A       col_B       col_C
0    kiwi   pineapple   blueberry
1  banana      grapes  grapefruit
2   apple  grapefruit      orange

4. Flatten Lists

Perhaps you have a situation where you are working with a list of lists, that is, a list in which all of its elements are also lists. This snippet will take this list of embedded lists and flatten it out to one linear list.

First we will create a list of lists to use in our example; then we will use list comprehensions to flatten the list in a Pythonic manner; finally, we print the resulting list to screen for verification.

# Create of list of lists (a list where all of its elements are lists)
list_of_lists = [['apple', 'pear', 'banana', 'grapes'], 
                 ['zebra', 'donkey', 'elephant', 'cow'],
	         ['vanilla', 'chocolate'], 
                 ['princess', 'prince']]

# Flatten the list of lists into a single list
flat_list = [element for sub_list in list_of_lists for element in sub_list]

# Print both to compare
print(f'List of lists:\n{list_of_lists}')
print(f'Flattened list:\n{flat_list}')

List of lists:
[['apple', 'pear', 'banana', 'grapes'], ['zebra', 'donkey', 'elephant', 'cow'], ['vanilla', 'chocolate'], ['princess', 'prince']]

Flattened list:
['apple', 'pear', 'banana', 'grapes', 'zebra', 'donkey', 'elephant', 'cow', 'vanilla', 'chocolate', 'princess', 'prince']

5. Sort List of Tuples

This snippet will entertain the idea of sorting tuples based on specified element. Tuples are an often overlooked Python data structure, and are a great way to store related pieces of data without using a more complex structure type.

In this example, we will first create a list of tuples of size 2, and fill them with numeric data; next we will sort the pairs, separately by both first and second elements, printing the results of both sorting processes to inspect the results; finally, we will extend this sorting to mixed alphanumeric data elements.

# Some paired data
pairs = [(1, 10.5), (5, 7.), (2, 12.7), (3, 9.2), (7, 11.6)]

# Sort pairs by first entry
sorted_pairs  = sorted(pairs, key=lambda x: x[0])
print(f'Sorted by element 0 (first element):\n{sorted_pairs}')

# Sort pairs by second entry
sorted_pairs  = sorted(pairs, key=lambda x: x[1])
print(f'Sorted by element 1 (second element):\n{sorted_pairs}')

# Extend this to tuples of size n and non-numeric entries
pairs = [('banana', 3), ('apple', 11), ('pear', 1), ('watermelon', 4), ('strawberry', 2), ('kiwi', 12)]
sorted_pairs  = sorted(pairs, key=lambda x: x[0])
print(f'Alphanumeric pairs sorted by element 0 (first element):\n{sorted_pairs}')

Sorted by element 0 (first element):
[(1, 10.5), (2, 12.7), (3, 9.2), (5, 7.0), (7, 11.6)]

Sorted by element 1 (second element):
[(5, 7.0), (3, 9.2), (1, 10.5), (7, 11.6), (2, 12.7)]

Alphanumeric pairs sorted by element 0 (first element):
[('apple', 11), ('banana', 3), ('kiwi', 12), ('pear', 1), ('strawberry', 2), ('watermelon', 4)]

And there you have 5 Python snippets which may be helpful to beginners for a few different data processing tasks.

Related:

5 Python Data Processing Tips & Code Snippets

1. Concatenate Multiple Text Files

2. Concatenate Multiple CSV Files Into a DataFrame

3. Zip & Unzip Files to Pandas

4. Flatten Lists

5. Sort List of Tuples

More On This Topic

Latest Posts

Top Posts