5 Python Data Processing Tips & Code Snippets
This is a small collection of Python code snippets that a beginner might find useful for data processing.
Photo by Hitesh Choudhary on Unsplash
This article contains 5 useful Python code snippets that a beginner might find helpful for data processing.
Python is a flexible, general purpose programming language, providing for many ways to approach and achieve the same task. These snippets shed light on one such approach for a given situation; you might find them useful, or find that you have come across another approach that makes more sense to you.
1. Concatenate Multiple Text Files
Let's start with concatenating multiple text files. Should you have a number of text files in a single directory you need concatenated into a single file, this Python code will do so.
First we get a list of all the txt files in the path; then we read in each file and write out its contents to the new output file; finally, we read the new file back in and print its contents to screen to verify.
import glob # Load all txt files in path files = glob.glob('/path/to/files/*.txt') # Concatenate files to new file with open('2020_output.txt', 'w') as out_file: for file_name in files: with open(file_name) as in_file: out_file.write(in_file.read()) # Read file and print with open('2020_output.txt', 'r') as new_file: lines = [line.strip() for line in new_file] for line in lines: print(line)
file 1 line 1 file 1 line 2 file 1 line 3 file 2 line 1 file 2 line 2 file 2 line 3 file 3 line 1 file 3 line 2 file 3 line 3
2. Concatenate Multiple CSV Files Into a DataFrame
Staying with the theme of file concatenation, this time let's tackle concatenating a number of comma separated value files into a single Pandas dataframe.
We first get a list of the CSV files in our path; then, for each file in the path, we read the contents into its own dataframe; afterwards, we combine all dataframes into a single frame; finally, we print out the results to inspect.
import pandas as pd import glob # Load all csv files in path files = glob.glob('/path/to/files/*.csv') # Create a list of dataframe, one series per CSV fruit_list = [] for file_name in files: df = pd.read_csv(file_name, index_col=None, header=None) fruit_list.append(df) # Create combined frame out of list of individual frames fruit_frame = pd.concat(fruit_list, axis=0, ignore_index=True) print(fruit_frame)
0 1 2 0 grapes 3 5.5 1 banana 7 6.8 2 apple 2 2.3 3 orange 9 7.2 4 blackberry 12 4.3 5 starfruit 13 8.9 6 strawberry 9 8.3 7 kiwi 7 2.7 8 blueberry 2 7.6
3. Zip & Unzip Files to Pandas
Let's say you are working with a Pandas dataframe, such as the resulting frame in the above snippet, and want to compress the frame directly to file for storage. This snippet will do so.
First we will create a dataframe to use with our example; then we will compress and save the dataframe directly to file; finally, we will read the frame back into a new frame directly from compressed file and print out for verificaiton.
import pandas as pd # Create a dataframe to use df = pd.DataFrame({'col_A': ['kiwi', 'banana', 'apple'], 'col_B': ['pineapple', 'grapes', 'grapefruit'], 'col_C': ['blueberry', 'grapefruit', 'orange']}) # Compress and save dataframe to file df.to_csv('sample_dataframe.csv.zip', index=False, compression='zip') print('Dataframe compressed and saved to file') # Read compressed zip file into dataframe df = pd.read_csv('sample_dataframe.csv.zip',) print(df)
Dataframe compressed and saved to file col_A col_B col_C 0 kiwi pineapple blueberry 1 banana grapes grapefruit 2 apple grapefruit orange
4. Flatten Lists
Perhaps you have a situation where you are working with a list of lists, that is, a list in which all of its elements are also lists. This snippet will take this list of embedded lists and flatten it out to one linear list.
First we will create a list of lists to use in our example; then we will use list comprehensions to flatten the list in a Pythonic manner; finally, we print the resulting list to screen for verification.
# Create of list of lists (a list where all of its elements are lists) list_of_lists = [['apple', 'pear', 'banana', 'grapes'], ['zebra', 'donkey', 'elephant', 'cow'], ['vanilla', 'chocolate'], ['princess', 'prince']] # Flatten the list of lists into a single list flat_list = [element for sub_list in list_of_lists for element in sub_list] # Print both to compare print(f'List of lists:\n{list_of_lists}') print(f'Flattened list:\n{flat_list}')
List of lists: [['apple', 'pear', 'banana', 'grapes'], ['zebra', 'donkey', 'elephant', 'cow'], ['vanilla', 'chocolate'], ['princess', 'prince']] Flattened list: ['apple', 'pear', 'banana', 'grapes', 'zebra', 'donkey', 'elephant', 'cow', 'vanilla', 'chocolate', 'princess', 'prince']
5. Sort List of Tuples
This snippet will entertain the idea of sorting tuples based on specified element. Tuples are an often overlooked Python data structure, and are a great way to store related pieces of data without using a more complex structure type.
In this example, we will first create a list of tuples of size 2, and fill them with numeric data; next we will sort the pairs, separately by both first and second elements, printing the results of both sorting processes to inspect the results; finally, we will extend this sorting to mixed alphanumeric data elements.
# Some paired data pairs = [(1, 10.5), (5, 7.), (2, 12.7), (3, 9.2), (7, 11.6)] # Sort pairs by first entry sorted_pairs = sorted(pairs, key=lambda x: x[0]) print(f'Sorted by element 0 (first element):\n{sorted_pairs}') # Sort pairs by second entry sorted_pairs = sorted(pairs, key=lambda x: x[1]) print(f'Sorted by element 1 (second element):\n{sorted_pairs}') # Extend this to tuples of size n and non-numeric entries pairs = [('banana', 3), ('apple', 11), ('pear', 1), ('watermelon', 4), ('strawberry', 2), ('kiwi', 12)] sorted_pairs = sorted(pairs, key=lambda x: x[0]) print(f'Alphanumeric pairs sorted by element 0 (first element):\n{sorted_pairs}')
Sorted by element 0 (first element): [(1, 10.5), (2, 12.7), (3, 9.2), (5, 7.0), (7, 11.6)] Sorted by element 1 (second element): [(5, 7.0), (3, 9.2), (1, 10.5), (7, 11.6), (2, 12.7)] Alphanumeric pairs sorted by element 0 (first element): [('apple', 11), ('banana', 3), ('kiwi', 12), ('pear', 1), ('strawberry', 2), ('watermelon', 4)]
And there you have 5 Python snippets which may be helpful to beginners for a few different data processing tasks.
Related:
- Data Preparation in SQL, with Cheat Sheet!
- How to Clean Text Data at the Command Line
- Top Python Libraries for Data Science, Data Visualization & Machine Learning