A Python Data Processing Script Template

Here's a skeleton general purpose template for getting a Python command line script fleshed out as quickly as possible.

By Matthew Mayo, KDnuggets Managing Editor on August 31, 2021 in Programming, Python

Like many of you, I tend to write a lot of Python scripts which are meant to perform similar tasks, or at least follow a similar pattern of functionality. In an effort to not repeat myself (or, approached from a different angle, to always repeat myself in exactly the same way), I like to setup boilerplate or template code for these kinds of scripts in an effort to keep my programming life as lazy as possible.

I recently wrote about Managing Your Reusable Python Code as a Data Scientist, and in that same vein I have put together a generic Python data processing script template I start off most projects with these days. It has changed and been tweaked over time, but this current version is the go-to for most, non-specialized (i.e. not machine learning) scripts that I start out writing.

First, here's a little about what I generally am looking to broadly accomplish with my scripts these days:

Parse command line arguments
Setup required paths and filenames
Access to Google Sheets for data retrieval and storage
Access to SQL databases for data retrieval and storage
Filesystem management
HTML and other text manipulation
Use of some of my own custom Python modules
Retrieval of some resources on the internet, HTML or CSV files, for example

Let's go through the code step by step.

Libraries

I'm particular with what and how I import. The first line imports (mostly) standard library modules; the next few lines import third party libraries, in the order 'import', 'import ... as ...', and 'from ... import ...'; a custom module named const is imported, which I include in my projects as a separate Python file holding immutable project-specific variable assignments (constants); finally, any of my own custom Python modules are imported, in this case a 'dates' modules from my own preprocessing library.

import sys, re, datetime, os, glob, argparse, json, yaml, pprint, urllib.request, wget, logging
import gspread
import sqlite3
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from oauth2client.service_account import ServiceAccountCredentials

import const
from my_preprocessing import dates

Exit Functions

Along with, or in place of, error logging, I add some exit functions to be called from later code for different reasons. Here are a few examples, the first of which I always include.

def quit():
	"""Program was not initialized correctly"""
	print("example: python template.py -o something_here -b")
	sys.exit(1)

def custom_error(message):
	"""Some other custom error"""
	print("There is a problem: %s" % message)
	sys.exit(2)

I like argparse, see below, but I like to call a custom exit function with a concrete example of how to initialize the command line script, as opposed to only the argparse default messaging. The second function would obviously be repurposed for something specific, if needed.

Parsing Command Line Arguments

I use argparse to parse command line arguments because why not? These args are included in my template for quick editing, and so I don't have to reference documentation all the time.

# Parse and process command line args
parser = argparse.ArgumentParser(description='Python data processing command line template.')
parser.add_argument('required_arg', metavar='req', type=int,
	help='a required arg of type int')
parser.add_argument('-o', '--optional', type=str, default='default_value',
	help='optional arg of type str')
parser.add_argument('-b', '--boolean', action='store_true',
	help='optional boolean arg, False if not used')

# Call custom function for additional console output when parse fails
try:
	args = parser.parse_args()
except SystemExit:
	quit()

# Assign command line args to variables
var_1 = args.req
var_2 = args.optional
var_3 = args.boolean

Note the call to the custom exit function outlined above in the try/except block, as well as the assignment of the parsed args to friendlier variable names.

Data Filenames and Paths

Time to setup project-specific data filenames and paths, since I am almost always processing data of some sort. The benefit of storing these in YAML files is that they are collected in one spot and can be easily changed without searching through code.

# Process data filenames
files_yaml = '/path/to/files/config/files.yaml'
with open(files_yaml) as f:
	files_dict = yaml.safe_load(f)
input_file = files_dict['input']
temp_file = files_dict['temp']
output_file = files_dict['output']
database_file = files_dict['example.db']

Here I have grabbed input, output, temp, and database files from a project YAML file, which resides in the same directory and looks like this:

input: '/path/to/project/specific/input/file'

output: '/path/to/project/specific/output/file'

temp: '/path/to/project/specific/temp/file'

database: '/path/to/project/specific/database/file.db'

Change in these filenames or locations? Just change them here once.

Google Sheets Setup and Config

I have taken to using Google Sheets for much (small scale) data storage, and so I need to be able to access and manipulate this data. I use the gspread library to do so, the config of which is beyond the scope of this article.

# Google Sheets API setup
creds_yaml = '/path/to/credentials/config/google_api.yaml'
with open(creds_yaml) as f:
	creds_dict = yaml.safe_load(f)
scope = creds_dict['scope']
creds_file = creds_dict['creds_file']
creds = ServiceAccountCredentials.from_json_keyfile_name(creds_file, scope)
client = gspread.authorize(creds)

# Google Sheets workbook and sheet setup
data_workbook = 'sample-workbook'
data_sheet = client.open(data_workbook).sheet1

This code gets API credential data from a pair of YAML and JSON files, authenticates, and connects to a specific workbook and a specific worksheet, in order to do some stuff in them in subsequent code.

SQLite Setup and Config

I also regularly access and manipulate SQL databases using SQLite3. Here is the code to setup and configure a database connection.

# Create database connection
con = sqlite3.connect(database_file)

# Perform some SQL tasks
cur = con.cursor()
# ...

# Save (commit) the changes
con.commit()

# Close connection
con.close()

BeautfulSoup, String Manipulation, and Using "Constants"

Scraping HTML is a common task I perform, and I do so using a combination of the BeautifulSoup library, simple string manipulation, and regular expressions. Here is the code for setting up BeautifulSoup, removing some HTML tags, using some variables stored in the const module for simple string replacements, as well as some use of regular expressions. These are all example stand-in excerpts to be changed without having to consult documentation.

Using BeautifulSoup for HTML scraping """
# Make the soup
html = input_file.read()
soup = BeautifulSoup(html, 'lxml')

""" Using variables from imported const file """
# Drop invalid tags
for tag in const.INVALID_TAGS: 
	for match in soup.find_all(tag):
		match.replaceWithChildren()

# BeautifulSoup bytes to string
soup_str = str(soup)

# String replacements and footer append
for pair in const.PAIRS:
	soup_str = soup_str.replace(str(pair[0]), str(pair[1]))
soup_str += const.FOOTER_INFO

""" Using regular expressions substitution """
# Remove excess newlines
soup_str = re.sub(r'\n\s*\n\n', '\n', soup_str)

# Output resulting HTML to file, clean up
output_file.write(soup_str)
output_file.close()
input_file.close()

The final result in this case is saved to file, and all open files are closed.

Custom Library

Here is an example of some custom library functionality. The functionality of this code creating some simple date features can be found here. As I use my collection of preprocessing snippets often, they have been packaged as the my_preprocessing library, which I automatically import.

# Date feature engineering
my_date = dates.process_date('2021-12-31')
pprint.pprint(my_date)

The Entire Template

And now that we've been over the code, here is the entire template.py file that I copy and set out with whenever I go to create a new data processing script.

I hope someone has found this useful.

Related: