A Python Data Processing Script Template
Here's a skeleton general purpose template for getting a Python command line script fleshed out as quickly as possible.
Like many of you, I tend to write a lot of Python scripts which are meant to perform similar tasks, or at least follow a similar pattern of functionality. In an effort to not repeat myself (or, approached from a different angle, to always repeat myself in exactly the same way), I like to setup boilerplate or template code for these kinds of scripts in an effort to keep my programming life as lazy as possible.
I recently wrote about Managing Your Reusable Python Code as a Data Scientist, and in that same vein I have put together a generic Python data processing script template I start off most projects with these days. It has changed and been tweaked over time, but this current version is the go-to for most, non-specialized (i.e. not machine learning) scripts that I start out writing.
First, here's a little about what I generally am looking to broadly accomplish with my scripts these days:
- Parse command line arguments
- Setup required paths and filenames
- Access to Google Sheets for data retrieval and storage
- Access to SQL databases for data retrieval and storage
- Filesystem management
- HTML and other text manipulation
- Use of some of my own custom Python modules
- Retrieval of some resources on the internet, HTML or CSV files, for example
Let's go through the code step by step.
I'm particular with what and how I import. The first line imports (mostly) standard library modules; the next few lines import third party libraries, in the order 'import', 'import ... as ...', and 'from ... import ...'; a custom module named const is imported, which I include in my projects as a separate Python file holding immutable project-specific variable assignments (constants); finally, any of my own custom Python modules are imported, in this case a 'dates' modules from my own preprocessing library.
import sys, re, datetime, os, glob, argparse, json, yaml, pprint, urllib.request, wget, logging import gspread import sqlite3 import pandas as pd import numpy as np from bs4 import BeautifulSoup from oauth2client.service_account import ServiceAccountCredentials import const from my_preprocessing import dates
Along with, or in place of, error logging, I add some exit functions to be called from later code for different reasons. Here are a few examples, the first of which I always include.
def quit(): """Program was not initialized correctly""" print("example: python template.py -o something_here -b") sys.exit(1) def custom_error(message): """Some other custom error""" print("There is a problem: %s" % message) sys.exit(2)
I like argparse, see below, but I like to call a custom exit function with a concrete example of how to initialize the command line script, as opposed to only the argparse default messaging. The second function would obviously be repurposed for something specific, if needed.
Parsing Command Line Arguments
I use argparse to parse command line arguments because why not? These args are included in my template for quick editing, and so I don't have to reference documentation all the time.
# Parse and process command line args parser = argparse.ArgumentParser(description='Python data processing command line template.') parser.add_argument('required_arg', metavar='req', type=int, help='a required arg of type int') parser.add_argument('-o', '--optional', type=str, default='default_value', help='optional arg of type str') parser.add_argument('-b', '--boolean', action='store_true', help='optional boolean arg, False if not used') # Call custom function for additional console output when parse fails try: args = parser.parse_args() except SystemExit: quit() # Assign command line args to variables var_1 = args.req var_2 = args.optional var_3 = args.boolean
Note the call to the custom exit function outlined above in the try/except block, as well as the assignment of the parsed args to friendlier variable names.
Data Filenames and Paths
Time to setup project-specific data filenames and paths, since I am almost always processing data of some sort. The benefit of storing these in YAML files is that they are collected in one spot and can be easily changed without searching through code.
# Process data filenames files_yaml = '/path/to/files/config/files.yaml' with open(files_yaml) as f: files_dict = yaml.safe_load(f) input_file = files_dict['input'] temp_file = files_dict['temp'] output_file = files_dict['output'] database_file = files_dict['example.db']
Here I have grabbed input, output, temp, and database files from a project YAML file, which resides in the same directory and looks like this:
input: '/path/to/project/specific/input/file' output: '/path/to/project/specific/output/file' temp: '/path/to/project/specific/temp/file' database: '/path/to/project/specific/database/file.db'
Change in these filenames or locations? Just change them here once.
Google Sheets Setup and Config
I have taken to using Google Sheets for much (small scale) data storage, and so I need to be able to access and manipulate this data. I use the gspread library to do so, the config of which is beyond the scope of this article.
# Google Sheets API setup creds_yaml = '/path/to/credentials/config/google_api.yaml' with open(creds_yaml) as f: creds_dict = yaml.safe_load(f) scope = creds_dict['scope'] creds_file = creds_dict['creds_file'] creds = ServiceAccountCredentials.from_json_keyfile_name(creds_file, scope) client = gspread.authorize(creds) # Google Sheets workbook and sheet setup data_workbook = 'sample-workbook' data_sheet = client.open(data_workbook).sheet1
This code gets API credential data from a pair of YAML and JSON files, authenticates, and connects to a specific workbook and a specific worksheet, in order to do some stuff in them in subsequent code.
SQLite Setup and Config
I also regularly access and manipulate SQL databases using SQLite3. Here is the code to setup and configure a database connection.
# Create database connection con = sqlite3.connect(database_file) # Perform some SQL tasks cur = con.cursor() # ... # Save (commit) the changes con.commit() # Close connection con.close()
BeautfulSoup, String Manipulation, and Using "Constants"
Scraping HTML is a common task I perform, and I do so using a combination of the BeautifulSoup library, simple string manipulation, and regular expressions. Here is the code for setting up BeautifulSoup, removing some HTML tags, using some variables stored in the const module for simple string replacements, as well as some use of regular expressions. These are all example stand-in excerpts to be changed without having to consult documentation.
Using BeautifulSoup for HTML scraping """ # Make the soup html = input_file.read() soup = BeautifulSoup(html, 'lxml') """ Using variables from imported const file """ # Drop invalid tags for tag in const.INVALID_TAGS: for match in soup.find_all(tag): match.replaceWithChildren() # BeautifulSoup bytes to string soup_str = str(soup) # String replacements and footer append for pair in const.PAIRS: soup_str = soup_str.replace(str(pair), str(pair)) soup_str += const.FOOTER_INFO """ Using regular expressions substitution """ # Remove excess newlines soup_str = re.sub(r'\n\s*\n\n', '\n', soup_str) # Output resulting HTML to file, clean up output_file.write(soup_str) output_file.close() input_file.close()
The final result in this case is saved to file, and all open files are closed.
Here is an example of some custom library functionality. The functionality of this code creating some simple date features can be found here. As I use my collection of preprocessing snippets often, they have been packaged as the my_preprocessing library, which I automatically import.
# Date feature engineering my_date = dates.process_date('2021-12-31') pprint.pprint(my_date)
The Entire Template
And now that we've been over the code, here is the entire
template.py file that I copy and set out with whenever I go to create a new data processing script.
I hope someone has found this useful.
- Managing Your Reusable Python Code as a Data Scientist
- Python Data Structures Compared
- Automate Microsoft Excel and Word Using Python