Five Command Line Tools for Data Science
You can do more data science than you think from the terminal.
By Rebecca Vickery, Data Scientist
One of the most frustrating aspects of data science can be the constant switching between different tools whilst working. You can be editing some code in a Jupyter Notebook, having to install a new tool on the command line and maybe editing a function in an IDE all whilst working on the same task. Sometimes it is nice to find ways of doing more things in the same piece of software.
In the following post, I am going to list some of the best tools I have found for doing data science on the command line. It turns out there are many tasks that can be completed via simple terminal commands than I first thought and I wanted to share some of those here.
This is a useful tool for obtaining data from any server via a variety of protocols including HTTP.
I’ll give a couple of example use cases for obtaining publically available data sets. The UCI Machine Learning Repository is an excellent resource for obtaining datasets for machine learning projects. I am going to use a simple curl command to download a data set taken from the blood transfusion centre in Hsin-Chu City, Taiwan. If we simply run
curl [url] which in our example will be
curl https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data this will print the data to the terminal.
Adding some additional arguments will download and save the data using a specified filename. The file will now be available in your current working directory.
Another common method of obtaining data for data science projects is via an API. This tool also supports both
POST requests for interacting with an API. Running the following command will obtain a single record from the OpenWeatherMap API and save as a JSON file named
weather.json . For a more comprehensive tutorial on cURL see this excellent article by Zaiste.
csvkit is a set of command line tools for working with CSV files. The tasks that it can execute can be divided into three areas: input, processing and output. Let’s look at a quick real-world example of how you can use this.
Firstly let’s install the tool using pip install.
For the purposes of this example, I am going to be using the same CSV file I created from the UCI Machine Learning Repository via a curl command above.
First, let’s use
csvclean to make sure that our CSV file is in the correct format. This function will automatically fix common CSV errors and remove any bad rows. A useful aspect of this function is that it automatically outputs a new cleaned version of the CSV file so that the raw data is preserved. The new file always has the following naming convention
[filename]_out.csv. If you would prefer for the original file to be overwritten you can add the optional
In the example file I have, there are no errors but this can be a really useful way to reduce errors further down the line when working with CSV files.
Now let’s say we want to quickly inspect the file. We can use
csvgrep to do this.
Firstly let’s print out the column names.
Let’s now determine how many classes there are in the target column
whether he/she donated blood in March 2007.
csvgrep function allows you to filter CSV files based on regular expression matching.
Let’s use this function to extract only the rows that match class 1.
You can also use
csvkit to perform simple data analysis using the
csvstat data_dl_out.csv prints descriptive statistics for the entire file to the command line. You can also just request the result of only one statistic with an optional command.
IPython gives access to enhanced interactive python from the shell. In essence, it means you can do most of the things that you can do in a Jupyter Notebook from the command line.
You can follow these steps to install it if you do not already have it available in your terminal.
To initiate IPython simply type
ipython at the command line. You are now in the interactive shell. Here you can import python installed libraries and I find this tool most useful for doing some quick data analysis on the command line.
Let’s perform some basic tasks on the data set we have already been using. First I will import pandas, read in the file and inspect the first few rows of data.
The file column names are quite long so next, I am going to use pandas to rename them, and then export the resulting dataframe to a new CSV file for later use.
As a final exercise let’s inspect the correlation between the features and the target variable using the pandas
To exit IPython simply type
At times you may also want to obtain a data set via a SQL query on a database. The tool csvsql, which is also part of the csvkit tool, supports querying, writing and creating tables directly on a database. It also supports SQL statements for querying a CSV file. Let’s run an example query on the cleaned dataset.
Yes, you can perform machine learning at the command line! There are a few tools for this but SciKit-Learn Laboratory is probably one of the most accessible. Let’s build a model using our blood donations data set.
SciKit-Learn laboratory relies on the correct files being placed in consistently named directories. So to begin with we will make a directory named
train and copy, move and rename the data file to
Next, we need to create a config file named
predict-donations.cfg and place it in our
Then we simply run this command
run_experiment -l predict-donations.cfg.
This automatically runs the experiment and creates an output folder containing the results.
We can run a SQL query to summarise the results in the
There are many other command line tools that can be useful for data science but I wanted to highlight here those that I had found useful in my work. For a really comprehensive view of data science at the command line, I found the book Data Science at the Command Line which is freely available online to be extremely useful.
Bio: Rebecca Vickery is learning data science through self study. Data Scientist @ Holiday Extras. Co-Founder of alGo.
Original. Reposted with permission.
- Top 12 Essential Command Line Tools for Data Scientists
- Data Science at the Command Line: Exploring Data
- Exploratory Data Analysis in Python
Top Stories Past 30 Days