5 More Command Line Tools for Data Science

Use these tools to Access API, Manipulate CSV files, download datasets, and more from your terminal.



5 More Command Line Tools for Data Science
Image by Author

 

1. csvkit

 

Csvkit is a king of tabular data. It has a collection of tools that can be used to convert CSV files, manipulate the data, and perform data analysis. 

You can install csvkit using pip. 

$ pip install csvkit

 

Example 1

 

In this example, we will use csvcut to select only two columns and use csvlook to display the results in tabular format. 

csvcut -c sepal_length,species iris.csv | csvlook --max-rows 5

 

5 More Command Line Tools for Data Science

 

Note: you can limit number of rows with the argument --max-rows

 

Example 2

 

We will convert a CSV file into a JSON file using csvjson. 

csvjson iris.csv > iris.json

 

Note: csvkit also provides us Excel to CSV and JSON to CSV tools. 

 

Example 3

 

We can also perform data analysis on a CSV file by using SQL query. Csvsql requires SQL query and CSV file path You can display the results or save it in CSV.

csvsql --query "select * from iris where species like 'Iris-setosa'" iris.csv | csvlook --max-rows 5

 

5 More Command Line Tools for Data Science

 

2. IPython

 

IPython is an interactive Python shell that brings some functionalities of a jupyter notebook into your terminal. It allows you to test ideas faster without creating a Python file. 

Install ipython using pip install.

$ pip install ipython

 

Note: Ipython also comes with Anaconda and Jupyter Notebook. So, in most cases you don’t have to install it. 

After installing, just type ipython in the terminal and start performing data analysis just like you do in Jupyter notebooks. It is easy and fast.

 

5 More Command Line Tools for Data Science

 

3. cURL

 

cURL stands for client URL and is a CLI tool for transferring data to and from the server using URLs. You can use it to limit the rate, log errors, display progress, and test endpoints. 

In the example, we are downloading the machine learning data from the University of California and saving it as a CSV file. 

curl -o blood.csv https://archive.ics.uci.edu/ml/machine-learning-databases/blood-transfusion/transfusion.data

 

Output: 

% Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 12843  100 12843    0     0   7772      0  0:00:01  0:00:01 --:--:--  7769

 

You can use cURL for accessing APIs with tokens, push files, and automate the data pipelines.

 

4. awk

 

Awk is a terminal scripting language that we can use to manipulate the data and perform data analysis. It requires no complaining. We can use variables, numeric functions, string functions, and logical operators to write any type of script. 

In the example, we are displaying the first and last columns of the CSV file and showing the last 10 rows. The $1 in the script means the first columns. You can also change it to $3 to display the 3rd column. The $NF represents the last columns.

awk -F "," '{print $1 " | " $NF}' iris.csv | tail

 

5 More Command Line Tools for Data Science

 

5. Kaggle

 

Kaggle API allows you to download all kinds of datasets from the Kaggle website. Furthermore, you can update your public dataset, submit the file to the competition, and run and manage Jupyter Notebook. It is a super command line tool.

Install Kaggle API using pip.

$ pip install kaggle

 

After that, go to the Kaggle website and get your credentials. You can follow this guide to set up your username and private key. 

export KAGGLE_USERNAME=kingabzpro
export KAGGLE_KEY=xxxxxxxxxxxxxx

 

Example 1

 

After setting up authentication, you can search for random datasets. In our case, we are using the Survey on Employment Trends dataset.

 

5 More Command Line Tools for Data Science
Image from Survey on Employment Trends

 

You can either run the download script with -d argument USERNAME/DATASET.

$ kaggle datasets download -d revathyta/survey-on-employment-trends

 

Or,

You can simply get API command by clicking on three dots and selecting “Copy API command” option.

 

5 More Command Line Tools for Data Science
Image from Survey on Employment Trends

 

It will download the dataset in the form of a zip file. You can also pipe the script with the unzip command to extract the data. 

Downloading survey-on-employment-trends.zip to C:\Users\abida

0%|                                                                                                   | 0.00/6.22k [00:00<?, ?B/s]

100%|██████████████████████████████████████████████████████████████████████████████████████████████████| 6.22k/6.22k [00:00<?, ?B/s]

 

Example 2

 

To create and share your dataset on Kaggle, you need to first initiate a metadata file by providing the path of the dataset.

$ kaggle datasets init -p /work/Kaggle/World-Vaccine-Progress

 

After that create the dataset and push the file to Kaggle server. 

$ kaggle datasets create -p /work/Kaggle/World-Vaccine-Progress

 

You can also update your dataset by using the version command. It requires a file path and message. Just like git. 

$ kaggle datasets version -p /work/Kaggle/World-Vaccine-Progress -m "second version"

 

You can also check out my project Vaccine Update Dashboard which has successfully implemented Kaggle API to update the dataset regularly. 

 

Conclusion

 

There are so many amazing CLI tools that I use and they have improved my productivity and helped me automate most of my work. You can even create your own CLI tool in Python using click or argparse. 

In this article, we have learned about CLI tools to download the dataset, manipulate it, perform analysis, run scripts, and generate reports. 

I am a fan of the Kaalgle API and csvkit. I use It regularly to automate my notebooks and analysis. If you want to learn how to use command line tools in your data science workflow, read Data Science at the Command Line book online for free. 
 
 
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.