SQL With CSVs

Write SQL query to analyze CSV files using the simple command line tool.



SQL With CSVs
Image by Author

 

You can run SQL query on a CSV file:

  1. By importing a CSV file on SQL Online IDE.
  2. By importing a CSV file directly to Database.
  3. By using Python, R, or Julia Packages to run SQL directly on a CSV file.
  4. By using DuckDB, which is best for running fast analytical queries on large CSV files.
  5. By using CLI tools.

In this post, we will be reviewing the most popular CLI tool for processing the CSV files and learning how to run SQL queries on CSV files hassle free.  

 

csvkit

 

csvkit consists of multiple command line conversion and processing tools. It is the king of tabular file formats. 

  • You can convert Excel to CSV, JSON to CSV, and CSV to JSON files.
  • Analyze the data by printing column names, viewing a subset of columns, reordering the columns, finding the rows with matching cells, and generating the summary statistics.
  • Run SQL Query on CSV file.
  • Import the data to the PostgreSQL database and extract the data from PostgreSQL to a CSV file. 

We will be focusing on running SQL queries and displaying and saving the result.

 

Installation 

 

You can install the csvkit using PIP.

pip install csvkit

 

After installation, read the documentation in Shell by typing the command below or reading the documentation online. 

csvsql -h

 

Syntax

 

To run the SQL, you need to use the --query argument and then write a query in the quotation mark. For longer SQL queries, you can use line breaks, as it works fine with PowerShell/ Bash. 

csvsql --query "WRITE SQL QUERY HERE" data.csv

 

Note: the file name at the end must be the same as the name of the table in the query. 

 

SQL with CSVs using csvkit

 

In this section, we will use a modified version of the Top 50 Fast-Food Chains in USA dataset from Kaggle to run the SQL query on a CSV file and print and save the results. 

 

Simple Query

 

First, we will run a simple query to test if the command works. 

csvsql --query "SELECT * FROM usa_ff LIMIT 2" usa_ff.csv

 

Output:

The results show four columns and the first two records of fast food chains in the USA. 

Chains,Sales,Franchised Stores,Company Stores
Arby's,4462.0,2293.0,1116.0
Baskin-Robbins,686.0,2317.0,0.0

 

Complex Query

 

Let’s try to run a complex SQL query to filter out the top 5 fast food chains with sales greater than or equal to 4 billion dollars.  

csvsql --query "
SELECT Chains AS 'Food Chains', Sales
FROM usa_ff
WHERE Sales >= 4000
ORDER BY Sales DESC
LIMIT 5" usa_ff.csv

 

Output:

Food Chains,Sales
McDonald's,45960.0
Starbucks,24300.0
Chick-fil-A,16700.0
Taco Bell,12600.0
Wendy's,11111.0

 

Results with csvlook

 

We will pipe “|” our result to the csvlook command and convert the result into tabular form. 

csvsql --query "
SELECT Chains AS 'Food Chains', Sales
FROM usa_ff
WHERE Sales >= 4000
ORDER BY Sales DESC
LIMIT 5" usa_ff.csv | csvlook 

 

Output:

It has improved the SQL query output. 

| Food Chains |  Sales |
| ----------- | ------ |
| McDonald's  | 45,960 |
| Starbucks   | 24,300 |
| Chick-fil-A | 16,700 |
| Taco Bell   | 12,600 |
| Wendy's     | 11,111 |

 

Saving the results

 

We will use redirection “>” to save SQL query results into CSV files. You can provide a filename or filename with a full address. 

csvsql --query "
SELECT Chains AS 'Food Chains', Sales
FROM usa_ff
WHERE Sales >= 4000
ORDER BY Sales DESC
LIMIT 5" usa_ff.csv > filtered_usa_ff.csv

 

Result:

As we can see, the file is successfully saved in the current directory. 

 

SQL With CSVs
Image by Author | Filtered CSV file using SQL

 

Conclusion

 

Having a handy command line tool helps you in automating the tasks and data pipelines. You can even use free online SQL tools and start working on the data analysis project. 

I will recommend you use Deepnote for running SQL queries on CSV files in seconds. It uses DuckDB in the background. 




 

Do let me know if you have questions about using SQL in Jupyter notebook, Deepnote, DuckDB, and csvkit. I will try my best to assist you.

 
 
Abid Ali Awan (@1abidaliawan) is a certified data scientist professional who loves building machine learning models. Currently, he is focusing on content creation and writing technical blogs on machine learning and data science technologies. Abid holds a Master's degree in Technology Management and a bachelor's degree in Telecommunication Engineering. His vision is to build an AI product using a graph neural network for students struggling with mental illness.