10 Essential Bash Shell Commands for Data Science

In this tutorial, we’ll cover 10 essential Bash shell commands every data scientist should know — commands that save time, simplify tasks, and keep you focused on insights rather than busywork.



10 Essential Bash Shell Commands for Data Science
Image by Author

 

When I first started getting into data science, one of the best tools I stumbled upon was the Bash shell, even though I had a software engineering background. It felt intimidating at first—lines of commands blinking on the terminal, no GUI to click—but once I got the hang of it, I realized how much faster and more efficient my workflows became.

For anyone in data science, knowing a handful of essential Bash commands can save hours of time, whether you’re wrangling datasets, automating repetitive tasks, or organizing projects. In this tutorial, I’ll share 10 must-know Bash commands for data science. These commands are practical, beginner-friendly, and will make your life easier.

So, grab a cup of coffee, open your terminal, and let’s dive in.

 

Why Should Data Scientists Learn Bash Scripting?

 
Let’s get the obvious question out of the way: why bother with Bash when you have Python, R, or fancy notebooks? Here’s the reason:

  1. Speed: Bash is ridiculously fast for file manipulation and scripting
  2. Efficiency: Automating tasks like cleaning up temporary files or combining multiple datasets is a breeze
  3. Versatility: It works on virtually any system—Windows (via WSL), macOS, or Linux

In short, Bash is like that reliable old tool in your kit—nothing flashy, but it gets the job done.

 

1. ls – List Files

This might seem basic, but ls is more powerful than just showing what’s in a directory.
Examples for Data Scientists:

  1. Check the size of dataset files with ls -lh
  2. Quickly filter files by type: ls *.csv shows only CSV files
  3. Add a little color to your terminal with ls --color

Pro Tip: Use ls -lhS to sort files by size, which is handy when dealing with massive datasets.

 

2. cat – Peek Inside Your Data

Want a quick glance at your dataset without opening a heavy editor? Use cat.

cat dataset.csv | head -n 10  

 

This displays the first 10 rows of your file. If you need just the column names, combine them with head -n 1.

Why it’s important: Before loading data into pandas or another library, you can spot issues like missing headers or unexpected encoding.

 

3. grep – Search For Information

Finding specific information in massive logs or datasets can be a pain. Enter grep.
Example of Use Case:

grep "error" data_processing.log

 

This highlights every line containing the word "error" in your log file. Combine it with -i to make it case-insensitive.

Pro tip: Searching for a value in a CSV? Try:

grep "California" sales_data.csv  

 

 

4. awk – Lightweight Data Manipulation

awk is great for extracting columns, filtering rows, and performing basic calculations.
Let’s say you have a CSV and need the second column:

awk -F, '{print $2}' dataset.csv  

 

This prints only the second column. If you’re dealing with space-delimited data, skip -F,.

For numeric summaries:

awk '{sum += $1} END {print sum}' numbers.txt

 

Use this to quickly sum up values in a file.

 

5. head and tail – Inspect the Ends

You have likely heard of these, but they are lifesavers for data inspection.

  • head -n 5 dataset.csv gives you the first 5 rows
  • tail -n 5 dataset.csv shows the last 5 rows

Bonus Tip: Add -f to tail to watch a log file update in real-time—great for monitoring long-running processes.

 

6. sort – Organize Your Data

Sorting data is not just for Excel. Use sort to rearrange files or columns in seconds.

Example: Sort a CSV by its first column:

sort -t, -k1 dataset.csv

 

Pro Tip: Combine sort with uniq to remove duplicate entries:

sort dataset.csv | uniq

 

 

7. wc – Count Rows, Words, or Characters

Ever wanted to know how many rows are in a dataset and you don’t want to open it? wc has your back.

wc -l dataset.csv

 

This counts the lines, which is usually the number of rows in a file. Combine it with grep for more precise stats, like counting specific words.

 

8. find – Locate Anything, Anywhere

When organizing projects it can leave you with scattered files. find is used to locate or search for all CSV files. like a detective for your filesystem.

Example:

find . -name "*.csv"

 

This searches for all CSV files starting from your current directory.

 

9. sed – Edit Data on the Fly

Need to quickly clean up a dataset? sed is used perfectly for find-and-replace operations.

Replace all commas with tabs:

sed 's/,/\t/g' dataset.csv > cleaned_dataset.csv

 

Pro Tip: Use -i to edit files in place.

 

10. xargs – Combine Multiple Commands

When you need to combine multiple commands, xargs comes to the rescue.

Example: Deleting all .tmp files:

find . -name "*.tmp" | xargs rm

 

 

How to Practice These Commands

 
If you’re new to Bash, start small:

  1. Use ls and cat to explore your project directories
  2. Try filtering log files with grep
  3. Slowly build up to awk and sed for data manipulation

I recommend setting aside at least 30 minutes to 1 hour a day to practice. Create a sample dataset and try different commands on it.

 

Real-Life Application: Automating a Workflow

 
Here’s how I once used Bash to process a massive dataset:

  1. I used ls to identify the largest files
  2. head helped me inspect their structure
  3. A combination of grep and awk filtered and cleaned the data
  4. Finally, I used sed to format the data before loading it into Python

The whole process took 10 minutes in Bash instead of an hour in a GUI tool.

 

Conclusion

 
Bash might not seem as glamorous as Python or R, but it’s a critical tool for any data scientist. Master these 10 commands, and you’ll find yourself saving time, reducing headaches, and feeling like a pro when working with data.

Do you have a favorite Bash command or tip? Let me know in the comments below! Also, don’t forget to share this blog with fellow data enthusiasts who might find it helpful.
 
 

Shittu Olumide is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.



No, thanks!