CSV vs. Parquet vs. Arrow: Storage Formats Explained
Same data, different formats, very different performance.

Image by Author
# Introduction
Hugging Face Datasets provides one of the most straightforward methods to load datasets using a single line of code. These datasets are frequently available in formats such as CSV, Parquet, and Arrow. While all three are designed to store tabular data, they operate differently at the backend. The choice of each format determines how data is stored, how quickly it can be loaded, how much storage space is required, and how efficiently the data types are preserved. These differences become increasingly significant as datasets grow larger and models more complex. In this article, we will look at how Hugging Face Datasets works with CSV, Parquet, and Arrow, what actually makes them different on disk and in memory, and when each one makes sense to use. So, let’s get started.
# 1. CSV
CSV stands for Comma-Separated Values. It’s just text, one row per line, columns separated by commas (or tabs). Almost every tool can open it i.e. Excel, Google Sheets, pandas, databases etc. It’s very simple and interoperable.
Example:
name,age,city
Kanwal,30,New York
Qasim,25,Edmonton
Hugging Face treats it as a row-based format, meaning it reads data row by row. While this is acceptable for small datasets, the performance deteriorates with scaling. Additionally, there are some other limitations, such as:
- No explicit schema: As all data is stored in text format, types need to be inferred every time the file is loaded. This may cause errors if the data is not consistent.
- Large size and slow I/O: Text storage increases the file size, and parsing numbers from text is CPU-intensive.
# 2. Parquet
Parquet is a binary columnar format. Instead of writing rows one after another like CSV, Parquet groups values by column. That makes reads and queries much faster when you only need a few columns, and compression keeps file sizes and I/O low. Parquet also stores a schema so types are preserved. It works best for batch processing and large-scale analytics, not for many small, frequent updates to the same file (It’s better for batch writes than constant edits). If we take the above CSV example, it will store all names together, all ages together, and all cities together. This is the columnar layout and the example would look like this:
Names: Kanwal, Qasim
Ages: 30, 25
Cities: New York, Edmonton
It also adds metadata for each column: the type, min/max values, null counts, and compression info. This allows faster reads, efficient storage, and accurate type handling. Compression algorithms like Snappy or Gzip further reduce disk space. It has following strengths:
- Compression: Similar column values compress well. Files are smaller and cheaper to store.
- Column-wise reading: Load only the columns you need, speeding up queries.
- Rich typing: Schema is stored, so no guessing types on every load.
- Scale: Works well for millions or billions of rows.
# 3. Arrow
Arrow is not the same as CSV or Parquet. It is a columnar format kept in memory for fast operations. In Hugging Face, every Dataset is backed by an Arrow table, whether you started from CSV, Parquet, or an Arrow file. Continuing with the same example table, Arrow also stores data column by column, but in memory:
Names: contiguous memory block storing Kanwal, Qasim
Ages: contiguous memory block storing 30, 25
Cities: contiguous memory block storing New York, Edmonton
Because data is in contiguous blocks, operations on a column (like filtering, mapping, or summing) are extremely fast. Arrow also supports memory mapping, which allows datasets to be accessed from disk without fully loading them into RAM. Some of the key benefits of this format are:
- Zero-copy reads: Memory-map files without loading everything into RAM.
- Fast column access: Columnar layout enables vectorized operations.
- Rich types: Handles nested data, lists, tensors.
- Interoperable: Works with pandas, PyArrow, Spark, Polars, and more.
# Wrapping Up
Hugging Face Datasets makes switching formats routine. Use CSV for quick experiments, Parquet to store large tables, and Arrow for fast in-memory training. Knowing when to use each keeps your pipeline fast and simple, so you can spend more time on the model.
Kanwal Mehreen is a machine learning engineer and a technical writer with a profound passion for data science and the intersection of AI with medicine. She co-authored the ebook "Maximizing Productivity with ChatGPT". As a Google Generation Scholar 2022 for APAC, she champions diversity and academic excellence. She's also recognized as a Teradata Diversity in Tech Scholar, Mitacs Globalink Research Scholar, and Harvard WeCode Scholar. Kanwal is an ardent advocate for change, having founded FEMCodes to empower women in STEM fields.