How to Choose a Data Format

In any data analytics project, after business understanding phase, data understanding and selection of right data format as well as ETL tools is very important task. In this article, a very useful and practical set of guidelines is explained covering data format selection and ETL phases of project lifecycle.



Use case scenarios
Each of the data formats has its own strengths, weaknesses, and trade-offs, so the decision on which format to use should be based on your specific use cases and systems.

If your main focus is to be able to write data as fast as possible and you have no concerns about space, then it might be acceptable to just store your data in text format with the understanding that query times for large data sets will be longer.

If your main concern is being able to handle evolving data in your system, then you can rely on Avro to save schemas. Keep in mind, though, that when writing files to the system Avro requires an pre-populated schema, which might involve some additional processing at the beginning.

Finally, if your main use case is analysis of the data and you would like to optimize the performance of the queries, then you might want to take a look at a columnar format such as Parquet or ORC because they offer the best performance in queries, particularly for partial searches where you are only reading specific columns. However, the speed advantage might decrease if you are reading all the columns.

There is a pattern in the mentioned uses cases: if a file takes longer to write, it is because it has been optimized to increase speed during reads.

Tests
We have already discussed how choosing the right data format for your system depends on several factors; to provide a more comprehensive explanation, we set out to empirically compare the different data formats in terms of performance for writing and reading files.

We created some quantitative tests comparing the following five data formats available in the Hadoop ecosystem:

– Text
– Sequence
– Avro
– Parquet
– ORC

Our test measure execution times for a couple different exploratory queries, using the following technologies:

– Hive
– Impala

We tested against three different datasets:

  • Narrow dataset – 10,000,000 rows of 10 columns resembling an Apache log file
  • Wide dataset – 4,000,000 rows of 1000 columns composed by the first few columns of personal identification data and the rest set by random numbers and booleans.
  • Wide dataset large – 1TB of the wide dataset and 302,924,000 rows.

Full results of the testing along with the appropiate code that you can try on your own system can be found here.

Original. Reposted with permission.

Bio: AUTHOR_NAMESilvia Oliveros has a background in computer engineering and visual analytics, and has worked on several projects helping clients explore and analyze their data. She is interested in building and optimizing the infrastructure and data pipelines used to gather insights from various datasets.

Related: