Skip to content

Data formats

Animesh Trivedi edited this page Aug 14, 2017 · 2 revisions

Performance comparison of different file formats and storage engines in the Apache Hadoop ecosystem

https://blog.cloudera.com/blog/2017/02/performance-comparing-of-different-file-formats-and-storage-engines-in-hadoop-file-system/

Benchmarking Apache Parquet: The Allstate Experience

http://blog.cloudera.com/blog/2016/04/benchmarking-apache-parquet-the-allstate-experience/

how to choose a data format

https://www.svds.com/how-to-choose-a-data-format/

File Format Benchmarks - Avro, JSON, ORC, & Parquet

(Munich presentation)

https://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet

Parquet vs Avro Format

http://kitesdk.org/docs/1.1.0/Parquet-vs-Avro-Format.html

Avro is a row-based storage format for Hadoop.

Parquet is a column-based storage format for Hadoop.

If your use case typically scans or retrieves all of the fields in a row in each query, Avro is usually the best choice.

If your dataset has many columns, and your use case typically involves working with a subset of those columns rather than entire records, Parquet is optimized for that kind of work.

Schema evolution in Avro, Protocol Buffers and Thrift

https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

Summary of these options, where they fit

So you have some data that you want to store in a file or send over the network. You may find yourself going through several phases of evolution:

  1. Using your programming language’s built-in serialization, such as Java serialization, Ruby’s marshal, or Python’s pickle. Or maybe you even invent your own format.

  2. Then you realise that being locked into one programming language sucks, so you move to using a widely supported, language-agnostic format like JSON (or XML if you like to party like it’s 1999).

  3. Then you decide that JSON is too verbose and too slow to parse, you’re annoyed that it doesn’t differentiate integers from floating point, and think that you’d quite like binary strings as well as Unicode strings. So you invent some sort of binary format that’s kinda like JSON, but binary (1, 2, 3, 4, 5, 6).

  4. Then you find that people are stuffing all sorts of random fields into their objects, using inconsistent types, and you’d quite like a schema and some documentation, thank you very much. Perhaps you’re also using a statically typed programming language and want to generate model classes from a schema. Also you realize that your binary JSON-lookalike actually isn’t all that compact, because you’re still storing field names over and over again; hey, if you had a schema, you could avoid storing objects’ field names, and you could save some more bytes!

Once you get to the fourth stage, your options are typically Thrift, Protocol Buffers or Avro. All three provide efficient, cross-language serialization of data using a schema, and code generation for the Java folks.

Clone this wiki locally