-
Notifications
You must be signed in to change notification settings - Fork 0
Data formats
http://blog.cloudera.com/blog/2016/04/benchmarking-apache-parquet-the-allstate-experience/
https://www.svds.com/how-to-choose-a-data-format/
(Munich presentation)
https://www.slideshare.net/oom65/file-format-benchmarks-avro-json-orc-parquet
http://kitesdk.org/docs/1.1.0/Parquet-vs-Avro-Format.html
Avro is a row-based storage format for Hadoop.
Parquet is a column-based storage format for Hadoop.
If your use case typically scans or retrieves all of the fields in a row in each query, Avro is usually the best choice.
If your dataset has many columns, and your use case typically involves working with a subset of those columns rather than entire records, Parquet is optimized for that kind of work.
https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html
Summary of these options, where they fit
So you have some data that you want to store in a file or send over the network. You may find yourself going through several phases of evolution:
-
Using your programming language’s built-in serialization, such as Java serialization, Ruby’s marshal, or Python’s pickle. Or maybe you even invent your own format.
-
Then you realise that being locked into one programming language sucks, so you move to using a widely supported, language-agnostic format like JSON (or XML if you like to party like it’s 1999).
-
Then you decide that JSON is too verbose and too slow to parse, you’re annoyed that it doesn’t differentiate integers from floating point, and think that you’d quite like binary strings as well as Unicode strings. So you invent some sort of binary format that’s kinda like JSON, but binary (1, 2, 3, 4, 5, 6).
-
Then you find that people are stuffing all sorts of random fields into their objects, using inconsistent types, and you’d quite like a schema and some documentation, thank you very much. Perhaps you’re also using a statically typed programming language and want to generate model classes from a schema. Also you realize that your binary JSON-lookalike actually isn’t all that compact, because you’re still storing field names over and over again; hey, if you had a schema, you could avoid storing objects’ field names, and you could save some more bytes!
Once you get to the fourth stage, your options are typically Thrift, Protocol Buffers or Avro. All three provide efficient, cross-language serialization of data using a schema, and code generation for the Java folks.