-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Parsing Avro File Headers #4888
Conversation
|
||
//! Transfer data between the Arrow memory format and Avro | ||
|
||
#![allow(unused)] // Temporary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is necessary because currently nothing is public 😄
|
||
mod vlq; | ||
|
||
/// Read a [`Header`] from the provided [`BufRead`] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These methods I expect to bundle up in a public Reader, but one step at a time 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👨🍳 👌 -- very nice @tustvold . I think the APIs and code make lots of sense and I have only suggstions on documentation and testing
|
||
#[test] | ||
fn test_header() { | ||
let header = decode_file("../testing/data/avro/alltypes_plain.avro"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for using the avro testing data
const MAGIC: &[u8; 4] = b"Obj\x01"; | ||
|
||
impl HeaderDecoder { | ||
/// Parse [`Header`] from `buf`, returning the number of bytes read |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would help to mention in the docs that
- the header can be pushed in chunks at a time (by calling
decode
multiple times) - when done, call
finish()
- What to do with remaining buffered data (it has to be preprended to the next call to decode)
- What happens on error (I think the entire decoder state basically is "unknown" and it can't recover)
arrow-avro/src/reader/header.rs
Outdated
} | ||
|
||
let mut decoder = HeaderDecoder::default(); | ||
decoder.decode(MAGIC).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this should also verify that the decoder reports consuming the correct number of bytes (4)?
fn test_header() { | ||
let header = decode_file("../testing/data/avro/alltypes_plain.avro"); | ||
let schema_json = header.get(SCHEMA_METADATA_KEY).unwrap(); | ||
let _schema: Schema<'_> = serde_json::from_slice(schema_json).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would be a better test (more clearly show it obviously worked) if it compared the entire schema json (not just a single field, which I have no idea if the value of the sync byte is correct)
let file = File::open("../testing/data/avro/alltypes_plain.avro").unwrap(); | ||
let mut reader = BufReader::new(file); | ||
let header = read_header(&mut reader).unwrap(); | ||
for result in read_blocks(reader) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps it could verify the number of blocks? Though maybe that will be covered as the reader becomes more sophisticated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah this is really just a placeholder test
arrow-avro/src/reader/vlq.rs
Outdated
/// Decoder for zig-zag encoded VLQ integers | ||
/// | ||
/// <https://protobuf.dev/programming-guides/encoding/#varints> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW the avro docs include a reference to lucene (not protobuf): https://lucene.apache.org/core/3_5_0/fileformats.html#VInt
Maybe not a big deal, but it was weird to see the protobuf reference here
/// Decoder for zig-zag encoded VLQ integers | |
/// | |
/// <https://protobuf.dev/programming-guides/encoding/#varints> | |
/// Decoder for zig-zag encoded variable length (VLW) integers | |
/// See also: | |
/// <https://avro.apache.org/docs/1.11.1/specification/#primitive-types-1> | |
/// <https://protobuf.dev/programming-guides/encoding/#varints> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe not a big deal, but it was weird to see the protobuf reference here
Yeah, amusingly the Avro spec links to protobuf as well. Avro is very heavily inspired by protobuf, and isn't ashamed to admit it 😄
})) | ||
); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Eventually it would be great to have a arrow-avro/tests/reader.rs
or something that reads the contents of th avro files in arrow-testing and verifies the conevrted schema.
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
The integration test failures are unrelated and appear to relate to apache/arrow#37990 |
Which issue does this PR close?
Part of #4886
Rationale for this change
This adds support for decoding the avro headers, and parsing the embedded schema.
This is a necessary first step towards supporting reading avro data
What changes are included in this PR?
Are there any user-facing changes?