Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

generate parquet schema from rust struct #539

Merged
merged 2 commits into from
Jul 14, 2021

Conversation

nevi-me
Copy link
Contributor

@nevi-me nevi-me commented Jul 11, 2021

Which issue does this PR close?

None

Rationale for this change

Users of parquet_derive currently have to write the schema of their Rust struct by hand.
This is inconvenient, and can be generated for them.

What changes are included in this PR?

Adds parquet::record::RecordWriter::schema() trait member, and an implementation in parquet_derive to generate a schema for the user.

Are there any user-facing changes?

Yes, new API. The breakage is probably irrelevant as I don't think that there are many users of parquet_derive.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 11, 2021
@nevi-me
Copy link
Contributor Author

nevi-me commented Jul 11, 2021

Might be of interest to @xrl and @bryantbiggs as you were the original authors

@codecov-commenter
Copy link

codecov-commenter commented Jul 11, 2021

Codecov Report

Merging #539 (e13b9e7) into master (f1fb2b1) will decrease coverage by 0.14%.
The diff coverage is 22.01%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #539      +/-   ##
==========================================
- Coverage   82.60%   82.45%   -0.15%     
==========================================
  Files         167      167              
  Lines       45984    46084     +100     
==========================================
+ Hits        37984    37999      +15     
- Misses       8000     8085      +85     
Impacted Files Coverage Δ
parquet_derive/src/lib.rs 0.00% <0.00%> (ø)
parquet_derive/src/parquet_field.rs 66.16% <12.50%> (-17.17%) ⬇️
parquet_derive_test/src/lib.rs 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1fb2b1...e13b9e7. Read the comment docs.

@bryantbiggs
Copy link
Contributor

nice!!! 🎉

@nevi-me nevi-me force-pushed the parquet-derive-generate-schema branch from aa6b1b0 to e13b9e7 Compare July 12, 2021 00:06
@alamb alamb added the parquet_derive parquet_derive crate label Jul 12, 2021
@alamb alamb mentioned this pull request Jul 12, 2021
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice improvement to me.

@@ -69,7 +64,7 @@ mod parquet_field;
/// }
/// ];
///
/// let schema = Arc::new(parse_message_type(schema_str).unwrap());
/// let schema = samples.as_slice().schema();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool to avoid having to use as_slice() but this PR seems significantly better than previously, so 👍 for me. We can make it even better in the future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I didn't like it too, but RecordWriter is implemented on &[Struct], so this was the least disruptive change to the trait I could make

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we could implement it for anything that implemented IntoIterator<Item=P> where P: AsRef<Struct> or whatever

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try that out in future PRs. The code that gets generated iterates through the slice n times, where n is the number of fields. It's not a true record writer in the parquet sense, but it uses the column writer internally.

I think we'll be able to improve on it when I work on nested type support.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I appreciate the extra attention to this code. My team relies on the derive macro to translate simple structs to parquet, which we do very often in our data pipeline. I'd be happy to test out future, more comprehensive versions. I applaud you effort, I was too intimidated by the record writer to keep going!

use parquet::basic::*;

let mut fields: Vec<TypePtr> = Vec::new();
#(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is pretty cool 🧙 👍

@@ -174,6 +174,50 @@ impl Field {
}
}

pub fn parquet_type(&self) -> proc_macro2::TokenStream {
// TODO: Support group types
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add notes about these TODO's in the docstrings (aka "group types are not yet supported") ?

@@ -57,27 +64,32 @@ mod tests {
#[test]
fn test_parquet_derive_hello() {
let file = get_temp_file("test_parquet_derive_hello", &[]);
let schema_str = "message schema {

// The schema is not required, but this tests that the generated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

let schema = Arc::new(parse_message_type(schema_str).unwrap());
let generated_schema = drs.as_slice().schema().unwrap();

assert_eq!(&schema, &generated_schema);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb
Copy link
Contributor

alamb commented Jul 13, 2021

I plan to merge this later today (will be included in 5.0) unless anyone else wants more time to review

@alamb alamb merged commit 55a5863 into apache:master Jul 14, 2021
@alamb alamb added the enhancement Any new improvement worthy of a entry in the changelog label Jul 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Any new improvement worthy of a entry in the changelog parquet_derive parquet_derive crate parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants