-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-94: [Format] Expand list example to clarify null vs empty list #58
Conversation
|
||
While a struct does not have physical storage for each of its semantic slots | ||
(i.e. each scalar C-like struct), an entire struct slot can be set to null via | ||
the null bitmap. Any of the child field arrays can have null values according | ||
to their respective independent null bitmaps. | ||
In the example above, the child arrays have a valid entries for the null struct | ||
but are 'hidden' from the consumer by the parent array's null bitmap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I recall there have been some questions around whether the child arrays' bitmaps must necessarily be set to null if the parent struct slot is null. I think the answer is "no" (and, in fact, you could combine a set of immutable constructed-elsewhere arrays with a bitmap that you layer on top to "null out" those other values), so having to twiddle other bitmaps would be onerous and ultimately not that useful.
If you agree perhaps we can spell this out here also specifically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do agree and I thought I read this elsewhere in the spec, but sounds like it is worth confirming the mailing list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is this statement already: "Any of the child field arrays can have null values according to their respective independent null bitmaps." I don't think it was ever a controversial point but rather just unclear -- feel free to raise it on the mailing list if you like, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rereading the spec, I agree. I will update the main text accordingly.
This looks great to me -- I think using text diagrams going forward will be more sustainable and clear to external readers, though hopefully we are close to finalizing some of the lingering format details! |
@@ -198,7 +308,8 @@ types (which can all be distinct), called its fields. | |||
Typically the fields have names, but the names and their types are part of the | |||
type metadata, not the physical memory layout. | |||
|
|||
A struct does not have any additional allocated physical storage. | |||
A struct array does not have any additional allocated physical storage for its values. | |||
A struct array must still have an allocated null bitmap, if it has one or more null values. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a struct array will always have the null bitmap, regardless being of any null entry or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't required to have the memory allocated if there are no nulls, like the other array types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thx. I'm not sure about how these types are used. When creating the object of such type, it must be known before if any null exists by scanning the data to put into it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the following dataset:
data = [
{foo: 1, bar: null},
null,
null,
{foo: 0, bar: 5}
]
Do you want to also delete the existing graphical diagrams as part of this patch? |
Might as well just pushed another commit. |
+1. Merging this. If there is something causing lingering controversy from this patch I encourage everyone to continue discussion on the mailing list so we can make further decisions there where relevant. |
…related refactoring This also restores passing on user's `CMAKE_CXX_FLAGS`, which had unfortunately led some compiler warnings to creep into our build. Author: Wes McKinney <wesm@apache.org> Closes apache#58 from wesm/PARQUET-457 and squashes the following commits: 4bf12ed [Wes McKinney] * SerializeThriftMsg now writes into an OutputStream. * Refactor page serialization in advance of compression tests * Test compression roundtrip on random bytes for snappy and gzip * Trying LZO compression results in ParquetException * Don't lose user's CMAKE_CXX_FLAGS * Remove Travis CI directory caching for now * Fix gzip memory leak if you do not call inflateEnd, deflateEnd
…related refactoring This also restores passing on user's `CMAKE_CXX_FLAGS`, which had unfortunately led some compiler warnings to creep into our build. Author: Wes McKinney <wesm@apache.org> Closes apache#58 from wesm/PARQUET-457 and squashes the following commits: 4bf12ed [Wes McKinney] * SerializeThriftMsg now writes into an OutputStream. * Refactor page serialization in advance of compression tests * Test compression roundtrip on random bytes for snappy and gzip * Trying LZO compression results in ParquetException * Don't lose user's CMAKE_CXX_FLAGS * Remove Travis CI directory caching for now * Fix gzip memory leak if you do not call inflateEnd, deflateEnd Change-Id: I44a58ef2d22f8e5064d198d0abeecde7ba4de3cb
…related refactoring This also restores passing on user's `CMAKE_CXX_FLAGS`, which had unfortunately led some compiler warnings to creep into our build. Author: Wes McKinney <wesm@apache.org> Closes apache#58 from wesm/PARQUET-457 and squashes the following commits: 4bf12ed [Wes McKinney] * SerializeThriftMsg now writes into an OutputStream. * Refactor page serialization in advance of compression tests * Test compression roundtrip on random bytes for snappy and gzip * Trying LZO compression results in ParquetException * Don't lose user's CMAKE_CXX_FLAGS * Remove Travis CI directory caching for now * Fix gzip memory leak if you do not call inflateEnd, deflateEnd Change-Id: I44a58ef2d22f8e5064d198d0abeecde7ba4de3cb
…related refactoring This also restores passing on user's `CMAKE_CXX_FLAGS`, which had unfortunately led some compiler warnings to creep into our build. Author: Wes McKinney <wesm@apache.org> Closes apache#58 from wesm/PARQUET-457 and squashes the following commits: 4bf12ed [Wes McKinney] * SerializeThriftMsg now writes into an OutputStream. * Refactor page serialization in advance of compression tests * Test compression roundtrip on random bytes for snappy and gzip * Trying LZO compression results in ParquetException * Don't lose user's CMAKE_CXX_FLAGS * Remove Travis CI directory caching for now * Fix gzip memory leak if you do not call inflateEnd, deflateEnd Change-Id: I44a58ef2d22f8e5064d198d0abeecde7ba4de3cb
…related refactoring This also restores passing on user's `CMAKE_CXX_FLAGS`, which had unfortunately led some compiler warnings to creep into our build. Author: Wes McKinney <wesm@apache.org> Closes apache#58 from wesm/PARQUET-457 and squashes the following commits: 4bf12ed [Wes McKinney] * SerializeThriftMsg now writes into an OutputStream. * Refactor page serialization in advance of compression tests * Test compression roundtrip on random bytes for snappy and gzip * Trying LZO compression results in ParquetException * Don't lose user's CMAKE_CXX_FLAGS * Remove Travis CI directory caching for now * Fix gzip memory leak if you do not call inflateEnd, deflateEnd Change-Id: I44a58ef2d22f8e5064d198d0abeecde7ba4de3cb
Add List input and output types for Gandiva functions. Add new reference implementations for array_contains and array_remove, tested via integration with Dremio. int32, int64, double and float list types have been tested. Support List types in function specification and llvm code generation. Pass back function type information through the expression registry. See 1p here: https://docs.google.com/document/d/1exwXdUUnk5FqZLzVZyTdhqgwxTk0u9bL54aLVNM5Tas/edit
Add List input and output types for Gandiva functions. Add new reference implementations for array_contains and array_remove, tested via integration with Dremio. int32, int64, double and float list types have been tested. Support List types in function specification and llvm code generation. Pass back function type information through the expression registry. See 1p here: https://docs.google.com/document/d/1exwXdUUnk5FqZLzVZyTdhqgwxTk0u9bL54aLVNM5Tas/edit
WIP to make sure what I've done so far looks good. Per discussion on the JIRA item I started conversion of examples images to "text diagrams", but I wanted to get feedback to see if this is actually desirable (and if the way I'm approaching it is desirable). The remaining diagrams are for unions which I can convert if the existing changes look OK to others (although I think the Union diagrams are are pretty reasonable/compact).
This change also includes some other minor cleanup, as well as including a statement about endianness per the discussion on the mailing list.
Rendered markdown can be seen at: https://github.com/emkornfield/arrow/blob/emk_doc_fixes_PR3/format/Layout.md