-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-94: [Format] Expand list example to clarify null vs empty list #58
Changes from 2 commits
590e4a7
0f23052
7dda5d5
b7aa7ea
69e1a78
5550a78
cab6f87
00b99ef
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,7 +9,7 @@ concepts, here is a small glossary to help disambiguate. | |
* Slot or array slot: a single logical value in an array of some particular data type | ||
* Contiguous memory region: a sequential virtual address space with a given | ||
length. Any byte can be reached via a single pointer offset less than the | ||
region’s length. | ||
region's length. | ||
* Primitive type: a data type that occupies a fixed-size memory slot specified | ||
in bit width or byte width | ||
* Nested or parametric type: a data type whose full structure depends on one or | ||
|
@@ -42,7 +42,7 @@ Base requirements | |
* Capable of representing fully-materialized and decoded / decompressed Parquet | ||
data | ||
* All leaf nodes (primitive value arrays) use contiguous memory regions | ||
* Any relative type can be have null slots | ||
* Any relative type can have null slots | ||
* Arrays are immutable once created. Implementations can provide APIs to mutate | ||
an array, but applying mutations will require a new array data structure to | ||
be built. | ||
|
@@ -69,11 +69,15 @@ Base requirements | |
* To define a selection or masking vector construct | ||
* Implementation-specific details | ||
* Details of a user or developer C/C++/Java API. | ||
* Any “table” structure composed of named arrays each having their own type or | ||
* Any "table" structure composed of named arrays each having their own type or | ||
any other structure that composes arrays. | ||
* Any memory management or reference counting subsystem | ||
* To enumerate or specify types of encodings or compression support | ||
|
||
## Byte Order (Endianess) | ||
|
||
The Arrow format is little endian. | ||
|
||
## Array lengths | ||
|
||
Any array has a known and fixed length, stored as a 32-bit signed integer, so a | ||
|
@@ -142,10 +146,61 @@ the size is rounded up to the nearest byte. | |
The associated null bitmap is contiguously allocated (as described above) but | ||
does not need to be adjacent in memory to the values buffer. | ||
|
||
(diagram not to scale) | ||
|
||
<img src="diagrams/layout-primitive-array.png" width="400"/> | ||
### Example Layout: Int32 Array | ||
For example a primitive array of int32s: | ||
|
||
[1, 2, null, 4, 8] | ||
|
||
Would look like: | ||
|
||
``` | ||
* Length: 5, Null count: 1 | ||
* Null bitmap buffer: | ||
|
||
|Byte 0 (validity bitmap) | Bytes 1-7 | | ||
|-------------------------|-----------------------| | ||
|00011011 | 0 (padding) | | ||
|
||
* Value Buffer: | ||
|
||
|Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | | ||
|------------|-------------|-------------|-------------|-------------| | ||
| 1 | 2 | unspecified | 4 | 8 | | ||
|
||
``` | ||
|
||
### Example Layout: Non-null int32 Array | ||
|
||
[1, 2, 3, 4, 8] has two possible layouts: | ||
|
||
``` | ||
* Length: 5, Null count: 0 | ||
* Null bitmap buffer: | ||
|
||
| Byte 0 (validity bitmap) | Bytes 1-7 (padding) | | ||
|--------------------------|-----------------------| | ||
| 00011111 | 0 (padding) | | ||
|
||
* Value Buffer: | ||
|
||
|Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 | | ||
|------------|-------------|-------------|-------------|-------------| | ||
| 1 | 2 | 3 | 4 | 8 | | ||
``` | ||
|
||
or with the bitmap elided: | ||
|
||
``` | ||
* Length 5, Null count: 0 | ||
* Null bitmap buffer: Not required | ||
* Value Buffer: | ||
|
||
|Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | bytes 12-15 | bytes 16-19 | | ||
|------------|-------------|-------------|-------------|-------------| | ||
| 1 | 2 | 3 | 4 | 8 | | ||
|
||
``` | ||
## List type | ||
|
||
List is a nested type in which each array slot contains a variable-size | ||
|
@@ -175,20 +230,84 @@ slot_length = offsets[j + 1] - offsets[j] // (for 0 <= j < length) | |
The first value in the offsets array is 0, and the last element is the length | ||
of the values array. | ||
|
||
Let’s consider an example, the type `List<Char>`, where Char is a 1-byte | ||
### Example Layout: `List<Char>` Array | ||
Let's consider an example, the type `List<Char>`, where Char is a 1-byte | ||
logical type. | ||
|
||
For an array of length 3 with respective values: | ||
For an array of length 4 with respective values: | ||
|
||
[[‘j’, ‘o’, ‘e’], null, [‘m’, ‘a’, ‘r’, ‘k’]] | ||
[['j', 'o', 'e'], null, ['m', 'a', 'r', 'k'], []] | ||
|
||
We have the following offsets and values arrays | ||
will have the following representation: | ||
|
||
<img src="diagrams/layout-list.png" width="400"/> | ||
``` | ||
* Length: 4, Null count: 1 | ||
* Null bitmap buffer: | ||
|
||
Let’s consider an array of a nested type, `List<List<byte>>` | ||
| Byte 0 (validity bitmap) | Bytes 0-7 | | ||
|--------------------------|-----------------------| | ||
| 00001101 | 0 (padding) | | ||
|
||
<img src="diagrams/layout-list-of-list.png" width="400"/> | ||
* Offsets array (int32 array) | ||
* Length: 5, Null count: 0 | ||
* Null bitmap buffer: Not required | ||
* Value Buffer (offsets into the Values array): | ||
|
||
| Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | Bytes 16-19 | | ||
|------------|-------------|-------------|-------------|-------------| | ||
| 0 | 3 | 3 | 7 | 7 | | ||
|
||
* Values array (char array): | ||
* Length: 7, Null count: 0 | ||
* Null bitmap buffer: Not required | ||
|
||
| Bytes 0-7 | | ||
|------------| | ||
| joemark | | ||
``` | ||
|
||
### Example Layout: `List<List<byte>>` | ||
[[[1, 2], [3, 4]], [[5, 6, 7], null, [8]], [[9, 10]]] | ||
|
||
will be be represented as follows: | ||
|
||
``` | ||
* Length 3 | ||
* Nulls count: 0 | ||
* Null bitmap buffer: Not required | ||
* Offsets array (int32 array) | ||
* Length: 4, Null count: 0 | ||
* Null bitmap buffer: Not required | ||
* Value Buffer (offsets into the Values array): | ||
|
||
| Bytes 0-3 | Bytes 3-6 | Bytes 7-10 | Bytes 10-13 | | ||
|------------|------------|------------|-------------| | ||
| 0 | 2 | 6 | 7 | | ||
|
||
* Values array (`List<byte>`) | ||
* Length: 6, Null count: 1 | ||
* Null bitmap buffer: | ||
|
||
| Byte 0 (validity bitmap) | Bytes 1-7 | | ||
|--------------------------|-------------| | ||
| 00110111 | 0 (padding) | | ||
|
||
* Offsets array (int32 array) | ||
* Length 7, Null count: 0 | ||
* Null bitmap buffer: Not required | ||
|
||
| Bytes 0-28 | | ||
|----------------------| | ||
| 0, 2, 4, 7, 7, 8, 10 | | ||
|
||
* Values array (bytes): | ||
* Length: 10, Null count: 0 | ||
* Null bitmap buffer: Not required | ||
|
||
| Bytes 0-9 | | ||
|-------------------------------| | ||
| 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 | | ||
``` | ||
|
||
## Struct type | ||
|
||
|
@@ -198,7 +317,8 @@ types (which can all be distinct), called its fields. | |
Typically the fields have names, but the names and their types are part of the | ||
type metadata, not the physical memory layout. | ||
|
||
A struct does not have any additional allocated physical storage. | ||
A struct array does not have any additional allocated physical storage for its values. | ||
A struct array must still have an allocated null bitmap, if it has one or more null values. | ||
|
||
Physically, a struct type has one child array for each field. | ||
|
||
|
@@ -213,15 +333,63 @@ Struct < | |
``` | ||
|
||
has two child arrays, one List<char> array (layout as above) and one 4-byte | ||
physical value array having Int32 logical type. Here is a diagram showing the | ||
full physical layout of this struct: | ||
primitive value array having Int32 logical type. | ||
|
||
### Example Layout: `Struct<List<char>, Int32>`: | ||
The layout for [{'joe', 1}, {null, 2}, null, {'mark', 4}] would be: | ||
|
||
``` | ||
* Length: 4, Null count: 1 | ||
* Null bitmap buffer: | ||
|
||
<img src="diagrams/layout-list-of-struct.png" width="400"/> | ||
| Byte 0 (validity bitmap) | Bytes 1-7 | | ||
|--------------------------|-------------| | ||
| 00001011 | 0 (padding) | | ||
|
||
* Children arrays: | ||
* field-0 array (`List<char>`): | ||
* Length: 4, Null count: 1 | ||
* Null bitmap buffer: | ||
|
||
| Byte 0 (validity bitmap) | Bytes 1-7 | | ||
|--------------------------|-----------------------| | ||
| 00011101 | 0 (padding) | | ||
|
||
* Offsets array: | ||
* Length: 5, Null count: 0 | ||
* Null bitmap buffer: Not required | ||
|
||
| byte 0-19 | | ||
|----------------| | ||
| 0, 3, 3, 6, 10 | | ||
|
||
* Values array: | ||
* Length: 10, Null count: 0 | ||
* Null bitmap buffer: Not required | ||
|
||
* Value buffer: | ||
|
||
| byte 0-9 | | ||
|----------------| | ||
| joebobmark | | ||
|
||
* field-1 array (int32 array): | ||
* Length: 4, Null count: 0 | ||
* Null bitmap buffer: Not required | ||
* Value Buffer: | ||
|
||
| byte 0-15 | | ||
|----------------| | ||
| 1, 2, 3, 4 | | ||
|
||
``` | ||
|
||
While a struct does not have physical storage for each of its semantic slots | ||
(i.e. each scalar C-like struct), an entire struct slot can be set to null via | ||
the null bitmap. Any of the child field arrays can have null values according | ||
to their respective independent null bitmaps. | ||
In the example above, the child arrays have a valid entries for the null struct | ||
but are 'hidden' from the consumer by the parent array's null bitmap. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As I recall there have been some questions around whether the child arrays' bitmaps must necessarily be set to null if the parent struct slot is null. I think the answer is "no" (and, in fact, you could combine a set of immutable constructed-elsewhere arrays with a bitmap that you layer on top to "null out" those other values), so having to twiddle other bitmaps would be onerous and ultimately not that useful. If you agree perhaps we can spell this out here also specifically. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I do agree and I thought I read this elsewhere in the spec, but sounds like it is worth confirming the mailing list? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is this statement already: "Any of the child field arrays can have null values according to their respective independent null bitmaps." I don't think it was ever a controversial point but rather just unclear -- feel free to raise it on the mailing list if you like, though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rereading the spec, I agree. I will update the main text accordingly. |
||
|
||
## Dense union type | ||
|
||
|
@@ -248,8 +416,8 @@ Alternate proposal (TBD): the types and offset values may be packed into an | |
int48 with 2 bytes for the type and 4 bytes for the offset. | ||
|
||
Critically, the dense union allows for minimal overhead in the ubiquitous | ||
union-of-structs with non-overlapping-fields use case (Union<s1: Struct1, s2: | ||
Struct2, s3: Struct3, …>) | ||
union-of-structs with non-overlapping-fields use case (`Union<s1: Struct1, s2: | ||
Struct2, s3: Struct3, ...>`) | ||
|
||
Here is a diagram of an example dense union: | ||
|
||
|
@@ -266,15 +434,18 @@ union, it has some advantages that may be desirable in certain use cases: | |
|
||
<img src="diagrams/layout-sparse-union.png" width="400"/> | ||
|
||
More amenable to vectorized expression evaluation in some use cases. | ||
Equal-length arrays can be interpreted as a union by only defining the types array | ||
* A sparse union is more amenable to vectorized expression evaluation in some use cases. | ||
* Equal-length arrays can be interpreted as a union by only defining the types array. | ||
|
||
Note that nested types in a sparse union must be internally consistent | ||
(e.g. see the List in the diagram), i.e. random access at any index j yields | ||
the correct value. | ||
In other words, the array for the nested type must be valid if it is | ||
reinterpreted as a non-nested array. | ||
|
||
|
||
## References | ||
|
||
Drill docs https://drill.apache.org/docs/value-vectors/ | ||
|
||
[1]: https://en.wikipedia.org/wiki/Bit_numbering | ||
[1]: https://en.wikipedia.org/wiki/Bit_numbering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought a struct array will always have the null bitmap, regardless being of any null entry or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It isn't required to have the memory allocated if there are no nulls, like the other array types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it, thx. I'm not sure about how these types are used. When creating the object of such type, it must be known before if any null exists by scanning the data to put into it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider the following dataset: