-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
columnar support for Arrow tables #2030
Conversation
The issue with dates can be reduced to this (which is independent of Plot): import * as Arrow from "apache-arrow";
const data = [{date: new Date(1950, 1, 2)}]
console.log(data, [...Arrow.tableFromJSON(data)].map(d=> ({...d}))); // [ { date: 1950-02-01T23:00:00.000Z } ] [ { date: 1950-03-23T16:02:47.296Z } ] you can see that the date is off by 50 days. Am I missing something, should I open an issue on https://github.com/apache/arrow @domoritz? version information:
|
Hmm, we’ve had success reading Date values in DuckDB transferred in Arrow format. So I’d be sure to check if this is an encoding problem (Date to Arrow) or a decoding issue (Arrow to Date) first. Some encoders in Arrow JS (eg for Decimal) are known to be broken. I sometimes have had to use DuckDB or pyarrow to generate Arrow bytes for testing. On a related note, in Mosaic we special case Timestamp types as Arrow JS returns those as numbers; we then instantiate Date objects ourselves. |
|
||
// Arrayify arrow tables. We try to avoid materializing the values, but the | ||
// Proxy might be used by the group reducer to construct groupData. | ||
function arrowTableProxy(data) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this negative most of the benefit of using arrow? I think proxies add a lot of overhead and arrow already has fast proxies for e.g. iteration. Can you defer the conversion to the group reducer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it works. We still read the named channels with getChild(name)
, and the Symbol.iterator
is also passed directly to the _Table
object, so there shouldn't be any waste here. The only place where this is not great, is in the group transform, which assembles a new array of data points for each group (with the take
function—that's what I think Arrow doesn't allow). I made a Proxy because I didn't want to add methods on the source _Table, but we could just slice() it defensively and not use a Proxy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. So in most cases you use columnar access anyway. Then maybe ignore my comment.
The internal data is like this:
The date is encoded on the two first 32bit numbers, which I decode manually to (-146*(2**32)) - 1498374784 = -628563600000 which is my initial date. So it's apparently the decoding that fails. |
I think DuckDB produces either Date32 or Timestamp values, so I haven't tripped over this yet! Thanks for documenting it. |
I've reported the Date issue at apache/arrow#40718. I think it is orthogonal to this PR, since I can get the same error with Plot#main and an arrow table |
Tested with apache/arrow#40725 everything works smoothly (except for test "mark data parallel to facet data triggers a warning" which is not relevant). Thanks for the super quick fix @trxcllnt and @domoritz! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I love the minimally-invasive nature of this change. I also worry that it might be brittle — the proxy masquerading as an Array does “just enough” for it to pass tests, but is this likely to cause problems in the future if the code assumes that the data is an array?
As a thought exercise, what would it look like if we allowed the data to be an arrow table throughout the code base? How many places need to read the materialized mark data as an array and would need to be changed to check isArrowTable
? Given the weight behind Apache Arrow, I’m more inclined to make it a first-class thing. Maybe someday Plot uses Arrow internally as the native data representation.
In the case of the group transform, I would love to avoid materializing the array-of-objects, too. (I think at some point I explored having the group transform not materialize the grouped data by default.)
I’m pretty close to approving this PR as-is, but maybe you can do a little more research before we do so? I’d love to understand the alternatives a little more.
In the PR that fixes the Date bug (apache/arrow#40725), @domoritz also changes the An example of a (custom) data transform that breaks with arrow tables is here. It uses data.flatMap, which does not exist on the "fake array". We could add it easily. (I don't think we need to rush this, we should probably wait for 40725 to land.) |
Yeah, compatibility with native arrays was my main goal with supporting The next arrow release is in April so we could try to get the change in there (releases are ~ every three months). However, you probably don't want to rely on the latest arrow library being used so having a stop gap until the new library is common makes sense. |
// Extract columnar data | ||
function columnar(data, name, type) { | ||
if (isArrowTable(data)) { | ||
const column = maybeTypedArrayify(data.getChild(name), type); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be calling data.getChild(name).toArray()
here, or else maybeTypedArrayify
will use the slowest iterable path rather than using the more efficient Vector.toArray
(which could be zero-copy if the Arrow table only has one chunk).
if (Array.isArray(column) && String(data.schema?.fields?.find((d) => d.name === name)).endsWith("<MILLISECOND>")) | ||
column.find((d, i) => d != null && (column[i] = new Date(d))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This assignment will have no effect if type
is not Array
, since assigning a value to a typed array will coerce back to a number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m going to explore the alternative I described in #2030 (review).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proposed alternative explored in #2115. Not quite functional yet, but the arrowDates
test works at least. 🙂
continued in (and superseded by) #2115 |
Detect Arrow tables and use as much of the direct access to the columns as we can—first and foremost, by not materializing the data on mark.initialize, and by routing a string accessor to getChild.
We don't add apache-arrow as a dependency (which means detection is done with duck typing of the methods we use… we could reinforce this a bit if needed, but I think that's fine).
The story is a bit complicated in the group transform (and maybe other places?) because we're actually making a new output data which currently uses take and map to create a new dataset that "looks like" the original data array.
In the Arrow table case, we might want take to be a "filtered" table, but I don't think it exists (API reference). We return instead an array of Row objects (which are Proxy objects into the columns); it's probably the best memory-wise, even though I don't like the looks of it. Anyway they're easy to convert to regular objects by writing
({...d})
.For a more thorough investigation of the places where we assume that values are arrays, I ran all the unit tests by replacing the data by an Arrow table (in Mark and facet data). This resulted in 25 "changed" snapshots (see diff); all of them are, it seems, only due to dates that change during the conversion to arrow. None of them were crashes.
I still need to investigate why the dates are modified
(I'm thinking that may be because Arrow coerces them to.Date32<day>
— nope, they areDateMillisecond
)closes #191
cc: @jheer