-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add arrow module in Py API #3216
Conversation
04a1adf
to
5255c86
Compare
The 'missing pyarrow' fixes should be rolled back once the base images are updated with 'pyarrow' preinstalled as with numpy/pandas etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can't merge as is - the --no-index
is a "safety" check that ensures everything flows through the base images.
If this is necessary, let's plumb https://github.com/deephaven/deephaven-base-images.
pyarrow is not a small dependency - it's 20-35MB depending on architecture, and it has a dependency on numpy.
Note, pyarrow is an architecture specific deliverable - but it looks like it has good coverage w/ amd64 and arm64 across Linux, Mac, and Windows with python 3.7-3.11.
server/src/main/java/io/deephaven/server/barrage/BarrageStreamGenerator.java
Outdated
Show resolved
Hide resolved
server/src/main/java/io/deephaven/server/barrage/BarrageStreamGenerator.java
Outdated
Show resolved
Hide resolved
@nbauernfeind @lbooker42 has looked that the changes in Java and thought they are reasonable. Could you also review these changes? |
record_batches = pa_table.to_batches() | ||
for rb in record_batches: | ||
pa_buffer = rb.serialize() | ||
j_barrage_table_builder.addRecordBatch(pa_buffer.to_pybytes()) | ||
j_barrage_table_builder.onCompleted() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but it ends up being excessively complex and inefficient
Unfortunately we do not use Arrow Buffers internal to DHC. The alternative implementation that avoids copying data twice requires building column source wrappers that are backed by Arrow's native buffers. If we want to support converting a DH native table (e.g. converting a ticking table to PyArrow's format) then we must do the copying. Note that our column sources support all sorts of fancy features like redirecting to yield a sorted table.
Our internal column source organization is specifically what makes pulling py arrow data into DH an expensive operation (as well as vice versa).
1cd87b1
to
76e9ae2
Compare
From the point of view of the question "is there overlap/work with the C++ wrapping / python client", I think the answer is not directly, and perhaps a way to understand that question would be to ask "should we worry about API consistency from the point of view of exposing arrow calls in python-server, python-client and C++-client? |
#3223 is created to track the lack of support of a few Arrow types that are supposedly supported in DHC but don't seem to work in the unit tests. @nbauernfeind could you take a look at the mapping and confirm that these arrow types are indeed supported by the Barrage service now? |
7d04398
to
bff6733
Compare
11cf4bc
to
6cf8068
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM if looks good to @chipkent
record_batches = pa_table.to_batches() | ||
for rb in record_batches: | ||
pa_buffer = rb.serialize() | ||
j_barrage_table_builder.addRecordBatch(pa_buffer.to_pybytes()) | ||
j_barrage_table_builder.onCompleted() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like Arrow->DH necessarily creates a static DH table. In that case, we should have column source support for Arrow buffers. That is certainly a follow-on project, but the inefficiency here isn't great. At a minimum, we should have a ticket for this effort. The copy will certainly make some users unhappy at some point.
In addition to one test case to show UTC based timestamp works
5179b24
All the concerned have been addressed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docker / setup.py additions look correct to me; haven't reviewed the arrow-specific parts.
Labels indicate documentation is required. Issues for documentation have been opened: How-to: https://github.com/deephaven/deephaven.io/issues/2030 |
Enable conversion between DH tables and Arrow tables on the server side