Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add arrow module in Py API #3216

Merged
merged 5 commits into from
Jan 10, 2023

Conversation

jmao-denver
Copy link
Contributor

@jmao-denver jmao-denver commented Dec 17, 2022

Enable conversion between DH tables and Arrow tables on the server side

@jmao-denver
Copy link
Contributor Author

jmao-denver commented Dec 18, 2022

The 'missing pyarrow' fixes should be rolled back once the base images are updated with 'pyarrow' preinstalled as with numpy/pandas etc.
@devinrsmith I am not sure if this is the right time to tackle the 'on-demand' dependency issue but it is my opinion that pyarrow should be treated as a hard dependency as numpy/pandas and be included in the base images.

@jmao-denver jmao-denver marked this pull request as ready for review December 18, 2022 05:53
@jmao-denver jmao-denver requested a review from chipkent December 18, 2022 17:07
Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can't merge as is - the --no-index is a "safety" check that ensures everything flows through the base images.

If this is necessary, let's plumb https://github.com/deephaven/deephaven-base-images.

pyarrow is not a small dependency - it's 20-35MB depending on architecture, and it has a dependency on numpy.

Note, pyarrow is an architecture specific deliverable - but it looks like it has good coverage w/ amd64 and arm64 across Linux, Mac, and Windows with python 3.7-3.11.

@jmao-denver
Copy link
Contributor Author

@nbauernfeind @lbooker42 has looked that the changes in Java and thought they are reasonable. Could you also review these changes?

Comment on lines +100 to +109
record_batches = pa_table.to_batches()
for rb in record_batches:
pa_buffer = rb.serialize()
j_barrage_table_builder.addRecordBatch(pa_buffer.to_pybytes())
j_barrage_table_builder.onCompleted()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but it ends up being excessively complex and inefficient

Unfortunately we do not use Arrow Buffers internal to DHC. The alternative implementation that avoids copying data twice requires building column source wrappers that are backed by Arrow's native buffers. If we want to support converting a DH native table (e.g. converting a ticking table to PyArrow's format) then we must do the copying. Note that our column sources support all sorts of fancy features like redirecting to yield a sorted table.

Our internal column source organization is specifically what makes pulling py arrow data into DH an expensive operation (as well as vice versa).

@jcferretti
Copy link
Member

From the point of view of the question "is there overlap/work with the C++ wrapping / python client", I think the answer is not directly, and perhaps a way to understand that question would be to ask "should we worry about API consistency from the point of view of exposing arrow calls in python-server, python-client and C++-client?
I think that is a fair concern and we can look into that but my guess is right now is not the best time to do that, we have an open question on the C++ wrapping side about exactly how arrow is going to be exposed for python when we integrate both C++ libraries that depend on arrow code and python code that depends on pyarrow. Without having clarity on the implementation level potential restrictions there, we can't (shouldn't?) say much now I think.
So I think is better to move this PR forward as is and perhaps open a separate ticket to check arrow API consistency between py-server, py-client and C++-client later.

@jmao-denver
Copy link
Contributor Author

jmao-denver commented Dec 21, 2022

#3223 is created to track the lack of support of a few Arrow types that are supposedly supported in DHC but don't seem to work in the unit tests. @nbauernfeind could you take a look at the mapping and confirm that these arrow types are indeed supported by the Barrage service now?

nbauernfeind
nbauernfeind previously approved these changes Jan 3, 2023
Copy link
Member

@nbauernfeind nbauernfeind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM if looks good to @chipkent

Comment on lines +100 to +109
record_batches = pa_table.to_batches()
for rb in record_batches:
pa_buffer = rb.serialize()
j_barrage_table_builder.addRecordBatch(pa_buffer.to_pybytes())
j_barrage_table_builder.onCompleted()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like Arrow->DH necessarily creates a static DH table. In that case, we should have column source support for Arrow buffers. That is certainly a follow-on project, but the inefficiency here isn't great. At a minimum, we should have a ticket for this effort. The copy will certainly make some users unhappy at some point.

chipkent
chipkent previously approved these changes Jan 10, 2023
In addition to one test case to show UTC based timestamp works
@jmao-denver jmao-denver dismissed stale reviews from chipkent and nbauernfeind via 5179b24 January 10, 2023 18:48
@jmao-denver jmao-denver dismissed devinrsmith’s stale review January 10, 2023 21:33

All the concerned have been addressed.

Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker / setup.py additions look correct to me; haven't reviewed the arrow-specific parts.

@jmao-denver jmao-denver merged commit ad3532c into deephaven:main Jan 10, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jan 10, 2023
@deephaven-internal
Copy link
Contributor

Labels indicate documentation is required. Issues for documentation have been opened:

How-to: https://github.com/deephaven/deephaven.io/issues/2030
Conceptual: https://github.com/deephaven/deephaven.io/issues/2032
Reference: https://github.com/deephaven/deephaven.io/issues/2031
Blog: https://github.com/deephaven/deephaven.io/issues/2029

@jmao-denver jmao-denver deleted the 3187-arrow-module-py branch March 31, 2023 03:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants