-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rework npy copy to integrate with query processor pipeline #1734
Conversation
Note that there's still a bug in the PR - reading large (>2048 rows), multidimensional (e.g. column w/ type INT32[10]) .npy files causes an error. I'm currently working on fixing this, but if it's urgent to get this PR merged to integrate with storage changes for the next release, I propose removing the failing test (which I've done already), and creating an issue for it to be fixed soon. Happy to discuss this further |
@aziz-mu Let's wait until the bug is fixed. The main use case of NPY copy is to handle large, high-dimensional data files for PyG workload currently. If there is a bug reading large multidimensional files, this feature will not be very useful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should fix failed tests.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1734 +/- ##
==========================================
+ Coverage 90.92% 91.05% +0.13%
==========================================
Files 774 773 -1
Lines 28371 28311 -60
==========================================
- Hits 25795 25779 -16
+ Misses 2576 2532 -44
☔ View full report in Codecov by Sentry. |
5c6599c
to
13a4f59
Compare
This PR implements #1670 , by removing classes specific to NPY reading that aren't needed anymore, implementing a read_npy operator to match the read_csv and read_parquet operators, and making changes to CopyNode so that copying can still be done column-by-column