Rework npy copy to integrate with query processor pipeline #1734

aziz-mu · 2023-06-28T19:40:28Z

This PR implements #1670 , by removing classes specific to NPY reading that aren't needed anymore, implementing a read_npy operator to match the read_csv and read_parquet operators, and making changes to CopyNode so that copying can still be done column-by-column

src/include/processor/operator/physical_operator.h

src/processor/mapper/map_ddl.cpp

src/processor/operator/copy/read_npy.cpp

src/processor/operator/copy/copy_node.cpp

aziz-mu · 2023-07-06T20:22:55Z

Note that there's still a bug in the PR - reading large (>2048 rows), multidimensional (e.g. column w/ type INT32[10]) .npy files causes an error. I'm currently working on fixing this, but if it's urgent to get this PR merged to integrate with storage changes for the next release, I propose removing the failing test (which I've done already), and creating an issue for it to be fixed soon. Happy to discuss this further

mewim · 2023-07-06T20:27:14Z

@aziz-mu Let's wait until the bug is fixed. The main use case of NPY copy is to handle large, high-dimensional data files for PyG workload currently. If there is a bug reading large multidimensional files, this feature will not be very useful.

ray6080

These should fix failed tests.

src/processor/operator/copy/read_npy.cpp

test/test_files/tinysnb/explain/explain.test

test/test_files/copy/copy_npy_large.test

src/storage/in_mem_storage_structure/in_mem_column.cpp

src/processor/operator/copy/copy_node.cpp

src/storage/copier/npy_reader.cpp

src/include/processor/operator/copy/read_npy.h

src/processor/mapper/map_ddl.cpp

src/storage/copier/npy_reader.cpp

src/processor/operator/copy/read_npy.cpp

codecov · 2023-07-07T15:14:59Z

Codecov Report

Patch coverage: 97.78% and project coverage change: +0.13 🎉

Comparison is base (97e3b8e) 90.92% compared to head (c01d26f) 91.05%.

❗ Current head c01d26f differs from pull request most recent head ee9a98c. Consider uploading reports for the commit ee9a98c to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1734      +/-   ##
==========================================
+ Coverage   90.92%   91.05%   +0.13%     
==========================================
  Files         774      773       -1     
  Lines       28371    28311      -60     
==========================================
- Hits        25795    25779      -16     
+ Misses       2576     2532      -44

Impacted Files	Coverage Δ
src/common/vector/value_vector.cpp	`100.00% <ø> (ø)`
src/include/main/connection.h	`100.00% <ø> (ø)`
...gical_plan/logical_operator/logical_create_table.h	`100.00% <ø> (ø)`
src/include/processor/operator/copy/read_file.h	`100.00% <ø> (+8.33%)`	⬆️
src/include/processor/operator/physical_operator.h	`100.00% <ø> (ø)`
src/include/processor/physical_plan.h	`100.00% <ø> (ø)`
src/processor/mapper/map_ddl.cpp	`100.00% <ø> (+2.22%)`	⬆️
src/storage/copier/npy_reader.cpp	`90.67% <92.30%> (-0.07%)`	⬇️
src/processor/mapper/map_copy.cpp	`95.71% <95.71%> (ø)`
src/processor/operator/copy/copy_node.cpp	`97.02% <96.96%> (+0.09%)`	⬆️
... and 10 more

... and 5 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

ray6080 reviewed Jun 29, 2023

View reviewed changes

aziz-mu marked this pull request as ready for review July 6, 2023 20:26

ray6080 requested changes Jul 7, 2023

View reviewed changes

ray6080 force-pushed the npy-copy branch 3 times, most recently from 5c6599c to 13a4f59 Compare July 8, 2023 11:18

rework copy npy to integrate with query processor pipeline

ee9a98c

ray6080 force-pushed the npy-copy branch from 13a4f59 to ee9a98c Compare July 8, 2023 11:27

ray6080 approved these changes Jul 8, 2023

View reviewed changes

ray6080 changed the title ~~Npy copy~~ Rework npy copy to integrate query processor pipeline Jul 8, 2023

ray6080 changed the title ~~Rework npy copy to integrate query processor pipeline~~ Rework npy copy to integrate with query processor pipeline Jul 8, 2023

ray6080 merged commit bfb4fc6 into master Jul 8, 2023

ray6080 deleted the npy-copy branch July 8, 2023 12:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rework npy copy to integrate with query processor pipeline #1734

Rework npy copy to integrate with query processor pipeline #1734

aziz-mu commented Jun 28, 2023 •

edited

Loading

aziz-mu commented Jul 6, 2023

mewim commented Jul 6, 2023

ray6080 left a comment •

edited

Loading

codecov bot commented Jul 7, 2023 •

edited

Loading

Rework npy copy to integrate with query processor pipeline #1734

Rework npy copy to integrate with query processor pipeline #1734

Conversation

aziz-mu commented Jun 28, 2023 • edited Loading

aziz-mu commented Jul 6, 2023

mewim commented Jul 6, 2023

ray6080 left a comment • edited Loading

Choose a reason for hiding this comment

codecov bot commented Jul 7, 2023 • edited Loading

Codecov Report

aziz-mu commented Jun 28, 2023 •

edited

Loading

ray6080 left a comment •

edited

Loading

codecov bot commented Jul 7, 2023 •

edited

Loading