Release v24.06.00 · rapidsai/cudf

🚨 Breaking Changes

Deprecate Groupby.collect (#15808) @galipremsagar
Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
Support filtered I/O in chunked_parquet_reader and simplify the use of parquet_reader_options (#15764) @mhaseeb123
Raise errors for unsupported operations on certain types (#15712) @galipremsagar
Support DurationType in cudf parquet reader via arrow:schema (#15617) @mhaseeb123
Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
Remove legacy JSON reader from Python (#15538) @bdice
Removing all batching code from parquet writer (#15528) @mhaseeb123
Convert libcudf resource parameters to rmm::device_async_resource_ref (#15507) @harrism
Remove deprecated strings offsets_begin (#15454) @davidwendt
Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information (#15398) @mhaseeb123
Remove deprecated hash() and spark_murmurhash3_x86_32() (#15375) @davidwendt
Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
[FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
Align date_range defaults with pandas, support tz (#15139) @mroeschke

🐛 Bug Fixes

Revert "Fix docs for IO readers and strings_convert" (#15872) @vyasr
Remove problematic call of index setter to unblock dask-cuda CI (#15844) @charlesbluca
Use rapids_cpm_nvtx3 to get same nvtx3 target state as rmm (#15840) @robertmaynard
Return boolean from config_host_memory_resource instead of throwing (#15815) @abellina
Add temporary dask-cudf workaround for categorical sorting (#15801) @rjzamora
Fix row group alignment in ORC writer (#15789) @vuule
Raise error when sorting by categorical column in dask-cudf (#15788) @rjzamora
Upgrade arrow to 16.1 (#15787) @galipremsagar
Add support for PandasArray for pandas<2.1.0 (#15786) @galipremsagar
Limit runtime dependency to libarrow>=16.0.0,<16.1.0a0 (#15782) @pentschev
Fix cat.as_ordered not propogating correct size (#15780) @mroeschke
Handle mixed-like homogeneous types in isin (#15771) @galipremsagar
Fix id_vars and value_vars not accepting string scalars in melt (#15765) @mroeschke
Fix DatetimeIndex.loc for all types of ordering cases (#15761) @galipremsagar
Fix arrow versioning logic (#15755) @vyasr
Avoid running sanitizer on Java test designed to cause an error (#15753) @jlowe
Handle empty dataframe object with index present in setitem of loc (#15752) @galipremsagar
Eliminate circular reference in DataFrame/Series.iloc/loc (#15749) @mroeschke
Cap the absolute row index per pass in parquet chunked reader. (#15735) @nvdbaranec
Fix Index.repeat for datetime64 types (#15722) @galipremsagar
Fix multibyte check for case convert for large strings (#15721) @davidwendt
Fix get_loc to properly fetch results from an index that is in decreasing order (#15719) @galipremsagar
Return same type as the original index for .loc operations (#15717) @galipremsagar
Correct static builds + static arrow (#15715) @robertmaynard
Raise errors for unsupported operations on certain types (#15712) @galipremsagar
Fix ColumnAccessor caching of nrows if empty previously (#15710) @mroeschke
Allow None when nan_as_null=False in column constructor (#15709) @galipremsagar
Refine CudaTest.testCudaException in case throwing wrong type of CudaError under aarch64 (#15706) @sperlingxx
Fix maxima of categorical column (#15701) @rjzamora
Add proxy for inplace operations in cudf.pandas (#15695) @galipremsagar
Make nan_as_null behavior consistent across all APIs (#15692) @galipremsagar
Fix CI s3 api command to fetch latest results (#15687) @galipremsagar
Add NumpyExtensionArray proxy type in cudf.pandas (#15686) @galipremsagar
Properly implement binaryops for proxy types (#15684) @galipremsagar
Fix copy assignment and the comparison operator of rmm_host_allocator (#15677) @vuule
Fix multi-source reading in JSON byte range reader (#15671) @shrshi
Return int64 when pandas compatible mode is turned on for get_indexer (#15659) @galipremsagar
Fix Index contains for error validations and float vs int comparisons (#15657) @galipremsagar
Preserve sub-second data for time scalars in column construction (#15655) @galipremsagar
Check row limit size in cudf::strings::join_strings (#15643) @davidwendt
Enable sorting on column with nulls using query-planning (#15639) @rjzamora
Fix operator precedence problem in Parquet reader (#15638) @etseidl
Fix decoding of dictionary encoded FIXED_LEN_BYTE_ARRAY data in Parquet reader (#15601) @etseidl
Fix debug warnings/errors in from_arrow_device_test.cpp (#15596) @davidwendt
Add "collect" aggregation support to dask-cudf (#15593) @rjzamora
Fix categorical-accessor support and testing in dask-cudf (#15591) @rjzamora
Disable compute-sanitizer usage in CI tests with CUDA<11.6 (#15584) @davidwendt
Preserve RangeIndex.step in to_arrow/from_arrow (#15581) @mroeschke
Ignore new cupy warning (#15574) @vyasr
Add cuda-sanitizer-api dependency for test-cpp matrix 11.4 (#15573) @davidwendt
Allow apply udf to reference global modules in cudf.pandas (#15569) @mroeschke
Fix deprecation warnings for json legacy reader (#15563) @davidwendt
Fix millisecond resampling in cudf Python (#15560) @mroeschke
Rename JSON_READER_OPTION to JSON_READER_OPTION_NVBENCH. (#15553) @bdice
Fix a JNI bug in JSON parsing fixup (#15550) @revans2
Remove conda channel setup from wheel CI image script. (#15539) @bdice
cudf.pandas: Series dt accessor is CombinedDatetimelikeProperties (#15523) @wence-
Fix for some compiler warnings in parquet/page_decode.cuh (#15518) @etseidl
Fix exponent overflow in strings-to-double conversion (#15517) @davidwendt
nanoarrow uses package override for proper pinned versions generation (#15515) @robertmaynard
Remove index name overrides in dask-cudf pyarrow table dispatch (#15514) @charlesbluca
Fix async synchronization issues in json_column.cu (#15497) @karthikeyann
Add new patch to hide more CCCL APIs (#15493) @vyasr
Make improvements in pandas-test reporting (#15485) @galipremsagar
Fixed page data truncation in parquet writer under certain conditions. (#15474) @nvdbaranec
Only use data_type constructor with scale for decimal types (#15472) @wence-
Avoid "p2p" shuffle as a default when dask_cudf is imported (#15469) @rjzamora
Fix debug build errors from to_arrow_device_test.cpp (#15463) @davidwendt
Fix base_normalator::integer_sizeof_fn integer dispatch (#15457) @davidwendt
Allow consumers of static builds to find nanoarrow (#15456) @robertmaynard
Allow jit compilation when using a splayed CUDA toolkit (#15451) @robertmaynard
Handle case of scan aggregation in groupby-transform (#15450) @wence-
Test static builds in CI and fix nanoarrow configure (#15437) @vyasr
Fixes potential race in JSON parser when parsing JSON lines format and when recovering from invalid lines (#15419) @elstehle
Fix errors in chunked ORC writer when no tables were (successfully) written (#15393) @vuule
Support implicit array conversion with query-planning enabled (#15378) @rjzamora
Fix arrow-based round trip of empty dataframes (#15373) @wence-
Remove empty elements from exploded character-ngrams output (#15371) @davidwendt
Remove boundscheck=False setting in cython files (#15362) @wence-
Patch dask-expr var logic in dask-cudf (#15347) @rjzamora
Fix for logical and syntactical errors in libcudf c++ examples (#15346) @mhaseeb123
Disable dask-expr in docs builds. (#15343) @bdice
Apply the cuFile error work around to data_sink as well (#15335) @vuule
Fix parquet predicate filtering with column projection (#15113) @karthikeyann
Check column type equality, handling nested types correctly. (#14531) @bdice

📖 Documentation

Fix docs for IO readers and strings_convert (#15842) @bdice
Update cudf.pandas docs for GA (#15744) @beckernick
Add contributing warning about circular imports (#15691) @er-eis
Update libcudf developer guide for strings offsets column (#15661) @davidwendt
Update developer guide with device_async_resource_ref guidelines (#15562) @harrism
DOC: add pandas intersphinx mapping (#15531) @raybellwaves
rm-dup-doc in frame.py (#15530) @raybellwaves
Update CONTRIBUTING.md to use latest cuda env (#15467) @raybellwaves
Doc: interleave columns pandas compat (#15383) @raybellwaves
Simplified README Examples (#15338) @wkaisertexas
Add debug tips section to libcudf developer guide (#15329) @davidwendt
Fix and clarify notes on result ordering (#13255) @shwina

🚀 New Features

Add JNI bindings for zstd compression of NVCOMP. (#15729) @firestarman
Fix spaces around CSV quoted strings (#15727) @thabetx
Add default pinned pool that falls back to new pinned allocations (#15665) @vuule
Overhaul ops-codeowners coverage (#15660) @raydouglass
Concatenate dictionary of objects along axis=1 (#15623) @er-eis
Construct pylibcudf columns from objects supporting __cuda_array_interface__ (#15615) @brandon-b-miller
Expose some Parquet per-column configuration options via the python API (#15613) @etseidl
Migrate string find operations to pylibcudf (#15604) @brandon-b-miller
Round trip FIXED_LEN_BYTE_ARRAY data properly in Parquet writer (#15600) @etseidl
Reading multi-line JSON in string columns using runtime configurable delimiter (#15556) @shrshi
Remove public gtest dependency from libcudf conda package (#15534) @robertmaynard
Fea/move to latest nanoarrow (#15526) @robertmaynard
Migrate string case operations to pylibcudf (#15489) @brandon-b-miller
Add Parquet encoding statistics to column chunk metadata (#15452) @etseidl
Implement JNI for chunked ORC reader (#15446) @ttnghia
Add some missing optional fields to the Parquet RowGroup metadata (#15421) @etseidl
Adding parquet transcoding example (#15420) @mhaseeb123
Add fields to Parquet Statistics structure that were added in parquet-format 2.10 (#15412) @etseidl
Add option to Parquet writer to skip compressing individual columns (#15411) @etseidl
Add BYTE_STREAM_SPLIT support to Parquet (#15311) @etseidl
Introduce benchmark suite for JSON reader options (#15124) @shrshi
Implement ORC chunked reader (#15094) @ttnghia
Extend cudf devcontainers to specify jitify2 kernel cache (#15068) @robertmaynard
Add to_arrow_device function to cudf interop using nanoarrow (#15047) @zeroshade
Add JSON option to prune columns (#14996) @karthikeyann

🛠️ Improvements

Deprecate Groupby.collect (#15808) @galipremsagar
Raise FileNotFoundError when a literal JSON string that looks like a json filename is passed (#15806) @lithomas1
Deprecate divisions='quantile' support in set_index (#15804) @rjzamora
Improve performance of Series.to_numpy/to_cupy (#15792) @mroeschke
Access self.index instead of self._index where possible (#15781) @mroeschke
Support filtered I/O in chunked_parquet_reader and simplify the use of parquet_reader_options (#15764) @mhaseeb123
Avoid index-to-column conversion in some DataFrame ops (#15763) @mroeschke
Fix chunked_parquet_reader behavior when input has no more rows to read (#15757) @mhaseeb123
[JNI] Expose java API for cudf::io::config_host_memory_resource (#15745) @abellina
Migrate all cpp pxd files into pylibcudf (#15740) @vyasr
Validate and materialize iterators earlier in as_column (#15739) @mroeschke
Push some as_column arrow logic to ColumnBase.from_arrow (#15738) @mroeschke
Expose stream parameter in public reduction APIs (#15737) @srinivasyadav18
remove unnecessary 'setuptools' host dependency, simplify dependencies.yaml (#15736) @jameslamb
Defer to C++ equality and hashing for pylibcudf DataType and Aggregation objects (#15732) @wence-
Implement null-aware NOT_EQUALS binop (#15731) @wence-
Fix split-record result list column offset type (#15707) @davidwendt
Upgrade arrow to 16 (#15703) @galipremsagar
Remove experimental namespace from make_strings_children (#15702) @davidwendt
Rework get_json_object benchmark to use nvbench (#15698) @davidwendt
Rework some python tests of Parquet delta encodings (#15693) @etseidl
Skeleton cudf polars package (#15688) @wence-
Upgrade pre commit hooks (#15685) @wence-
Allow fillna to validate for CategoricalColumn.fillna (#15683) @galipremsagar
Misc Column cleanups (#15682) @mroeschke
Reducing runtime of JSON reader options benchmark (#15681) @shrshi
Add Timestamp and Timedelta proxy types (#15680) @galipremsagar
Remove host_parse_nested_json. (#15674) @bdice
Reduce runtime for ParquetChunkedReaderInputLimitTest gtests (#15672) @davidwendt
Add large-strings gtest for cudf::interleave_columns (#15669) @davidwendt
Use experimental make_strings_children for multi-replace_re (#15667) @davidwendt
Enabled Holiday types in cudf.pandas (#15664) @galipremsagar
Remove obsolete XFAIL markers for query-planning (#15662) @rjzamora
Clean up join benchmarks (#15644) @PointKernel
Enable warnings as errors in custreamz (#15642) @mroeschke
Improve distinct join with set retrieve (#15636) @PointKernel
Fix -Werror=type-limits. (#15635) @bdice
Enable FutureWarnings/DeprecationWarnings as errors for dask_cudf (#15634) @mroeschke
Remove NVBench SHA override. (#15633) @alliepiper
Add support for large string columns to Parquet reader and writer (#15632) @etseidl
Large strings support in MD5 and SHA hashers (#15631) @davidwendt
Fix make_offsets_child_column usage in cudf::strings::detail::shift (#15630) @davidwendt
Use experimental make_strings_children for strings convert (#15629) @davidwendt
Forward-merge branch-24.04 to branch-24.06 (#15627) @bdice
Avoid accessing attributes via _column if not needed (#15624) @mroeschke
Make ColumnBase.cuda_array_interface opt out instead of opt in (#15622) @mroeschke
Large strings support for cudf::gather (#15621) @davidwendt
Remove jni-docker-build workflow (#15619) @bdice
Support DurationType in cudf parquet reader via arrow:schema (#15617) @mhaseeb123
Drop Centos7 support (#15608) @NvTimLiu
Use experimental make_strings_children for json/csv writers (#15599) @davidwendt
Use experimental make_strings_children for strings join/url_encode/slice (#15598) @davidwendt
Use experimental make_strings_children in nvtext APIs (#15595) @davidwendt
Migrate to {{ stdlib("c") }} (#15594) @hcho3
Deprecate to/from_dask_dataframe APIs in dask-cudf (#15592) @rjzamora
Minor fixups for future NumPy 2 compatibility (#15590) @seberg
Delay materializing RangeIndex in .reset_index (#15588) @mroeschke
Use experimental make_strings_children for capitalize/case/pad functions (#15587) @davidwendt
Use experimental make_strings_children for strings replace/filter/translate (#15586) @davidwendt
Add multithreaded parquet reader benchmarks. (#15585) @nvdbaranec
Don't materialize column during RangeIndex methods (#15582) @mroeschke
Improve performance for cudf::strings::count_re (#15578) @davidwendt
Replace RangeIndex._start/_stop/_step with _range (#15576) @mroeschke
add --rm and --name to devcontainer run args (#15572) @trxcllnt
Change the default dictionary policy in Parquet writer from ALWAYS to ADAPTIVE (#15570) @mhaseeb123
Rename experimental JSON tests. (#15568) @bdice
Refactor JNI native dependency loading to allow returning of library path (#15566) @jlowe
Remove protobuf and use parsed ORC statistics from libcudf (#15564) @bdice
Deprecate legacy JSON reader options. (#15558) @bdice
Use same .clang-format in cuDF JNI (#15557) @bdice
Large strings support for cudf::fill (#15555) @davidwendt
Upgrade upper bound pinning to pandas-2.2.2 (#15554) @galipremsagar
Work around issues with cccl main (#15552) @miscco
Enable pandas plotting unit tests for cudf.pandas (#15547) @mroeschke
Move timezone conversion logic to DatetimeColumn (#15545) @mroeschke
Large strings support for cudf::interleave_columns (#15544) @davidwendt
[skip ci] Switch back to 24.06 branch for pandas tests (#15543) @galipremsagar
Remove checks dependency from static-configure test job. (#15542) @bdice
Remove legacy JSON reader from Python (#15538) @bdice
Enable more ignored pandas unit tests for cudf.pandas (#15535) @mroeschke
Large strings support for cudf::clamp (#15533) @davidwendt
Remove version hard-coding (#15529) @galipremsagar
Removing all batching code from parquet writer (#15528) @mhaseeb123
Make some private class properties not settable (#15527) @mroeschke
Large strings support in regex replace APIs (#15524) @davidwendt
Skip pandas unit tests that crash pytest workers in cudf.pandas (#15521) @mroeschke
Preserve column metadata during more DataFrame operations (#15519) @mroeschke
Move to pandas-tests to a dedicated workflow file and trigger it from branch.yaml (#15516) @galipremsagar
Large strings gtest fixture and utilities (#15513) @davidwendt
Convert libcudf resource parameters to rmm::device_async_resource_ref (#15507) @harrism
Relax protobuf lower bound to 3.20. (#15506) @bdice
Clean up index methods (#15496) @mroeschke
Update strings contains benchmarks to nvbench (#15495) @davidwendt
Update NVBench fixture to use new hooks, fix pinned memory segfault. (#15492) @alliepiper
Enable tests/scalar and test/series in cudf.pandas tests (#15486) @mroeschke
Clean up cuda_array_interface handling in as_column (#15477) @mroeschke
Avoid .ordered and .categories from being settable in CategoricalColumn and CategoricalDtype (#15475) @mroeschke
Ignore pandas tests for cudf.pandas that need motoserver (#15468) @mroeschke
Use cached_property for NumericColumn.nan_count instead of ._nan_count variable (#15466) @mroeschke
Add to_arrow_device() functions that accept views (#15465) @davidwendt
Add custom status check workflow (#15464) @galipremsagar
Disable pandas 2.x clipboard tests in cudf.pandas tests (#15462) @mroeschke
Enable tests/strings/test_api.py and tests/io/pytables in cudf.pandas tests (#15461) @mroeschke
Enable test_parsing in cudf.pandas tests (#15460) @mroeschke
Add from_arrow_device function to cudf interop using nanoarrow (#15458) @zeroshade
Remove deprecated strings offsets_begin (#15454) @davidwendt
Enable tests/windows/ in cudf.pandas tests (#15444) @mroeschke
Enable tests/interchange/test_impl.py in cudf.pandas tests (#15443) @mroeschke
Enable tests/io/test_user_agent.py in cudf pandas tests (#15442) @mroeschke
Performance improvement in libcudf case conversion for long strings (#15441) @davidwendt
Remove prior test skipping in run-pandas-tests with testing 2.2.1 (#15440) @mroeschke
Support orc and text IO with dask-expr using legacy conversion (#15439) @rjzamora
Floating <--> fixed-point conversion must now be called explicitly (#15438) @pmattione-nvidia
Unify Copy-On-Write and Spilling (#15436) @madsbk
Enable dask_cudf json and s3 tests with query-planning on (#15408) @rjzamora
Bump ruff and codespell pre-commit checks (#15407) @mroeschke
Enable all tests for arm arch (#15402) @galipremsagar
Bind read_parquet_metadata API to libcudf instead of pyarrow and extract RowGroup information (#15398) @mhaseeb123
Optimizing multi-source byte range reading in JSON reader (#15396) @shrshi
add correct labels to pandas_function_request.md (#15381) @raybellwaves
Remove deprecated hash() and spark_murmurhash3_x86_32() (#15375) @davidwendt
Large strings support in cudf::merge (#15374) @davidwendt
Enable test-reporting for pandas pytests in CI (#15369) @galipremsagar
Use logical types in Parquet reader (#15365) @etseidl
Add experimental make_strings_children utility (#15363) @davidwendt
Forward-merge branch-24.04 to branch-24.06 (#15349) @bdice
Fix CMake files in libcudf C++ examples to use existing libcudf build if present (#15348) @mhaseeb123
Use ruff pydocstyle over pydocstyle pre-commit hook (#15345) @mroeschke
Refactor stream mode setup for gtests (#15337) @davidwendt
Benchmark decimal <--> floating conversions. (#15334) @pmattione-nvidia
Avoid duplicate dask-cudf testing (#15333) @rjzamora
Skip decode steps in Parquet reader when nullable columns have no nulls (#15332) @etseidl
Update udf_cpp to use rapids_cpm_cccl. (#15331) @bdice
Forward-merge branch-24.04 into branch-24.06 [skip ci] (#15330) @rapids-bot[bot]
Allow numeric_only=True for simple groupby reductions (#15326) @rjzamora
Drop CentOS 7 support. (#15323) @bdice
Rework cudf::find_and_replace_all to use gather-based make_strings_column (#15305) @davidwendt
First pass at adding testing for pylibcudf (#15300) @vyasr
[FEA] Performance improvement for mixed left semi/anti join (#15288) @tgujar
Rework cudf::replace_nulls to use strings::detail::copy_if_else (#15286) @davidwendt
Clean up special casing in as_column for non-typed input (#15276) @mroeschke
Large strings support in cudf::concatenate (#15195) @davidwendt
Use less _is_categorical_dtype (#15148) @mroeschke
Align date_range defaults with pandas, support tz (#15139) @mroeschke
ModuleAccelerator performance: cache the result of checking if a caller is in the denylist (#15056) @shwina
Use offsetalator in cudf::strings::replace functions (#14824) @davidwendt
Cleanup some timedelta/datetime column logic (#14715) @mroeschke
Refactor numpy array input in as_column (#14651) @mroeschke
Refactor joins for conditional semis and antis (#14646) @DanialJavady96
Eagerly populate the class dict for cudf.pandas proxy types (#14534) @shwina
Some additional kernel thread index refactoring. (#14107) @bdice

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v24.06.00

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors