Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

string scalar support in AST - proof of concept #6

Closed
wants to merge 74 commits into from

Commits on Mar 30, 2023

  1. Configuration menu
    Copy the full SHA
    982af8a View commit details
    Browse the repository at this point in the history

Commits on Apr 4, 2023

  1. Configuration menu
    Copy the full SHA
    0a9eb86 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    50ee55d View commit details
    Browse the repository at this point in the history
  3. cleanup docs

    karthikeyann committed Apr 4, 2023
    Configuration menu
    Copy the full SHA
    9735d51 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    8653e61 View commit details
    Browse the repository at this point in the history

Commits on Apr 5, 2023

  1. Fix OOB memory access in CSV reader when reading without NA values (r…

    …apidsai#13011)
    
    CSV reader uses a trie to read field with special values as nulls. The creation of the trie does not work correctly when there are not special values. This can happen when the NA filter is enabled, but the default NA values are removed, and user does not specify custom values. In this case, use of this trie leads to OOB memory access.
    This PR fixes the trie creation to create an empty trie when there are not special values to look for.
    Included a C++ test that crashes without the fix.
    
    Authors:
      - Vukasin Milovanovic (https://github.com/vuule)
    
    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Mike Wilson (https://github.com/hyperbolic2346)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#13011
    vuule committed Apr 5, 2023
    Configuration menu
    Copy the full SHA
    9a770f6 View commit details
    Browse the repository at this point in the history
  2. Add except declaration in Cython interface for regex_program::create (r…

    …apidsai#13054)
    
    Add the `except +` declaration to the `cudf::strings::regex_program::create()` function in the Cython `regex_program.pxd` interface since invalid regex patterns are thrown by this call. This allows the normal Cython exception handling to pass the exception to the Python logic without aborting the process.
    
    Closes rapidsai#13052
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
    
    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Ashwin Srinath (https://github.com/shwina)
    
    URL: rapidsai#13054
    davidwendt committed Apr 5, 2023
    Configuration menu
    Copy the full SHA
    7a739ce View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    7268b5f View commit details
    Browse the repository at this point in the history
  4. Fix tests/identify_stream_usage.cpp (rapidsai#13066)

    The identify_stream_usage test uses `strcmp` but not does not include `<cstring>`. This PR fixes that.
    
    The missing include was surfaced by rapidsai#13064, showing that the test relied on headers in `spdlog` to include `cstring`.
    
    Authors:
      - Allard Hendriksen (https://github.com/ahendriksen)
      - Bradley Dice (https://github.com/bdice)
    
    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#13066
    ahendriksen committed Apr 5, 2023
    Configuration menu
    Copy the full SHA
    da7fe2a View commit details
    Browse the repository at this point in the history
  5. Fix a dask-cudf error

    galipremsagar committed Apr 5, 2023
    Configuration menu
    Copy the full SHA
    6563440 View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    54e7889 View commit details
    Browse the repository at this point in the history
  7. Configuration menu
    Copy the full SHA
    46a8016 View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    1d95f75 View commit details
    Browse the repository at this point in the history
  9. Configuration menu
    Copy the full SHA
    a3ed98a View commit details
    Browse the repository at this point in the history
  10. Configuration menu
    Copy the full SHA
    5179b8e View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    241b560 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    d1a0114 View commit details
    Browse the repository at this point in the history
  13. Add algorithm include in data_sink.hpp (rapidsai#13068)

    `data_sink.hpp` uses `std::transform` but not does not include <algorithm>. This PR fixes that.
    
    The missing include was surfaced by rapidsai#13064.
    
    Authors:
      - Allard Hendriksen (https://github.com/ahendriksen)
    
    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - David Wendt (https://github.com/davidwendt)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#13068
    ahendriksen committed Apr 5, 2023
    Configuration menu
    Copy the full SHA
    0cf8c91 View commit details
    Browse the repository at this point in the history

Commits on Apr 6, 2023

  1. Merge pull request rapidsai#13070 from galipremsagar/pin_dask

    [REVIEW] Pin `dask` and `distributed` for release
    jolorunyomi committed Apr 6, 2023
    Configuration menu
    Copy the full SHA
    2c3b2ab View commit details
    Browse the repository at this point in the history
  2. Update join to use experimental row hasher and comparator (rapidsai…

    …#12787)
    
    Part of rapidsai#11844. I will create a separate PR for `mixed_join`.
    
    Compilation times:
    `main` rapidsai@94bbc82 : `16m47.513s`
    This PR rapidsai@5d75db8 : `16m47.520s`
    
    Benchmarks: rapidsai#12787 (comment)
    
    Authors:
      - Divye Gala (https://github.com/divyegala)
    
    Approvers:
      - Yunsong Wang (https://github.com/PointKernel)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#12787
    divyegala committed Apr 6, 2023
    Configuration menu
    Copy the full SHA
    d5aad2f View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    0f4ea41 View commit details
    Browse the repository at this point in the history
  4. Merge pull request rapidsai#13080 from galipremsagar/branch-23.06-mer…

    …ge-23.04
    
    Resolved automerger from `branch-23.04` to `branch-23.06`
    raydouglass committed Apr 6, 2023
    Configuration menu
    Copy the full SHA
    d82f97c View commit details
    Browse the repository at this point in the history

Commits on Apr 7, 2023

  1. Adding hostdevice_span that is a span createable from `hostdevice_v…

    …ector` (rapidsai#12981)
    
    I ran into a need for a span-like view into a `hostdevice_vector`. I was chopping it up into pieces to pass into a function to process portions at a time, but it still wanted to do things like host to device on the spans. This class is a result of that need.
    
    Authors:
      - Mike Wilson (https://github.com/hyperbolic2346)
      - Nghia Truong (https://github.com/ttnghia)
    
    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - Vukasin Milovanovic (https://github.com/vuule)
    
    URL: rapidsai#12981
    hyperbolic2346 committed Apr 7, 2023
    Configuration menu
    Copy the full SHA
    e28c9c5 View commit details
    Browse the repository at this point in the history
  2. Fix column selection read_parquet benchmarks (rapidsai#13082)

    Helper function `get_col_names` in the Parquet reader benchmarks throws with nested columns. It should instead just ignore the children columns and return the top-level colum names.
    Also renamed the function to better reflect what it does.
    
    Authors:
      - Vukasin Milovanovic (https://github.com/vuule)
    
    Approvers:
      - https://github.com/nvdbaranec
      - Yunsong Wang (https://github.com/PointKernel)
    
    URL: rapidsai#13082
    vuule committed Apr 7, 2023
    Configuration menu
    Copy the full SHA
    46b5900 View commit details
    Browse the repository at this point in the history
  3. Compute column sizes in Parquet preprocess with single kernel (rapids…

    …ai#12931)
    
    Addresses rapidsai#11922 
    
    Currently in Parquet preprocessing a `thrust::reduce()` and `thrust::exclusive_scan_by_key()` is performed to compute the column size and offsets for each nested column. For complicated schemas this results in a large number of kernel invocations. This PR calculates the sizes and offsets of all columns in single calls to `thrust::reduce_by_key()` and `thrust::exclusive_scan_by_key()`. 
    
    This change results in around 1.3x speedup when reading a complicated schema.
    Before:
    ![image](https://user-images.githubusercontent.com/26264495/224823213-ae998654-274c-450a-8ad7-ea854541335e.png)
    
    After:
    ![image](https://user-images.githubusercontent.com/26264495/224823108-cb91c380-5e35-4c77-a6f9-6703e321be05.png)
    
    Authors:
      - Srikar Vanavasam (https://github.com/SrikarVanavasam)
    
    Approvers:
      - Yunsong Wang (https://github.com/PointKernel)
      - Nghia Truong (https://github.com/ttnghia)
      - Vukasin Milovanovic (https://github.com/vuule)
    
    URL: rapidsai#12931
    SrikarVanavasam committed Apr 7, 2023
    Configuration menu
    Copy the full SHA
    f328b64 View commit details
    Browse the repository at this point in the history
  4. Reduce shared memory usage in gpuComputePageSizes by 50% (rapidsai#13047

    )
    
    In a multithreaded, multi-stream environment (Spark) we were experiencing a performance regression on some benchmark queries.  The culprit was gpu scheduling issues related to the `gpuComputePageSizes` kernel.   Dependent kernels (`gpuDecodePages`) were getting serialized because `gpuComputePageSizes` wasn't running alongside other streams well.
    
    The fix was reducing shared memory usage in `gpuComputePageSizes`.  The kernel shares a lot of code and data structures with `gpuDecodePages` but doesn't actually use several of the large buffers that are stored in shared memory.  This PR refactors those buffers out so that they are only declared in the `gpuDecodePages` kernel, reducing the shared usage by 50% (3kb).
    
    This clears up the performance issue on Spark.  I am currently experiencing build issues with cudf benchmarks so I'm marking this as do-not-merge until I can verify with them.
    
    Authors:
      - https://github.com/nvdbaranec
      - Vukasin Milovanovic (https://github.com/vuule)
    
    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - Vukasin Milovanovic (https://github.com/vuule)
    
    URL: rapidsai#13047
    nvdbaranec committed Apr 7, 2023
    Configuration menu
    Copy the full SHA
    c4a34eb View commit details
    Browse the repository at this point in the history
  5. Add empty test files for test reorganization (rapidsai#12288)

    This PR adds empty test modules that match the "Test Organization" guidelines outlined in the [developer guide](https://github.com/rapidsai/cudf/blob/branch-23.02/docs/cudf/source/developer_guide/testing.md#test-organization). Follow-up PRs will move existing tests into these test modules. 
    
    While I have attempted to match the structure of our API reference as much a possible, there are small differences. For example, the API reference lumps together [Reshaping, Sorting, and Transposing](https://docs.rapids.ai/api/cudf/stable/api_docs/dataframe.html#reshaping-sorting-transposing), while I opted to include two different modules for reshaping and sorting.
    
    There are only a couple of instances where I needed to deviate from the structure though.
    
    Authors:
      - Ashwin Srinath (https://github.com/shwina)
    
    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    URL: rapidsai#12288
    shwina committed Apr 7, 2023
    Configuration menu
    Copy the full SHA
    5a703d0 View commit details
    Browse the repository at this point in the history

Commits on Apr 8, 2023

  1. Raise NotImplementedError when attempting to construct cuDF objects…

    … from timezone-aware datetimes (rapidsai#13086)
    
    Closes rapidsai#13077
    
    Authors:
      - Ashwin Srinath (https://github.com/shwina)
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    URL: rapidsai#13086
    shwina committed Apr 8, 2023
    Configuration menu
    Copy the full SHA
    52e8b5e View commit details
    Browse the repository at this point in the history

Commits on Apr 10, 2023

  1. Remove deprecated regex functions from libcudf (rapidsai#13067)

    Removes the libcudf regex APIs that were deprecated in 23.04. All calls to these functions within the repo had already been removed.
    Marking this breaking since APIs are being removed.
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
    
    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - Bradley Dice (https://github.com/bdice)
    
    URL: rapidsai#13067
    davidwendt committed Apr 10, 2023
    Configuration menu
    Copy the full SHA
    ebe4757 View commit details
    Browse the repository at this point in the history
  2. Fix unused variable error/warning in page_data.cu (rapidsai#13093)

    Fixes a minor compile error/warnings for unused variables in the `cpp/src/io/parquet/page_data.cu` source file.
    
    ```
    cudf/cpp/src/io/parquet/page_data.cu -o CMakeFiles/cudf.dir/src/io/parquet/page_data.cu.o
    /cudf/cpp/src/io/parquet/page_data.cu(636): error rapidsai#177-D: parameter "s" was declared but never referenced
    
    /cudf/cpp/src/io/parquet/page_data.cu(343): error rapidsai#177-D: parameter "sb" was declared but never referenced
              detected during instantiation of "cuda::std::__4::pair<int, int> cudf::io::parquet::gpu::<unnamed>::gpuDecodeDictionaryIndices<sizes_only>(volatile cudf::io::parquet::gpu::<unnamed>::page_state_s *, volatile cudf::io::parquet::gpu::<unnamed>::page_state_buffers_s *, int, int) [with sizes_only=true]" 
    (1720): here
    
    /cudf/cpp/src/io/parquet/page_data.cu(527): error rapidsai#177-D: parameter "sb" was declared but never referenced
              detected during instantiation of "cudf::size_type cudf::io::parquet::gpu::<unnamed>::gpuInitStringDescriptors<sizes_only>(volatile cudf::io::parquet::gpu::<unnamed>::page_state_s *, volatile cudf::io::parquet::gpu::<unnamed>::page_state_buffers_s *, int, int) [with sizes_only=true]" 
    (1724): here
    
    3 errors detected in the compilation of "/cudf/cpp/src/io/parquet/page_data.cu".
    
    ```
    
    Found these with a Debug build using nvcc 11.5 and gcc 9.5.
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
    
    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - Divye Gala (https://github.com/divyegala)
      - Vukasin Milovanovic (https://github.com/vuule)
      - Karthikeyan (https://github.com/karthikeyann)
    
    URL: rapidsai#13093
    davidwendt committed Apr 10, 2023
    Configuration menu
    Copy the full SHA
    cf26353 View commit details
    Browse the repository at this point in the history
  3. Fix missing confluent kafka version (rapidsai#13101)

    This PR fixes missing `python-confluent-kafka` version changes in `custreamz` and removes `python-confluent-kafka` from `cudf_kafka` because I don't see any usage of `confluent_kafka` in the python code.
    
    Authors:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
      - Bradley Dice (https://github.com/bdice)
    
    Approvers:
      - Ray Douglass (https://github.com/raydouglass)
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Bradley Dice (https://github.com/bdice)
    
    URL: rapidsai#13101
    galipremsagar committed Apr 10, 2023
    Configuration menu
    Copy the full SHA
    5e41c1f View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    14214db View commit details
    Browse the repository at this point in the history
  5. Remove uses-setup-env-vars (rapidsai#13105)

    This setting now matches the default behavior of the shared-action-workflows repo
    
    Authors:
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
      - AJ Schmidt (https://github.com/ajschmidt8)
    
    URL: rapidsai#13105
    vyasr committed Apr 10, 2023
    Configuration menu
    Copy the full SHA
    8c9a8c4 View commit details
    Browse the repository at this point in the history
  6. Support structs of lists in row lexicographic comparator (rapidsai#13005

    )
    
    This fixes the lexicographic comparator that cannot handle the input having structs of lists. The new implementation mainly changes the helper functions `decompose_structs`. In particular:
     * If a structs column has its first child is a lists column, the first column of the result table will no longer be `Struct<Struct<...<List<SomeType>...>` (i.e., nested structs ultimately having one child).
     * Instead, the first output column will be nested empty structs: `Struct<...Struct<>>...>`. The innermost child column `List<SomeType>` is output as the second column in the result table.
    
    Depends on:
     * rapidsai#12995
    
    Closes rapidsai#11672.
    
    Authors:
      - Nghia Truong (https://github.com/ttnghia)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Divye Gala (https://github.com/divyegala)
      - Bradley Dice (https://github.com/bdice)
    
    URL: rapidsai#13005
    ttnghia committed Apr 10, 2023
    Configuration menu
    Copy the full SHA
    f357892 View commit details
    Browse the repository at this point in the history
  7. Optimize set-like operations (rapidsai#12769)

    Set-like operations such as `intersect_distinct` and `difference_distinct` call `purge_nonempty_nulls` when the input is nullable. This PR optimizes these set APIs by checking the existence of non-empty nulls (using `has_nonempty_nulls`) before calling to `purge_nonempty_nulls`.
    
    Authors:
      - Nghia Truong (https://github.com/ttnghia)
    
    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Yunsong Wang (https://github.com/PointKernel)
    
    URL: rapidsai#12769
    ttnghia committed Apr 10, 2023
    Configuration menu
    Copy the full SHA
    30411b5 View commit details
    Browse the repository at this point in the history
  8. Replace unnecessary uses of UNKNOWN_NULL_COUNT (rapidsai#13102)

    This PR replaces uses of `cudf::UNKNOWN_NULL_COUNT` where the null count is either already known or trivially computed.
    
    Contributes to rapidsai#11968
    
    Authors:
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
      - David Wendt (https://github.com/davidwendt)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#13102
    vyasr committed Apr 10, 2023
    Configuration menu
    Copy the full SHA
    cab6522 View commit details
    Browse the repository at this point in the history

Commits on Apr 11, 2023

  1. Adding ifdefs around nvcc-specific pragmas (rapidsai#13110)

    This change wraps the NVCC-specific `#pragma` macros inside an `ifdef` to prevent compilation warnings as described in issue rapidsai#13106
    
    closes rapidsai#13106
    
    Authors:
      - Mike Wilson (https://github.com/hyperbolic2346)
    
    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#13110
    hyperbolic2346 committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    e9e86f4 View commit details
    Browse the repository at this point in the history
  2. Fixes sliced list and struct column bug in JSON chunked writer (rapid…

    …sai#13108)
    
    Fixes the OOM access error while using chunked JSON writer on list columns.
    The issue is present in struct columns also, which is fixed in this change.
    Fixes rapidsai#13030
    
    Authors:
      - Karthikeyan (https://github.com/karthikeyann)
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - David Wendt (https://github.com/davidwendt)
    
    URL: rapidsai#13108
    karthikeyann committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    5638d44 View commit details
    Browse the repository at this point in the history
  3. Remove using namespace cudf; from libcudf gtests source (rapidsai#13089)

    Removes `using namespace cudf;` from gtests source code to make it easier to read -- find where utilities and function calls are implemented. Also removed a few `using namespace cudf::test;` usages which by extension includes namespace `cudf`.
    
    Found these while working on rapidsai#13081
    Reference rapidsai#11734
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
    
    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Bradley Dice (https://github.com/bdice)
    
    URL: rapidsai#13089
    davidwendt committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    73c8d16 View commit details
    Browse the repository at this point in the history
  4. Fix GPU_ARCHS setting in Java CMake build and CMAKE_CUDA_ARCHITECTURE…

    …S in Python package build. (rapidsai#13117)
    
    Changes the `GPU_ARCHS` setting in `pom.xml` from `ALL` to `RAPIDS` per recent change in rapidsai/rapids-cmake/pull/397
    
    The Python package build requires CUDA as of the addition of string UDFs, which added compilation of ptx code to the Python build. Therefore CMAKE_CUDA_ARCHITECTURES must be set appropriately even when only the Python package is being built. rapidsai/rapids-cmake#397 requires a nonempty string value to be used if the variable is set at all. This PR updates build.sh to include the appropriate default when only the Python build is requested.
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
    
    Approvers:
      - Robert (Bobby) Evans (https://github.com/revans2)
      - Bradley Dice (https://github.com/bdice)
    
    URL: rapidsai#13117
    davidwendt committed Apr 11, 2023
    Configuration menu
    Copy the full SHA
    50718e6 View commit details
    Browse the repository at this point in the history

Commits on Apr 12, 2023

  1. Cleanup ORC chunked writer (rapidsai#13091)

    This changes the internal variables of ORC chunked writer:
     * Renaming them to have a `_` prefix consistently.
     * Add `const` qualifier to some variables that are writer parameters.
     * Regroup them.
    
    There is not any new implementation added. However, the unused parameter `mr` is removed from its interface thus this is flagged as `breaking` changes.
    
    Closes: 
     * rapidsai#12973
    
    Authors:
      - Nghia Truong (https://github.com/ttnghia)
    
    Approvers:
      - Karthikeyan (https://github.com/karthikeyann)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    URL: rapidsai#13091
    ttnghia committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    ecadda5 View commit details
    Browse the repository at this point in the history
  2. Use make_empty_lists_column instead of make_empty_column(type_id::LIS…

    …T) (rapidsai#13099)
    
    Fixes bug where `cudf::make_empty_column(type_id::LIST)` is called and adds a gtests to check for this error.
    The `make_empty_column` cannot accept a nested type because it requires a child type. The internal `make_empty_lists_column` is moved to the `lists_column_factories.hpp` header which is itself moved to the `cpp/include/cudf/lists/detail` directory since it only contains detail functions.
    
    Closes rapidsai#13096
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
      - Nghia Truong (https://github.com/ttnghia)
    
    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - AJ Schmidt (https://github.com/ajschmidt8)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    URL: rapidsai#13099
    davidwendt committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    9a9f718 View commit details
    Browse the repository at this point in the history
  3. Refactor cudf::detail::sorted_order (rapidsai#13062)

    This PR does some cleanup for the `src/sort/sort_impl.cuh` file and the related headers/source files:
     * Moving some `include<header>` from there to the directly used source files.
     * Adding `constexpr` for the `if/else` statements.
     * Adding missing doxygen tag.
     * Removing code duplicate by extracting the common code into a lambda.
    
    There is not any new implementation added in this PR.
    
    Authors:
      - Nghia Truong (https://github.com/ttnghia)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
      - David Wendt (https://github.com/davidwendt)
    
    URL: rapidsai#13062
    ttnghia committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    1d77984 View commit details
    Browse the repository at this point in the history
  4. Cleanup Parquet chunked writer (rapidsai#13094)

    Similar to rapidsai#13091, this changes the internal variables of Parquet chunked writer:
     * Renaming them to have a `_` prefix consistently.
     * Add `const` qualifier to some variables that are writer parameters.
     * Regroup them.
    
    There is not any new implementation added. However, the unused parameter `mr` is removed from its interface thus this is flagged as breaking changes.
    
    Closes:
     * rapidsai#13079
    
    Authors:
      - Nghia Truong (https://github.com/ttnghia)
    
    Approvers:
      - Mike Wilson (https://github.com/hyperbolic2346)
      - Karthikeyan (https://github.com/karthikeyann)
    
    URL: rapidsai#13094
    ttnghia committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    2bf0b44 View commit details
    Browse the repository at this point in the history
  5. Pin curand version (rapidsai#13127)

    Merging the conda-forge curand recipe and building conda-forge packages has caused conda to choose a newer version of curand than what cudf currently supports (we cannot use the version from CUDA 12).
    
    Closes rapidsai#13126 
    
    Authors:
       - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
       - Ray Douglass (https://github.com/raydouglass)
       - Robert Maynard (https://github.com/robertmaynard)
       - Bradley Dice (https://github.com/bdice)
    vyasr committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    ed9385b View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    2d311d7 View commit details
    Browse the repository at this point in the history
  7. Merge pull request rapidsai#13131 from vyasr/branch-23.06-merge-23.04

    Branch 23.06 merge 23.04
    ajschmidt8 committed Apr 12, 2023
    Configuration menu
    Copy the full SHA
    cae6132 View commit details
    Browse the repository at this point in the history

Commits on Apr 13, 2023

  1. Adds checks to make sure json reader won't overflow (rapidsai#13115)

    The JSON reader is currently using 32-bit offsets to index into the input's characters to lower memory footprint and for performance reasons. Hence, currently, if an input larger than `UINT_MAX` is read, the parser may return incorrect data. 
    
    This PR adds a check that fails for inputs that could overflow. 
    
    The longer term plan is to make the finite-state transducer stage reentrant and split up inputs larger than `UINT_MAX` into smaller chunks.
    
    Authors:
      - Elias Stehle (https://github.com/elstehle)
      - Vukasin Milovanovic (https://github.com/vuule)
    
    Approvers:
      - Vukasin Milovanovic (https://github.com/vuule)
      - Karthikeyan (https://github.com/karthikeyann)
    
    URL: rapidsai#13115
    elstehle committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    3069f1e View commit details
    Browse the repository at this point in the history
  2. Fix hash join when the input tables have nulls on only one side (rapi…

    …dsai#13120)
    
    This is very similar to rapidsai#11284, which fixes a bug when only one input table has nulls while the other doesn't. This is due to the new experimental hasher producing different hash values depending on an input flag `has_nulls`. In order to properly use it, `has_nulls` must be computed by checking all the possible input tables, or set to a constant value (`true`).
    
    Closes:
     * rapidsai#13109
    
    Authors:
      - Nghia Truong (https://github.com/ttnghia)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
      - Divye Gala (https://github.com/divyegala)
      - Yunsong Wang (https://github.com/PointKernel)
    
    URL: rapidsai#13120
    ttnghia committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    d415ffe View commit details
    Browse the repository at this point in the history
  3. Prevent overflow with skip_rows in ORC and Parquet readers (rapidsa…

    …i#13063)
    
    Use int64_t for `skip_rows` since source or combined sources can have more than two billion rows, and we should be able to read a range of rows even in that case.
    Store `num_rows` as `std::optional`, instead of using special value (`-1`).
    Reuse code with error-prone logic between ORC and Parquet.
    Added unit tests for the tricky code above.
    Converted inout `select_stripes` parameters to input params + return values.
    
    Authors:
      - Vukasin Milovanovic (https://github.com/vuule)
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
      - Robert Maynard (https://github.com/robertmaynard)
      - https://github.com/brandon-b-miller
      - Yunsong Wang (https://github.com/PointKernel)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    URL: rapidsai#13063
    vuule committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    f77403e View commit details
    Browse the repository at this point in the history
  4. Purge nonempty nulls from byte_cast list outputs. (rapidsai#11971)

    Resolves rapidsai#11754. The `byte_cast` function is creating unsanitized lists from null inputs, which is a bug. [This logic](https://github.com/rapidsai/cudf/blob/9c06330363db4da99803a3728b8bf44f9829f0b9/cpp/src/reshape/byte_cast.cu#L66-L81) copies nonzero bytes even if the input element is null. The input's null mask is copied onto the output parent list column, but the null children are nonempty. This PR fixes the bug by calling `cudf::purge_nonempty_nulls` on the result before returning, if there are any nulls to be purged.
    
    Depends on:
     * rapidsai#13099
    
    Authors:
      - Bradley Dice (https://github.com/bdice)
      - Nghia Truong (https://github.com/ttnghia)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
      - David Wendt (https://github.com/davidwendt)
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Mike Wilson (https://github.com/hyperbolic2346)
    
    URL: rapidsai#11971
    bdice committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    4b34831 View commit details
    Browse the repository at this point in the history
  5. Fix null_count of columns returned by chunked_parquet_reader (rap…

    …idsai#13111)
    
    Chunked Parquet reader returns columns with incorrect null counts - the counts are cumulative sums that include all previous chunks.
    Root cause is that `nesting_decode_cache` is not copied back to `nesting_decode` when `gpuDecodePageData` returns early, so previously computed null counts are only reset in the cache.
    With this PR, we use RAII to make sure cached decode info is always copied back in `gpuDecodePageData`.
    Also fixed `column_buffer::empty_like` to return zero null count and empty null mask.
    
    Authors:
      - Vukasin Milovanovic (https://github.com/vuule)
    
    Approvers:
      - https://github.com/nvdbaranec
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    URL: rapidsai#13111
    vuule committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    5764ba5 View commit details
    Browse the repository at this point in the history
  6. Remove more instances of UNKNOWN_NULL_COUNT (rapidsai#13134)

    Contributes to rapidsai#11968.
    
    Authors:
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
      - Jason Lowe (https://github.com/jlowe)
      - Nghia Truong (https://github.com/ttnghia)
      - David Wendt (https://github.com/davidwendt)
      - Bradley Dice (https://github.com/bdice)
    
    URL: rapidsai#13134
    vyasr committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    6ae591f View commit details
    Browse the repository at this point in the history
  7. Explicitly compute null count in concatenate APIs (rapidsai#13104)

    The total number of nulls in the output can be computed by summing the nulls in the input columns.
    
    Contributes to rapidsai#11968
    
    Authors:
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - David Wendt (https://github.com/davidwendt)
    
    URL: rapidsai#13104
    vyasr committed Apr 13, 2023
    Configuration menu
    Copy the full SHA
    4f0c46e View commit details
    Browse the repository at this point in the history

Commits on Apr 14, 2023

  1. Fix Series and DataFrame constructors to validate index lengths (r…

    …apidsai#13122)
    
    Fixes: rapidsai#12999, rapidsai#13056
    
    This PR fixes the `Series` and `DataFrame` constructors to validate the `data` & `index` lengths. This also contains fixes where `index` was being ignored in certain cases.
    
    Authors:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    URL: rapidsai#13122
    galipremsagar committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    2d70331 View commit details
    Browse the repository at this point in the history
  2. Use CTAD instead of functions in ProtobufReader (rapidsai#13135)

    Replaced `std::make_tuple` with `std::tuple` constructor
    Removed `std::make_field_reader`, calling `field_reader` constructor directly now.
    
    Authors:
      - Vukasin Milovanovic (https://github.com/vuule)
    
    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#13135
    vuule committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    a562a7e View commit details
    Browse the repository at this point in the history
  3. Compute null-count in cudf::detail::slice (rapidsai#13124)

    Calculates the null-count in the `cudf::detail::slice()` function. This requires adding a stream parameter to the function and updating the callers to pass the stream. Also moved the function definition to the `slice.cu` file since there are only two possible values for the template parameter.
    
    Labeling this with non-breaking since it is a detail function.
    
    Contributes to: rapidsai#11968
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - Divye Gala (https://github.com/divyegala)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    URL: rapidsai#13124
    davidwendt committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    891698d View commit details
    Browse the repository at this point in the history
  4. Set null-count in linked_column_view conversion operator (rapidsai#13121

    )
    
    Removes the `UNKNOWN_NULL_COUNT` usage in the `linked_column_view::column_view()` conversion operator. The null-count is copied from the parent instance. The `linked_column_view` class was reworked to move the C++ function definitions from the header file to a new .cpp file.
    
    Contributes to: rapidsai#11968
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
    
    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#13121
    davidwendt committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    e6cb2d0 View commit details
    Browse the repository at this point in the history
  5. Use .element() instead of .data() for window range calculations (r…

    …apidsai#13095)
    
    In the staging step for executing window range queries, the boundaries of each row's window are calculated. This involves subtracting/adding the `preceding`/`following` values from each order-by column row, and then searching backwards/forwards for the boundary values.
    
    The staging step has been using `column_device_view.data()` for accessing the order-by rows, an acceptable approach for when the order-by columns are numeric (e.g. `INT32`).
    
    This approach fails when the order-by column is a `STRING`, because `.data()` is not defined for such columns. A better approach would be to use `.element()` to directly access the rows, because it has special handling for `STRING`, among other types, while continuing to work for numeric primitives.
    
    ## Future
    In a followup to this change, support for `STRING` order-by columns will be added.
    
    Authors:
      - MithunR (https://github.com/mythrocks)
    
    Approvers:
      - Yunsong Wang (https://github.com/PointKernel)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#13095
    mythrocks committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    5c93b44 View commit details
    Browse the repository at this point in the history
  6. Change cudf::test::make_null_mask to also return null-count (rapidsai…

    …#13081)
    
    Change the `cudf::test::make_null_mask` to return both the null-mask and the null-count. Callers can then use this null-count instead of `UNKNOWN_NULL_COUNT`. These changes include removing `UNKNOWN_NULL_COUNT` usage from the libcudf C++ test source code.
    
    One side-effect found that strings column with all nulls can technically have no children but using `UNKNOWN_NULL_COUNT` allowed the check for this to be bypassed. Therefore many utilities started to fail when `UNKNOWN_NULL_COUNT` was removed. The factory was modified to remove the check which results in an offsets column and an empty chars column as children.
    
    More code will likely need to be change when the `UNKNOWN_NULL_COUNT` is no longer used as a default parameter for factories and other column functions.
    
    No behavior is changed. Since the `cudf::test::make_null_mask` is technically a public API, this PR could be marked as a breaking change as well.
    
    Contributes to: rapidsai#11968
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
    
    Approvers:
      - MithunR (https://github.com/mythrocks)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    URL: rapidsai#13081
    davidwendt committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    4481142 View commit details
    Browse the repository at this point in the history
  7. Deprecate pad and backfill methods (rapidsai#13140)

    This PR deprecates `pad` and `backfill` methods in favor of `ffill` and `bfill` methods.
    
    Pandas recently deprecated these:
    pandas-dev/pandas#51221
    pandas-dev/pandas#45076
    
    Authors:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    Approvers:
      - Matthew Roeschke (https://github.com/mroeschke)
    
    URL: rapidsai#13140
    galipremsagar committed Apr 14, 2023
    Configuration menu
    Copy the full SHA
    daf3ac0 View commit details
    Browse the repository at this point in the history

Commits on Apr 15, 2023

  1. Enable binary operations between scalars and columns of differing dec…

    …imal types (rapidsai#13034)
    
    Closes rapidsai#12958 
    
    This PR enables some previously xfailing tests.
    
    Authors:
      - Ashwin Srinath (https://github.com/shwina)
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    Approvers:
      - GALI PREM SAGAR (https://github.com/galipremsagar)
      - Bradley Dice (https://github.com/bdice)
      - Vyas Ramasubramani (https://github.com/vyasr)
    
    URL: rapidsai#13034
    shwina committed Apr 15, 2023
    Configuration menu
    Copy the full SHA
    0b59fda View commit details
    Browse the repository at this point in the history
  2. Update clang-format to 16.0.1. (rapidsai#13133)

    This PR updates the clang-format version used by pre-commit.
    
    Authors:
      - Bradley Dice (https://github.com/bdice)
    
    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - Elias Stehle (https://github.com/elstehle)
    
    URL: rapidsai#13133
    bdice committed Apr 15, 2023
    Configuration menu
    Copy the full SHA
    580ee40 View commit details
    Browse the repository at this point in the history

Commits on Apr 17, 2023

  1. Fix a few clang-format style check errors (rapidsai#13146)

    Fixes some build errors occuring after rapidsai#13133 was merged. Looks like a couple files may have gotten mismerged perhaps. This should unblock several current PRs.
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
    
    Approvers:
      - Yunsong Wang (https://github.com/PointKernel)
      - Divye Gala (https://github.com/divyegala)
    
    URL: rapidsai#13146
    davidwendt committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    a6fb6a2 View commit details
    Browse the repository at this point in the history
  2. Add null-count parameter to json experimental parse_data utility (rap…

    …idsai#13107)
    
    Add `null_count` parameter to the `cudf::io::json::experimental::detail::parse_data` function which already accepts a `null_mask`. Normally, the callers already know the count. This unction can use the parameter to help build the output column.
    
    Found while working on rapidsai#13081
    Contributes to: rapidsai#11968
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
      - GALI PREM SAGAR (https://github.com/galipremsagar)
    
    Approvers:
      - Vyas Ramasubramani (https://github.com/vyasr)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#13107
    davidwendt committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    7c3a34e View commit details
    Browse the repository at this point in the history
  3. Add Python bindings for time zone data (TZiF) reader (rapidsai#12826)

    This PR adds bindings to the TZiF reader that was added in the libcudf API in rapidsai#12805.
    
    No tests are being added as these bindings are just for internal-use. In follow-up PRs, I will add a timezone-aware datetime type and timezone-aware operations to the public API, along with tests for those operations.
    
    The bindings can be used as follows:
    
    ```python
    >>> transition_times, offsets = make_timezone_transition_table("/usr/share/zoneinfo", "America/New_York")
                                                
    >>> transition_times
    <cudf.core.column.datetime.DatetimeColumn object at 0x7f95cd6ac840>
    [
      1883-11-18 17:00:00,
      1883-11-18 17:00:00,
      1918-03-31 07:00:00,
      1918-10-27 06:00:00,
      1919-03-30 07:00:00,
      1919-10-26 06:00:00,
      1920-03-28 07:00:00,
      1920-10-31 06:00:00,
      1921-04-24 07:00:00,
      1921-09-25 06:00:00,
      ...
      2365-03-14 07:00:00,
      2365-11-07 06:00:00,
      2366-03-13 07:00:00,
      2366-11-06 06:00:00,
      2367-03-12 07:00:00,
      2367-11-05 06:00:00,
      2368-03-10 07:00:00,
      2368-11-03 06:00:00,
      2369-03-09 07:00:00,
      2369-11-02 06:00:00
    ]
    dtype: datetime64[s]
    
    >>> offsets
    <cudf.core.column.timedelta.TimeDeltaColumn object at 0x7f94e69bad40>
    [
      -18000,
      -18000,
      -14400,
      -18000,
      -14400,
      -18000,
      -14400,
      -18000,
      -14400,
      -18000,
      ...
      -14400,
      -18000,
      -14400,
      -18000,
      -14400,
      -18000,
      -14400,
      -18000,
      -14400,
      -18000
    ]
    dtype: timedelta64[s]
    ```
    
    Authors:
      - Ashwin Srinath (https://github.com/shwina)
      - Vukasin Milovanovic (https://github.com/vuule)
    
    Approvers:
      - Bradley Dice (https://github.com/bdice)
    
    URL: rapidsai#12826
    shwina committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    b05d5e7 View commit details
    Browse the repository at this point in the history
  4. Use ARC V2 self-hosted runners for GPU jobs (rapidsai#13123)

    This PR is updating the runner labels to use ARC V2 self-hosted runners for GPU jobs. This is needed to resolve the auto-scalling issues.
    
    Authors:
      - Jordan Jacobelli (https://github.com/jjacobelli)
    
    Approvers:
      - AJ Schmidt (https://github.com/ajschmidt8)
    
    URL: rapidsai#13123
    jjacobelli committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    b8ab63d View commit details
    Browse the repository at this point in the history
  5. Fix read_avro() skip_rows and num_rows. (rapidsai#12912)

    This PR fixes the avro reader (`cudf.read_avro()`) such that it honors the values passed to the `skip_rows` and `num_rows` parameters.  In implementing this new logic, we also revamp the reader's ability to handle multi-block avro files, which we also test extensively with a new `test_avro_reader_multiblock()` test that features some 1300 permutations of various block size combinations.
    
    Closes rapidsai#6529.
    
    Authors:
      - Trent Nelson (https://github.com/tpn)
    
    Approvers:
      - Lawrence Mitchell (https://github.com/wence-)
      - Vukasin Milovanovic (https://github.com/vuule)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#12912
    tpn committed Apr 17, 2023
    Configuration menu
    Copy the full SHA
    62e02c6 View commit details
    Browse the repository at this point in the history

Commits on Apr 18, 2023

  1. Configuration menu
    Copy the full SHA
    e5ea7df View commit details
    Browse the repository at this point in the history
  2. Improve performance of slice_strings for long strings (rapidsai#13057)

    Improves on performance for longer strings with `cudf::strings::slice_strings()` API. The `cudf::string_view::substr` was reworked to minimize counting characters and the gather version of `make_strings_children` is used to build the resulting strings column. This version is already optimized for small and large strings.
    
    Additionally, the code was refactored so the common case of `step==1 and start < stop`  can also make use of the gather approach. Common code was also grouped closer together to help navigate the source file better.
    
    The `slice.cpp` benchmark was updated to better measure large strings with comparable slice boundaries. The benchmark showed performance improvement was up to 9x for larger strings with no significant degradation for smaller strings.
    
    Reference rapidsai#13048 and rapidsai#12445
    
    Authors:
      - David Wendt (https://github.com/davidwendt)
    
    Approvers:
      - Nghia Truong (https://github.com/ttnghia)
      - Elias Stehle (https://github.com/elstehle)
    
    URL: rapidsai#13057
    davidwendt committed Apr 18, 2023
    Configuration menu
    Copy the full SHA
    feea040 View commit details
    Browse the repository at this point in the history
  3. Allow compilation with any GTest version 1.11+ (rapidsai#13153)

    GTest max support for `Types` was removed in 1.11, so we remove the workarounds in cudf_gtest.
    
    Since we need to support our custom `Types` and the GTest 1.11+ version rework the type_list_utilities to be generic and not depend on specific traits.
    
    Also corrected the `<<` overloads for GTest printing so that they work with GTest 1.11.
    
    Authors:
      - Robert Maynard (https://github.com/robertmaynard)
      - Vukasin Milovanovic (https://github.com/vuule)
    
    Approvers:
      - Bradley Dice (https://github.com/bdice)
      - Nghia Truong (https://github.com/ttnghia)
    
    URL: rapidsai#13153
    robertmaynard committed Apr 18, 2023
    Configuration menu
    Copy the full SHA
    1750bff View commit details
    Browse the repository at this point in the history
  4. add null literal case

    karthikeyann committed Apr 18, 2023
    Configuration menu
    Copy the full SHA
    a4febb6 View commit details
    Browse the repository at this point in the history
  5. Merge branch 'fea-string_scalar_ast_compare' of github.com:karthikeya…

    …nn/cudf into fea-string_scalar_ast_compare
    karthikeyann committed Apr 18, 2023
    Configuration menu
    Copy the full SHA
    972b9fa View commit details
    Browse the repository at this point in the history
  6. Configuration menu
    Copy the full SHA
    8354982 View commit details
    Browse the repository at this point in the history