Skip to content

Commit

Permalink
Fix OOB memory access in CSV reader when reading without NA values (#…
Browse files Browse the repository at this point in the history
…13011)

CSV reader uses a trie to read field with special values as nulls. The creation of the trie does not work correctly when there are not special values. This can happen when the NA filter is enabled, but the default NA values are removed, and user does not specify custom values. In this case, use of this trie leads to OOB memory access.
This PR fixes the trie creation to create an empty trie when there are not special values to look for.
Included a C++ test that crashes without the fix.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13011
  • Loading branch information
vuule committed Apr 5, 2023
1 parent bd1ef29 commit 9a770f6
Show file tree
Hide file tree
Showing 3 changed files with 21 additions and 3 deletions.
2 changes: 2 additions & 0 deletions cpp/src/io/utilities/trie.cu
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,8 @@ namespace detail {
rmm::device_uvector<serial_trie_node> create_serialized_trie(const std::vector<std::string>& keys,
rmm::cuda_stream_view stream)
{
if (keys.empty()) { return rmm::device_uvector<serial_trie_node>{0, stream}; }

static constexpr int alphabet_size = std::numeric_limits<char>::max() + 1;
struct TreeTrieNode {
using TrieNodePtr = std::unique_ptr<TreeTrieNode>;
Expand Down
6 changes: 3 additions & 3 deletions cpp/src/io/utilities/trie.cuh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2018-2021, NVIDIA CORPORATION.
* Copyright (c) 2018-2023, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -83,8 +83,8 @@ trie create_serialized_trie(const std::vector<std::string>& keys, rmm::cuda_stre
__host__ __device__ inline bool serialized_trie_contains(device_span<serial_trie_node const> trie,
device_span<char const> key)
{
if (trie.data() == nullptr || trie.empty()) return false;
if (key.empty()) return trie.front().is_leaf;
if (trie.empty()) { return false; }
if (key.empty()) { return trie.front().is_leaf; }
auto curr_node = trie.begin() + 1;
for (auto curr_key = key.begin(); curr_key < key.end(); ++curr_key) {
// Don't jump away from root node
Expand Down
16 changes: 16 additions & 0 deletions cpp/tests/io/csv_test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1358,6 +1358,22 @@ TEST_F(CsvReaderTest, nullHandling)

CUDF_TEST_EXPECT_COLUMNS_EQUAL(expect, view.column(0));
}

// Filter enabled, but no NA values
{
cudf::io::csv_reader_options in_opts =
cudf::io::csv_reader_options::builder(cudf::io::source_info{filepath})
.keep_default_na(false)
.dtypes({dtype<cudf::string_view>()})
.header(-1)
.skip_blank_lines(false);
const auto result = cudf::io::read_csv(in_opts);
const auto view = result.tbl->view();
auto expect =
cudf::test::strings_column_wrapper({"NULL", "", "null", "n/a", "Null", "NA", "nan"});

CUDF_TEST_EXPECT_COLUMNS_EQUAL(expect, view.column(0));
}
}

TEST_F(CsvReaderTest, FailCases)
Expand Down

0 comments on commit 9a770f6

Please sign in to comment.