-
Notifications
You must be signed in to change notification settings - Fork 901
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Fix N/A detection for empty fields in CSV reader #6922
[REVIEW] Fix N/A detection for empty fields in CSV reader #6922
Conversation
…ug-csv-empty-na-values
…ug-csv-empty-na-values
…ug-csv-empty-na-values
Please update the changelog in order to start CI tests. View the gpuCI docs here. |
… into bug-csv-empty-na-values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Codecov Report
@@ Coverage Diff @@
## branch-0.18 #6922 +/- ##
===============================================
+ Coverage 81.53% 83.50% +1.96%
===============================================
Files 96 96
Lines 15876 18290 +2414
===============================================
+ Hits 12945 15273 +2328
- Misses 2931 3017 +86
Continue to review full report at Codecov.
|
rerun tests |
@@ -224,7 +224,7 @@ __global__ void __launch_bounds__(csvparse_block_dim) | |||
long tempPos = pos - 1; | |||
long field_len = pos - start; | |||
|
|||
if (field_len <= 0 || serialized_trie_contains(opts.trie_na, raw_csv + start, field_len)) { | |||
if (field_len < 0 || serialized_trie_contains(opts.trie_na, raw_csv + start, field_len)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess field_len < 0
is because we're using inclusive offsets (i.e, [0, 0] includes index 0)... Do we have a bug open for changing inclusive offsets -> exclusive offsets throughout cuio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might want to add a comment to this so it's not thought to be a mistake. Or use <= -1
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took another look and it seems like field_len
can never be negative. Since long field_len = pos - start;
, and pos
is initialized to start
and can only advance.
I left the condition as a sanity check, but it actually makes no sense in the context. Will remove the condition, thanks!
@@ -89,6 +89,8 @@ inline thrust::host_vector<SerialTrieNode> createSerializedTrie( | |||
// Serialize the tree trie | |||
std::deque<IndexedTrieNode> to_visit; | |||
thrust::host_vector<SerialTrieNode> nodes; | |||
// suport for matching empty input | |||
nodes.push_back(SerialTrieNode(trie_terminating_character, tree_trie.is_end_of_word)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean all tries match empty input? Difficult to follow the logic how this coordinates with the code below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not always. Here we create a node that will have is_end_of_word
= true if the root node of the tree_trie
is also end of a word (empty word, since it's the root). Hence, passing tree_trie.is_end_of_word as the second parameter.
I'll expand the comment above to make this clearer. 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytests lgtm
Fixes #6682, #6680
Currently, empty fields are treated as N/A regardless on parsing options. However, the desired behavior is to handle empty fields the same way as fields with special values (apply default_na_values, na_filter logic).
This PR irons out the behavior so it matches Pandas in this regard.