-
Notifications
You must be signed in to change notification settings - Fork 889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] CSV reader cannot handle unquoted quote character appearing in a field #11948
Comments
I was able to make the test case a lot simpler and still see the same error.
shows the same problems, only the first three lines come out, and the third entry is only |
Wow, even if I escape the quotes I still get the problem with the lines, but it "fixes" the issue with the quote at the end being removed, but it does not remove the escapes from the quotes.
But the output is
when it should be
Oddly if I remove the escapes and just keep the entire thing quoted.
It fixes the problem with dropping lines, but it does not fix the single entry.
vs from spark
This one I am less sure that we have to match exactly what Spark is doing, because pandas matches CUDF in this case. Also pandas does different things for escaped quotes too, so just take these as info for now. |
Thanks @revans2 for investigating this. I'm posting a python reproducer:
|
I can't repro the issue in the comment. Trying using Python:
|
Did some scoping and it seems like this requires large changes in the way the reader finds row offsets. Current state machine has four states (represented by two bits) and handling this would require an additional state, and thus more bits. My main concern is with the work involved to change the way state machine packs and handles the states. |
Describe the bug
This is directly from NVIDIA/spark-rapids#6435 If you have a field like
abc""
in a CSV file the cudf CSV parser stops processing more data.Steps/Code to reproduce bug
Create a file
test.csv
with the following data in it.Now try to read it using CUDF. The last two rows are skipped, and the
acb""
is read back missing the last"
(From spark using the rapids plugin for apache spark)
Without the plugin I get back
Which is also what I get back from pandas.
Expected behavior
CUDF returns the same result as Pandas and Spark.
The text was updated successfully, but these errors were encountered: