-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ArrayIndexOutOfBoundsException when reading CSV file #658
Comments
Thanks for the bug report and the file to use when re-creating it. I will
work on this tomorrow.
Given that most of tablesaw relies on referencing columns by name, i think
trimming is almost always a good idea. If you print the table or table
structure you won't see the white space, and subsequent attempts to
reference the column by name are likely to fail with no obvious reason. We
shouldn't have names that look like Foo and are actually Foo
(There's a trailing whitespace on the second one.)
This should never return an ArrayOutOfBounds error, though. I'm inclined to
"fix" it by ensuring the column names are trimmed on both sides, if that's
possible. I'm not sure that's exactly the right solution, but it's better
than the current state, and is probably better for most people, most of the
time.
Do you have a use-case for maintaining names with trailing whitespace? In
general, names like that are an artifact of an upstream quality issue. I
opened the file in Excel and checked a couple of columns. They didn't seem
to have trailing whitespace, and i couldn't tell which one or ones did by
glancing at the header.
…On Wed, Sep 4, 2019 at 7:16 PM Aécio Santos ***@***.***> wrote:
This bug is similar to #297
<#297> but it is not the
same. This happens, when the CSV has a column header with an empty space,
eg: c1,"c2 ","c3" (note the space in "c2 " header).
While reading the file, the method selectColumnNames()
<https://github.com/jtablesaw/tablesaw/blob/e0370f73904d3aee92d85a643a748031ad421969/core/src/main/java/tech/tablesaw/io/FileReader.java#L189>
(in line String[] columnNames = selectColumnNames(headerRow, types))
returns trimmed strings which are then used to search over the original
names of the columns in line columnIndexes[i] =
headerRow.indexOf(columnNames[i])
<https://github.com/jtablesaw/tablesaw/blob/e0370f73904d3aee92d85a643a748031ad421969/core/src/main/java/tech/tablesaw/io/FileReader.java#L107>.
Thus, it does not find the correct column index and returns -1, which
ultimately causes an index out of bounds exception.
I don't think header names should be trimmed in this case where column
headers are delimited by quotes.
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
at java.util.ArrayList.elementData(ArrayList.java:422)
at java.util.ArrayList.get(ArrayList.java:435)
at tech.tablesaw.io.AddCellToColumnException.<init>(AddCellToColumnException.java:63)
at tech.tablesaw.io.FileReader.addRows(FileReader.java:143)
at tech.tablesaw.io.FileReader.parseRows(FileReader.java:104)
at tech.tablesaw.io.csv.CsvReader.read(CsvReader.java:89)
at tech.tablesaw.io.csv.CsvReader.read(CsvReader.java:78)
at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:156)
at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:152)
Here is a dataset that can be used to reproduce the error:
https://finances.worldbank.org/Projects/2017-Climate-Investment-Funds-Scaling-Up-Renewable/vq9a-4dmu
Direct CSV link:
https://finances.worldbank.org/api/views/vq9a-4dmu/rows.csv?accessType=DOWNLOAD
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#658?email_source=notifications&email_token=AA2FPASUD2GTIVIE4I6IND3QIA6VDA5CNFSM4ITXOBK2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HJMMMEQ>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA2FPAU7JMKGBOCIFH23CZLQIA6VDANCNFSM4ITXOBKQ>
.
|
I understand and agree with you, that trimming is good for interactive
data analysis.
I'm more concerned with the case where one is programmatically working with
a large number of tables, and the original header name (potentially not
read from tablesaw) is used to reference the column. In such a case,
although rare, unexpected errors could happen.
That being said, I would be ok with trimming if it is easiest way to fix
the bug.
Note also that according to the CSV standard (RFC 4180), trimming is not
allowed. So going forward, having an option to disable trimming would be
good. The wikipedia article has some interesting points about the standard
and common practice:
https://en.m.wikipedia.org/wiki/Comma-separated_values
|
Trimming is not the easy way to fix the issue; it's no easier to trim both sides than it is to trim neither side. This is the right way to fix the issue for Tablesaw. Thanks again for the detailed report |
I still think that adding an option to Anyway, thanks for the quick fix! Are there any plans to release the next bug-fix version? |
As soon as the check for your support contract clears I'll be happy to do that. |
I'd be happy to contribute a pull request if this is something you'd want. |
@aecio Thank you for the offer. I appreciate any offers to help. As it stands, I'm reluctant to add functionality based on a hypothetical need at this point. There's a lot of code already and any number of non-hypothetical improvements we could use. AFAIK, no Tablesaw user has had the specific concern you mention, although i can imagine it happening. What do you think, @benmccann and/or @ryancerf |
@aecio, to follow up. I happened to notice yesterday that there are options to ignore trailing whitespace and ignore leading whitespace on readOptions. It sounds like this is what you want. I think LMK if you're interested. Either way I will reopen this issue for getting this fixed. |
I'm curious where you are seeing this. I don't see anything related to trimming whitespace in |
@lwhite1 I could not find any method related to trimming whitespace in What I would like is something similar to the option To maintain backward compatibility, the default is to trim trailing and leading spaces. |
Closing this issue given that this issue is fixed and PR #686 has been closed. |
This bug is similar to #297 but it is not the same. This happens, when the CSV has a column header with an empty space, eg:
c1,"c2 ","c3"
(note the space in"c2 "
header).While reading the file, the method
selectColumnNames()
(in lineString[] columnNames = selectColumnNames(headerRow, types)
) returns trimmed strings which are then used to search over the original names of the columns in line columnIndexes[i] = headerRow.indexOf(columnNames[i]). Thus, it does not find the correct column index and returns -1, which ultimately causes an index out of bounds exception.I don't think header names should be trimmed in this case where column headers are delimited by quotes.
Here is a dataset that can be used to reproduce the error: https://finances.worldbank.org/Projects/2017-Climate-Investment-Funds-Scaling-Up-Renewable/vq9a-4dmu
Direct CSV link: https://finances.worldbank.org/api/views/vq9a-4dmu/rows.csv?accessType=DOWNLOAD
The text was updated successfully, but these errors were encountered: