Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrayIndexOutOfBoundsException when reading CSV file #658

Closed
aecio opened this issue Sep 4, 2019 · 11 comments
Closed

ArrayIndexOutOfBoundsException when reading CSV file #658

aecio opened this issue Sep 4, 2019 · 11 comments

Comments

@aecio
Copy link
Contributor

aecio commented Sep 4, 2019

This bug is similar to #297 but it is not the same. This happens, when the CSV has a column header with an empty space, eg: c1,"c2 ","c3" (note the space in "c2 " header).

While reading the file, the method selectColumnNames() (in line String[] columnNames = selectColumnNames(headerRow, types)) returns trimmed strings which are then used to search over the original names of the columns in line columnIndexes[i] = headerRow.indexOf(columnNames[i]). Thus, it does not find the correct column index and returns -1, which ultimately causes an index out of bounds exception.

I don't think header names should be trimmed in this case where column headers are delimited by quotes.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
	at java.util.ArrayList.elementData(ArrayList.java:422)
	at java.util.ArrayList.get(ArrayList.java:435)
	at tech.tablesaw.io.AddCellToColumnException.<init>(AddCellToColumnException.java:63)
	at tech.tablesaw.io.FileReader.addRows(FileReader.java:143)
	at tech.tablesaw.io.FileReader.parseRows(FileReader.java:104)
	at tech.tablesaw.io.csv.CsvReader.read(CsvReader.java:89)
	at tech.tablesaw.io.csv.CsvReader.read(CsvReader.java:78)
	at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:156)
	at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:152)

Here is a dataset that can be used to reproduce the error: https://finances.worldbank.org/Projects/2017-Climate-Investment-Funds-Scaling-Up-Renewable/vq9a-4dmu
Direct CSV link: https://finances.worldbank.org/api/views/vq9a-4dmu/rows.csv?accessType=DOWNLOAD

@lwhite1
Copy link
Collaborator

lwhite1 commented Sep 5, 2019 via email

@aecio
Copy link
Contributor Author

aecio commented Sep 5, 2019 via email

lwhite1 added a commit that referenced this issue Sep 5, 2019
@lwhite1
Copy link
Collaborator

lwhite1 commented Sep 5, 2019

Trimming is not the easy way to fix the issue; it's no easier to trim both sides than it is to trim neither side. This is the right way to fix the issue for Tablesaw.

Thanks again for the detailed report

@aecio
Copy link
Contributor Author

aecio commented Sep 5, 2019

I still think that adding an option to CsvReadOptions to disable trimming would be useful to some people. :)

Anyway, thanks for the quick fix!

Are there any plans to release the next bug-fix version?

lwhite1 added a commit that referenced this issue Sep 6, 2019
* fixes ArrayIndexOutOfBoundsException when reading CSV file #658

trims headers consistently.

* fixing the potential NPE
@lwhite1
Copy link
Collaborator

lwhite1 commented Sep 6, 2019

I still think that adding an option to CsvReadOptions to disable trimming would be useful to some people. :)

As soon as the check for your support contract clears I'll be happy to do that.

@lwhite1 lwhite1 closed this as completed Sep 6, 2019
@aecio
Copy link
Contributor Author

aecio commented Sep 6, 2019

I'd be happy to contribute a pull request if this is something you'd want.

@lwhite1
Copy link
Collaborator

lwhite1 commented Sep 7, 2019

@aecio Thank you for the offer. I appreciate any offers to help.

As it stands, I'm reluctant to add functionality based on a hypothetical need at this point. There's a lot of code already and any number of non-hypothetical improvements we could use. AFAIK, no Tablesaw user has had the specific concern you mention, although i can imagine it happening.

What do you think, @benmccann and/or @ryancerf

@lwhite1
Copy link
Collaborator

lwhite1 commented Sep 11, 2019

@aecio, to follow up. I happened to notice yesterday that there are options to ignore trailing whitespace and ignore leading whitespace on readOptions. It sounds like this is what you want.
Based on my quick glance, it looks like the ignore leading whitespace method is not fully implemented (it never gets read from the options object).

I think
(a) it sound like this is what you wanted, more-or-less. Is that true?
(b) since the methods are there, we should make sure they work, so if you're still interested in doing this, I would love a PR that fixes the leading whitespace method and ensures the trailing method works as well.

LMK if you're interested. Either way I will reopen this issue for getting this fixed.

@lwhite1 lwhite1 reopened this Sep 11, 2019
@emilianbold
Copy link
Contributor

@aecio, to follow up. I happened to notice yesterday that there are options to ignore trailing whitespace and ignore leading whitespace on readOptions.

I'm curious where you are seeing this. I don't see anything related to trimming whitespace in CsvReadOptions or the parent class ReadOptions.

@aecio
Copy link
Contributor Author

aecio commented Oct 19, 2019

@lwhite1 I could not find any method related to trimming whitespace in CsvReadOptions as well, so I'm not sure exactly what you are proposing.

What I would like is something similar to the option ignoreLeadingWhitespacesInQuotes and ignoreTrailingWhitespacesInQuotes available in the underlying Univocity CSV parser. I opened the PR #686 related to this.

To maintain backward compatibility, the default is to trim trailing and leading spaces.

@aecio
Copy link
Contributor Author

aecio commented Jan 22, 2020

Closing this issue given that this issue is fixed and PR #686 has been closed.

@aecio aecio closed this as completed Jan 22, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants