ArrayIndexOutOfBoundsException when reading CSV file #658

aecio · 2019-09-04T23:16:29Z

This bug is similar to #297 but it is not the same. This happens, when the CSV has a column header with an empty space, eg: c1,"c2 ","c3" (note the space in "c2 " header).

While reading the file, the method selectColumnNames() (in line String[] columnNames = selectColumnNames(headerRow, types)) returns trimmed strings which are then used to search over the original names of the columns in line columnIndexes[i] = headerRow.indexOf(columnNames[i]). Thus, it does not find the correct column index and returns -1, which ultimately causes an index out of bounds exception.

I don't think header names should be trimmed in this case where column headers are delimited by quotes.

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1
	at java.util.ArrayList.elementData(ArrayList.java:422)
	at java.util.ArrayList.get(ArrayList.java:435)
	at tech.tablesaw.io.AddCellToColumnException.<init>(AddCellToColumnException.java:63)
	at tech.tablesaw.io.FileReader.addRows(FileReader.java:143)
	at tech.tablesaw.io.FileReader.parseRows(FileReader.java:104)
	at tech.tablesaw.io.csv.CsvReader.read(CsvReader.java:89)
	at tech.tablesaw.io.csv.CsvReader.read(CsvReader.java:78)
	at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:156)
	at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:152)

Here is a dataset that can be used to reproduce the error: https://finances.worldbank.org/Projects/2017-Climate-Investment-Funds-Scaling-Up-Renewable/vq9a-4dmu
Direct CSV link: https://finances.worldbank.org/api/views/vq9a-4dmu/rows.csv?accessType=DOWNLOAD

The text was updated successfully, but these errors were encountered:

lwhite1 · 2019-09-05T01:05:01Z

Thanks for the bug report and the file to use when re-creating it. I will work on this tomorrow. Given that most of tablesaw relies on referencing columns by name, i think trimming is almost always a good idea. If you print the table or table structure you won't see the white space, and subsequent attempts to reference the column by name are likely to fail with no obvious reason. We shouldn't have names that look like Foo and are actually Foo (There's a trailing whitespace on the second one.) This should never return an ArrayOutOfBounds error, though. I'm inclined to "fix" it by ensuring the column names are trimmed on both sides, if that's possible. I'm not sure that's exactly the right solution, but it's better than the current state, and is probably better for most people, most of the time. Do you have a use-case for maintaining names with trailing whitespace? In general, names like that are an artifact of an upstream quality issue. I opened the file in Excel and checked a couple of columns. They didn't seem to have trailing whitespace, and i couldn't tell which one or ones did by glancing at the header.

…

On Wed, Sep 4, 2019 at 7:16 PM Aécio Santos ***@***.***> wrote: This bug is similar to #297 <#297> but it is not the same. This happens, when the CSV has a column header with an empty space, eg: c1,"c2 ","c3" (note the space in "c2 " header). While reading the file, the method selectColumnNames() <https://github.com/jtablesaw/tablesaw/blob/e0370f73904d3aee92d85a643a748031ad421969/core/src/main/java/tech/tablesaw/io/FileReader.java#L189> (in line String[] columnNames = selectColumnNames(headerRow, types)) returns trimmed strings which are then used to search over the original names of the columns in line columnIndexes[i] = headerRow.indexOf(columnNames[i]) <https://github.com/jtablesaw/tablesaw/blob/e0370f73904d3aee92d85a643a748031ad421969/core/src/main/java/tech/tablesaw/io/FileReader.java#L107>. Thus, it does not find the correct column index and returns -1, which ultimately causes an index out of bounds exception. I don't think header names should be trimmed in this case where column headers are delimited by quotes. Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -1 at java.util.ArrayList.elementData(ArrayList.java:422) at java.util.ArrayList.get(ArrayList.java:435) at tech.tablesaw.io.AddCellToColumnException.<init>(AddCellToColumnException.java:63) at tech.tablesaw.io.FileReader.addRows(FileReader.java:143) at tech.tablesaw.io.FileReader.parseRows(FileReader.java:104) at tech.tablesaw.io.csv.CsvReader.read(CsvReader.java:89) at tech.tablesaw.io.csv.CsvReader.read(CsvReader.java:78) at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:156) at tech.tablesaw.io.DataFrameReader.csv(DataFrameReader.java:152) Here is a dataset that can be used to reproduce the error: https://finances.worldbank.org/Projects/2017-Climate-Investment-Funds-Scaling-Up-Renewable/vq9a-4dmu Direct CSV link: https://finances.worldbank.org/api/views/vq9a-4dmu/rows.csv?accessType=DOWNLOAD — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#658?email_source=notifications&email_token=AA2FPASUD2GTIVIE4I6IND3QIA6VDA5CNFSM4ITXOBK2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HJMMMEQ>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA2FPAU7JMKGBOCIFH23CZLQIA6VDANCNFSM4ITXOBKQ> .

aecio · 2019-09-05T03:01:42Z

I understand and agree with you, that trimming is good for interactive data analysis. I'm more concerned with the case where one is programmatically working with a large number of tables, and the original header name (potentially not read from tablesaw) is used to reference the column. In such a case, although rare, unexpected errors could happen. That being said, I would be ok with trimming if it is easiest way to fix the bug. Note also that according to the CSV standard (RFC 4180), trimming is not allowed. So going forward, having an option to disable trimming would be good. The wikipedia article has some interesting points about the standard and common practice: https://en.m.wikipedia.org/wiki/Comma-separated_values

trims headers consistently.

lwhite1 · 2019-09-05T12:40:40Z

Trimming is not the easy way to fix the issue; it's no easier to trim both sides than it is to trim neither side. This is the right way to fix the issue for Tablesaw.

Thanks again for the detailed report

aecio · 2019-09-05T23:51:16Z

I still think that adding an option to CsvReadOptions to disable trimming would be useful to some people. :)

Anyway, thanks for the quick fix!

Are there any plans to release the next bug-fix version?

* fixes ArrayIndexOutOfBoundsException when reading CSV file #658 trims headers consistently. * fixing the potential NPE

lwhite1 · 2019-09-06T01:06:49Z

I still think that adding an option to CsvReadOptions to disable trimming would be useful to some people. :)

As soon as the check for your support contract clears I'll be happy to do that.

aecio · 2019-09-06T23:55:22Z

I'd be happy to contribute a pull request if this is something you'd want.

lwhite1 · 2019-09-07T19:36:15Z

@aecio Thank you for the offer. I appreciate any offers to help.

As it stands, I'm reluctant to add functionality based on a hypothetical need at this point. There's a lot of code already and any number of non-hypothetical improvements we could use. AFAIK, no Tablesaw user has had the specific concern you mention, although i can imagine it happening.

What do you think, @benmccann and/or @ryancerf

lwhite1 · 2019-09-11T14:51:35Z

@aecio, to follow up. I happened to notice yesterday that there are options to ignore trailing whitespace and ignore leading whitespace on readOptions. It sounds like this is what you want.
Based on my quick glance, it looks like the ignore leading whitespace method is not fully implemented (it never gets read from the options object).

I think
(a) it sound like this is what you wanted, more-or-less. Is that true?
(b) since the methods are there, we should make sure they work, so if you're still interested in doing this, I would love a PR that fixes the leading whitespace method and ensures the trailing method works as well.

LMK if you're interested. Either way I will reopen this issue for getting this fixed.

emilianbold · 2019-10-17T19:22:30Z

@aecio, to follow up. I happened to notice yesterday that there are options to ignore trailing whitespace and ignore leading whitespace on readOptions.

I'm curious where you are seeing this. I don't see anything related to trimming whitespace in CsvReadOptions or the parent class ReadOptions.

aecio · 2019-10-19T19:09:08Z

@lwhite1 I could not find any method related to trimming whitespace in CsvReadOptions as well, so I'm not sure exactly what you are proposing.

What I would like is something similar to the option ignoreLeadingWhitespacesInQuotes and ignoreTrailingWhitespacesInQuotes available in the underlying Univocity CSV parser. I opened the PR #686 related to this.

To maintain backward compatibility, the default is to trim trailing and leading spaces.

aecio · 2020-01-22T19:57:01Z

Closing this issue given that this issue is fixed and PR #686 has been closed.

lwhite1 added a commit that referenced this issue Sep 5, 2019

fixes ArrayIndexOutOfBoundsException when reading CSV file #658

9cd8d09

trims headers consistently.

lwhite1 added a commit that referenced this issue Sep 6, 2019

fixes ArrayIndexOutOfBoundsException when reading CSV file #658 (#659)

e7a612a

* fixes ArrayIndexOutOfBoundsException when reading CSV file #658 trims headers consistently. * fixing the potential NPE

lwhite1 closed this as completed Sep 6, 2019

lwhite1 reopened this Sep 11, 2019

aecio closed this as completed Jan 22, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ArrayIndexOutOfBoundsException when reading CSV file #658

ArrayIndexOutOfBoundsException when reading CSV file #658

aecio commented Sep 4, 2019

lwhite1 commented Sep 5, 2019 via email

aecio commented Sep 5, 2019 via email

lwhite1 commented Sep 5, 2019

aecio commented Sep 5, 2019

lwhite1 commented Sep 6, 2019

aecio commented Sep 6, 2019

lwhite1 commented Sep 7, 2019

lwhite1 commented Sep 11, 2019

emilianbold commented Oct 17, 2019

aecio commented Oct 19, 2019

aecio commented Jan 22, 2020

ArrayIndexOutOfBoundsException when reading CSV file #658

ArrayIndexOutOfBoundsException when reading CSV file #658

Comments

aecio commented Sep 4, 2019

lwhite1 commented Sep 5, 2019 via email

aecio commented Sep 5, 2019 via email

lwhite1 commented Sep 5, 2019

aecio commented Sep 5, 2019

lwhite1 commented Sep 6, 2019

aecio commented Sep 6, 2019

lwhite1 commented Sep 7, 2019

lwhite1 commented Sep 11, 2019

emilianbold commented Oct 17, 2019

aecio commented Oct 19, 2019

aecio commented Jan 22, 2020