Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delimiter detection incorrectly determines delimiter and leads to inconsistent record sizes since 2.9.1 #494

Open
pmaria opened this issue Feb 22, 2022 · 0 comments

Comments

@pmaria
Copy link

pmaria commented Feb 22, 2022

Since release 2.9.1 automatic delimiter detection returns delimiters that lead to inconsistent record sizes.

Consider this single column csv example.

  private static final String INPUT =
          "Name\n" +
          "http://example.com/company/Alice\n" +
          "Bob\n" +
          "Bob/Charles\n" +
          "path/../Danny\n" +
          "Emily Smith";

  @Test
  void test() {
    var settings = new CsvParserSettings();
    settings.setHeaderExtractionEnabled(true);
    settings.setLineSeparatorDetectionEnabled(true);
    settings.setDelimiterDetectionEnabled(true);
    var parser = new CsvParser(settings);

    parser.iterateRecords(IOUtils.toInputStream(INPUT, StandardCharsets.UTF_8))
        .forEach(record -> System.out.printf("record:[ %s ] --- size: %s%n", record, record.getValues().length));
  }

Running this returns:

record:[ http:, null, example.com, company, Alice ] --- size: 5
record:[ Bob ] --- size: 1
record:[ Bob, Charles ] --- size: 2
record:[ path, .., Danny ] --- size: 3
record:[ Emily Smith ] --- size: 1

The delimiter detection does not seem to factor in equality of record size across rows, nor the related fact that the delimiter doesn't occur in all rows (anymore?). This has not been a problem in earlier releases.

IMO the delimiter canidate that leads to a consistent record size should be preferred above the most prevalent.

(In this case, header detection could already indicate that there is only one column, so delimiter detection could be skipped.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant