Delimiter detection incorrectly determines delimiter and leads to inconsistent record sizes since 2.9.1 #494

pmaria · 2022-02-22T06:21:47Z

Since release 2.9.1 automatic delimiter detection returns delimiters that lead to inconsistent record sizes.

Consider this single column csv example.

  private static final String INPUT =
          "Name\n" +
          "http://example.com/company/Alice\n" +
          "Bob\n" +
          "Bob/Charles\n" +
          "path/../Danny\n" +
          "Emily Smith";

  @Test
  void test() {
    var settings = new CsvParserSettings();
    settings.setHeaderExtractionEnabled(true);
    settings.setLineSeparatorDetectionEnabled(true);
    settings.setDelimiterDetectionEnabled(true);
    var parser = new CsvParser(settings);

    parser.iterateRecords(IOUtils.toInputStream(INPUT, StandardCharsets.UTF_8))
        .forEach(record -> System.out.printf("record:[ %s ] --- size: %s%n", record, record.getValues().length));
  }

Running this returns:

record:[ http:, null, example.com, company, Alice ] --- size: 5
record:[ Bob ] --- size: 1
record:[ Bob, Charles ] --- size: 2
record:[ path, .., Danny ] --- size: 3
record:[ Emily Smith ] --- size: 1

The delimiter detection does not seem to factor in equality of record size across rows, nor the related fact that the delimiter doesn't occur in all rows (anymore?). This has not been a problem in earlier releases.

IMO the delimiter canidate that leads to a consistent record size should be preferred above the most prevalent.

(In this case, header detection could already indicate that there is only one column, so delimiter detection could be skipped.)

The text was updated successfully, but these errors were encountered:

nchammas mentioned this issue Feb 19, 2024

[SPARK-46862][SQL] Disable CSV column pruning in the multi-line mode apache/spark#44872

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delimiter detection incorrectly determines delimiter and leads to inconsistent record sizes since 2.9.1 #494

Delimiter detection incorrectly determines delimiter and leads to inconsistent record sizes since 2.9.1 #494

pmaria commented Feb 22, 2022 •

edited

Loading

Delimiter detection incorrectly determines delimiter and leads to inconsistent record sizes since 2.9.1 #494

Delimiter detection incorrectly determines delimiter and leads to inconsistent record sizes since 2.9.1 #494

Comments

pmaria commented Feb 22, 2022 • edited Loading

pmaria commented Feb 22, 2022 •

edited

Loading