New approach to CSV reading #1629

kosak · 2021-12-03T00:22:45Z

Part of epic #1750

engine/table/src/main/java/io/deephaven/db/tables/utils/kosak/ParseDenseStorageToColumn.java

engine/table/src/main/java/io/deephaven/db/tables/utils/kosak/parsers/ZamboniParser.java

extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java

extensions/csv/src/main/java/io/deephaven/csv/CsvSpecs.java

extensions/csv/src/main/java/io/deephaven/csv/containers/ByteSlice.java

extensions/csv/src/main/java/io/deephaven/csv/containers/GrowableCharBuffer.java

extensions/csv/src/main/java/io/deephaven/csv/densestorage/DenseStorageReader.java

extensions/csv/src/main/java/io/deephaven/csv/densestorage/DenseStorageWriter.java

extensions/csv/src/main/java/io/deephaven/csv/densestorage/QueueNode.java

extensions/csv/src/main/java/io/deephaven/csv/densestorage/QueueReader.java

extensions/csv/src/main/java/io/deephaven/csv/densestorage/QueueWriter.java

devinrsmith

There is a lot here. I've skipped a lot of the implementation details (on that front, I think a lot of the implementation exposure should change from public to non-public if applicable).

There are specific things around licensing we need to account for given the fast double parsing. It would be nice if we had a JMH project that demonstrated the measurable gain on using the fast double parsing over the jdk double parsing. If this is something we need, I'm happy to help setup. It may be a worthwhile project anyways, as I'd like to incorporate other microbenchmarks in other places as well.

I want to go over it at least one more time... I checked it out (had to fix a compile issue) to run some CSV files - it seemed to work for me, but I didn't go too deep.

extensions/csv/src/main/java/io/deephaven/csv/CsvSpecs.java

extensions/csv/src/main/java/io/deephaven/csv/InferenceSpecs.java

extensions/csv/src/main/java/io/deephaven/csv/sinks/Sink.java

devinrsmith · 2021-12-13T18:05:36Z

extensions/csv/src/main/java/io/deephaven/csv/parsers/TimestampParserBase.java

+            if (ih.bs().size() > 1) {
+                ctx.isNullOrWidthOneSoFar = false;
+            }
+            chunk[chunkIndex++] = value * scale;


I think by implicitly multiplying by scale, we are artificially limiting ourselves to the implementation details for how engine DateTime expects to work. I wonder if it's better to pass values as is to the downstream sink?

I think I probably don't "get" the motivation here, so maybe this needs to be discussed further.
I don't see the CSV library as the right place for people to plug in arbitrary type converters.
If they have a column of longs and they want it transformed to something else, they should do that with Deephaven table "update" operations, not by plugging a new parser into their CSV reader. (in my opinion)

Okay - I think that is a valid approach to take - in which case, we don't support arbitrary user supplied parsers. That's different than the logic as it exists today, but I'm happy to discuss and support removal of that assumption.

Support for custom parsing added.

devinrsmith · 2021-12-13T18:31:49Z

extensions/csv/src/main/java/io/deephaven/csv/reading/ParseDenseStorageToColumn.java

+            // Put the user-specified parsers in precedence order.
+            parsersToTry = Parsers.PRECEDENCE.stream().filter(parsers::contains).collect(Collectors.toList());


This looks like it will only use the parsers that are contained in PRECEDENCE, instead of sorting parsers. This seems wrong?

Yes, PRECEDENCE contains the universal list of all known parsers. In particular, users cannot define their own parsers. If users defining their own parsers is a "thing" we want, I'd like to understand more.

Support for custom parsing added.

devinrsmith · 2021-12-13T22:26:59Z

extensions/csv/src/main/java/io/deephaven/csv/tokenization/external/DoubleParser.java

+/*
+ * @(#)FastDoubleParser.java
+ * Copyright © 2021. Werner Randelshofer, Switzerland. MIT License.
+ */


We likely need to move this to its own module and license as appropriate.

devinrsmith · 2021-12-13T22:29:00Z

extensions/csv/src/main/java/io/deephaven/csv/tokenization/external/DoubleParser.java

+ * This is a C++ to Java port of Daniel Lemire's fast_double_parser.
+ * <p>
+ * The code has been changed, so that it parses the same syntax as
+ * {@link Double#parseDouble(String)}.


At the very least, I'd want to make sure that DoubleParser passes https://github.com/srisatish/openjdk/blob/master/jdk/test/java/lang/Double/ParseDouble.java

I wouldn't think it's in scope for us to unit test someone else's library, but we can do it... and maybe we should. Dunno.

I'm happy to create a fork of https://github.com/wrandelshofer/FastDoubleParser and have unit tests live there (and potentially contribute back) if they don't exist already. As a way to bulk up any unit tests, and have it live with the source library. (It would also be nice to publish it instead of bringing it into the codebase.)

Most of the bugs for https://github.com/wrandelshofer/FastDoubleParser/issues?q=is%3Aissue+is%3Aclosed seem to be around differences from the original fast double parser or differences from JDK parsing... from our side, I'd really want an easy switch back to JDK parsing.

extensions/csv/src/test/java/io/deephaven/csv/CsvTest.java

kosak · 2021-12-14T03:54:32Z

JMH project that demonstrated the measurable gain on using the fast double parsing

I don't know how to reply to a comment so I will put it here. The major motivation for the third-party library was not necessarily faster double parsing per se, but because Java's builtin Double.parseDouble takes a String, not a CharSequence, and forcing me to make what could be literally tens or even hundreds of millions of short-lived temporary java.lang.Strings is a total dealkiller. I can easily benchmark how bad that is for performance.

kosak · 2021-12-14T06:46:46Z

Note: I have addressed many of the issues brought up here (thank you, reviewers) but I didn't get to everything. I thought you might want an updated view of what I have so far.

extensions/csv/src/main/java/io/deephaven/csv/densestorage/DenseStorageWriter.java

extensions/csv/src/main/java/io/deephaven/csv/densestorage/DenseStorageReader.java

extensions/csv/src/main/java/io/deephaven/csv/containers/GrowableCharBuffer.java

extensions/csv/src/main/java/io/deephaven/csv/densestorage/QueueWriter.java

extensions/csv/src/main/java/io/deephaven/csv/parsers/BooleanParser.java

extensions/csv/src/main/java/io/deephaven/csv/parsers/DateTimeAsLongParser.java

extensions/csv/src/main/java/io/deephaven/csv/parsers/DoubleParser.java

extensions/csv/src/main/java/io/deephaven/csv/parsers/IteratorHolder.java

extensions/csv/src/main/java/io/deephaven/csv/reading/CellGrabber.java

engine/table/src/main/java/io/deephaven/engine/table/impl/InMemoryTable.java

extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java

extensions/csv/src/main/java/io/deephaven/csv/tokenization/Tokenizer.java

extensions/csv/src/main/java/io/deephaven/csv/reading/CellGrabber.java

extensions/csv/src/main/java/io/deephaven/csv/tokenization/Tokenizer.java

extensions/csv/src/main/java/io/deephaven/csv/util/CsvReaderException.java

jcferretti

Can you please do a test run from IntelliJ including code coverage, so that we get an idea
of the coverage for the new files?

kosak · 2021-12-18T01:52:16Z

Can you please do a test run from IntelliJ including code coverage, so that we get an idea of the coverage for the new files?

OK, great idea. I've got some unreachable code so that (at least) means I need some more tests. Thank you.

kosak · 2021-12-23T21:32:30Z

OK thanks to this I've added a ton more tests. Coverage is "pretty high" now.

kosak · 2021-12-26T00:13:06Z

I think I have addressed everything (or maybe almost) everything that was asked for. If items are still open they are waiting for you to either comment on them or mark them closed.

jcferretti · 2022-01-04T19:11:30Z

We discussed over slack this example from python:

    if inference is None:
        inference = INFERENCE_STANDARD

    csv_specs_builder = _JCsvSpecs.builder()

    # build the head spec
    table_header = _build_header(header)
    if table_header:
        csv_specs_builder.header(table_header)

    csv_specs = (csv_specs_builder.inference(inference)
                 .hasHeaderRow(not headless)
                 .delimiter(ord(delimiter))
                 .quote(ord(quote))
                 .ignoreSurroundingSpaces(ignore_surrounding_spaces)
                 .trim(trim)
                 .charset(_JCharset.forName(charset))
                 .build())

You mentioned there is no mapping anymore for the .charset method (since we are UTF-8 everywhere all the time), and we agreed we should remove it. I am creating this comment to ensure we don't forget.

extensions/csv/src/main/java/io/deephaven/csv/sinks/Source.java

extensions/csv/src/main/java/io/deephaven/csv/util/Renderer.java

extensions/csv/src/main/java/io/deephaven/csv/util/TimeLogger.java

extensions/csv/src/main/java/io/deephaven/csv/tokenization/RangeTests.java

extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java

kosak · 2022-01-06T02:08:12Z

You mentioned there is no mapping anymore for the .charset method (since we are UTF-8 everywhere all the time), and we agreed we should remove it.

Removed.

devinrsmith

I don't have any bones to pick. Once Cristian happy, let's merge. And then soon we can start work on externalizing it.

extensions/csv/src/main/java/io/deephaven/csv/parsers/FloatFastParser.java

extensions/csv/src/main/java/io/deephaven/csv/parsers/FloatStrictParser.java

extensions/csv/src/main/java/io/deephaven/csv/sinks/Sink.java

jcferretti · 2022-01-07T17:00:27Z

Fixes #1571
Fixes #1561

kosak requested a review from jcferretti December 3, 2021 00:22

kosak self-assigned this Dec 3, 2021

jcferretti reviewed Dec 9, 2021

View reviewed changes

engine/table/src/main/java/io/deephaven/db/tables/utils/kosak/ParseDenseStorageToColumn.java Outdated Show resolved Hide resolved

engine/table/src/main/java/io/deephaven/db/tables/utils/kosak/parsers/ZamboniParser.java Outdated Show resolved Hide resolved

kosak force-pushed the kosak_integrate-csv branch 2 times, most recently from e869711 to 5ea570b Compare December 10, 2021 08:30

jcferretti requested changes Dec 11, 2021

View reviewed changes

kosak requested review from devinrsmith, rcaudy and jcferretti December 11, 2021 17:18

jcferretti requested changes Dec 13, 2021

View reviewed changes

extensions/csv/src/main/java/io/deephaven/csv/CsvSpecs.java Outdated Show resolved Hide resolved

extensions/csv/src/main/java/io/deephaven/csv/CsvSpecs.java Show resolved Hide resolved

jcferretti requested changes Dec 13, 2021

View reviewed changes

devinrsmith reviewed Dec 13, 2021

View reviewed changes

kosak requested review from devinrsmith and jcferretti December 14, 2021 06:46

jcferretti requested changes Dec 14, 2021

View reviewed changes

jcferretti requested changes Dec 15, 2021

View reviewed changes

kosak added the NoDocumentationNeeded label Dec 15, 2021

kosak force-pushed the kosak_integrate-csv branch from a4b6ca5 to 34978ac Compare December 16, 2021 00:39

kosak force-pushed the kosak_integrate-csv branch from 34978ac to ea04752 Compare December 23, 2021 21:27

kosak requested a review from jcferretti December 26, 2021 00:12

pete-petey added this to the Jan 2022 milestone Dec 29, 2021

kosak force-pushed the kosak_integrate-csv branch from ea04752 to 52b41ce Compare January 4, 2022 09:01

jcferretti previously approved these changes Jan 5, 2022

View reviewed changes

extensions/csv/src/main/java/io/deephaven/csv/sinks/Source.java Show resolved Hide resolved

extensions/csv/src/main/java/io/deephaven/csv/util/Renderer.java Show resolved Hide resolved

extensions/csv/src/main/java/io/deephaven/csv/util/TimeLogger.java Outdated Show resolved Hide resolved

kosak changed the title ~~WIP: New approach for CSV reading~~ New approach for CSV reading Jan 5, 2022

jcferretti dismissed their stale review via 8862a18 January 5, 2022 14:03

kosak changed the title ~~New approach for CSV reading~~ New approach to CSV reading Jan 5, 2022

kosak force-pushed the kosak_integrate-csv branch 2 times, most recently from 31e95fb to 52ee70c Compare January 5, 2022 20:42

jcferretti reviewed Jan 6, 2022

View reviewed changes

extensions/csv/src/main/java/io/deephaven/csv/tokenization/RangeTests.java Outdated Show resolved Hide resolved

extensions/csv/src/main/java/io/deephaven/csv/reading/CsvReader.java Outdated Show resolved Hide resolved

kosak force-pushed the kosak_integrate-csv branch from 52ee70c to 9f0c2de Compare January 6, 2022 04:31

devinrsmith reviewed Jan 6, 2022

View reviewed changes

jcferretti previously approved these changes Jan 11, 2022

View reviewed changes

New approach for CSV reading

cc874b6

kosak dismissed jcferretti’s stale review via cc874b6 January 11, 2022 03:09

kosak force-pushed the kosak_integrate-csv branch from 5249c95 to cc874b6 Compare January 11, 2022 03:09

kosak requested a review from jcferretti January 11, 2022 03:38

jcferretti approved these changes Jan 11, 2022

View reviewed changes

kosak merged commit a583699 into deephaven:main Jan 11, 2022

github-actions bot locked and limited conversation to collaborators Jan 11, 2022

kosak deleted the kosak_integrate-csv branch June 23, 2022 03:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New approach to CSV reading #1629

New approach to CSV reading #1629

kosak commented Dec 3, 2021 •

edited

Loading

devinrsmith left a comment

devinrsmith Dec 13, 2021

kosak Dec 14, 2021

devinrsmith Dec 14, 2021

kosak Dec 18, 2021

devinrsmith Dec 13, 2021

kosak Dec 14, 2021 •

edited

Loading

kosak Dec 18, 2021

devinrsmith Dec 13, 2021

kosak Dec 14, 2021

devinrsmith Dec 13, 2021

kosak Dec 14, 2021 •

edited

Loading

devinrsmith Dec 14, 2021

kosak commented Dec 14, 2021 •

edited

Loading

kosak commented Dec 14, 2021 •

edited

Loading

jcferretti left a comment

kosak commented Dec 18, 2021

kosak commented Dec 23, 2021

kosak commented Dec 26, 2021

jcferretti commented Jan 4, 2022 •

edited

Loading

kosak commented Jan 6, 2022

devinrsmith left a comment

jcferretti commented Jan 7, 2022

		// Put the user-specified parsers in precedence order.
		parsersToTry = Parsers.PRECEDENCE.stream().filter(parsers::contains).collect(Collectors.toList());

New approach to CSV reading #1629

New approach to CSV reading #1629

Conversation

kosak commented Dec 3, 2021 • edited Loading

devinrsmith left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kosak Dec 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kosak Dec 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kosak commented Dec 14, 2021 • edited Loading

kosak commented Dec 14, 2021 • edited Loading

jcferretti left a comment

Choose a reason for hiding this comment

kosak commented Dec 18, 2021

kosak commented Dec 23, 2021

kosak commented Dec 26, 2021

jcferretti commented Jan 4, 2022 • edited Loading

kosak commented Jan 6, 2022

devinrsmith left a comment

Choose a reason for hiding this comment

jcferretti commented Jan 7, 2022

kosak commented Dec 3, 2021 •

edited

Loading

kosak Dec 14, 2021 •

edited

Loading

kosak Dec 14, 2021 •

edited

Loading

kosak commented Dec 14, 2021 •

edited

Loading

kosak commented Dec 14, 2021 •

edited

Loading

jcferretti commented Jan 4, 2022 •

edited

Loading