Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New approach to CSV reading #1629

Merged
merged 1 commit into from
Jan 11, 2022
Merged

Conversation

kosak
Copy link
Contributor

@kosak kosak commented Dec 3, 2021

Part of epic #1750

@kosak kosak requested a review from jcferretti December 3, 2021 00:22
@kosak kosak self-assigned this Dec 3, 2021
@kosak kosak force-pushed the kosak_integrate-csv branch 2 times, most recently from e869711 to 5ea570b Compare December 10, 2021 08:30
Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a lot here. I've skipped a lot of the implementation details (on that front, I think a lot of the implementation exposure should change from public to non-public if applicable).

There are specific things around licensing we need to account for given the fast double parsing. It would be nice if we had a JMH project that demonstrated the measurable gain on using the fast double parsing over the jdk double parsing. If this is something we need, I'm happy to help setup. It may be a worthwhile project anyways, as I'd like to incorporate other microbenchmarks in other places as well.

I want to go over it at least one more time... I checked it out (had to fix a compile issue) to run some CSV files - it seemed to work for me, but I didn't go too deep.

if (ih.bs().size() > 1) {
ctx.isNullOrWidthOneSoFar = false;
}
chunk[chunkIndex++] = value * scale;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think by implicitly multiplying by scale, we are artificially limiting ourselves to the implementation details for how engine DateTime expects to work. I wonder if it's better to pass values as is to the downstream sink?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I probably don't "get" the motivation here, so maybe this needs to be discussed further.
I don't see the CSV library as the right place for people to plug in arbitrary type converters.
If they have a column of longs and they want it transformed to something else, they should do that with Deephaven table "update" operations, not by plugging a new parser into their CSV reader. (in my opinion)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay - I think that is a valid approach to take - in which case, we don't support arbitrary user supplied parsers. That's different than the logic as it exists today, but I'm happy to discuss and support removal of that assumption.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support for custom parsing added.

Comment on lines 68 to 69
// Put the user-specified parsers in precedence order.
parsersToTry = Parsers.PRECEDENCE.stream().filter(parsers::contains).collect(Collectors.toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it will only use the parsers that are contained in PRECEDENCE, instead of sorting parsers. This seems wrong?

Copy link
Contributor Author

@kosak kosak Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, PRECEDENCE contains the universal list of all known parsers. In particular, users cannot define their own parsers. If users defining their own parsers is a "thing" we want, I'd like to understand more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support for custom parsing added.

Comment on lines 3 to 6
/*
* @(#)FastDoubleParser.java
* Copyright © 2021. Werner Randelshofer, Switzerland. MIT License.
*/
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We likely need to move this to its own module and license as appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

* This is a C++ to Java port of Daniel Lemire's fast_double_parser.
* <p>
* The code has been changed, so that it parses the same syntax as
* {@link Double#parseDouble(String)}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the very least, I'd want to make sure that DoubleParser passes https://github.com/srisatish/openjdk/blob/master/jdk/test/java/lang/Double/ParseDouble.java

Copy link
Contributor Author

@kosak kosak Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't think it's in scope for us to unit test someone else's library, but we can do it... and maybe we should. Dunno.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to create a fork of https://github.com/wrandelshofer/FastDoubleParser and have unit tests live there (and potentially contribute back) if they don't exist already. As a way to bulk up any unit tests, and have it live with the source library. (It would also be nice to publish it instead of bringing it into the codebase.)

Most of the bugs for https://github.com/wrandelshofer/FastDoubleParser/issues?q=is%3Aissue+is%3Aclosed seem to be around differences from the original fast double parser or differences from JDK parsing... from our side, I'd really want an easy switch back to JDK parsing.

extensions/csv/src/test/java/io/deephaven/csv/CsvTest.java Outdated Show resolved Hide resolved
extensions/csv/src/test/java/io/deephaven/csv/CsvTest.java Outdated Show resolved Hide resolved
@kosak
Copy link
Contributor Author

kosak commented Dec 14, 2021

JMH project that demonstrated the measurable gain on using the fast double parsing

I don't know how to reply to a comment so I will put it here. The major motivation for the third-party library was not necessarily faster double parsing per se, but because Java's builtin Double.parseDouble takes a String, not a CharSequence, and forcing me to make what could be literally tens or even hundreds of millions of short-lived temporary java.lang.Strings is a total dealkiller. I can easily benchmark how bad that is for performance.

@kosak
Copy link
Contributor Author

kosak commented Dec 14, 2021

Note: I have addressed many of the issues brought up here (thank you, reviewers) but I didn't get to everything. I thought you might want an updated view of what I have so far.

Copy link
Member

@jcferretti jcferretti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please do a test run from IntelliJ including code coverage, so that we get an idea
of the coverage for the new files?

@kosak
Copy link
Contributor Author

kosak commented Dec 18, 2021

Can you please do a test run from IntelliJ including code coverage, so that we get an idea of the coverage for the new files?

OK, great idea. I've got some unreachable code so that (at least) means I need some more tests. Thank you.

@kosak
Copy link
Contributor Author

kosak commented Dec 23, 2021

OK thanks to this I've added a ton more tests. Coverage is "pretty high" now.

@kosak
Copy link
Contributor Author

kosak commented Dec 26, 2021

I think I have addressed everything (or maybe almost) everything that was asked for. If items are still open they are waiting for you to either comment on them or mark them closed.

@pete-petey pete-petey added this to the Jan 2022 milestone Dec 29, 2021
@jcferretti
Copy link
Member

jcferretti commented Jan 4, 2022

We discussed over slack this example from python:

    if inference is None:
        inference = INFERENCE_STANDARD

    csv_specs_builder = _JCsvSpecs.builder()

    # build the head spec
    table_header = _build_header(header)
    if table_header:
        csv_specs_builder.header(table_header)

    csv_specs = (csv_specs_builder.inference(inference)
                 .hasHeaderRow(not headless)
                 .delimiter(ord(delimiter))
                 .quote(ord(quote))
                 .ignoreSurroundingSpaces(ignore_surrounding_spaces)
                 .trim(trim)
                 .charset(_JCharset.forName(charset))
                 .build())

You mentioned there is no mapping anymore for the .charset method (since we are UTF-8 everywhere all the time), and we agreed we should remove it. I am creating this comment to ensure we don't forget.

@kosak kosak changed the title WIP: New approach for CSV reading New approach for CSV reading Jan 5, 2022
@kosak kosak changed the title New approach for CSV reading New approach to CSV reading Jan 5, 2022
@kosak kosak force-pushed the kosak_integrate-csv branch 2 times, most recently from 31e95fb to 52ee70c Compare January 5, 2022 20:42
@kosak
Copy link
Contributor Author

kosak commented Jan 6, 2022

You mentioned there is no mapping anymore for the .charset method (since we are UTF-8 everywhere all the time), and we agreed we should remove it.

Removed.

Copy link
Member

@devinrsmith devinrsmith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any bones to pick. Once Cristian happy, let's merge. And then soon we can start work on externalizing it.

@jcferretti
Copy link
Member

Fixes #1571
Fixes #1561

jcferretti
jcferretti previously approved these changes Jan 11, 2022
@kosak kosak merged commit a583699 into deephaven:main Jan 11, 2022
@github-actions github-actions bot locked and limited conversation to collaborators Jan 11, 2022
@kosak kosak deleted the kosak_integrate-csv branch June 23, 2022 03:12
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants