Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ignoreZeroDecimal to ReadOptions #748

Merged
merged 24 commits into from
Apr 13, 2020
Merged

Add ignoreZeroDecimal to ReadOptions #748

merged 24 commits into from
Apr 13, 2020

Conversation

larshelge
Copy link
Contributor

@larshelge larshelge commented Jan 30, 2020

Description

Introduces new option ignoreZeroDecimal to ReadOptions. This option controls whether a numeric value ending with ".0" may be considered an integer (or short, long). Default value is true and retains the current behavior. If set to false, values ending with ".0" will be not be considered an integer and instead be considered a floating point.

Since different software/systems have different opinions on this matter, an option to control this behavior could be justified. Will be useful when using TableSaw together with other systems.

The logic could be centralized in AbstractColumnParser if that is preferred. Open for other variable naming suggestions.

Relates to issue #747.

Testing

Tests added to CsvReaderTest and StringUtilsTest.

@larshelge larshelge changed the title Add trimZeroDecimals option to ReadOptions Add ignoreZeroDecimals option to ReadOptions Jan 30, 2020
@larshelge larshelge changed the title Add ignoreZeroDecimals option to ReadOptions Add ignoreZeroDecimals to ReadOptions Jan 30, 2020
@benmccann benmccann requested a review from lwhite1 January 30, 2020 22:27
@larshelge larshelge changed the title Add ignoreZeroDecimals to ReadOptions Add zeroDecimalAsFloat to ReadOptions Feb 1, 2020
@larshelge
Copy link
Contributor Author

larshelge commented Feb 7, 2020

Fully appreciating that people are busy, I was just wondering whether you may consider including this patch, or if I should start working on an external work-around?

@@ -25,7 +28,7 @@ public boolean canParse(String str) {
}
String s = str;
try {
if (s.endsWith(".0")) {
if (!zeroDecimalAsFloat && s.endsWith(".0")) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a little weird that it only works with a single 0. What if it was .00? I know you didn't write that part, but it might be a good opportunity to clean it up

Copy link
Contributor Author

@larshelge larshelge Feb 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I have updated the PR now.

I added a function to StringUtils so that we can centralize the logic. I opted for a regex with a compiled pattern for performance reasons. I combined the check for zero decimal with the remove operation as it is simpler and performance will be the same. I can put this method in another util class if that is better.

This made the FixedWidthReaderTest#testDataTypeDetection test fail which is legitimate as the test data contains data with two zero decimals, so I updated the test.

I plan to add another test once the approach is considered acceptable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems fine. Would you mind adding the test?

Copy link
Contributor Author

@larshelge larshelge Mar 11, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Many thanks @lwhite1. I have added tests now. These cover both the column type detection feature and getting columns and their types after reading the data into a table. There are tests asserting behavior when ignoreZeroDecimal is disabled as well as enabled. Let me know if the tests look appropriate.

@larshelge larshelge changed the title Add zeroDecimalAsFloat to ReadOptions Add ignoreZeroDecimal to ReadOptions Feb 9, 2020
@larshelge
Copy link
Contributor Author

Would it be possible to kindly get a timeline for when a review of this PR could take place? I am happy to change the approach to what the reviewer find most appropriate.

@lwhite1
Copy link
Collaborator

lwhite1 commented Feb 21, 2020

I guess I don't see why you can't read the file as a double or float column and then convert it to an int column. The asIntColumn() method uses a cast

  public IntColumn asIntColumn() {
    IntColumn result = IntColumn.create(name());
    for (double d : data) {
      if (DoubleColumnType.valueIsMissing(d)) {
        result.appendMissing();
      } else {
        result.append((int) d);
      }
    }
    return result;
  }

DoubleColumn also implements asShortColumn and asLongColumn.

One advantage to doing this work here as opposed to in the reader is that most numeric operators return a double column, which you may want to convert to an integral value.

@benmccann
Copy link
Collaborator

Yeah, I'd debated that as well. You could generally convert after the fact, but it's probably enough extra work that I didn't think this approach was too bad.

Also, there might be circumstances where the current approach makes you lose information. Right now we always strip ".0" and treat as an int. But what if you had a data source where some columns had trailing zeros and others didn't and you want to retain that information? Then I think you'd need to turn off the automatic stripping of ".0". I think that's sort of what was being suggested in #747

@larshelge
Copy link
Contributor Author

Many thanks for the comments, really appreciate it.

Just wanted to mention that the original use-case for #747 is data type detection, where for various reasons one would like to consider numbers ending in zero-decimals only as type double.

Converting the column to int after the fact is not really an option as we do not know the data type of the data files/columns up front, as they are uploaded by users. We are using TableSaw to
inform us what data type each column has.

Note that this PR will also coincidentally solve the problem of #732.

@larshelge
Copy link
Contributor Author

I have added tests. @lwhite1 would it be possible to kindly ask for a review?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants