Add ignoreZeroDecimal to ReadOptions #748

larshelge · 2020-01-30T22:22:01Z

Tick to sign-off your agreement to the Developer Certificate of Origin (DCO) 1.1

Description

Introduces new option ignoreZeroDecimal to ReadOptions. This option controls whether a numeric value ending with ".0" may be considered an integer (or short, long). Default value is true and retains the current behavior. If set to false, values ending with ".0" will be not be considered an integer and instead be considered a floating point.

Since different software/systems have different opinions on this matter, an option to control this behavior could be justified. Will be useful when using TableSaw together with other systems.

The logic could be centralized in AbstractColumnParser if that is preferred. Open for other variable naming suggestions.

Relates to issue #747.

Testing

Tests added to CsvReaderTest and StringUtilsTest.

larshelge · 2020-02-07T07:33:51Z

Fully appreciating that people are busy, I was just wondering whether you may consider including this patch, or if I should start working on an external work-around?

core/src/main/java/tech/tablesaw/columns/numbers/IntParser.java

benmccann · 2020-02-07T17:03:29Z

core/src/main/java/tech/tablesaw/columns/numbers/IntParser.java

@@ -25,7 +28,7 @@ public boolean canParse(String str) {
    }
    String s = str;
    try {
-      if (s.endsWith(".0")) {
+      if (!zeroDecimalAsFloat && s.endsWith(".0")) {


It's a little weird that it only works with a single 0. What if it was .00? I know you didn't write that part, but it might be a good opportunity to clean it up

Agreed. I have updated the PR now.

I added a function to StringUtils so that we can centralize the logic. I opted for a regex with a compiled pattern for performance reasons. I combined the check for zero decimal with the remove operation as it is simpler and performance will be the same. I can put this method in another util class if that is better.

This made the FixedWidthReaderTest#testDataTypeDetection test fail which is legitimate as the test data contains data with two zero decimals, so I updated the test.

I plan to add another test once the approach is considered acceptable.

This seems fine. Would you mind adding the test?

Many thanks @lwhite1. I have added tests now. These cover both the column type detection feature and getting columns and their types after reading the data into a table. There are tests asserting behavior when ignoreZeroDecimal is disabled as well as enabled. Let me know if the tests look appropriate.

larshelge · 2020-02-20T06:49:19Z

Would it be possible to kindly get a timeline for when a review of this PR could take place? I am happy to change the approach to what the reviewer find most appropriate.

lwhite1 · 2020-02-21T15:48:52Z

I guess I don't see why you can't read the file as a double or float column and then convert it to an int column. The asIntColumn() method uses a cast

  public IntColumn asIntColumn() {
    IntColumn result = IntColumn.create(name());
    for (double d : data) {
      if (DoubleColumnType.valueIsMissing(d)) {
        result.appendMissing();
      } else {
        result.append((int) d);
      }
    }
    return result;
  }

DoubleColumn also implements asShortColumn and asLongColumn.

One advantage to doing this work here as opposed to in the reader is that most numeric operators return a double column, which you may want to convert to an integral value.

benmccann · 2020-02-21T17:39:27Z

Yeah, I'd debated that as well. You could generally convert after the fact, but it's probably enough extra work that I didn't think this approach was too bad.

Also, there might be circumstances where the current approach makes you lose information. Right now we always strip ".0" and treat as an int. But what if you had a data source where some columns had trailing zeros and others didn't and you want to retain that information? Then I think you'd need to turn off the automatic stripping of ".0". I think that's sort of what was being suggested in #747

larshelge · 2020-02-22T13:48:27Z

Many thanks for the comments, really appreciate it.

Just wanted to mention that the original use-case for #747 is data type detection, where for various reasons one would like to consider numbers ending in zero-decimals only as type double.

Converting the column to int after the fact is not really an option as we do not know the data type of the data files/columns up front, as they are uploaded by users. We are using TableSaw to
inform us what data type each column has.

Note that this PR will also coincidentally solve the problem of #732.

core/src/test/java/tech/tablesaw/io/csv/CsvReaderTest.java

…nteger-parsing-option

core/src/test/java/tech/tablesaw/io/csv/CsvReaderTest.java

core/src/test/java/tech/tablesaw/util/StringUtilsTest.java

larshelge · 2020-04-06T06:13:53Z

I have added tests. @lwhite1 would it be possible to kindly ask for a review?

core/src/test/java/tech/tablesaw/io/csv/CsvReaderTest.java

larshelge added 2 commits January 30, 2020 23:13

Add trimZeroDecimals option to ReadOptions

57fedd9

Rename option

9a988eb

larshelge changed the title ~~Add trimZeroDecimals option to ReadOptions~~ Add ignoreZeroDecimals option to ReadOptions Jan 30, 2020

larshelge changed the title ~~Add ignoreZeroDecimals option to ReadOptions~~ Add ignoreZeroDecimals to ReadOptions Jan 30, 2020

larshelge mentioned this pull request Jan 30, 2020

Column type detection strategy for numbers ending with .0 #747

Closed

benmccann requested a review from lwhite1 January 30, 2020 22:27

larshelge added 3 commits January 31, 2020 12:48

Add javadoc

c298633

Add builder override

4ddff8d

Rename property

9d9655a

larshelge changed the title ~~Add ignoreZeroDecimals to ReadOptions~~ Add zeroDecimalAsFloat to ReadOptions Feb 1, 2020

Add builder override

37fe24f

benmccann reviewed Feb 7, 2020

View reviewed changes

larshelge added 5 commits February 9, 2020 12:08

Rename property to ignoreZeroDecimal

cff77dc

Fix generics issue

466b822

Add StringUtils.removeZeroDecimal

9642342

Updated javadoc

0a6f8ea

Updated javadoc

a8e8549

larshelge changed the title ~~Add zeroDecimalAsFloat to ReadOptions~~ Add ignoreZeroDecimal to ReadOptions Feb 9, 2020

larshelge added 6 commits March 11, 2020 11:33

Add tests for number detection and ignoreZeroDecimal option

ca9fec2

Reformat code

979fb10

Add tests for table column types

a447f41

Update test

2eeb06f

Update test

371c8ca

Update test

dc4d606

larshelge commented Mar 11, 2020

View reviewed changes

core/src/test/java/tech/tablesaw/io/csv/CsvReaderTest.java Show resolved Hide resolved

larshelge added 4 commits March 11, 2020 12:48

Update test name

3c8c637

Update test names

758bed0

Update test names

8226467

Merge branch 'master' of https://github.com/jtablesaw/tablesaw into i…

111b293

…nteger-parsing-option

benmccann reviewed Mar 23, 2020

View reviewed changes

core/src/test/java/tech/tablesaw/io/csv/CsvReaderTest.java Outdated Show resolved Hide resolved

benmccann reviewed Mar 23, 2020

View reviewed changes

core/src/test/java/tech/tablesaw/util/StringUtilsTest.java Outdated Show resolved Hide resolved

larshelge added 2 commits March 24, 2020 09:17

Clean up tests

ca254d6

Merge branch 'master' into integer-parsing-option

dce36d9

benmccann reviewed Apr 9, 2020

View reviewed changes

core/src/test/java/tech/tablesaw/io/csv/CsvReaderTest.java Show resolved Hide resolved

Remove two tests

b1bbc89

benmccann merged commit 4658e63 into jtablesaw:master Apr 13, 2020

benmccann mentioned this pull request Apr 13, 2020

CSV import: IntParser removes decimal when value ends with ".0" during parsing #732

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ignoreZeroDecimal to ReadOptions #748

Add ignoreZeroDecimal to ReadOptions #748

larshelge commented Jan 30, 2020 •

edited

Loading

larshelge commented Feb 7, 2020 •

edited

Loading

benmccann Feb 7, 2020

larshelge Feb 9, 2020 •

edited

Loading

lwhite1 Feb 23, 2020

larshelge Mar 11, 2020 •

edited

Loading

larshelge commented Feb 20, 2020

lwhite1 commented Feb 21, 2020

benmccann commented Feb 21, 2020

larshelge commented Feb 22, 2020

larshelge commented Apr 6, 2020

Add ignoreZeroDecimal to ReadOptions #748

Add ignoreZeroDecimal to ReadOptions #748

Conversation

larshelge commented Jan 30, 2020 • edited Loading

Description

Testing

larshelge commented Feb 7, 2020 • edited Loading

benmccann Feb 7, 2020

Choose a reason for hiding this comment

larshelge Feb 9, 2020 • edited Loading

Choose a reason for hiding this comment

lwhite1 Feb 23, 2020

Choose a reason for hiding this comment

larshelge Mar 11, 2020 • edited Loading

Choose a reason for hiding this comment

larshelge commented Feb 20, 2020

lwhite1 commented Feb 21, 2020

benmccann commented Feb 21, 2020

larshelge commented Feb 22, 2020

larshelge commented Apr 6, 2020

larshelge commented Jan 30, 2020 •

edited

Loading

larshelge commented Feb 7, 2020 •

edited

Loading

larshelge Feb 9, 2020 •

edited

Loading

larshelge Mar 11, 2020 •

edited

Loading