-
Notifications
You must be signed in to change notification settings - Fork 646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ignoreZeroDecimal to ReadOptions #748
Add ignoreZeroDecimal to ReadOptions #748
Conversation
Fully appreciating that people are busy, I was just wondering whether you may consider including this patch, or if I should start working on an external work-around? |
core/src/main/java/tech/tablesaw/columns/numbers/IntParser.java
Outdated
Show resolved
Hide resolved
@@ -25,7 +28,7 @@ public boolean canParse(String str) { | |||
} | |||
String s = str; | |||
try { | |||
if (s.endsWith(".0")) { | |||
if (!zeroDecimalAsFloat && s.endsWith(".0")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little weird that it only works with a single 0. What if it was .00
? I know you didn't write that part, but it might be a good opportunity to clean it up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I have updated the PR now.
I added a function to StringUtils
so that we can centralize the logic. I opted for a regex with a compiled pattern for performance reasons. I combined the check for zero decimal with the remove operation as it is simpler and performance will be the same. I can put this method in another util class if that is better.
This made the FixedWidthReaderTest#testDataTypeDetection
test fail which is legitimate as the test data contains data with two zero decimals, so I updated the test.
I plan to add another test once the approach is considered acceptable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems fine. Would you mind adding the test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Many thanks @lwhite1. I have added tests now. These cover both the column type detection feature and getting columns and their types after reading the data into a table. There are tests asserting behavior when ignoreZeroDecimal
is disabled as well as enabled. Let me know if the tests look appropriate.
Would it be possible to kindly get a timeline for when a review of this PR could take place? I am happy to change the approach to what the reviewer find most appropriate. |
I guess I don't see why you can't read the file as a double or float column and then convert it to an int column. The asIntColumn() method uses a cast
DoubleColumn also implements asShortColumn and asLongColumn. One advantage to doing this work here as opposed to in the reader is that most numeric operators return a double column, which you may want to convert to an integral value. |
Yeah, I'd debated that as well. You could generally convert after the fact, but it's probably enough extra work that I didn't think this approach was too bad. Also, there might be circumstances where the current approach makes you lose information. Right now we always strip ".0" and treat as an int. But what if you had a data source where some columns had trailing zeros and others didn't and you want to retain that information? Then I think you'd need to turn off the automatic stripping of ".0". I think that's sort of what was being suggested in #747 |
Many thanks for the comments, really appreciate it. Just wanted to mention that the original use-case for #747 is data type detection, where for various reasons one would like to consider numbers ending in zero-decimals only as type double. Converting the column to int after the fact is not really an option as we do not know the data type of the data files/columns up front, as they are uploaded by users. We are using TableSaw to Note that this PR will also coincidentally solve the problem of #732. |
I have added tests. @lwhite1 would it be possible to kindly ask for a review? |
Description
Introduces new option
ignoreZeroDecimal
toReadOptions
. This option controls whether a numeric value ending with ".0" may be considered an integer (or short, long). Default value istrue
and retains the current behavior. If set to false, values ending with ".0" will be not be considered an integer and instead be considered a floating point.Since different software/systems have different opinions on this matter, an option to control this behavior could be justified. Will be useful when using TableSaw together with other systems.
The logic could be centralized in
AbstractColumnParser
if that is preferred. Open for other variable naming suggestions.Relates to issue #747.
Testing
Tests added to
CsvReaderTest
andStringUtilsTest
.