-
Notifications
You must be signed in to change notification settings - Fork 91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unique value count #150
Unique value count #150
Conversation
(cherry picked from commit 164cc94)
(cherry picked from commit cf108a7)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, one caveat here: there's a maximum on the number of unique values that WhiteRabbit will keep in memory: https://github.com/OHDSI/WhiteRabbit/blob/master/src/org/ohdsi/whiteRabbit/scan/SourceDataScan.java#L49
The default number of rows to scan (100,000) is the same as this maximum, but if people set the number of rows higher these new statistics might suggest fewer unique values than actually encountered.
I suggest that if the number of unique values in memory is equal to the maximum (or maybe we set a flag if the max is hit), we modify the statistics here to say 'greater than or equal'. So for example:
N unique >= 100,000, Fraction unique >= 10%.
In that case, the number of counted unique values will actually be greater than the actual number. The discarded values are potentially counted multiple times. See these lines: Lets indeed go with your proposal. At maximum 100,000 unique values should be counted and if that limit is hit, report the stats as you mentioned (>= 100,000, >= xx%). We should be able to reuse the |
Ah, I now see you introduced the |
My initial implementation indeed used WhiteRabbit/src/org/ohdsi/whiteRabbit/scan/SourceDataScan.java Lines 343 to 346 in d8ff935
|
(cherry picked from commit 164cc94)
(cherry picked from commit cf108a7)
…re trimmed due to too many values in memory
…abbit into unique-stat2
Amended as suggested by @schuemie. The unique count and unique fraction will now be marked with '<=' when the unique count is more than the By the way, what is the reason that the trimming is done only once? This does not guarantee that the size of the WhiteRabbit/src/org/ohdsi/whiteRabbit/scan/SourceDataScan.java Lines 436 to 439 in d8ff935
|
whiterabbit/src/main/java/org/ohdsi/whiteRabbit/scan/SourceDataScan.java
Outdated
Show resolved
Hide resolved
whiterabbit/src/main/java/org/ohdsi/whiteRabbit/scan/SourceDataScan.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, some minor comments
This change adds the number of unique values in the overview tab, together with the fraction unique (100% meaning a column with only unique values, 0% with a constant value).
Also, the fraction columns are styled as percentage (rounded to integer percent).