Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unique value count #150

Merged
merged 10 commits into from
Sep 23, 2019
Merged

Unique value count #150

merged 10 commits into from
Sep 23, 2019

Conversation

MaximMoinat
Copy link
Collaborator

This change adds the number of unique values in the overview tab, together with the fraction unique (100% meaning a column with only unique values, 0% with a constant value).

Also, the fraction columns are styled as percentage (rounded to integer percent).

image

Maxim Moinat added 3 commits October 29, 2018 16:23
Copy link
Member

@schuemie schuemie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, one caveat here: there's a maximum on the number of unique values that WhiteRabbit will keep in memory: https://github.com/OHDSI/WhiteRabbit/blob/master/src/org/ohdsi/whiteRabbit/scan/SourceDataScan.java#L49

The default number of rows to scan (100,000) is the same as this maximum, but if people set the number of rows higher these new statistics might suggest fewer unique values than actually encountered.

I suggest that if the number of unique values in memory is equal to the maximum (or maybe we set a flag if the max is hit), we modify the statistics here to say 'greater than or equal'. So for example:
N unique >= 100,000, Fraction unique >= 10%.

@MaximMoinat
Copy link
Collaborator Author

MaximMoinat commented Jan 9, 2019

In that case, the number of counted unique values will actually be greater than the actual number. The discarded values are potentially counted multiple times. See these lines:
https://github.com/thehyve/OHDSI-WhiteRabbit/blob/unique-stat/src/org/ohdsi/whiteRabbit/scan/SourceDataScan.java#L479-L480
(these lines were added because of issues caused by trimming of the fieldInfo object directly after processing all values)

Lets indeed go with your proposal. At maximum 100,000 unique values should be counted and if that limit is hit, report the stats as you mentioned (>= 100,000, >= xx%). We should be able to reuse the tooManyValues flag for this.

@schuemie
Copy link
Member

schuemie commented Jan 9, 2019

Ah, I now see you introduced the uniqueCount variable, which I think is redundant with valueCounts.size().

@MaximMoinat
Copy link
Collaborator Author

MaximMoinat commented Jan 9, 2019

My initial implementation indeed used valueCounts.size(). However, when I tried this the unique count was at maximum the 'max distinct values' (by default 1000). This is caused by the truncation of the valueCounts object before generating the scan report. So not all unique values are kept after the file/table has been analysed.

for (FieldInfo fieldInfo : fieldInfos)
fieldInfo.trim();
return fieldInfos;

@MaximMoinat
Copy link
Collaborator Author

Amended as suggested by @schuemie. The unique count and unique fraction will now be marked with '<=' when the unique count is more than the MAX_VALUES_IN_MEMORY. In these cases the unique count will be an overestimate, as an intermittent trimming of the value counts is done (keeping only topN).

By the way, what is the reason that the trimming is done only once? This does not guarantee that the size of the valueCounts is always smaller than the MAX_VALUES_IN_MEMORY.

if (!tooManyValues && valueCounts.size() > MAX_VALUES_IN_MEMORY) {
tooManyValues = true;
valueCounts.keepTopN(maxValues);
}

@MaximMoinat MaximMoinat changed the base branch from master to develop September 11, 2019 12:04
Copy link
Collaborator

@blootsvoets blootsvoets left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, some minor comments

@MaximMoinat MaximMoinat merged commit b27ec07 into OHDSI:develop Sep 23, 2019
@MaximMoinat MaximMoinat deleted the unique-stat branch September 23, 2019 10:20
@MaximMoinat MaximMoinat mentioned this pull request Nov 11, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants