Unique value count #150

MaximMoinat · 2018-12-31T15:25:14Z

This change adds the number of unique values in the overview tab, together with the fraction unique (100% meaning a column with only unique values, 0% with a constant value).

Also, the fraction columns are styled as percentage (rounded to integer percent).

(cherry picked from commit 164cc94)

(cherry picked from commit cf108a7)

schuemie

Hmmm, one caveat here: there's a maximum on the number of unique values that WhiteRabbit will keep in memory: https://github.com/OHDSI/WhiteRabbit/blob/master/src/org/ohdsi/whiteRabbit/scan/SourceDataScan.java#L49

The default number of rows to scan (100,000) is the same as this maximum, but if people set the number of rows higher these new statistics might suggest fewer unique values than actually encountered.

I suggest that if the number of unique values in memory is equal to the maximum (or maybe we set a flag if the max is hit), we modify the statistics here to say 'greater than or equal'. So for example:
N unique >= 100,000, Fraction unique >= 10%.

MaximMoinat · 2019-01-09T11:40:21Z

In that case, the number of counted unique values will actually be greater than the actual number. The discarded values are potentially counted multiple times. See these lines:
https://github.com/thehyve/OHDSI-WhiteRabbit/blob/unique-stat/src/org/ohdsi/whiteRabbit/scan/SourceDataScan.java#L479-L480
(these lines were added because of issues caused by trimming of the fieldInfo object directly after processing all values)

Lets indeed go with your proposal. At maximum 100,000 unique values should be counted and if that limit is hit, report the stats as you mentioned (>= 100,000, >= xx%). We should be able to reuse the tooManyValues flag for this.

schuemie · 2019-01-09T12:40:36Z

Ah, I now see you introduced the uniqueCount variable, which I think is redundant with valueCounts.size().

MaximMoinat · 2019-01-09T16:39:52Z

My initial implementation indeed used valueCounts.size(). However, when I tried this the unique count was at maximum the 'max distinct values' (by default 1000). This is caused by the truncation of the valueCounts object before generating the scan report. So not all unique values are kept after the file/table has been analysed.

WhiteRabbit/src/org/ohdsi/whiteRabbit/scan/SourceDataScan.java

Lines 343 to 346 in d8ff935

    
           for (FieldInfo fieldInfo : fieldInfos) 
        
           	fieldInfo.trim(); 
        
           return fieldInfos;

(cherry picked from commit 164cc94)

(cherry picked from commit cf108a7)

…re trimmed due to too many values in memory

…abbit into unique-stat2

MaximMoinat · 2019-08-13T08:43:11Z

Amended as suggested by @schuemie. The unique count and unique fraction will now be marked with '<=' when the unique count is more than the MAX_VALUES_IN_MEMORY. In these cases the unique count will be an overestimate, as an intermittent trimming of the value counts is done (keeping only topN).

By the way, what is the reason that the trimming is done only once? This does not guarantee that the size of the valueCounts is always smaller than the MAX_VALUES_IN_MEMORY.

WhiteRabbit/src/org/ohdsi/whiteRabbit/scan/SourceDataScan.java

Lines 436 to 439 in d8ff935

    
           if (!tooManyValues && valueCounts.size() > MAX_VALUES_IN_MEMORY) { 
        
           	tooManyValues = true; 
        
           	valueCounts.keepTopN(maxValues); 
        
           }

whiterabbit/src/main/java/org/ohdsi/whiteRabbit/scan/SourceDataScan.java

blootsvoets

Looks good, some minor comments

whiterabbit/src/main/java/org/ohdsi/whiteRabbit/scan/SourceDataScan.java

…ieldinfo class

Maxim Moinat added 3 commits October 29, 2018 16:23

include number and fraction unique values in report

c4345b7

(cherry picked from commit 164cc94)

fractions styled as percentage in excel

decab39

(cherry picked from commit cf108a7)

count unique values separately

3e7f045

schuemie reviewed Jan 9, 2019

View reviewed changes

Maxim Moinat added 5 commits August 12, 2019 15:16

include number and fraction unique values in report

77b1d29

(cherry picked from commit 164cc94)

fractions styled as percentage in excel

384ff8f

(cherry picked from commit cf108a7)

count unique values separately

3e4de89

add <= to the unique count and fraction output, if the value counts a…

c047911

…re trimmed due to too many values in memory

Merge branch 'unique-stat' of https://github.com/thehyve/OHDSI-WhiteR…

0b751d0

…abbit into unique-stat2

MaximMoinat requested review from schuemie and blootsvoets August 13, 2019 08:43

Merge branch 'develop' into unique-stat

c61e346

MaximMoinat changed the base branch from master to develop September 11, 2019 12:04

blootsvoets reviewed Sep 16, 2019

View reviewed changes

whiterabbit/src/main/java/org/ohdsi/whiteRabbit/scan/SourceDataScan.java Outdated Show resolved Hide resolved

blootsvoets reviewed Sep 16, 2019

View reviewed changes

whiterabbit/src/main/java/org/ohdsi/whiteRabbit/scan/SourceDataScan.java Outdated Show resolved Hide resolved

blootsvoets reviewed Sep 16, 2019

View reviewed changes

whiterabbit/src/main/java/org/ohdsi/whiteRabbit/scan/SourceDataScan.java Outdated Show resolved Hide resolved

minor refactorings after code review. formatting logic removed from f…

b82665e

…ieldinfo class

MaximMoinat merged commit b27ec07 into OHDSI:develop Sep 23, 2019

MaximMoinat deleted the unique-stat branch September 23, 2019 10:20

MaximMoinat mentioned this pull request Nov 11, 2019

Release 0.9.0 #194

Merged

MaximMoinat mentioned this pull request Apr 27, 2021

Add count distinct to scan report #300

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unique value count #150

Unique value count #150

MaximMoinat commented Dec 31, 2018

schuemie left a comment

MaximMoinat commented Jan 9, 2019 •

edited

Loading

schuemie commented Jan 9, 2019

MaximMoinat commented Jan 9, 2019 •

edited

Loading

MaximMoinat commented Aug 13, 2019

blootsvoets left a comment

Unique value count #150

Unique value count #150

Conversation

MaximMoinat commented Dec 31, 2018

schuemie left a comment

Choose a reason for hiding this comment

MaximMoinat commented Jan 9, 2019 • edited Loading

schuemie commented Jan 9, 2019

MaximMoinat commented Jan 9, 2019 • edited Loading

MaximMoinat commented Aug 13, 2019

blootsvoets left a comment

Choose a reason for hiding this comment

MaximMoinat commented Jan 9, 2019 •

edited

Loading

MaximMoinat commented Jan 9, 2019 •

edited

Loading