adding validation tool (KS test) #31

lguzzi · 2020-10-26T17:54:46Z

PR to add the vaidation tool. The script loads a list of files into a RDataFrame, splits the dataframe into different chunks and then runs a Kolmogorov-Smirnov test between the first chunk and the rest to check the compatibility of some distributions.

The distributions considered for the KS test are:

lepton_gen_match, and:
- tau_pt and tau_eta for each gen match value
sample_type distribution, and:
- tau_pt and tau_eta for each sample type value
dataset_group_id distribution, and:
- tau_pt, tau_eta for each group id value
- dataset_id for each group id value

The script creates a root file storing the chunk histograms, a json file with the p-values and the pdf files of the KS comparison.
You can find and example at /afs/cern.ch/user/l/lguzzi/public/TauPOG/validation_tool_test (run on /eos/cms/store/group/phys_tau/TauML/prod_2018_v1/full_tuples/WJetsToLNu_HT-2500ToInf/eventTuple_1-1.root)

Time tests

Here are some timing test I run on an lxplus machine:

root-version: 6.20/02
dataset: WJetsToLNu_HT-1200To2500
- size: 42 GB
- events: 15890103 (15 milion)

source /cvmfs/sft.cern.ch/lcg/views/LCG_97apython3/x86_64-centos7-clang10-opt/setup.sh
python -m cProfile validation_tool.py --input "/eos/cms/store/group/phys_tau/TauML/prod_2018_v1/full_tuples/WJetsToLNu_HT-1200To2500/*.root" --output test-v5 --legend

Run time after a fresh login (subsequent calls are about twice as fast):
1 thread: 106 s
4 thread: 47 s
10 thread: 19 s

The most time consuming function is load_histograms, which takes about 7s per call running on one thread, 2s per call running on four threads and 1s per call on ten threads (13 calls in total).

Production/scripts/validation_tool.py

kandrosov

Thank you @lguzzi! The current version code looks good. To close this PR, could you please:

Post here timing to run the tool on a relatively large dataset (at least O(10 GB)) for the future references.
Add instructions on how to run the validation into README.md

lguzzi added 3 commits October 26, 2020 18:27

adding validation tool (KS test)

e9eeca4

adding a comment

0ef725b

small fix

5f7c1a9

lguzzi requested a review from kandrosov October 26, 2020 17:58

lguzzi added 3 commits October 27, 2020 10:35

integer types from groupby method

eec9866

numbered histograms

cb5ae3e

adding dataset_id to main variables

e3a93ac

kandrosov requested changes Oct 27, 2020

View reviewed changes

Production/scripts/validation_tool.py Outdated Show resolved Hide resolved

Production/scripts/validation_tool.py Outdated Show resolved Hide resolved

Production/scripts/validation_tool.py Show resolved Hide resolved

Production/scripts/validation_tool.py Outdated Show resolved Hide resolved

lguzzi added 9 commits October 27, 2020 14:54

comment

bd57611

cast fix

b12d21a

n splits default 100

7f60424

python3-like print

6869fd1

changin output files structure

200cb39

changin / to //

cf217ef

correct conversion from RDF.Count() to int

9175af5

new version of validation tool

3c0beb5

validation tool update

75a6735

kandrosov reviewed Nov 3, 2020

View reviewed changes

update readme for validation tool

1e7bf8a

lguzzi requested a review from kandrosov November 4, 2020 11:14

kandrosov approved these changes Nov 4, 2020

View reviewed changes

kandrosov merged commit 4fc9573 into cms-tau-pog:master Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adding validation tool (KS test) #31

adding validation tool (KS test) #31

lguzzi commented Oct 26, 2020 •

edited

Loading

kandrosov left a comment •

edited

Loading

adding validation tool (KS test) #31

adding validation tool (KS test) #31

Conversation

lguzzi commented Oct 26, 2020 • edited Loading

Time tests

kandrosov left a comment • edited Loading

Choose a reason for hiding this comment

lguzzi commented Oct 26, 2020 •

edited

Loading

kandrosov left a comment •

edited

Loading