Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding validation tool (KS test) #31

Merged
merged 16 commits into from
Nov 4, 2020
Merged

Conversation

lguzzi
Copy link
Contributor

@lguzzi lguzzi commented Oct 26, 2020

PR to add the vaidation tool. The script loads a list of files into a RDataFrame, splits the dataframe into different chunks and then runs a Kolmogorov-Smirnov test between the first chunk and the rest to check the compatibility of some distributions.

The distributions considered for the KS test are:

  • lepton_gen_match, and:
    • tau_pt and tau_eta for each gen match value
  • sample_type distribution, and:
    • tau_pt and tau_eta for each sample type value
  • dataset_group_id distribution, and:
    • tau_pt, tau_eta for each group id value
    • dataset_id for each group id value

The script creates a root file storing the chunk histograms, a json file with the p-values and the pdf files of the KS comparison.
You can find and example at /afs/cern.ch/user/l/lguzzi/public/TauPOG/validation_tool_test (run on /eos/cms/store/group/phys_tau/TauML/prod_2018_v1/full_tuples/WJetsToLNu_HT-2500ToInf/eventTuple_1-1.root)

Time tests

Here are some timing test I run on an lxplus machine:

  • root-version: 6.20/02
  • dataset: WJetsToLNu_HT-1200To2500
    • size: 42 GB
    • events: 15890103 (15 milion)
source /cvmfs/sft.cern.ch/lcg/views/LCG_97apython3/x86_64-centos7-clang10-opt/setup.sh
python -m cProfile validation_tool.py --input "/eos/cms/store/group/phys_tau/TauML/prod_2018_v1/full_tuples/WJetsToLNu_HT-1200To2500/*.root" --output test-v5 --legend

Run time after a fresh login (subsequent calls are about twice as fast):
1 thread: 106 s
4 thread: 47 s
10 thread: 19 s

The most time consuming function is load_histograms, which takes about 7s per call running on one thread, 2s per call running on four threads and 1s per call on ten threads (13 calls in total).

@lguzzi lguzzi requested a review from kandrosov October 26, 2020 17:58
Production/scripts/validation_tool.py Outdated Show resolved Hide resolved
Production/scripts/validation_tool.py Outdated Show resolved Hide resolved
Production/scripts/validation_tool.py Show resolved Hide resolved
Production/scripts/validation_tool.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@kandrosov kandrosov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @lguzzi! The current version code looks good. To close this PR, could you please:

  • Post here timing to run the tool on a relatively large dataset (at least O(10 GB)) for the future references.
  • Add instructions on how to run the validation into README.md

@lguzzi lguzzi requested a review from kandrosov November 4, 2020 11:14
@kandrosov kandrosov merged commit 4fc9573 into cms-tau-pog:master Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants