Home

Instructions manual

Welcome to the adversarial wiki! For specific topics, see the table of contents on the right. Below, you will find details about what the point of the package is, what it contains, and how to use it.

What's the point?

Substructure observables can be used to classify jets based on their radiation patterns, which is used e.g. for distinguishing hadronically decaying W bosons from non-resonant multijets. Additionally, machine learning (ML) methods can be used to combine several such observables into a more powerful classifier.

However, these observables may be correlated with other properties of the jets, like its invariant mass, which leads to a sculpting of jet distributions when performing a selection on an observables with which they are correlated. This is shown in the below figure, from the ATL-PHYS-PUB-2017-004 note.

Here, it is seen that the distribution of the invariant mass of multijets before any selection (green, hashed) is sculpted after performing a selection on the D₂ substructure observable (pink, line). Since signal process has a resonant mass, the invariant mass of the jet is a powerful discriminator against the non-resonant background. ML taggers approximately learn the jet mass as a feature, and use this for discrimination, resulting in increased sculpting, leading the background distribution after performing a selection on an ML-tagger (red and blue, line) to resemble the signal distribution (yellow, hashed).

The aim of this package is to provide a range of tools to mitigate such (non-linear) correlation with the jet mass, for both single-variable and ML-based taggers. The package provides five different techniques for constructing mass-decorrelated jet taggers:

Designed decorrelated taggers (DDT): Performs a linear transform of tau₂₁ vs. rho^DDT = log[m² / (p_T x 1 GeV)] for the dominant background to construct tau₂₁^DDT which is independent of the jet mass and p_T in a large region of phase-space. Only works for tau₂₁.
k-nearest neighbour regression (k-NN): Performs a non-parametric regression of the value of any substructure variable X which corresponds to a fixed background efficiency, as a function of (log p_T, rho = log[m²/pT²]) for the dominant background to construct X^k-NN which is independent of the jet mass and pT throughout phase-space for a cut at the fixed target background selection efficiency.
Convolved substructure: In bins of the jet mass, this method convolved the distribution of any jet substructure observable to morph it to become coherently closer to the distribution at some reference mass. This is intended to remove the dependence on mass of both the average of the distribution but also of higher-order moments. No independence of p_T, and works best if the intrinsic shape if the substructure observable distributions at different masses are not too different.
Boosting for uniform selection efficiency (uBoost): Method for using adaptive boosting of decision trees (BDT) to penalise not just classification error but also non-uniformity in background selection efficiency wrt. some target variable, here the jet mass. p_T dependence can be mitigated by re-weighting training data to uniformity in this variable; is able to reduce correlation with the jet mass, but features still remain.
Adversarially trained neural networks (ANN): A standard neural network (NN) classifier is pitted against an adversary network, tasked with inferring the jet mass from the classifier output and (optionally) additional, auxiliary variables (e.g. log p_T); using gradient reversal, the classifier network is penalised if this is possible, and is therefore trained to make the tagger variable independent of the jet mass. This method is almost Pareto-efficient in terms of background rejection vs. mass-decorrelation (1/Jensen-Shannon divergence of pass- and fail jet mass distributions), but some p_T-dependence remained despite re-weighting.

A study of the performance of each of these methods was conducted in ATL-PHYS-PUB-2018-014. Each methods have pros and cons, as detailed in the PUB, so there is no one-size-fits-all solution to all analysis needs.

How do I get started?

We'll assume that you are working on lxplus, and that you have a set of signal and background ROOT files containing features that you want to use for classification.

Log on to lxplus.
```
$ ssh <username>@lxplus.cern.ch
```

Clone the package.

$ git clone git@github.com:asogaard/adversarial.git
$ cd adversarial

Install miniconda and create the appropriate software environments. Here you have two options: Either install it yourself using the utility script, following the prompts:
```
$ source install.sh
```
Make sure your install location is somewhere with sufficient space (> ca. 9 GB), so either your AFS workarea or EOS. Alternatively, use the common, maintained installation by doing:
```
$ export PATH="/afs/cern.ch/work/a/asogaard/public/miniconda2/bin:$PATH"
```
every time you start a clean shell.
Activate the conda environment. This can be done using the utility script:
```
$ source activate.sh
```
Convert your ROOT files to a single HDF5 file. The package assumes the input is in the form of a single, flat HDF5 file (for speed and portability). To perform this conversion, you can use the prepro/converterROOT.py script. To see how it is used, run the script with the help flag from the base of the package:
```
$ python -m prepro.converterROOT --help
```
This script takes a variable number of signal (--sig) and background (--bkg) ROOT files, each containing lists of events with some number of jets in them.¹ For the --sig and --bkg flags, you can use wildcards and specify a white space-separated list of multiple paths, e.g.: --sig PATH1 PATH2/*.root MYDIR/*FILE3.root. As an example, the conversion can be run as shown:
```
$ MYBASEPATH=/afs/cern.ch/work/e/ehansen/public/DDT-studies-FourJets/
$ python -m prepro.converterROOT --sig $MYBASEPATH/zprime/*zprime*.root \
                                 --bkg $MYBASEPATH/jetjet/*.root \
                                 --output mytest-susy-fourjet.h5
```
If the script complains about unknown variables or dividing by zero in np.log, or if you want to impose some selection before doing the conversion, you can edit prepro/converterROOT.py:L145-L153.

Alternatively, if you don't have any data yourself and just want to run the package, you can use the scripts/get_data.sh script to download a pre-prepared HDF5 file:
```
$ source scripts/get_data.sh
```
Place your data somewhere convenient. By default, all of the scripts in the package expects the input HDF5 file to be located in input/data.h5. If you used the get_data.sh script, it should have put the file in the correct place. Otherwise, you may want to move your own HDF5 file there. If you want to use a separate input directory, e.g. putting your HDF5 file at my-test-input/data.h5 (the data.h5 bit is hard-coded -- sorry) you can either use the --input my-test-input/ flag which should be supported by all scripts, or you can change the default in adversarial/utils/setup.py:L47
Train a first method. Now you should be able to start using each of the five methods mentioned above. These are all implemented in the package's run/ directory. The simplest one is the DDT method. The run/ddt/ directory contains a script for performing the linear fit (run/ddt/train.py) and a script for plotting the result (run/ddt/test.py). To perform the fit yourself, you run:
```
$ python -m run.ddt.train [--input my-test-input/]
```
If you get errors talking about unknown variable names, you can edit the default tau₂₁, rho^DDT, and weight variable names in run/ddt/common.py:L22-L24.

Once the DDT training script concludes, the trained model is saved in the models/ddt/ directory. It is pickled and gzipped for portability and size. This trained model can then be used for studies to test it's performance.
Perform first test. To test the DDT trained in the above item, you can run the following script:
```
$ python -m run.ddt.test [--input my-test-input/]
```
This loads in the trained model from the models/ddt/ directory and plots the profile of tau₂₁ vs. rho^DDT before (red, open markers) and after (blue, filled markers) the transform, as well as the linear fit itself. Hopefully you'll see the transform make the profile flatter (in the fit range).
Continue exploring. Now you can tinker away with the DDT transform in run/ddt/train.py, perform additional studies in run/ddt/test.py, or start looking into the four other methods in the run/ directory -- they all have the same structure. Once you have trained a few different methods, you can start performing comparison tests. The tests/studies directory already contains a lot of such studies in separate python scripts. These are run collectively using the tests/comparison.py script as:
```
$ python -m tests.comparison [--input my-test-input/]
```
Here, you can modify studies or add new one particular to your own analysis needs.

What's in the box?

The adversarial package contains the following directories, in approximate order of when you'll encounter them:

envs: YAML files specifying the software environments, as set up using conda, which in the package is intended to run. These ensure that all of the necessary software libraries (ROOT, numpy, tensorflow, etc.) are installed.
prepro: Scripts for preprocessing ROOT files and converting them to the supported HDF5 file format.
run: Contains one sub-directory for each mass-decorrelation method, each of these then contains scripts for training and testing the methods.
models: Where all trained models are stored.
tests: Holds all perfomance and validation tests to study the different mass-decorrelation methods.
configs: Configuration files for the neural network classifier, specifying the architecture, training procedure, etc.
optimisation: Scripts for performing Bayesian optimisation of neural network hyper-parameters.
adversarial: Common definitions.
scripts: Common bash scripts.
fixes: Miscellaneous minor patch files.

¹ Notice that currently, all variable length branches in the ROOT files are treated the same. This means that if you are requesting only the two leading jets (--nleading 2), then only the two first entries in each vector-like container is kept, which may also include e.g. tracks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Instructions manual

What's the point?

How do I get started?

What's in the box?

Table of contents

Clone this wiki locally