Instructions to run the validation code with the Spark cluster

The Apache Spark cluster is a powerful framework that is used for big data processing of data stored in Hadoop. The Spark components work as a Cloud in the sense that you can connect to them in an interactive way via Swan (Jupyter notebook) or via shell. The interactive way is the easiest one, Swan Projects, so notebooks can be runned there to develope and test code. In the other hand, a connection should be made via a lxplus machine.

In both cases it's necessary to request acces to the Spark cluster and Hadoop space system before starting. This can be done using the link.

Start a session with Swan

It's trivial. Just have to select "Analytix" in the Spark cluster option when it's initillized.

Start a session with lxplus

To connect to Spark and Hadoop via lxplus machine:

source /cvmfs/
source /cvmfs/ <cluster name> spark3

Here, the cluster name can be: hadoop-qa, analytix, lxhadoop, nxcals-prod. Now, the option to use is:

source /cvmfs/ analytix spark3

Then, use:


It will ask your password. Finally, if you have permision to use hadoop you can test:

hdfs dfs -ls /


hdfs dfs -mkdir /hdfs/user/UserName/testFolder

Connect with hadoop from lxplus

These commands are used to get connected directly to the /hdfs space, but problems can occur.

ssh it-hadoop-client


Connection with the Muon validation code (This is our main way)

The optimal way to get a connection and to install the code:

git clone

cd spark_tnp



How to generate distributions

The code is developed in the context of the Muon-POG spark code, so, as in the case of efficiencies, it reads the configuration file and plot the ratio and no ratio plots of the one dimentional variables initialized in the section "binVariables" of the configuration.json file.

To run the code, two different options:

Produce Data/MC distributions for a full era:

./ compare particle probe resonance era configs/muon_example.json --baseDir ./example

For example:

./ compare muon generalTracks Z Run2018_UL configs/muon_example.json --baseDir ./example

Produce distributions comparing two specific suberas:

Two options, compare Data or MC datasets from the same era or from different eras. In the first case:

./ compare particle probe resonance era configs/muon_example.json --baseDir ./example --subera1 SubEra1 --subera2 SubEra2

For example:

./ compare muon generalTracks Z Run2018_UL configs/muon_example.json --baseDir ./example --subera1 Run2018A --subera2 DY_madgraph

In the second case, from two different eras:

./ compare particle probe resonance era1 configs/muon_example.json --baseDir ./example --subera1 SubEra1 --subera2 SubEra2 --era2 Era2

For example:

./ compare muon generalTracks Z Run2018_UL configs/muon_example.json --baseDir ./example --subera1 Run2018A --subera2 DY_madgraph --era2 Run2016_UL

Submit work to condor

In case of having to produce plots for a lot of muon IDs or RECOs, the code can spend a lot of time processing the parquet files one by one for each efficiency. This fact can also occur if you want to draw histograms for much variables. In that case, an option can be added to compute the histograms for each efficiency separately, i.e. the work is parallelized as a funtion of each muon type.

Add to the command line:


For example:

./ compare particle probe resonance era configs/muon_example.json --baseDir ./example --condor_submit

Final steps

Once the plots have been generated:

cp -r ./BaseDir/plots/particle/probe/resonance/era/* /eos/user/u/username/www/some_directory/

cp -r ./example/plots/muon/generalTracks/Z/Run2018_UL/* /eos/user/u/username/www/some_directory/

cd /eos/user/u/username/www/some_directory/

find . -type d -exec cp index.php {} \;


