Skip to content

Commit

Permalink
Merge pull request #30 from LDMX-Software/analysis-doc
Browse files Browse the repository at this point in the history
Some example analysis workflows
  • Loading branch information
tomeichlersmith authored May 1, 2024
2 parents e1448d1 + 4d49e40 commit 98e6ddb
Show file tree
Hide file tree
Showing 10 changed files with 883 additions and 0 deletions.
3 changes: 3 additions & 0 deletions src/SUMMARY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,9 @@
[Welcome](index.md)

# Getting Started
- [Analyzing ldmx-sw Event Files](analysis/intro.md)
- [Using Python (Efficiently)](analysis/python.md)
- [Using ldmx-sw Directly](analysis/ldmx-sw.md)
- [Building and Installing ldmx-sw](building/intro.md)
- [Shared Computing Clusters](building/clusters.md)
- [Updating ldmx-sw](building/updating.md)
Expand Down
14 changes: 14 additions & 0 deletions src/analysis/ana-cfg.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
from LDMX.Framework import ldmxcfg
p = ldmxcfg.Process('ana')
import os
# needs to match path to where compiled library is
# deduced automatically if built and installed alongside ldmx-sw
ldmxcfg.Process.addLibrary(f'{os.getcwd()}/libMyAnalysis.so')
class MyAnalysis:
def __init__(self):
self.instanceName = 'my-ana'
self.className = 'MyAnalyzer' # match class name in source file
self.histograms = []
p.sequence = [ MyAnalysis() ]
p.inputFiles = [ 'events.root' ]
p.histogramFile = 'hist.root'
22 changes: 22 additions & 0 deletions src/analysis/analyzer-boilerplate.cxx
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#include "Framework/EventProcessor.h"

#include "Ecal/Event/EcalHit.h"

class MyAnalyzer : public framework::Analyzer {
public:
MyAnalyzer(const std::string& name, framework::Process& p)
: framework::Analyzer(name, p) {}
~MyAnalyzer() = default;
void onProcessStart() final;
void analyze(const framework::Event& event) final;
};

void MyAnalyzer::onProcessStart() {
// this is where we will define the histograms we want to fill
}

void MyAnalyzer::analyze(const framework::Event& event) {
// this is where we will fill the histograms
}

DECLARE_ANALYZER(MyAnalyzer);
31 changes: 31 additions & 0 deletions src/analysis/analyzer-total-rec-energy.cxx
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
#include "Framework/EventProcessor.h"

#include "Ecal/Event/EcalHit.h"

class MyAnalyzer : public framework::Analyzer {
public:
MyAnalyzer(const std::string& name, framework::Process& p)
: framework::Analyzer(name, p) {}
~MyAnalyzer() = default;
void onProcessStart() final;
void analyze(const framework::Event& event) final;
};

void MyAnalyzer::onProcessStart() {
getHistoDirectory(); // forget this -> silently no output histogram
histograms_.create(
"total_ecal_rec_energy",
"Total ECal Rec Energy [GeV]", 160, 0.0, 16.0
);
}

void MyAnalyzer::analyze(const framework::Event& event) {
const auto& ecal_rec_hits{event.getCollection<ldmx::EcalHit>("EcalRecHits")};
double total = 0.0;
for (const auto& hit : ecal_rec_hits) {
total += hit.getEnergy();
}
histograms_.fill("total_ecal_rec_energy", total/1000.);
}

DECLARE_ANALYZER(MyAnalyzer);
50 changes: 50 additions & 0 deletions src/analysis/intro.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# Analyzing ldmx-sw Event Files

Often when first starting on LDMX, people are given a `.root` file produced by ldmx-sw
(or some method for producing their own file). This then leads to the very next
and reasonable question -- how to analyze the data in this file?

Many answers to this question have been said and many of them are functional.
In subsequent sections of this chapter of the website, I choose to focus
on highlighting two possible analysis workflows that I personally
like (for different reasons).

I am not able to fully cover all of the possible different types of analysis,
so I am writing this guide in the context of one of the most common analyses:
looking at a histogram. This type of analysis can be broken into four steps.
1. Load data: from a data file, load the information in that file into memory.
2. Data Manipulation: from the data already present, calculate the variable
that you would like to put into the histogram.
3. Histogram Filling: define how the histogram should be binned and fill
the histogram with the variable that you calculated.
4. Plotting: from the definition of the histogram and its content,
draw the histogram in a visual manner for inspection.

The software that is used to do each of these steps is what mainly separates
the different analysis workflows, so I am first going to mention various
software tools that are helpful for one or more of these steps. I have separated
these tools into two "ecosystems", both of which are popularly used within
HEP.

Purpose | ROOT | scikit-hep
---|---|---
Load Data | `TFile`,`TTree` | `uproot`
Manipulate Data | `C++` Code | Vectorized Python with `awkward`
Fill Histograms | `TH1*` | `hist`,`boost_histogram`
Plot Histograms | `TBrowser`, `TCanvas` | `matplotlib`,`mplhep`

How one mixes and matches these tools is a personal choice,
especially since the LDMX collaboration has not landed on a
widely agreed-upon method for analysis. With this in mind,
I find it important to emphasize that the following subsections
are **examples** of analysis workflows -- a specific analyzer
can choose to mix and match the tools however they like.

### Caveat
While I (Tom Eichlersmith) have focused on two potential analysis
workflows, I do not mean to claim that I have tried all of them
and these two are "the best". I just mean to say that I have
drifted to these two analysis workflows over time as I've looked
for easier and better methods of analyzing data. If you have
an analysis workflow that you would like to share, add another
subsection to this chapter of the website!
Binary file added src/analysis/jsroot-screenshot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
120 changes: 120 additions & 0 deletions src/analysis/ldmx-sw.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
# Using ldmx-sw Directly

As you may be able to pick up based on my tone, I prefer the (efficient) python analysis
method given before using `uproot`, `awkward`, and `hist`. Nevertheless, there are two
main reasons I've had for putting an analysis into the ldmx-sw C++.
1. **Longevity**: While the python ecosystem evolving quickly is helpful for obtaining
new features, it can also cause long-running projects to "break" -- forcing a refactor
that is purely due to upstream code changes. Writing an analysis into C++, with its
more rigid view on backwards compatibility, essentially solidifies it so that it can
be run in the same way for a long time into the future.
2. **Non-Vectorization**: In the previous chapter, I made a big deal about how most
analyses can be vectorizable and written in terms of these pre-compiled and fast
functions. That being said, sometimes its difficult to figure out _how_ to write
an analysis in a vectorizable form. Dropping down into the C++ allows analyzers
to write the `for` loop themselves which may be necessary for an analysis to
be understandable (or even functional).

The following example is **stand-alone** in the same sense as the prior chapter's
jupyter notebook. It can be run from outside of the ldmx-sw repository; however,
this directly contradicts the first reason for writing an analysis like this
in the first place. I would recommend that you store your analyzer source code
in ldmx-sw or the private ldmx-analysis. These repos also contain CMake infrastructure
so you can avoid having to write the long `g++` command necessary to compile
and link the source code.

## Set Up
I am going to use the same version of ldmx-sw that was used to generate
the input `events.root` file. This isn't strictly necessary - more often
than not, newer ldmx-sw versions are able to read files from prior ldmx-sw
versions.
```
cd work-area
denv init ldmx/pro:v3.3.6
```
The following code block shows the necessary boilerplate for starting
a C++ analyzer running with ldmx-sw.
```cpp
{{#include analyzer-boilerplate.cxx}}
```
And below is an example python config call `ana-cfg.py` that I will
use to run this analyzer with `fire`. It assumes to be in the same
place as the source file so that it knows where the library it needs
to load is.
```python
{{#include ana-cfg.py}}
```
A quick test can show that the code is compiling and running
(although it will not print out anything or create any files).
```
$ denv 'g++ -fPIC -shared -o libMyAnalysis.so -lFramework -I$(root-config --incdir) MyAnalysis.cxx'
$ denv time fire ana-cfg.py
---- LDMXSW: Loading configuration --------
---- LDMXSW: Configuration load complete --------
---- LDMXSW: Starting event processing --------
---- LDMXSW: Event processing complete --------
1.99user 0.11system 0:02.10elapsed 100%CPU (0avgtext+0avgdata 320716maxresident)k
0inputs+0outputs (0major+58469minor)pagefaults 0swaps
```
Notice that we still took about 2s to run. This is because, even though `MyAnalyzer` isn't
doing anything with the data, `fire` is still looping through all of the events.

## Load Data, Manipulate Data, and Fill Histograms
While these steps were separate in the previous workflow, they all share the same process
in this workflow. The data loading is handled by `fire` and we are expected to write the
code that manipulates the data and fills the histograms.

I'm going to look at the total reconstructed energy in the ECal again. Below is the updated
code file. Notice that, besides the additional `#include` at the top, all of the changes were
made within the function definitions.
```cpp
{{#include analyzer-total-rec-energy.cxx}}
```
In order to run this code on the data, we need to compile and run the program.
Again, putting your analyzer within ldmx-sw or ldmx-analysis gives you infrastructure that
shortens how much you type during this compilation step.
```
$ denv 'g++ -fPIC -shared -o libMyAnalysis.so -lFramework -I$(root-config --incdir) MyAnalysis.cxx'
$ denv time fire ana-cfg.py
---- LDMXSW: Loading configuration --------
---- LDMXSW: Configuration load complete --------
---- LDMXSW: Starting event processing --------
---- LDMXSW: Event processing complete --------
2.03user 0.12system 0:02.20elapsed 97%CPU (0avgtext+0avgdata 321300maxresident)k
0inputs+40outputs (0major+57977minor)pagefaults 0swaps
```
Now there is a new file `hist.root` in this directory which has the histogram we filled stored within it.
```
$ denv rootls -l hist.root
TDirectoryFile Apr 30 22:30 2024 my-ana "my-ana"
$ denv rootls -l hist.root:*
TH1F Apr 30 22:30 2024 my-ana_total_ecal_rec_energy ""
```
## Plotting Histograms
Viewing the histogram with the filled data is another fork in the road.
I often switch back to the Jupyter Lab approach using `uproot` to load histograms from `hist.root`
into `hist` objects that I can then manipulate and plot with `hist` and `matplotlib`.
~~~admonish tip title='Accessing Histograms in Jupyter Lab' collapsible=true
I usually hold the histogram file handle created by `uproot` and then use the `to_hist()` function
once I find the histogram I want to plot in order to pull the histogram into an object I am
familiar with. For example
```python
f = uproot.open('hist.root')
h = f['my-ana/my-ana_total_ecal_rec_energy'].to_hist()
h.plot()
```
~~~
But it is also very common to view these histograms with one of ROOT's browsers.
There is a root browser within the dev images (via `denv rootbrowse` or `ldmx rootbrowse`),
but I also like to view the histogram [with the online ROOT browser](https://root.cern.ch/js/latest/).
Below is a screenshot of my browser window after opening the `hist.root` file and then
selecting the histogram we filled.
![screenshot of JSROOT](jsroot-screenshot.png)
Binary file added src/analysis/output_11_0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/analysis/output_9_0.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 98e6ddb

Please sign in to comment.