-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #30 from LDMX-Software/analysis-doc
Some example analysis workflows
- Loading branch information
Showing
10 changed files
with
883 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
from LDMX.Framework import ldmxcfg | ||
p = ldmxcfg.Process('ana') | ||
import os | ||
# needs to match path to where compiled library is | ||
# deduced automatically if built and installed alongside ldmx-sw | ||
ldmxcfg.Process.addLibrary(f'{os.getcwd()}/libMyAnalysis.so') | ||
class MyAnalysis: | ||
def __init__(self): | ||
self.instanceName = 'my-ana' | ||
self.className = 'MyAnalyzer' # match class name in source file | ||
self.histograms = [] | ||
p.sequence = [ MyAnalysis() ] | ||
p.inputFiles = [ 'events.root' ] | ||
p.histogramFile = 'hist.root' |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
#include "Framework/EventProcessor.h" | ||
|
||
#include "Ecal/Event/EcalHit.h" | ||
|
||
class MyAnalyzer : public framework::Analyzer { | ||
public: | ||
MyAnalyzer(const std::string& name, framework::Process& p) | ||
: framework::Analyzer(name, p) {} | ||
~MyAnalyzer() = default; | ||
void onProcessStart() final; | ||
void analyze(const framework::Event& event) final; | ||
}; | ||
|
||
void MyAnalyzer::onProcessStart() { | ||
// this is where we will define the histograms we want to fill | ||
} | ||
|
||
void MyAnalyzer::analyze(const framework::Event& event) { | ||
// this is where we will fill the histograms | ||
} | ||
|
||
DECLARE_ANALYZER(MyAnalyzer); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,31 @@ | ||
#include "Framework/EventProcessor.h" | ||
|
||
#include "Ecal/Event/EcalHit.h" | ||
|
||
class MyAnalyzer : public framework::Analyzer { | ||
public: | ||
MyAnalyzer(const std::string& name, framework::Process& p) | ||
: framework::Analyzer(name, p) {} | ||
~MyAnalyzer() = default; | ||
void onProcessStart() final; | ||
void analyze(const framework::Event& event) final; | ||
}; | ||
|
||
void MyAnalyzer::onProcessStart() { | ||
getHistoDirectory(); // forget this -> silently no output histogram | ||
histograms_.create( | ||
"total_ecal_rec_energy", | ||
"Total ECal Rec Energy [GeV]", 160, 0.0, 16.0 | ||
); | ||
} | ||
|
||
void MyAnalyzer::analyze(const framework::Event& event) { | ||
const auto& ecal_rec_hits{event.getCollection<ldmx::EcalHit>("EcalRecHits")}; | ||
double total = 0.0; | ||
for (const auto& hit : ecal_rec_hits) { | ||
total += hit.getEnergy(); | ||
} | ||
histograms_.fill("total_ecal_rec_energy", total/1000.); | ||
} | ||
|
||
DECLARE_ANALYZER(MyAnalyzer); |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
# Analyzing ldmx-sw Event Files | ||
|
||
Often when first starting on LDMX, people are given a `.root` file produced by ldmx-sw | ||
(or some method for producing their own file). This then leads to the very next | ||
and reasonable question -- how to analyze the data in this file? | ||
|
||
Many answers to this question have been said and many of them are functional. | ||
In subsequent sections of this chapter of the website, I choose to focus | ||
on highlighting two possible analysis workflows that I personally | ||
like (for different reasons). | ||
|
||
I am not able to fully cover all of the possible different types of analysis, | ||
so I am writing this guide in the context of one of the most common analyses: | ||
looking at a histogram. This type of analysis can be broken into four steps. | ||
1. Load data: from a data file, load the information in that file into memory. | ||
2. Data Manipulation: from the data already present, calculate the variable | ||
that you would like to put into the histogram. | ||
3. Histogram Filling: define how the histogram should be binned and fill | ||
the histogram with the variable that you calculated. | ||
4. Plotting: from the definition of the histogram and its content, | ||
draw the histogram in a visual manner for inspection. | ||
|
||
The software that is used to do each of these steps is what mainly separates | ||
the different analysis workflows, so I am first going to mention various | ||
software tools that are helpful for one or more of these steps. I have separated | ||
these tools into two "ecosystems", both of which are popularly used within | ||
HEP. | ||
|
||
Purpose | ROOT | scikit-hep | ||
---|---|--- | ||
Load Data | `TFile`,`TTree` | `uproot` | ||
Manipulate Data | `C++` Code | Vectorized Python with `awkward` | ||
Fill Histograms | `TH1*` | `hist`,`boost_histogram` | ||
Plot Histograms | `TBrowser`, `TCanvas` | `matplotlib`,`mplhep` | ||
|
||
How one mixes and matches these tools is a personal choice, | ||
especially since the LDMX collaboration has not landed on a | ||
widely agreed-upon method for analysis. With this in mind, | ||
I find it important to emphasize that the following subsections | ||
are **examples** of analysis workflows -- a specific analyzer | ||
can choose to mix and match the tools however they like. | ||
|
||
### Caveat | ||
While I (Tom Eichlersmith) have focused on two potential analysis | ||
workflows, I do not mean to claim that I have tried all of them | ||
and these two are "the best". I just mean to say that I have | ||
drifted to these two analysis workflows over time as I've looked | ||
for easier and better methods of analyzing data. If you have | ||
an analysis workflow that you would like to share, add another | ||
subsection to this chapter of the website! |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,120 @@ | ||
# Using ldmx-sw Directly | ||
|
||
As you may be able to pick up based on my tone, I prefer the (efficient) python analysis | ||
method given before using `uproot`, `awkward`, and `hist`. Nevertheless, there are two | ||
main reasons I've had for putting an analysis into the ldmx-sw C++. | ||
1. **Longevity**: While the python ecosystem evolving quickly is helpful for obtaining | ||
new features, it can also cause long-running projects to "break" -- forcing a refactor | ||
that is purely due to upstream code changes. Writing an analysis into C++, with its | ||
more rigid view on backwards compatibility, essentially solidifies it so that it can | ||
be run in the same way for a long time into the future. | ||
2. **Non-Vectorization**: In the previous chapter, I made a big deal about how most | ||
analyses can be vectorizable and written in terms of these pre-compiled and fast | ||
functions. That being said, sometimes its difficult to figure out _how_ to write | ||
an analysis in a vectorizable form. Dropping down into the C++ allows analyzers | ||
to write the `for` loop themselves which may be necessary for an analysis to | ||
be understandable (or even functional). | ||
|
||
The following example is **stand-alone** in the same sense as the prior chapter's | ||
jupyter notebook. It can be run from outside of the ldmx-sw repository; however, | ||
this directly contradicts the first reason for writing an analysis like this | ||
in the first place. I would recommend that you store your analyzer source code | ||
in ldmx-sw or the private ldmx-analysis. These repos also contain CMake infrastructure | ||
so you can avoid having to write the long `g++` command necessary to compile | ||
and link the source code. | ||
|
||
## Set Up | ||
I am going to use the same version of ldmx-sw that was used to generate | ||
the input `events.root` file. This isn't strictly necessary - more often | ||
than not, newer ldmx-sw versions are able to read files from prior ldmx-sw | ||
versions. | ||
``` | ||
cd work-area | ||
denv init ldmx/pro:v3.3.6 | ||
``` | ||
The following code block shows the necessary boilerplate for starting | ||
a C++ analyzer running with ldmx-sw. | ||
```cpp | ||
{{#include analyzer-boilerplate.cxx}} | ||
``` | ||
And below is an example python config call `ana-cfg.py` that I will | ||
use to run this analyzer with `fire`. It assumes to be in the same | ||
place as the source file so that it knows where the library it needs | ||
to load is. | ||
```python | ||
{{#include ana-cfg.py}} | ||
``` | ||
A quick test can show that the code is compiling and running | ||
(although it will not print out anything or create any files). | ||
``` | ||
$ denv 'g++ -fPIC -shared -o libMyAnalysis.so -lFramework -I$(root-config --incdir) MyAnalysis.cxx' | ||
$ denv time fire ana-cfg.py | ||
---- LDMXSW: Loading configuration -------- | ||
---- LDMXSW: Configuration load complete -------- | ||
---- LDMXSW: Starting event processing -------- | ||
---- LDMXSW: Event processing complete -------- | ||
1.99user 0.11system 0:02.10elapsed 100%CPU (0avgtext+0avgdata 320716maxresident)k | ||
0inputs+0outputs (0major+58469minor)pagefaults 0swaps | ||
``` | ||
Notice that we still took about 2s to run. This is because, even though `MyAnalyzer` isn't | ||
doing anything with the data, `fire` is still looping through all of the events. | ||
|
||
## Load Data, Manipulate Data, and Fill Histograms | ||
While these steps were separate in the previous workflow, they all share the same process | ||
in this workflow. The data loading is handled by `fire` and we are expected to write the | ||
code that manipulates the data and fills the histograms. | ||
|
||
I'm going to look at the total reconstructed energy in the ECal again. Below is the updated | ||
code file. Notice that, besides the additional `#include` at the top, all of the changes were | ||
made within the function definitions. | ||
```cpp | ||
{{#include analyzer-total-rec-energy.cxx}} | ||
``` | ||
In order to run this code on the data, we need to compile and run the program. | ||
Again, putting your analyzer within ldmx-sw or ldmx-analysis gives you infrastructure that | ||
shortens how much you type during this compilation step. | ||
``` | ||
$ denv 'g++ -fPIC -shared -o libMyAnalysis.so -lFramework -I$(root-config --incdir) MyAnalysis.cxx' | ||
$ denv time fire ana-cfg.py | ||
---- LDMXSW: Loading configuration -------- | ||
---- LDMXSW: Configuration load complete -------- | ||
---- LDMXSW: Starting event processing -------- | ||
---- LDMXSW: Event processing complete -------- | ||
2.03user 0.12system 0:02.20elapsed 97%CPU (0avgtext+0avgdata 321300maxresident)k | ||
0inputs+40outputs (0major+57977minor)pagefaults 0swaps | ||
``` | ||
Now there is a new file `hist.root` in this directory which has the histogram we filled stored within it. | ||
``` | ||
$ denv rootls -l hist.root | ||
TDirectoryFile Apr 30 22:30 2024 my-ana "my-ana" | ||
$ denv rootls -l hist.root:* | ||
TH1F Apr 30 22:30 2024 my-ana_total_ecal_rec_energy "" | ||
``` | ||
## Plotting Histograms | ||
Viewing the histogram with the filled data is another fork in the road. | ||
I often switch back to the Jupyter Lab approach using `uproot` to load histograms from `hist.root` | ||
into `hist` objects that I can then manipulate and plot with `hist` and `matplotlib`. | ||
~~~admonish tip title='Accessing Histograms in Jupyter Lab' collapsible=true | ||
I usually hold the histogram file handle created by `uproot` and then use the `to_hist()` function | ||
once I find the histogram I want to plot in order to pull the histogram into an object I am | ||
familiar with. For example | ||
```python | ||
f = uproot.open('hist.root') | ||
h = f['my-ana/my-ana_total_ecal_rec_energy'].to_hist() | ||
h.plot() | ||
``` | ||
~~~ | ||
But it is also very common to view these histograms with one of ROOT's browsers. | ||
There is a root browser within the dev images (via `denv rootbrowse` or `ldmx rootbrowse`), | ||
but I also like to view the histogram [with the online ROOT browser](https://root.cern.ch/js/latest/). | ||
Below is a screenshot of my browser window after opening the `hist.root` file and then | ||
selecting the histogram we filled. | ||
![screenshot of JSROOT](jsroot-screenshot.png) |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.