Merge pull request #30 from LDMX-Software/analysis-doc

Some example analysis workflows
LDMX-Software · May 1, 2024 · 98e6ddb · 98e6ddb
2 parents e1448d1 + 4d49e40
commit 98e6ddb
Show file tree

Hide file tree

Showing 10 changed files with 883 additions and 0 deletions.
diff --git a/src/SUMMARY.md b/src/SUMMARY.md
@@ -1,6 +1,9 @@
 [Welcome](index.md)
 
 # Getting Started
+- [Analyzing ldmx-sw Event Files](analysis/intro.md)
+  - [Using Python (Efficiently)](analysis/python.md)
+  - [Using ldmx-sw Directly](analysis/ldmx-sw.md)
 - [Building and Installing ldmx-sw](building/intro.md)
   - [Shared Computing Clusters](building/clusters.md)
   - [Updating ldmx-sw](building/updating.md)

diff --git a/src/analysis/ana-cfg.py b/src/analysis/ana-cfg.py
@@ -0,0 +1,14 @@
+from LDMX.Framework import ldmxcfg
+p = ldmxcfg.Process('ana')
+import os
+# needs to match path to where compiled library is
+# deduced automatically if built and installed alongside ldmx-sw
+ldmxcfg.Process.addLibrary(f'{os.getcwd()}/libMyAnalysis.so')
+class MyAnalysis:
+    def __init__(self):
+        self.instanceName = 'my-ana'
+        self.className = 'MyAnalyzer' # match class name in source file
+        self.histograms = []
+p.sequence = [ MyAnalysis() ]
+p.inputFiles = [ 'events.root' ]
+p.histogramFile = 'hist.root'
diff --git a/src/analysis/analyzer-boilerplate.cxx b/src/analysis/analyzer-boilerplate.cxx
@@ -0,0 +1,22 @@
+#include "Framework/EventProcessor.h"
+
+#include "Ecal/Event/EcalHit.h"
+
+class MyAnalyzer : public framework::Analyzer {
+ public:
+  MyAnalyzer(const std::string& name, framework::Process& p)
+    : framework::Analyzer(name, p) {}
+  ~MyAnalyzer() = default;
+  void onProcessStart() final;
+  void analyze(const framework::Event& event) final;
+};
+
+void MyAnalyzer::onProcessStart() {
+  // this is where we will define the histograms we want to fill
+}
+
+void MyAnalyzer::analyze(const framework::Event& event) {
+  // this is where we will fill the histograms
+}
+
+DECLARE_ANALYZER(MyAnalyzer);
diff --git a/src/analysis/analyzer-total-rec-energy.cxx b/src/analysis/analyzer-total-rec-energy.cxx
@@ -0,0 +1,31 @@
+#include "Framework/EventProcessor.h"
+
+#include "Ecal/Event/EcalHit.h"
+
+class MyAnalyzer : public framework::Analyzer {
+ public:
+  MyAnalyzer(const std::string& name, framework::Process& p)
+    : framework::Analyzer(name, p) {}
+  ~MyAnalyzer() = default;
+  void onProcessStart() final;
+  void analyze(const framework::Event& event) final;
+};
+
+void MyAnalyzer::onProcessStart() {
+  getHistoDirectory(); // forget this -> silently no output histogram
+  histograms_.create(
+      "total_ecal_rec_energy",
+      "Total ECal Rec Energy [GeV]", 160, 0.0, 16.0
+  );
+}
+
+void MyAnalyzer::analyze(const framework::Event& event) {
+  const auto& ecal_rec_hits{event.getCollection<ldmx::EcalHit>("EcalRecHits")};
+  double total = 0.0;
+  for (const auto& hit : ecal_rec_hits) {
+    total += hit.getEnergy();
+  }
+  histograms_.fill("total_ecal_rec_energy", total/1000.);
+}
+
+DECLARE_ANALYZER(MyAnalyzer);
diff --git a/src/analysis/intro.md b/src/analysis/intro.md
@@ -0,0 +1,50 @@
+# Analyzing ldmx-sw Event Files
+
+Often when first starting on LDMX, people are given a `.root` file produced by ldmx-sw
+(or some method for producing their own file). This then leads to the very next
+and reasonable question -- how to analyze the data in this file?
+
+Many answers to this question have been said and many of them are functional.
+In subsequent sections of this chapter of the website, I choose to focus
+on highlighting two possible analysis workflows that I personally
+like (for different reasons).
+
+I am not able to fully cover all of the possible different types of analysis,
+so I am writing this guide in the context of one of the most common analyses:
+looking at a histogram. This type of analysis can be broken into four steps.
+1. Load data: from a data file, load the information in that file into memory.
+2. Data Manipulation: from the data already present, calculate the variable
+    that you would like to put into the histogram.
+3. Histogram Filling: define how the histogram should be binned and fill
+    the histogram with the variable that you calculated.
+4. Plotting: from the definition of the histogram and its content,
+    draw the histogram in a visual manner for inspection.
+
+The software that is used to do each of these steps is what mainly separates
+the different analysis workflows, so I am first going to mention various
+software tools that are helpful for one or more of these steps. I have separated
+these tools into two "ecosystems", both of which are popularly used within
+HEP.
+
+Purpose | ROOT | scikit-hep
+---|---|---
+Load Data | `TFile`,`TTree` | `uproot`
+Manipulate Data | `C++` Code | Vectorized Python with `awkward`
+Fill Histograms | `TH1*` | `hist`,`boost_histogram`
+Plot Histograms | `TBrowser`, `TCanvas` | `matplotlib`,`mplhep`
+
+How one mixes and matches these tools is a personal choice,
+especially since the LDMX collaboration has not landed on a
+widely agreed-upon method for analysis. With this in mind,
+I find it important to emphasize that the following subsections
+are **examples** of analysis workflows -- a specific analyzer
+can choose to mix and match the tools however they like.
+
+### Caveat
+While I (Tom Eichlersmith) have focused on two potential analysis
+workflows, I do not mean to claim that I have tried all of them
+and these two are "the best". I just mean to say that I have
+drifted to these two analysis workflows over time as I've looked
+for easier and better methods of analyzing data. If you have
+an analysis workflow that you would like to share, add another
+subsection to this chapter of the website!
diff --git a/src/analysis/jsroot-screenshot.png b/src/analysis/jsroot-screenshot.png
diff --git a/src/analysis/ldmx-sw.md b/src/analysis/ldmx-sw.md
@@ -0,0 +1,120 @@
+# Using ldmx-sw Directly
+
+As you may be able to pick up based on my tone, I prefer the (efficient) python analysis
+method given before using `uproot`, `awkward`, and `hist`. Nevertheless, there are two
+main reasons I've had for putting an analysis into the ldmx-sw C++.
+1. **Longevity**: While the python ecosystem evolving quickly is helpful for obtaining
+  new features, it can also cause long-running projects to "break" -- forcing a refactor
+  that is purely due to upstream code changes. Writing an analysis into C++, with its
+  more rigid view on backwards compatibility, essentially solidifies it so that it can
+  be run in the same way for a long time into the future.
+2. **Non-Vectorization**: In the previous chapter, I made a big deal about how most
+  analyses can be vectorizable and written in terms of these pre-compiled and fast
+  functions. That being said, sometimes its difficult to figure out _how_ to write
+  an analysis in a vectorizable form. Dropping down into the C++ allows analyzers
+  to write the `for` loop themselves which may be necessary for an analysis to
+  be understandable (or even functional).
+
+The following example is **stand-alone** in the same sense as the prior chapter's
+jupyter notebook. It can be run from outside of the ldmx-sw repository; however,
+this directly contradicts the first reason for writing an analysis like this
+in the first place. I would recommend that you store your analyzer source code
+in ldmx-sw or the private ldmx-analysis. These repos also contain CMake infrastructure
+so you can avoid having to write the long `g++` command necessary to compile
+and link the source code.
+
+## Set Up
+I am going to use the same version of ldmx-sw that was used to generate
+the input `events.root` file. This isn't strictly necessary - more often
+than not, newer ldmx-sw versions are able to read files from prior ldmx-sw
+versions.
+```
+cd work-area
+denv init ldmx/pro:v3.3.6
+```
+The following code block shows the necessary boilerplate for starting
+a C++ analyzer running with ldmx-sw.
+```cpp
+{{#include analyzer-boilerplate.cxx}}
+```
+And below is an example python config call `ana-cfg.py` that I will
+use to run this analyzer with `fire`. It assumes to be in the same
+place as the source file so that it knows where the library it needs
+to load is.
+```python
+{{#include ana-cfg.py}}
+```
+A quick test can show that the code is compiling and running
+(although it will not print out anything or create any files).
+```
+$ denv 'g++ -fPIC -shared -o libMyAnalysis.so -lFramework -I$(root-config --incdir) MyAnalysis.cxx'
+$ denv time fire ana-cfg.py
+---- LDMXSW: Loading configuration --------
+---- LDMXSW: Configuration load complete  --------
+---- LDMXSW: Starting event processing --------
+---- LDMXSW: Event processing complete  --------
+1.99user 0.11system 0:02.10elapsed 100%CPU (0avgtext+0avgdata 320716maxresident)k
+0inputs+0outputs (0major+58469minor)pagefaults 0swaps
+```
+Notice that we still took about 2s to run. This is because, even though `MyAnalyzer` isn't
+doing anything with the data, `fire` is still looping through all of the events.
+
+## Load Data, Manipulate Data, and Fill Histograms
+While these steps were separate in the previous workflow, they all share the same process
+in this workflow. The data loading is handled by `fire` and we are expected to write the
+code that manipulates the data and fills the histograms.
+
+I'm going to look at the total reconstructed energy in the ECal again. Below is the updated
+code file. Notice that, besides the additional `#include` at the top, all of the changes were
+made within the function definitions.
+```cpp
+{{#include analyzer-total-rec-energy.cxx}}
+```
+
+In order to run this code on the data, we need to compile and run the program.
+Again, putting your analyzer within ldmx-sw or ldmx-analysis gives you infrastructure that
+shortens how much you type during this compilation step.
+
+```
+$ denv 'g++ -fPIC -shared -o libMyAnalysis.so -lFramework -I$(root-config --incdir) MyAnalysis.cxx'
+$ denv time fire ana-cfg.py
+---- LDMXSW: Loading configuration --------
+---- LDMXSW: Configuration load complete  --------
+---- LDMXSW: Starting event processing --------
+---- LDMXSW: Event processing complete  --------
+2.03user 0.12system 0:02.20elapsed 97%CPU (0avgtext+0avgdata 321300maxresident)k
+0inputs+40outputs (0major+57977minor)pagefaults 0swaps
+```
+
+Now there is a new file `hist.root` in this directory which has the histogram we filled stored within it.
+
+```
+$ denv rootls -l hist.root
+TDirectoryFile  Apr 30 22:30 2024 my-ana  "my-ana"
+$ denv rootls -l hist.root:*
+TH1F  Apr 30 22:30 2024 my-ana_total_ecal_rec_energy  ""
+```
+
+## Plotting Histograms
+Viewing the histogram with the filled data is another fork in the road.
+I often switch back to the Jupyter Lab approach using `uproot` to load histograms from `hist.root`
+into `hist` objects that I can then manipulate and plot with `hist` and `matplotlib`.
+
+~~~admonish tip title='Accessing Histograms in Jupyter Lab' collapsible=true
+I usually hold the histogram file handle created by `uproot` and then use the `to_hist()` function
+once I find the histogram I want to plot in order to pull the histogram into an object I am
+familiar with. For example
+```python
+f = uproot.open('hist.root')
+h = f['my-ana/my-ana_total_ecal_rec_energy'].to_hist()
+h.plot()
+```
+~~~
+
+But it is also very common to view these histograms with one of ROOT's browsers.
+There is a root browser within the dev images (via `denv rootbrowse` or `ldmx rootbrowse`),
+but I also like to view the histogram [with the online ROOT browser](https://root.cern.ch/js/latest/).
+Below is a screenshot of my browser window after opening the `hist.root` file and then
+selecting the histogram we filled.
+
+![screenshot of JSROOT](jsroot-screenshot.png)
diff --git a/src/analysis/output_11_0.png b/src/analysis/output_11_0.png
diff --git a/src/analysis/output_9_0.png b/src/analysis/output_9_0.png