Skip to content

Commit

Permalink
ServiceX Exploratory Notebook (#4)
Browse files Browse the repository at this point in the history
* Basic readme updates

* FIrst config to get sx going (does not work yet!)

* Ignore vscode settings for now

* Ignore dumb error

* The sx yaml file

* Basics of how we can do extraction from a datasample

* Added text

* UPdated readme

* Apply Alex's comments
  • Loading branch information
gordonwatts authored Mar 31, 2024
1 parent fa97782 commit 49c6a55
Show file tree
Hide file tree
Showing 5 changed files with 313 additions and 1 deletion.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -161,3 +161,8 @@ cython_debug/

# custom
*.root.*

# vscode
.vscode/

servicex.yaml
15 changes: 14 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,25 @@

Targeting analysis at 200 Gbps with ATLAS PHYSLITE. This repository is very much a work in progress.

Description of files:
ATLAS does not have released OpenData, so there isn't an AGC we can copy and try to run. As a result, this repository's main purpose is as a facilities test:

* Run from PHYSLITE
* Load 200 Gbps off of the PHYSLITE samples
* Push all that data downstream to DASK (or similar) workers.

## Description of files

- `size_per_branch.ipynb`: produce breakdown of branch sizes for given file
- `branch_sizes.json`: output of , produced by `size_per_branch.ipynb`
- `materialize_branches.ipynb`: read list of branches, distributable with Dask (use for benchmarking)

## Usage

When run on the UChicago AF Jupyter Notebook no package installs are required.

There is a `requirements.txt` which should allow this to be run on a bare-bones machine (modulo location of files, etc.).

If you are going to use the `servicex` version, you have to pin `dask_awkward==2024.2.0`. The future versions have a [bug](https://github.com/dask-contrib/dask-awkward/issues/456) which hasn't been fixed yet.

## Acknowledgements

Expand Down
11 changes: 11 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
jupyterlab
servicex
awkward
hist[dask]
# Necessary due to bug in uproot/dask-awkward that prevents
# ak.concat working.
dask_awkward==2024.2.0
uproot
# Get the version with PHYSLITE support sort-of built it
func_adl_servicex_xaodr21>=2.0a1
ipywidgets
270 changes: 270 additions & 0 deletions servicex/00-exploring-the-data.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring The Data\n",
"\n",
"Looking at the data to see how to access enough columns to make this relevant."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Using release 21.2.231\n"
]
}
],
"source": [
"from func_adl_servicex_xaodr21 import atlas_release\n",
"# TODO: Update to use R22/23 or whatever.\n",
"from func_adl_servicex_xaodr21 import SXDSAtlasxAODR21\n",
"\n",
"from hist.dask import Hist\n",
"import dask_awkward as dak\n",
"\n",
"print(f'Using release {atlas_release}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setup the dataset we will use for testing."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"ttbar_all_rucio_dataset_name = \"mc23_13p6TeV.601229.PhPy8EG_A14_ttbar_hdamp258p75_SingleLep.deriv.DAOD_PHYSLITE.e8514_s4162_r14622_p6026\"\n",
"ttbar_all = f\"rucio://{ttbar_all_rucio_dataset_name}?files=1\"\n",
"ds = SXDSAtlasxAODR21(ttbar_all, backend='atlasr22')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ServiceX Query\n",
"\n",
"Do an event-level query - so lists of jets, met, etc, all at the top level."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:root:Fetched the default calibration configuration for a query. It should have been intentionally configured - using configuration for data format PHYS\n"
]
}
],
"source": [
"# TODO: The EventInfo argument should default correctly (that may just be a matter of using func_adl xaod r22)\n",
"# TODO: dataclass should be supported so as not to lose type-following!\n",
"query = (ds\n",
" .Select(lambda e: {\n",
" 'evt': e.EventInfo(\"EventInfo\"),\n",
" 'jet': e.Jets(\"AnalysisJets\", calibrate=False)\n",
" })\n",
" .Select(lambda ei: {\n",
" 'event_number': ei.evt.eventNumber(),\n",
" 'run_number': ei.evt.runNumber(),\n",
" 'jet_pt': ei.jet.Select(lambda j: j.pt()/1000)\n",
" })\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We do not have tight integration into `dask_awkward` until there is extra code working, so lets grab all the data."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Start by grabbing the data as an awkward array\n",
"# TODO: Files should remain in the S3 cache and be read directly from there\n",
"data = query.AsAwkwardArray().value()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plots\n",
"\n",
"Next, lets make plots of everything"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Quick construction, no other imports needed:\n",
"h = (\n",
" Hist.new.Reg(20, 0, 100000000, name=\"x\", label=\"x-axis\")\n",
" .Int64()\n",
")\n",
"r1 = h.fill(data.event_number)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Quick construction, no other imports needed:\n",
"h = (\n",
" Hist.new.Reg(20, 0, 200, name=\"x\", label=\"Jet $p_T$\")\n",
" .Int64()\n",
")\n",
"r2 = h.fill(dak.flatten(data.jet_pt))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<html>\n",
"<div style=\"display:flex; align-items:center;\">\n",
"<div style=\"width:290px;\">\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"-10 -105 270 120\">\n",
"<line x1=\"-5\" y1=\"0\" x2=\"255\" y2=\"0\" style=\"fill:none;stroke-width:2;stroke:currentColor\"/>\n",
"<text text-anchor=\"middle\" x=\"0\" y=\"15\" style=\"fill:currentColor;\">\n",
"0\n",
"</text>\n",
"<text text-anchor=\"middle\" x=\"250\" y=\"15\" style=\"fill:currentColor;\">\n",
"1e+08\n",
"</text>\n",
"<text text-anchor=\"middle\" x=\"125.0\" y=\"15\" style=\"fill:currentColor;\">\n",
"x-axis\n",
"</text>\n",
"<polyline points=\" 0,0 0,-0 12.5,-0 12.5,-0 25,-0 25,-0 37.5,-0 37.5,-0 50,-0 50,-0 62.5,-0 62.5,-0 75,-0 75,-0 87.5,-0 87.5,-0 100,-0 100,-0 112.5,-0 112.5,-0 125,-0 125,-0 137.5,-0 137.5,-0 150,-0 150,-0 162.5,-0 162.5,-0 175,-0 175,-0 187.5,-0 187.5,-100 200,-100 200,-0 212.5,-0 212.5,-0 225,-0 225,-0 237.5,-0 237.5,-0 250,-0 250,0\" style=\"fill:none; stroke:currentColor;\"/>\n",
"</svg>\n",
"</div>\n",
"<div style=\"flex=grow:1;\">\n",
"Regular(20, 0, 1e+08, name='x', label='x-axis')<br/>\n",
"<hr style=\"margin-top:.2em; margin-bottom:.2em;\"/>\n",
"Int64() Σ=150000.0\n",
"\n",
"</div>\n",
"</div>\n",
"</html>"
],
"text/plain": [
"Hist(Regular(20, 0, 1e+08, name='x', label='x-axis'), storage=Int64()) # Sum: 150000.0"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r1.compute()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<html>\n",
"<div style=\"display:flex; align-items:center;\">\n",
"<div style=\"width:290px;\">\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"-10 -105 270 120\">\n",
"<line x1=\"-5\" y1=\"0\" x2=\"255\" y2=\"0\" style=\"fill:none;stroke-width:2;stroke:currentColor\"/>\n",
"<text text-anchor=\"middle\" x=\"0\" y=\"15\" style=\"fill:currentColor;\">\n",
"0\n",
"</text>\n",
"<text text-anchor=\"middle\" x=\"250\" y=\"15\" style=\"fill:currentColor;\">\n",
"200\n",
"</text>\n",
"<text text-anchor=\"middle\" x=\"125.0\" y=\"15\" style=\"fill:currentColor;\">\n",
"Jet $p_T$\n",
"</text>\n",
"<polyline points=\" 0,0 0,-2.02 12.5,-2.02 12.5,-100 25,-100 25,-59.4 37.5,-59.4 37.5,-31.2 50,-31.2 50,-22.1 62.5,-22.1 62.5,-17.1 75,-17.1 75,-13.7 87.5,-13.7 87.5,-10.6 100,-10.6 100,-8.31 112.5,-8.31 112.5,-6.49 125,-6.49 125,-5.12 137.5,-5.12 137.5,-4 150,-4 150,-3.04 162.5,-3.04 162.5,-2.39 175,-2.39 175,-1.88 187.5,-1.88 187.5,-1.45 200,-1.45 200,-1.16 212.5,-1.16 212.5,-0.881 225,-0.881 225,-0.726 237.5,-0.726 237.5,-0.596 250,-0.596 250,0\" style=\"fill:none; stroke:currentColor;\"/>\n",
"</svg>\n",
"</div>\n",
"<div style=\"flex=grow:1;\">\n",
"Regular(20, 0, 200, name='x', label='Jet $p_T$')<br/>\n",
"<hr style=\"margin-top:.2em; margin-bottom:.2em;\"/>\n",
"Int64() Σ=1435200.0 <em>(1450989.0 with flow)</em>\n",
"\n",
"</div>\n",
"</div>\n",
"</html>"
],
"text/plain": [
"Hist(Regular(20, 0, 200, name='x', label='Jet $p_T$'), storage=Int64()) # Sum: 1435200.0 (1450989.0 with flow)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r2.compute()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
13 changes: 13 additions & 0 deletions servicex/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Introduction

This directory contains scripts and notebooks to implement fetching the data locally using ServiceX.

The default `servicex.yaml` file was used from the UChicago AF.

Note that you'll need to be aware of the `requirements.txt` as a bug in `dask_awkward` means this can't run on the most recent version.

## Files

| File | Description |
|------|-------------|
| 00-exploring-the-data | Outlines the raw ServiceX code that we can use. We'll need to develop libraries which will obscure this code quite a bit given how many branches we'll need to load. This notebook can't run on the most recent version of `dask_awkward` - until [this bug](https://github.com/dask-contrib/dask-awkward/issues/456) is fixed. |

0 comments on commit 49c6a55

Please sign in to comment.