Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ServiceX Exploratory Notebook #4

Merged
merged 9 commits into from
Mar 31, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -161,3 +161,8 @@ cython_debug/

# custom
*.root.*

# vscode
.vscode/

servicex.yaml
15 changes: 14 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,25 @@

Targeting analysis at 200 Gbps with ATLAS PHYSLITE. This repository is very much a work in progress.

Description of files:
ATLAS does not have released OpenData, so there isn't an AGC we can copy and try to run. As a result, this repository's main purpose is as a facilities test:

* Run from PHYSLITE
* Load 200 Gbps off of the PHYSLITE samples
* Push all that data downstream to DASK (or similar) workers.

## Description of files

- `size_per_branch.ipynb`: produce breakdown of branch sizes for given file
- `branch_sizes.json`: output of , produced by `size_per_branch.ipynb`
- `materialize_branches.ipynb`: read list of branches, distributable with Dask (use for benchmarking)

## Usage

When run on the UChicago AF Jupyter Notebook no package installs are required.

There is a `requirements.txt` which should allow this to be run on a bare-bones machine (modulo location of files, etc.).

If you are going to use the `servicex` version, you have to pin `dask_awkward==2024.2.0`. The future versions have a [bug](https://github.com/dask-contrib/dask-awkward/issues/456) which hasn't been fixed yet.

## Acknowledgements

Expand Down
11 changes: 11 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
jupyterlab
servicex
awkward
hist[dask]
# Necessary due to bug in uproot/dask-awkward that prevents
# ak.concat working.
dask_awkward==2024.2.0
uproot
# Get the version with PHYSLITE support sort-of built it
func_adl_servicex_xaodr21>=2.0a1
ipywidgets
270 changes: 270 additions & 0 deletions servicex/00-exploring-the-data.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,270 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring The Data\n",
"\n",
"Looking at the data to see how to access enough columns to make this relevant."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Using release 21.2.231\n"
]
}
],
"source": [
"from func_adl_servicex_xaodr21 import atlas_release\n",
"# TODO: Update to use R22/23 or whatever.\n",
"from func_adl_servicex_xaodr21 import SXDSAtlasxAODR21\n",
"\n",
"from hist.dask import Hist\n",
"import dask_awkward as dak\n",
"\n",
"print(f'Using release {atlas_release}')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Setup the dataset we will use for testing."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"ttbar_all_rucio_dataset_name = \"mc23_13p6TeV.601229.PhPy8EG_A14_ttbar_hdamp258p75_SingleLep.deriv.DAOD_PHYSLITE.e8514_s4162_r14622_p6026\"\n",
"ttbar_all = f\"rucio://{ttbar_all_rucio_dataset_name}?files=1\"\n",
"ds = SXDSAtlasxAODR21(ttbar_all, backend='atlasr22')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## ServiceX Query\n",
"\n",
"Do an event-level query - so lists of jets, met, etc, all at the top level."
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"WARNING:root:Fetched the default calibration configuration for a query. It should have been intentionally configured - using configuration for data format PHYS\n"
]
}
],
"source": [
"# TODO: The EventInfo argument should default correctly (that may just be a matter of using func_adl xaod r22)\n",
"# TODO: dataclass should be supported so as not to lose type-following!\n",
"query = (ds\n",
" .Select(lambda e: {\n",
" 'evt': e.EventInfo(\"EventInfo\"),\n",
" 'jet': e.Jets(\"AnalysisJets\", calibrate=False)\n",
" })\n",
" .Select(lambda ei: {\n",
" 'event_number': ei.evt.eventNumber(),\n",
" 'run_number': ei.evt.runNumber(),\n",
" 'jet_pt': ei.jet.Select(lambda j: j.pt()/1000)\n",
" })\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We do not have tight integration into `dask_awkward` until there is extra code working, so lets grab all the data."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Start by grabbing the data as an awkward array\n",
"# TODO: Files should remain in the S3 cache and be read directly from there\n",
"data = query.AsAwkwardArray().value()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Plots\n",
"\n",
"Next, lets make plots of everything"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"# Quick construction, no other imports needed:\n",
"h = (\n",
" Hist.new.Reg(20, 0, 100000000, name=\"x\", label=\"x-axis\")\n",
" .Int64()\n",
")\n",
"r1 = h.fill(data.event_number)"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Quick construction, no other imports needed:\n",
"h = (\n",
" Hist.new.Reg(20, 0, 200, name=\"x\", label=\"Jet $p_T$\")\n",
" .Int64()\n",
")\n",
"r2 = h.fill(dak.flatten(data.jet_pt))"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<html>\n",
"<div style=\"display:flex; align-items:center;\">\n",
"<div style=\"width:290px;\">\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"-10 -105 270 120\">\n",
"<line x1=\"-5\" y1=\"0\" x2=\"255\" y2=\"0\" style=\"fill:none;stroke-width:2;stroke:currentColor\"/>\n",
"<text text-anchor=\"middle\" x=\"0\" y=\"15\" style=\"fill:currentColor;\">\n",
"0\n",
"</text>\n",
"<text text-anchor=\"middle\" x=\"250\" y=\"15\" style=\"fill:currentColor;\">\n",
"1e+08\n",
"</text>\n",
"<text text-anchor=\"middle\" x=\"125.0\" y=\"15\" style=\"fill:currentColor;\">\n",
"x-axis\n",
"</text>\n",
"<polyline points=\" 0,0 0,-0 12.5,-0 12.5,-0 25,-0 25,-0 37.5,-0 37.5,-0 50,-0 50,-0 62.5,-0 62.5,-0 75,-0 75,-0 87.5,-0 87.5,-0 100,-0 100,-0 112.5,-0 112.5,-0 125,-0 125,-0 137.5,-0 137.5,-0 150,-0 150,-0 162.5,-0 162.5,-0 175,-0 175,-0 187.5,-0 187.5,-100 200,-100 200,-0 212.5,-0 212.5,-0 225,-0 225,-0 237.5,-0 237.5,-0 250,-0 250,0\" style=\"fill:none; stroke:currentColor;\"/>\n",
"</svg>\n",
"</div>\n",
"<div style=\"flex=grow:1;\">\n",
"Regular(20, 0, 1e+08, name='x', label='x-axis')<br/>\n",
"<hr style=\"margin-top:.2em; margin-bottom:.2em;\"/>\n",
"Int64() Σ=150000.0\n",
"\n",
"</div>\n",
"</div>\n",
"</html>"
],
"text/plain": [
"Hist(Regular(20, 0, 1e+08, name='x', label='x-axis'), storage=Int64()) # Sum: 150000.0"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r1.compute()"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<html>\n",
"<div style=\"display:flex; align-items:center;\">\n",
"<div style=\"width:290px;\">\n",
"<svg xmlns=\"http://www.w3.org/2000/svg\" viewBox=\"-10 -105 270 120\">\n",
"<line x1=\"-5\" y1=\"0\" x2=\"255\" y2=\"0\" style=\"fill:none;stroke-width:2;stroke:currentColor\"/>\n",
"<text text-anchor=\"middle\" x=\"0\" y=\"15\" style=\"fill:currentColor;\">\n",
"0\n",
"</text>\n",
"<text text-anchor=\"middle\" x=\"250\" y=\"15\" style=\"fill:currentColor;\">\n",
"200\n",
"</text>\n",
"<text text-anchor=\"middle\" x=\"125.0\" y=\"15\" style=\"fill:currentColor;\">\n",
"Jet $p_T$\n",
"</text>\n",
"<polyline points=\" 0,0 0,-2.02 12.5,-2.02 12.5,-100 25,-100 25,-59.4 37.5,-59.4 37.5,-31.2 50,-31.2 50,-22.1 62.5,-22.1 62.5,-17.1 75,-17.1 75,-13.7 87.5,-13.7 87.5,-10.6 100,-10.6 100,-8.31 112.5,-8.31 112.5,-6.49 125,-6.49 125,-5.12 137.5,-5.12 137.5,-4 150,-4 150,-3.04 162.5,-3.04 162.5,-2.39 175,-2.39 175,-1.88 187.5,-1.88 187.5,-1.45 200,-1.45 200,-1.16 212.5,-1.16 212.5,-0.881 225,-0.881 225,-0.726 237.5,-0.726 237.5,-0.596 250,-0.596 250,0\" style=\"fill:none; stroke:currentColor;\"/>\n",
"</svg>\n",
"</div>\n",
"<div style=\"flex=grow:1;\">\n",
"Regular(20, 0, 200, name='x', label='Jet $p_T$')<br/>\n",
"<hr style=\"margin-top:.2em; margin-bottom:.2em;\"/>\n",
"Int64() Σ=1435200.0 <em>(1450989.0 with flow)</em>\n",
"\n",
"</div>\n",
"</div>\n",
"</html>"
],
"text/plain": [
"Hist(Regular(20, 0, 200, name='x', label='Jet $p_T$'), storage=Int64()) # Sum: 1435200.0 (1450989.0 with flow)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r2.compute()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": ".venv",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.10"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
13 changes: 13 additions & 0 deletions servicex/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Introduction

This directory contains scripts and notebooks to implement fetching the data locally using ServiceX.

The default `servicex.yaml` file was used from the UChicago AF.

Note that you'll need to be aware of the `requirements.txt` as a bug in `dask_awkward` means this can't run on the most recent version.

## Files

| File | Description |
|------|-------------|
| 00-exploring-the-data | Outlines the raw ServiceX code that we can use. We'll need to develop libraries which will obscure this code quite a bit given how many branches we'll need to load. This notebook can't run on the most recent version of `dask_awkward` - until [this bug](https://github.com/dask-contrib/dask-awkward/issues/456) is fixed. |