Best way to process many, long jagged arrays with uproot? #597

MoAly98 · 2022-05-13T17:09:06Z

MoAly98
May 13, 2022

Hi

I was wondering what uproot experts/users recommend as the best way (for performance) to process large data when reading from uproot? I am trying to process data on lxplus from EOS, where i have 65 GBs of data split over many ~1GB ROOT files. From each of these ROOT files I am reading quite a lot of the jagged branches and some of them are quite large (e.g. jets). At this stage I don't want to limit how much I read from each branch so I would like to keep all these objects. The way I process the files at the moment is by grouping them into chunks of 2 (or 4, not much difference) GBs and then run uproot.concatenate on each chunk, and then dump the content out before reading the next chunk. This seems to take O(300-400s) per chunk on lxplus which means I have to wait O(hours) to process these datasets. I tried moving the files onto AFS, where the code runs, but it is still slow. I also tested making a clean virtual environment to run the code in, but also still slow. Running on my local computer with 10 cores and 32GBs of RAM each chunk can take ~30s. Am I just limited by lxplus resources or is there a way I can make the I/O faster?

jpivarski · 2022-05-13T18:16:33Z

jpivarski
May 13, 2022
Maintainer

What you're doing already sound like the best practices. The possible bottlenecks are (1) disk access rate, (2) decompression rate, (3) data type interpretation rate, and (4) header-reading overhead. The files are large enough to address (4), as long as the TBaskets within the file are also large enough. If these are single-jagged (i.e. double[] or std::vector<double>), not any more deeply nested types, then we can rule out (3). If it's not LZMA, then we can rule out (2).

Some gotchas to look out for (a checklist):

Are you sure you're not reading more branches than you need? If you don't specify an expressions or filter_* to uproot.concatenate, then you'll be reading all branches, regardless of whether you need them.
It's pretty easy to tell if you're reading more entries than you need when you use uproot.concatenate because you're reading whole files. If you slice off entries in one task because they belong to another task, that would be lost efficiency, but I think you'd be aware of that.
Are you sure you're not doing any Python for loops over the data? If the data are deeper than single-jagged, then Uproot is doing a Python for loop internally, but it sounds like that's not your case. In your code, you should be able to see explicit Python for loops and list comprehensions by the word for, but what about internal loops in functions like sum, max, or map? These can all be replaced by NumPy or Awkward equivalents.

There are also some things you can check that you can't do anything to improve, but at least know that what you have is top speed, so that you don't needlessly spend time on optimization.

Check the typical TBasket size within the ROOT files. Each TBranch has n TBaskets, which you can find with uproot.TBranch.num_baskets. If you calculate individual TBaskets to be at the kB scale or smaller, then you'll suffer from (4) header-reading overhead. Unfortunately, that's baked into the file. If you're going to be running over the data more than once, it would pay to make the TBaskets reasonably large, like MB scale.
What's the raw disk access rate? When everything goes well—minimal header-reading overhead, minimal interpretation rate, minimal decompression time—then Uproot is limited by the time it takes to get bytes off disk and into RAM. Try making a big file of junk and loading it into memory with tmp = open("filename.junk", "rb").read(). Be careful, though, that it's not already in RAM: modern operating systems cache segments of disk that have been recently accessed. You can control this with vmtouch, but if it's lxplus (same disks accessed by many computers), it may be easiest to make the junk file with one computer and try to access it with another, since the latter has never heard of this file and wouldn't have it in RAM. That's a "cold" read.
If the data are LZMA compressed, decompressing them will be slow, but you can test this, too. The lzma command can be used to compress and uncompress a large junk file to find out how long that can be expected to take. (This is irrelevant if your file is not LZMA compressed, and if it's the issue, there's no way to solve it except by recreating the files with a different compression algorithm.)

I wouldn't expect AFS to be a fast disk, since it's known for being one of the slower (and older) ways of maintaining filesystem consistency on a distributed system. On lxplus, AFS is (was?) used for home directories, which are "small, but important" data. You want this on some disk intended for big datasets. I don't know what lxplus options exist. If you find that your limiting factor is raw disk access (just loading a junk file into RAM accounts for most of the time), then you'll want to ask lxplus experts what your options are.

2 replies

MoAly98 May 19, 2022
Author

Hi Jim, Thanks a lot for the detailed answer. Onto your checklist:

I call uproot with the call uproot.concatenate(files, branches), so I think we are good on this side
You're right, this wouldn't happen in my case since the way I chunk is to simply go through the list of files and append them to a list until the total size of the files in that list reaches 3GBs. If I file takes the set beyond the 3GB boundry, it goes to the next list. This way there is no single file present in 2 chunks.
I explicitly check the time taken by uproot.concatenate so even if I had loops my time estimate wouldn't know about it. On that note im pretty sure I don't have any event loops -- i do have loops over the the field names but this is a fairly small list.

For your bottlnecks list:

I checked inside the ROOT files and it seems like for the jagged branches, the basket sizes are ~20-50kbs and for flat branches ~1kb. Is changing the TBasket size something that can be done from within ROOT? On that note, I tried reading a reduced set of branches from a set of files that total to ~3GBs with RDataframes and with uproot.concatenate and both use up the same amount of time with uproot being a bit faster.
I checked that by creating a file using truncate -s 4G foo then opening like you said ... for a 4GB file the timeit magic suggests the longest run took 200s (which is the bulk of the time spent by uproot reading same size files). Maybe the access rate combined with the Basket size are the bottlenecks?
I am not sure how to check the compression of a ROOT file, so couldn't test that.

jpivarski May 19, 2022
Maintainer

Checking compression is tricky. There are uproot.ReadOnlyFile.compression and uproot.TBranch.compression flags that report what the ROOT files says the compression algorithm and level are—so start there—but the actual compression of a TBasket can be anything, and in principle it can vary from TBasket to TBasket (though I don't think it would for most methods of creating ROOT files).

If the average basket size is one to tens of kB, that's borderline good. Their headers are about 80 bytes each, and each requires about 100 lines of Python code to process before ingesting thousands (one to tens of kB) of numerical values in a vectorized read. In very old performance plots, tens of kB TBasket size was the turn-over point where Uproot's vectorized read starts providing some improvement over C++ event loops. The value of recreating all of your input data with larger TBaskets might be noticeable, but not dramatic.

Did you say that reading 4 GB of random data takes 200 seconds and reading 4 GB of data from ROOT takes 300‒400 seconds? That's not bad: all of the extra work of reading and navigating TFile, TDirectory, TTree, and TBasket headers, decompressing, possibly removing std::vector headers if it's std::vector instead of variable-length arrays, and concatenating all of the little TBasket arrays into big arrays is on the same scale as raw data reading. If the overhead were much less, you would still be waiting for the disk.

Taking the worst case, that the full workflow is 2× slower than it possibly could be, does that make a big difference in some metric? That is, if you're informally launching jobs and dealing with them when they're done, 5 hours vs 10 hours may be equivalent. (E.g. leave it, do something else, deal with the results the following day.) If you're on a cloud infrastructure and paying for every CPU-second, then yeah: paying twice as much is an important metric! Then you'd want to think about increasing TBasket size (in ROOT: TTree::SetAutoFlush or RSnapshotOptions::fAutoFlush; in Uproot: size of the array sent to uproot.WritableTree.extend), LZ4 compression (in ROOT: TFile::SetCompressionAlgorithm and following; in Uproot: uproot.recreate with explicit compression).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best way to process many, long jagged arrays with uproot? #597

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Best way to process many, long jagged arrays with uproot? #597

MoAly98 May 13, 2022

Replies: 1 comment · 2 replies

jpivarski May 13, 2022 Maintainer

MoAly98 May 19, 2022 Author

jpivarski May 19, 2022 Maintainer

MoAly98
May 13, 2022

Replies: 1 comment 2 replies

jpivarski
May 13, 2022
Maintainer

MoAly98 May 19, 2022
Author

jpivarski May 19, 2022
Maintainer