Replies: 1 comment 2 replies
-
What you're doing already sound like the best practices. The possible bottlenecks are (1) disk access rate, (2) decompression rate, (3) data type interpretation rate, and (4) header-reading overhead. The files are large enough to address (4), as long as the TBaskets within the file are also large enough. If these are single-jagged (i.e. Some gotchas to look out for (a checklist):
There are also some things you can check that you can't do anything to improve, but at least know that what you have is top speed, so that you don't needlessly spend time on optimization.
I wouldn't expect AFS to be a fast disk, since it's known for being one of the slower (and older) ways of maintaining filesystem consistency on a distributed system. On lxplus, AFS is (was?) used for home directories, which are "small, but important" data. You want this on some disk intended for big datasets. I don't know what lxplus options exist. If you find that your limiting factor is raw disk access (just loading a junk file into RAM accounts for most of the time), then you'll want to ask lxplus experts what your options are. |
Beta Was this translation helpful? Give feedback.
-
Hi
I was wondering what uproot experts/users recommend as the best way (for performance) to process large data when reading from uproot? I am trying to process data on lxplus from EOS, where i have 65 GBs of data split over many ~1GB ROOT files. From each of these ROOT files I am reading quite a lot of the jagged branches and some of them are quite large (e.g. jets). At this stage I don't want to limit how much I read from each branch so I would like to keep all these objects. The way I process the files at the moment is by grouping them into chunks of 2 (or 4, not much difference) GBs and then run uproot.concatenate on each chunk, and then dump the content out before reading the next chunk. This seems to take O(300-400s) per chunk on lxplus which means I have to wait O(hours) to process these datasets. I tried moving the files onto AFS, where the code runs, but it is still slow. I also tested making a clean virtual environment to run the code in, but also still slow. Running on my local computer with 10 cores and 32GBs of RAM each chunk can take ~30s. Am I just limited by lxplus resources or is there a way I can make the I/O faster?
Beta Was this translation helpful? Give feedback.
All reactions