Appending data in a TTree object #649

scarsi99 · 2022-07-10T11:39:25Z

scarsi99
Jul 10, 2022

Hi everyone

I would like to ask you how can I append data in a root file
I noticed that there exist the extend method but it can only be used on a writing object (created with create/recreate). What if i open a tree with open/update and i would like to append some data to that file?

Does anyone has s solution for me?

Thanks in advance

jpivarski · 2022-07-10T20:12:10Z

jpivarski
Jul 10, 2022
Maintainer

It might be a disappointing answer, but at least I can save you some time by saying it: no, you can't extend an existing TTree. It's not a feature of the codebase (right now, and probably forever).

When we create the TTree object ourselves, we can ensure that it's in a particular state, and therefore confidently extend it. But if we adopt TTrees made by ROOT, we have to fully understand all of its possible states, so that we don't take something from a state we don't understand into an erroneous state. Since a changing TTree is a moving target, it's not even easy to produce a repeatable error report.

(That was the situation with objects in TDirectories—in order to uproot.update, which Uproot 3 couldn't do, we had to fully understand ROOT's disk allocation algorithm and all the states it could get itself into. It's difficult because the testing examples we make ourselves likely don't represent all the possible ways files can be made.)

0 replies

scarsi99 · 2022-07-11T07:26:03Z

scarsi99
Jul 11, 2022
Author

Hi Thanks for your feedback. I can try to better explain my problem I am referring to data coming from High Energy Physics at CERN. I have a different file for each spill of particle, but I want to produce just one file for each data taking (which is composed of lot of spills) At the moment I am able to produce a tree at the end of the run by first joining all the ascii files and then storing the data into the root file. But if I want to dynamically append data as they're produced, I real time, do you have any suggestion? I also took a look at uproot.concatenate, which can join different TTree object, even if I cannot completely understand how to exploit the returned object Thanks in advance Stefano Carsi Da: Jim Pivarski ***@***.***> Inviato: domenica 10 luglio 2022 22:12 A: scikit-hep/uproot5 ***@***.***> Cc: CARSI STEFANO ***@***.***>; Author ***@***.***> Oggetto: Re: [scikit-hep/uproot5] Appending data in a TTree object (Discussion #649) It might be a disappointing answer, but at least I can save you some time by saying it: no, you can't extend an existing TTree. It's not a feature of the codebase (right now, and probably forever). When we create the TTree object ourselves, we can ensure that it's in a particular state, and therefore confidently extend it. But if we adopt TTrees made by ROOT, we have to fully understand all of its possible states, so that we don't take something from a state we don't understand into an erroneous state. Since a changing TTree is a moving target, it's not even easy to produce a repeatable error report. (That was the situation with objects in TDirectories-in order to uproot.update, which Uproot 3 couldn't do, we had to fully understand ROOT's disk allocation algorithm and all the states it could get itself into. It's difficult because the testing examples we make ourselves likely don't represent all the possible ways files can be made.) - Reply to this email directly, view it on GitHub<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fscikit-hep%2Fuproot5%2Fdiscussions%2F649%23discussioncomment-3117067&data=05%7C01%7Cscarsi%40studenti.uninsubria.it%7Cf1a80c692b2a4abf028e08da62b07faa%7C9252ed8bdffc401c86ca6237da9991fa%7C0%7C0%7C637930807442145233%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=ii9eyl0I8Qgg2mxi%2F5AZp35nUnV4aU5XGWQN59hVOc0%3D&reserved=0>, or unsubscribe<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAYGAYATN2I63POJMVVW7PLDVTMVCLANCNFSM53E4LB7A&data=05%7C01%7Cscarsi%40studenti.uninsubria.it%7Cf1a80c692b2a4abf028e08da62b07faa%7C9252ed8bdffc401c86ca6237da9991fa%7C0%7C0%7C637930807442145233%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=6FKBeX7nDeDnywLqmLymFa5OYaYpxbu%2FHojanuZW4Hs%3D&reserved=0>. You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>

1 reply

jpivarski Jul 11, 2022
Maintainer

If the intermediate files are small, ASCII might not be a bad choice. Formats intended for large datasets are more efficient in the long run, but if you use them to make small datasets, they can be worse. ROOT files have a lot of sources of overhead, from the file header to TDirectories, to the TTree header.

If you want to go directly to a ROOT file, you can keep the process open. A single process can continue to accumulate data to an output file. But even then, it's possible to make one file inefficient by calling uproot.TTree.extend many times: each call to extend makes a new TBasket (an object within the file, with its own overhead)—it's only efficient if TBaskets are large. So you'd have to fill arrays in memory, which might extend over multiple spills, before writing a large batch as a TBasket with extend. Compared to that, writing small ASCII files that you later collect may be simpler and more robust.

What uproot.concatenate does is it reads multiple files and gives you a concatenated array of all their data. It does not write concatenated files. Of course, you could use it as a way to accumulate: if you've written many small ROOT files, concatenate would show them all to you as a single array and you could write that to an output file as a single TBasket with extend...

scarsi99 · 2022-07-12T08:04:01Z

scarsi99
Jul 12, 2022
Author

Dear Jim, thanks again for your response. Actually ASCII files are not a good idea: we have something like 1000 good events per spill and for each events we acquire (at least) one 32 ch digitizer and for each channel we export 1024 pts for waveforms, so... lots of data The idea of using a TTree object come for two reasons: * The need of compressing data * The TTree can be used both on python and ROOT If I understand right, what you suggested me is to open with uproot.recreate at the beginning of the run and for each spill to call uproot.TTree.extend. I have a couple of questions * First of all, when data are actually stored? Each time I call the extend method, data are pushed into the file? * Since I plan to save data on a shared folder where all my colleagues can access, is it a problem if someone try to load a TTree file which is still open for writing in my script? If it's not a problem, will he be able to see data until the last call of extend? Actually, it would be nice if my colleagues could load from their script the TTree and as a new spill arrives, by recalling the same script they can access to the newer data. As I explain later, if they would have to load all the ascii, they cannot "simply" get rid of the waveform, instead if they can access a TTree, they can select which fields (I don't know the exact word, sorry) to access Otherwise, could you provide me a piece of code to write in a single TTree object what it is returned by uproot.concatenate? Is it so different from iterating over each TTree file and then use the extend method? Finally, someone reported me that ROOT has a special method "hadd" for "appending data" in a TTree object as I was wondering... do you have any further suggestions? In conclusion, after my explanations, can I ask you what is your final suggestion? Take into account that waveforms data are most of the weights but usually no one want to take care of them (in the ASCII files are stored some other data, i.e. pulse height and timing information of the waveform, which are enough for most of the analysis). I also will take some performance tests in order to decide what is the best idea Thanks again in advance Kind regards Stefano Carsi Da: Jim Pivarski ***@***.***> Inviato: lunedì 11 luglio 2022 13:47 A: scikit-hep/uproot5 ***@***.***> Cc: CARSI STEFANO ***@***.***>; Author ***@***.***> Oggetto: Re: [scikit-hep/uproot5] Appending data in a TTree object (Discussion #649) If the intermediate files are small, ASCII might not be a bad choice. Formats intended for large datasets are more efficient in the long run, but if you use them to make small datasets, they can be worse. ROOT files have a lot of sources of overhead, from the file header to TDirectories, to the TTree header. If you want to go directly to a ROOT file, you can keep the process open. A single process can continue to accumulate data to an output file. But even then, it's possible to make one file inefficient by calling uproot.TTree.extend<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuproot.readthedocs.io%2Fen%2Flatest%2Fuproot.writing.writable.WritableTree.html%23extend&data=05%7C01%7Cscarsi%40studenti.uninsubria.it%7C8f948dc89b7249549f6508da633306dd%7C9252ed8bdffc401c86ca6237da9991fa%7C0%7C0%7C637931368120743954%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=N%2BeBlK20ITy68BA5RRaxQ46BL8Xs%2FYdjSz4ApbSEPY4%3D&reserved=0> many times: each call to extend makes a new TBasket (an object within the file, with its own overhead)-it's only efficient if TBaskets are large. So you'd have to fill arrays in memory, which might extend over multiple spills, before writing a large batch as a TBasket with extend. Compared to that, writing small ASCII files that you later collect may be simpler and more robust. What uproot.concatenate<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fuproot.readthedocs.io%2Fen%2Flatest%2Fuproot.behaviors.TBranch.concatenate.html&data=05%7C01%7Cscarsi%40studenti.uninsubria.it%7C8f948dc89b7249549f6508da633306dd%7C9252ed8bdffc401c86ca6237da9991fa%7C0%7C0%7C637931368120743954%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4qTCUAyGmr9MhrWWQFswif28zaFlQ9sDPiXCub6RItA%3D&reserved=0> does is it reads multiple files and gives you a concatenated array of all their data. It does not write concatenated files. Of course, you could use it as a way to accumulate: if you've written many small ROOT files, concatenate would show them all to you as a single array and you could write that to an output file as a single TBasket with extend... - Reply to this email directly, view it on GitHub<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fscikit-hep%2Fuproot5%2Fdiscussions%2F649%23discussioncomment-3120685&data=05%7C01%7Cscarsi%40studenti.uninsubria.it%7C8f948dc89b7249549f6508da633306dd%7C9252ed8bdffc401c86ca6237da9991fa%7C0%7C0%7C637931368120743954%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=TMsRpbryqAPFXoCK5Mir8UASZP6boc1E3iYD8TPPh6Q%3D&reserved=0>, or unsubscribe<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAYGAYARH6S6UKHHNXHQTX53VTQCR5ANCNFSM53E4LB7A&data=05%7C01%7Cscarsi%40studenti.uninsubria.it%7C8f948dc89b7249549f6508da633306dd%7C9252ed8bdffc401c86ca6237da9991fa%7C0%7C0%7C637931368120743954%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=5pI9098gTYFr%2B%2BDMB4ZCwG14CGINoGYmaZb9nEGbmME%3D&reserved=0>. You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>

0 replies

jpivarski · 2022-07-12T12:19:28Z

jpivarski
Jul 12, 2022
Maintainer

Data are actually stored when the extend method is called. The ROOT file is in an invalid state for the shortest possible time during that method call, and we insert data before inserting pointers to it, so that the TTree object is valid but not up-to-date during most of the writing. If another reader reads old data, it should be okay.

Since the data associated with each spill is big, what about putting each spill in a separate TTree?

I was going to suggest HDF5 (designed for sharing big arrays, can be used in C++ and Python) until I noticed you said "at least" one channel, which sounds like you have variable-length data.

Since the data associated with each spill is big, what about one TTree per spill, which is readable while collecting spills and can be concatenated later. A TTree has a lot of header metadata associated with it, so it would be a bad idea to fill a new TTree for a small amount of data, but you spill datasets are big.

Take these as suggestions and try it out. If it were my own problem, I wouldn't settle on a method until I tried some things and found out just how big the data are compared with the for headers.

0 replies

scarsi99 · 2022-07-12T12:35:11Z

scarsi99
Jul 12, 2022
Author

Dear Jim, thanks again for your help Of course I will do some tries to decide what is the best solution for me, based on your suggestions For your information, I do not have variable-length data, my ascii are always a "matrix": unknown amount of lines, can vary spill per spill, but always the same amount of columns. I was referring to the fact that for different setup we can have one or two digitizer (32 or 64 channel) but does not vary during a data taking Just to conclude, I promise I won't bother you again, when you suggest me to put in different TTree, do you mean to put in different directoryes of the same object (which I can do with the update method) or to create different files, one per each spill? My concerns are that if I create a different file for each spill and then merging in a separate moment, my colleagues have to use different files for the same data It is a real pity I cannot open an existing file and simply calling the extend method, but I understand the technical complications Best regards Stefano Carsi Da: Jim Pivarski ***@***.***> Inviato: martedì 12 luglio 2022 14:20 A: scikit-hep/uproot5 ***@***.***> Cc: CARSI STEFANO ***@***.***>; Author ***@***.***> Oggetto: Re: [scikit-hep/uproot5] Appending data in a TTree object (Discussion #649) Data are actually stored when the extend method is called. The ROOT file is in an invalid state for the shortest possible time during that method call, and we insert data before inserting pointers to it, so that the TTree object is valid but not up-to-date during most of the writing. If another reader reads old data, it should be okay. Since the data associated with each spill is big, what about putting each spill in a separate TTree? I was going to suggest HDF5 (designed for sharing big arrays, can be used in C++ and Python) until I noticed you said "at least" one channel, which sounds like you have variable-length data. Since the data associated with each spill is big, what about one TTree per spill, which is readable while collecting spills and can be concatenated later. A TTree has a lot of header metadata associated with it, so it would be a bad idea to fill a new TTree for a small amount of data, but you spill datasets are big. Take these as suggestions and try it out. If it were my own problem, I wouldn't settle on a method until I tried some things and found out just how big the data are compared with the for headers. - Reply to this email directly, view it on GitHub<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fscikit-hep%2Fuproot5%2Fdiscussions%2F649%23discussioncomment-3128836&data=05%7C01%7Cscarsi%40studenti.uninsubria.it%7Cadc3210787b64a905e6008da6400cbdf%7C9252ed8bdffc401c86ca6237da9991fa%7C0%7C0%7C637932251833203877%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=4c7Xv1ix1O1E5LNO16lceHsheeQy4Q9QgbUlTbW5kTk%3D&reserved=0>, or unsubscribe<https://eur01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fnotifications%2Funsubscribe-auth%2FAYGAYAQILBKQJP3FJTGMDSLVTVPFXANCNFSM53E4LB7A&data=05%7C01%7Cscarsi%40studenti.uninsubria.it%7Cadc3210787b64a905e6008da6400cbdf%7C9252ed8bdffc401c86ca6237da9991fa%7C0%7C0%7C637932251833203877%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FGcnMugQPuelGOD2f5ojMnzXOAsFyjOg5fKyB1dkCa0%3D&reserved=0>. You are receiving this because you authored the thread.Message ID: ***@***.******@***.***>>

0 replies

jpivarski · 2022-07-12T13:34:40Z

jpivarski
Jul 12, 2022
Maintainer

I meant that different TTrees in the same file (different names in the TDirectory) or different TTrees in different files are both options. Considering that you want people to be able to read this while it's being written, it will be more robust if it is more granular, anyway. Reading an object that is in the process of being updated is inherently unstable—it can be solved by "locking" (preventing readers from reading while the data are being written) and also by distinguishing between old objects, which can be freely read, and new objects, which are in the process of being read and are not ready yet.

Since you say that the data have fixed size, HDF5 becomes a more much attractive option. The format and ecosystem of tools around it are very mature, and it addresses your exact problems. See, for instance, h5cpp for C++ and h5py for Python. The compression options are generalized into a "filter pipeline" of algorithms to apply to the data between when it's in memory and when it's on disk, there are workflows for shared reading while writing in chunks (the "old data/new data" split I described above, but presented as a view of a single large array), and it's widely used in supercomputing applications. (All the references to "MPI" are for massive scale-out of non-embarrassingly parallel tasks—overkill for most HEP problems.) The one major drawback of HDF5 is that they only consider rectangular arrays (even sparse and chunked arrays, which are variable-length on disk, are at least logically rectangular). But that happens to fit your use-case, so when you're exploring options, it would be a good idea to keep HDF5 in mind.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Appending data in a TTree object #649

{{title}}

Replies: 6 comments 1 reply

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Appending data in a TTree object #649

scarsi99 Jul 10, 2022

Replies: 6 comments · 1 reply

jpivarski Jul 10, 2022 Maintainer

scarsi99 Jul 11, 2022 Author

jpivarski Jul 11, 2022 Maintainer

scarsi99 Jul 12, 2022 Author

jpivarski Jul 12, 2022 Maintainer

scarsi99 Jul 12, 2022 Author

jpivarski Jul 12, 2022 Maintainer

scarsi99
Jul 10, 2022

Replies: 6 comments 1 reply

jpivarski
Jul 10, 2022
Maintainer

scarsi99
Jul 11, 2022
Author

jpivarski Jul 11, 2022
Maintainer

scarsi99
Jul 12, 2022
Author

jpivarski
Jul 12, 2022
Maintainer

scarsi99
Jul 12, 2022
Author

jpivarski
Jul 12, 2022
Maintainer