-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: working TList serialization #763
Conversation
Since I saw your message in Slack, I started looking into this, but it seems that you're ahead of me: yes, the TObject will need its parameters. (Doesn't the TObject constructor make them, though?) TObjString is heavily tested in the testing suite, but not TList. I'm a little surprised that TList is even implemented, but I suppose it might be because it was needed to implement something else, such as histograms. (Writing TObjString, descendants of TH1, and simple and jagged TTrees have been implemented; nothing else has been attempted.) It's very likely that TList's Even though you've found something that works, you should remain suspicious of it because it might be working by accident. You're filling the What I was looking at before seeing this PR is the fact that each item in a TList has a uproot5/src/uproot/models/TList.py Lines 103 to 105 in baa82c9
Python In your testing, you can test read-back first with Uproot, then (when that works) with ROOT. Uproot is generally more forgiving of wrong formats, since it reads as little as it needs to and doesn't do all of the consistency checks that ROOT does. Also, it's easier to use Uproot in a debugger or insert print-outs (my favorite method of debugging, since Python loads quickly and doesn't have a long compilation phase), so you can more easily examine which bytes the reader is seeing when it's expecting to be building a given structure. Once it works in Uproot, you still have to test it in ROOT, of course, because it might still be wrong. I hope this is useful! I'll keep tabs on your progress, and go ahead and ask questions on #awkward-uproot in Slack as you go. Good luck! |
@jpivarski this might be a good place to discuss; should a bare import uproot
print(f"{uproot.__version__=}")
filename = "list.root"
key = "listOfStrings"
with uproot.recreate(filename) as f:
f[key] = uproot.writing.identify.to_TString("hello world!")
with uproot.open(filename) as f:
print(f"keys: {f.keys()}")
print(f"key: {f[key]}")
``` |
I also noticed this interesting behaviour that may explain why this has slipped under the radar:
|
I don't think that bare TString objects can be put in TDirectories (even in ROOT, that is). I think that's why TObjString exists: it inherits from TObject so that it has the
The |
Oh, that sounds familiar, actually.
Good to see a silly mistake motivating a new feature ;) |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
After some work I became pretty stuck.
A simple root script to generate this basic list: auto f = TFile::Open("list.root", "RECREATE");
auto l = new TList();
l->Add(new TObject()); // I also tried setting a non-zero unique ID for the TObject, deserialization still works
l->Write("myList", TObject::kSingleKey); // if TObject::kSingleKey is not specified, each item on the list is written as a key to the root file
f->Close();
import uproot
tobject = uproot.models.TObject.Model_TObject.empty()
tobject._members["@fUniqueID"] = 0
tobject._members["@fBits"] = 0
mylist = uproot.writing.identify.to_TList([tobject])
filename = "list.root"
key = "mylist"
with uproot.recreate(filename) as f:
f[key] = mylist which will produce an error like:
I added a check for this but it's not the problem in this case. I am not sure how to proceed, unless I am missing something the problem seems a bit too deep for me yet. Any tips are welcome @jpivarski @agoose77 ty. |
It might be helpful to write a |
@lobis it looks like the problem is related to the reading and writing of the list item options in Locally, I added a simple def string(data):
"""
Converts a Python string into bytes, ready to be written to a file.
If the string's byte representation (UTF-8) has fewer than 255 bytes, it
is preceded by a 1-byte length; otherwise, it is preceded by ``b'\xff'`` and a
4-byte length.
"""
return bytestring(
data.encode(errors="surrogateescape")
)
def bytestring(data):
"""
Converts Python bytes into a length-prefixed bytestring, ready to be written to a file.
If the string's byte representation (UTF-8) has fewer than 255 bytes, it
is preceded by a 1-byte length; otherwise, it is preceded by ``b'\xff'`` and a
4-byte length.
"""
length = len(data)
if length < 255:
return struct.pack(">B%ds" % length, length, data)
else:
return struct.pack(">BI%ds" % length, 255, length, data) Then, changing for datum, option in zip(self._data, self._options):
uproot.serialization._serialize_object_any(out, datum, None)
out.append(uproot.serialization.bytestring(option)) # Fixed line fixes the reading. I think this seems fairly reasonable; if we're able to read a ROOT file correctly, then our reading logic should be correct. This implies that we want to read a length-prefixed bytestring here. And, that seems intuitive; it's the simplest means of encoding a string. |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
…ation to use the new `bytestring` helper" This reverts commit 8e6ad2e.
The CI here is failing because we don't upper-bound our Awkward version, and it's picking up the V2 RC candidate. This surprised me, until I realised that this happens when the dependency specifier already includes a pre-release, e.g. |
This reverts commit 897972f. # Conflicts: # src/uproot/serialization.py
for more information, see https://pre-commit.ci
This reverts commit 9a500c1.
…serialization to use the new `bytestring` helper"" This reverts commit 884ec4d.
Thanks, I was wondering what was wrong. |
Co-authored-by: Angus Hollands <goosey15@gmail.com>
for more information, see https://pre-commit.ci
@jpivarski in my view, this is good to go. I would appreciate your second glance to approve. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great! Thanks, @lobis, for this nice work!
@all-contributors please add @lobis for code |
I've put up a pull request to add @lobis! 🎉 |
* fix: initialize empty `TObject` members on `to_TObjString` * add test for serialization of `TObjString` * remove unused dependency on test * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add `tojson` method to `TObjString` * add additional check to `TObjString` write test * fix bad field in `TList` tojson conversion * add inexpensive `assert` to `TList` serialization * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add `to_THashList` method * add aux method to create `THashList` from categorical axis * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update src/uproot/writing/identify.py Co-authored-by: Angus Hollands <goosey15@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bad serialization of non-empty TList due to options (#763 (comment)) * add tests for TList serialization * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed bad `__repr__` for `TObject` * add serialization of `fUniqueID` to `TObject` * add `empty` method to `TObject` * remove redundant `TObject` member initialization * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add serialization for `THashList` * made `THashList` writable * add test to check serialization of `TList` vs `THashList` (which should be mostly equal) * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * initialize boost histogram axis as `IntCategory` if labels can be converted to integers * Update src/uproot/writing/identify.py Co-authored-by: Jim Pivarski <jpivarski@users.noreply.github.com> * moved `TList` serialization list to `serialize` method * add helper serialization method `bytestring` as suggested in #763 (comment) by @agoose77 * keep `TList` `_options` as python `bytes` and update serialization to use the new `bytestring` helper * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add bin index as `TObject.fUniqueID` when setting hist labels as done by root * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reset `serialization.py` to `main` branch status * Revert "keep `TList` `_options` as python `bytes` and update serialization to use the new `bytestring` helper" This reverts commit 8e6ad2e. * Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks" This reverts commit 897972f. # Conflicts: # src/uproot/serialization.py * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert "reset `serialization.py` to `main` branch status" This reverts commit 9a500c1. * Revert "Revert "keep `TList` `_options` as python `bytes` and update serialization to use the new `bytestring` helper"" This reverts commit 884ec4d. * Apply suggestions from code review Co-authored-by: Jim Pivarski <jpivarski@users.noreply.github.com> * initialize `opt` with different list initialization for consistent style * add back two spaces before imports (this is not handled by the pre-commit formatter!) * Update src/uproot/models/TObject.py Co-authored-by: Angus Hollands <goosey15@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix bad label placement * axis `fName` will always be one of the default values (`xaxis`, `yaxis`, `zaxis`) * add test for categorical histogram * hist categorical test: check for correctly overwritten axis name * add missing assert to test * updated tests for new axis naming behaviour * remove unused dependency from test * do not used mutable default arguments * fix `to_hist` not using options from function arguments (was using default arguments all the time instead) * fix not checking for all categories * add test for TH1 -> hist/boost conversion * TH1 to boost/hist do not use weights (temporary solution) * build action: move checkout right before install * add debug step to pipeline * fix test not being updated to new functionality * reverted build CI to original state * updated test to check for working `to_hist` conversion * simplified test * add check for boost histogram conversion * check also for intcategory to resize values * add test for TH3 histogram * remove redundant test * add link to issue on test * add temporary fix to bad `to_boost` conversion for categorical histograms * add tmp fix * reduced repeated code for slicing values when categorical axis via helper function * implemented `to_boost` in `Histogram` parent class to reduce code duplication * remove now unecessary helper function `_slice_values_if_categorical_axis` * remove "temporary" `and False` * remove unused dependency * add missing asserts * add test for TH1 weights from root * add test for issue * alternative way to detect weighted hist * made `fSumw2` None if storage is not weights * remove comments from test * add test for hist with weights * fix bad check for storage type * remove comment * add test for hist with(out) weights and labels * updated TH1 `to_boost` to handle weights/labels better * placed histogram `to_boost` in parent `Histogram` class to reduce code duplication * updated `weighted` property * implemented histogram `weighted` property in parent `Histogram` class * using weighted property instead of copying check * do not use mutable default arguments * fix calling property * add missing asserts to test * add test for issue #722 * add weight test for 2D and 3D histograms * add temporary skip to test until file is uploaded * update issue test file * add back temporary skip until file is available * Apply suggestions from code review Co-authored-by: Jim Pivarski <jpivarski@users.noreply.github.com> Co-authored-by: Angus Hollands <goosey15@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add suggestion from #774 (comment) * add check for length of `fSumw2` greater than 0 so empty histograms are not weighted * remove unnecessary subclass method implementation * addressed #764 (comment) Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Angus Hollands <goosey15@gmail.com> Co-authored-by: Jim Pivarski <jpivarski@users.noreply.github.com>
Closes #762.
This PR was motivated because of problems serializing
TList
.TList
serialization was fixed in 4da653f.A few other minor improvements were done along the way:
TObject
should now correctly serializefUniqueID
member (previously it was using only 0).TObject
__repr__
is fixed.Model_TObject.empty
has been added as override to avoid using its parent method in order to simplify code.TObjString
now correctly initializes itsbases
with aTObject
with all its members.tojson
calls of above mentined classes.TObjString
andTList
.