Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: working TList serialization #763

Merged
merged 34 commits into from
Oct 28, 2022
Merged

Conversation

lobis
Copy link
Collaborator

@lobis lobis commented Oct 24, 2022

Closes #762.

This PR was motivated because of problems serializing TList.

  • TList serialization was fixed in 4da653f.

A few other minor improvements were done along the way:

  • TObject should now correctly serialize fUniqueID member (previously it was using only 0).
  • TObject __repr__ is fixed.
  • A method Model_TObject.empty has been added as override to avoid using its parent method in order to simplify code.
  • TObjString now correctly initializes its bases with a TObject with all its members.
  • Updated some tojson calls of above mentined classes.
  • Added / Updated tests for writing and reading TObjString and TList.

@jpivarski
Copy link
Member

Since I saw your message in Slack, I started looking into this, but it seems that you're ahead of me: yes, the TObject will need its parameters. (Doesn't the TObject constructor make them, though?)

TObjString is heavily tested in the testing suite, but not TList. I'm a little surprised that TList is even implemented, but I suppose it might be because it was needed to implement something else, such as histograms. (Writing TObjString, descendants of TH1, and simple and jagged TTrees have been implemented; nothing else has been attempted.)

It's very likely that TList's _serialize method was only written for empty TLists. It's very likely that some histogram class can hold a list of objects that we don't attempt to write (such as fit functions) and the TList serialization was only written to allow empty TLists to be included in histograms. So when you tried to write a non-empty TList, you were in uncharted territory.

Even though you've found something that works, you should remain suspicious of it because it might be working by accident. You're filling the fUniqueID and fBits of the TObject with zeros; the reader trying to deserialize a different part of the structure when it gets to those zeros, and interprets the zeros as that other thing, like size of the list, length of the TList fName, or something. Be sure to use much more than one test; lists of various lengths, names, etc.

What I was looking at before seeing this PR is the fact that each item in a TList has a fOption. The writer does a Python zip through items and self._options:

for datum, option in zip(self._data, self._options):
uproot.serialization._serialize_object_any(out, datum, None)
out.append(option)

Python zip stops when it reaches the end of either of the lists it is given, so filling self._data and not self._options will not get it to write the correct number of items. If that number of items disagrees with the declared self._members["fSize"], the reader will likely crash.

In your testing, you can test read-back first with Uproot, then (when that works) with ROOT. Uproot is generally more forgiving of wrong formats, since it reads as little as it needs to and doesn't do all of the consistency checks that ROOT does. Also, it's easier to use Uproot in a debugger or insert print-outs (my favorite method of debugging, since Python loads quickly and doesn't have a long compilation phase), so you can more easily examine which bytes the reader is seeing when it's expecting to be building a given structure. Once it works in Uproot, you still have to test it in ROOT, of course, because it might still be wrong.

I hope this is useful! I'll keep tabs on your progress, and go ahead and ask questions on #awkward-uproot in Slack as you go. Good luck!

@agoose77
Copy link
Collaborator

agoose77 commented Oct 24, 2022

@jpivarski this might be a good place to discuss; should a bare TString be round-trippable in Awkward? I noticed that it didn't deserialise when taking a quick look at this issue with TObjString:

import uproot
print(f"{uproot.__version__=}")

filename = "list.root"
key = "listOfStrings"

with uproot.recreate(filename) as f:
    f[key] = uproot.writing.identify.to_TString("hello world!")

with uproot.open(filename) as f:
    print(f"keys: {f.keys()}")
    print(f"key: {f[key]}")
    ```

@lobis
Copy link
Collaborator Author

lobis commented Oct 24, 2022

I also noticed this interesting behaviour that may explain why this has slipped under the radar:

import uproot

filename = "list.root"
key = "listOfStrings"

s = uproot.writing.identify.to_TObjString("test")
with uproot.recreate(filename) as f:
    try:
        print(s.tojson())  # this does not work
    except AttributeError as e:
        print(e)
    f[key] = s
    print(f[key].tojson())  # this works

@jpivarski
Copy link
Member

should a bare TString be round-trippable in Awkward?

I don't think that bare TString objects can be put in TDirectories (even in ROOT, that is). I think that's why TObjString exists: it inherits from TObject so that it has the fProcess, fBits infrastructure that objects-in-TDirectories need. (They might also need fName and fTitle, since this information is also in the TKeys of a TDirectory.) But perhaps Uproot could use some code to prohibit non-TDirectory-ready objects from being written to TDirectories.

I also noticed this interesting behaviour that may explain why this has slipped under the radar:

The tojson method is not actually used in serialization; JSON serialization is an alternative to binary serialization, defined by ROOT, and so I used it in the early part of the project to verify that I was at least identifying the right data to serialize, before attempting the binary format. Since those early tests (test_0010-start-streamers.py and test_0011-generate-classes-from-streamers.py), the tojson method hasn't been used by anything and it hasn't been kept up-to-date with recent developments, such as serialization of additional data types. I've been on the fence about removing it completely: as a public interface, it should either work or not exist, and it would be easier to back off on it than to go forward with it. The reason I haven't done so is because it's a good diagnostic when someone raises a serialization issue.

@agoose77
Copy link
Collaborator

I don't think that bare TString objects can be put in TDirectories (even in ROOT, that is).

Oh, that sounds familiar, actually.

But perhaps Uproot could use some code to prohibit non-TDirectory-ready objects from being written to TDirectories.

Good to see a silly mistake motivating a new feature ;)

@lobis
Copy link
Collaborator Author

lobis commented Oct 24, 2022

After some work I became pretty stuck.

  • 5c90b48 fixed (allegedly) serialization of TObjString inside the TList.

  • The serialization of TList is still wrong and now I am pretty sure the problem is either in uproot.writing.identify.to_TList or in TList::_serialize.

  • The deserealization (reading) of TList appears to work fine. An empty list is serialized / deserialized correctly too.

  • I tried to use the simplest TList I could think of (containing a single TObject) to test, and see that the problem is not related to TObjString.

A simple root script to generate this basic list:

auto f = TFile::Open("list.root", "RECREATE");

auto l = new TList();

l->Add(new TObject()); // I also tried setting a non-zero unique ID for the TObject, deserialization still works

l->Write("myList", TObject::kSingleKey); // if TObject::kSingleKey is not specified, each item on the list is written as a key to the root file

f->Close();
  • This is read correctly by uproot.

  • Attempt to recreate this file with uproot:

import uproot

tobject = uproot.models.TObject.Model_TObject.empty()
tobject._members["@fUniqueID"] = 0
tobject._members["@fBits"] = 0

mylist = uproot.writing.identify.to_TList([tobject])

filename = "list.root"
key = "mylist"

with uproot.recreate(filename) as f:
    f[key] = mylist

which will produce an error like:

    raise DeserializationError(
uproot.deserialization.DeserializationError: while reading
    TList version 5 as uproot.models.TList.Model_TList (31 bytes)
        (base): <TObject None None at 0x0122b05f10c0>
        fName: ''
        fSize: 1
Base classes for TList: TObject?
Members for TList: fName?, fSize?
expected 31 bytes but cursor moved by 26 bytes (through TList)
in file list.root
in object /mylist;1

Python zip stops when it reaches the end of either of the lists it is given, so filling self._data and not self._options will not get it to write the correct number of items. If that number of items disagrees with the declared self._members["fSize"], the reader will likely crash.

I added a check for this but it's not the problem in this case.

I am not sure how to proceed, unless I am missing something the problem seems a bit too deep for me yet. Any tips are welcome @jpivarski @agoose77 ty.

@agoose77
Copy link
Collaborator

It might be helpful to write a TList[TObject()] from ROOT, and try to round-trip it through uproot. This also fails, but might help to narrow down where the serialisation is differing. (I assume the error lies in how we serialise the TList / TObject).

@agoose77
Copy link
Collaborator

agoose77 commented Oct 25, 2022

@lobis it looks like the problem is related to the reading and writing of the list item options in TList._serialize. When reading in the list options in TList.read_members, they are parsed as pascal (byte)strings - a series of bytes with the string length prepended. Meanwhile, the TList._serialize writing routine directly appends the raw option bytestring to the output list. This leads to an off-by-one error.

Locally, I added a simple bytestring helper to uproot.serialization that converts "bytes" into a length-prefixed bytestring:

def string(data):
    """
    Converts a Python string into bytes, ready to be written to a file.

    If the string's byte representation (UTF-8) has fewer than 255 bytes, it
    is preceded by a 1-byte length; otherwise, it is preceded by ``b'\xff'`` and a
    4-byte length.
    """
    return bytestring(
        data.encode(errors="surrogateescape")
    )


def bytestring(data):
    """
    Converts Python bytes into a length-prefixed bytestring, ready to be written to a file.

    If the string's byte representation (UTF-8) has fewer than 255 bytes, it
    is preceded by a 1-byte length; otherwise, it is preceded by ``b'\xff'`` and a
    4-byte length.
    """
    length = len(data)
    if length < 255:
        return struct.pack(">B%ds" % length, length, data)
    else:
        return struct.pack(">BI%ds" % length, 255, length, data)

Then, changing TList._serialize to use this helper

        for datum, option in zip(self._data, self._options):
            uproot.serialization._serialize_object_any(out, datum, None)
            out.append(uproot.serialization.bytestring(option)) # Fixed line

fixes the reading.

I think this seems fairly reasonable; if we're able to read a ROOT file correctly, then our reading logic should be correct. This implies that we want to read a length-prefixed bytestring here. And, that seems intuitive; it's the simplest means of encoding a string.

@lobis
Copy link
Collaborator Author

lobis commented Oct 25, 2022

fixes the reading.

Thanks @agoose77! this fixed the issue. I implemented your suggestion on the uproot.writing.identity.to_TList method instead of in the TList model itself because it seemed more appropiate (4da653f).

@lobis lobis marked this pull request as ready for review October 25, 2022 14:15
@agoose77
Copy link
Collaborator

agoose77 commented Oct 26, 2022

The CI here is failing because we don't upper-bound our Awkward version, and it's picking up the V2 RC candidate. This surprised me, until I realised that this happens when the dependency specifier already includes a pre-release, e.g. 1.9.0rc1. I recommend we wait until #765 merges, and then rebase this.

lobis and others added 4 commits October 26, 2022 10:39
This reverts commit 897972f.

# Conflicts:
#	src/uproot/serialization.py
…serialization to use the new `bytestring` helper""

This reverts commit 884ec4d.
@lobis
Copy link
Collaborator Author

lobis commented Oct 26, 2022

The CI here is failing because we don't upper-bound our Awkward version, and it's picking up the V2 RC candidate. This surprised me, until I realised that this happens when the dependency specifier already includes a pre-release, e.g. 1.9.0rc1. I recommend we wait until #765 merges, and then rebase this.

Thanks, I was wondering what was wrong.

lobis and others added 2 commits October 27, 2022 09:38
Co-authored-by: Angus Hollands <goosey15@gmail.com>
@lobis lobis requested a review from agoose77 October 27, 2022 09:40
@agoose77
Copy link
Collaborator

@jpivarski in my view, this is good to go. I would appreciate your second glance to approve.

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Thanks, @lobis, for this nice work!

@jpivarski jpivarski merged commit cb8c776 into scikit-hep:main Oct 28, 2022
@jpivarski
Copy link
Member

@all-contributors please add @lobis for code

@allcontributors
Copy link
Contributor

@jpivarski

I've put up a pull request to add @lobis! 🎉

@lobis lobis deleted the to_TObjString-fix branch October 29, 2022 00:00
jpivarski added a commit that referenced this pull request Nov 2, 2022
* fix: initialize empty `TObject` members on `to_TObjString`

* add test for serialization of `TObjString`

* remove unused dependency on test

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add `tojson` method to `TObjString`

* add additional check to `TObjString` write test

* fix bad field in `TList` tojson conversion

* add inexpensive `assert` to `TList` serialization

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add `to_THashList` method

* add aux method to create `THashList` from categorical axis

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update src/uproot/writing/identify.py

Co-authored-by: Angus Hollands <goosey15@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bad serialization of non-empty TList due to options (#763 (comment))

* add tests for TList serialization

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fixed bad `__repr__` for `TObject`

* add serialization of `fUniqueID` to `TObject`

* add `empty` method to `TObject`

* remove redundant `TObject` member initialization

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add serialization for `THashList`

* made `THashList` writable

* add test to check serialization of `TList` vs `THashList` (which should be mostly equal)

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* initialize boost histogram axis as `IntCategory` if labels can be converted to integers

* Update src/uproot/writing/identify.py

Co-authored-by: Jim Pivarski <jpivarski@users.noreply.github.com>

* moved `TList` serialization list to `serialize` method

* add helper serialization method `bytestring` as suggested in #763 (comment) by @agoose77

* keep `TList` `_options` as python `bytes` and update serialization to use the new `bytestring` helper

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add bin index as `TObject.fUniqueID` when setting hist labels as done by root

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* reset `serialization.py` to `main` branch status

* Revert "keep `TList` `_options` as python `bytes` and update serialization to use the new `bytestring` helper"

This reverts commit 8e6ad2e.

* Revert "[pre-commit.ci] auto fixes from pre-commit.com hooks"

This reverts commit 897972f.

# Conflicts:
#	src/uproot/serialization.py

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Revert "reset `serialization.py` to `main` branch status"

This reverts commit 9a500c1.

* Revert "Revert "keep `TList` `_options` as python `bytes` and update serialization to use the new `bytestring` helper""

This reverts commit 884ec4d.

* Apply suggestions from code review

Co-authored-by: Jim Pivarski <jpivarski@users.noreply.github.com>

* initialize `opt` with different list initialization for consistent style

* add back two spaces before imports (this is not handled by the pre-commit formatter!)

* Update src/uproot/models/TObject.py

Co-authored-by: Angus Hollands <goosey15@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix bad label placement

* axis `fName` will always be one of the default values (`xaxis`, `yaxis`, `zaxis`)

* add test for categorical histogram

* hist categorical test: check for correctly overwritten axis name

* add missing assert to test

* updated tests for new axis naming behaviour

* remove unused dependency from test

* do not used mutable default arguments

* fix `to_hist` not using options from function arguments (was using default arguments all the time instead)

* fix not checking for all categories

* add test for TH1 -> hist/boost conversion

* TH1 to boost/hist do not use weights (temporary solution)

* build action: move checkout right before install

* add debug step to pipeline

* fix test not being updated to new functionality

* reverted build CI to original state

* updated test to check for working `to_hist` conversion

* simplified test

* add check for boost histogram conversion

* check also for intcategory to resize values

* add test for TH3 histogram

* remove redundant test

* add link to issue on test

* add temporary fix to bad `to_boost` conversion for categorical histograms

* add tmp fix

* reduced repeated code for slicing values when categorical axis via helper function

* implemented `to_boost` in `Histogram` parent class to reduce code duplication

* remove now unecessary helper function `_slice_values_if_categorical_axis`

* remove "temporary" `and False`

* remove unused dependency

* add missing asserts

* add test for TH1 weights from root

* add test for issue

* alternative way to detect weighted hist

* made `fSumw2` None if storage is not weights

* remove comments from test

* add test for hist with weights

* fix bad check for storage type

* remove comment

* add test for hist with(out) weights and labels

* updated TH1 `to_boost` to handle weights/labels better

* placed histogram `to_boost` in parent `Histogram` class to reduce code duplication

* updated `weighted` property

* implemented histogram `weighted` property in parent `Histogram` class

* using weighted property instead of copying check

* do not use mutable default arguments

* fix calling property

* add missing asserts to test

* add test for issue #722

* add weight test for 2D and 3D histograms

* add temporary skip to test until file is uploaded

* update issue test file

* add back temporary skip until file is available

* Apply suggestions from code review

Co-authored-by: Jim Pivarski <jpivarski@users.noreply.github.com>
Co-authored-by: Angus Hollands <goosey15@gmail.com>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add suggestion from #774 (comment)

* add check for length of `fSumw2` greater than 0 so empty histograms are not weighted

* remove unnecessary subclass method implementation

* addressed #764 (comment)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Angus Hollands <goosey15@gmail.com>
Co-authored-by: Jim Pivarski <jpivarski@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problem writing TList of TObjString
3 participants