-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miscellaneous Minor SpanGroups/DocBin Improvements #10250
Conversation
Added a related improvement on Some more non-rigorous timings (updated script). Expand for timing script
master branch
prior commit 830de78
this commit 8563c61
I should also say the reason I think this is worthwhile is that I think most docs will not have |
Also added a conditional to Expand for timing script
master branch
from_bytes update commit 8ae6d11
|
Can you figure what the timing differences are for the msgpack deserialization vs. the span group deserialization for the empty cases? It seems odd that I think we shouldn't need If we do add a global, we add them toward the top under the imports. For now, I don't think I'd put this in |
You're right, this is a better way to do this. I was just unfamiliar with
I've updated to do this, but I also use this in
I'm not sure my speed tests capture what you're asking, but I ran the following: Expand for serialization speed tests
I suppose I'm not thinking of serialization as slow, I think msgpack is one of the faster serialization formats, but comparing it against the creation of an empty list it doesn't seem like a fair contest. |
I suggested This global remains a bit hacky for my taste. How about a private |
I like this idea organizationally, but I'm not clear how it would work. When we need to check this in |
Then maybe as a |
Didn't even think of that! Thanks for the suggestion, I think that works. |
As a class attribute rather than a property, I'd go back to a global-looking name, maybe just |
|
I had the same thoughts, but there would be no need to have |
On the bright side, the short name means this gets formatted on a single line. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good! I think having SpanGroups._EMPTY_BYTES
makes sense, and it's nice to see the speedup :-)
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
Right now this PR does two things:
doc.copy
and modifying a span group on the original document does not modify the span group on the copied documentDocBin.get_docs
to check each indexedspan_group
for an empty serialized list, rather than a truthy value by index, an empty span group will be an empty serialized list as a result ofdoc.spans.to_bytes()
Here's a dummy (not very rigorous) speed test with the improvement compared to the original (where every empty list bytes is deserialized):
Before/After
Open Questions
I am not sure of the best way to create a test for the DocBin improvement. I attempted to mock a few different things, but could not quite figure out the right combination of methods to mock, so I'm submitting this as a draft until we have a nice way to check whether this is working as intended.
More broadly, I'm not sure if we want to change the implementation of
span_groups
to do something more efficient in general. If there are nospan_groups
at all on any docs in aDocBin
, we still waste some time serializing an empty list for each docI'd also appreciate feedback on
test_span_group_copy
- I'm not sure this is the best way to test this.Types of change
Test, improvement
Checklist