Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miscellaneous Minor SpanGroups/DocBin Improvements #10250

Merged
merged 13 commits into from
Feb 21, 2022
Merged

Miscellaneous Minor SpanGroups/DocBin Improvements #10250

merged 13 commits into from
Feb 21, 2022

Conversation

pmbaumgartner
Copy link
Contributor

Right now this PR does two things:

  • Adds a test to verify that calling doc.copy and modifying a span group on the original document does not modify the span group on the copied document
  • Updates DocBin.get_docs to check each indexed span_group for an empty serialized list, rather than a truthy value by index, an empty span group will be an empty serialized list as a result of doc.spans.to_bytes()

Here's a dummy (not very rigorous) speed test with the improvement compared to the original (where every empty list bytes is deserialized):

Before/After
# dummytest.py
import spacy
from spacy.tokens import DocBin
from timeit import default_timer as timer


doc_bin = DocBin()
texts = ["A single sentence, wow!"] * 10000
nlp = spacy.blank("en")
for doc in nlp.pipe(texts):
    doc_bin.add(doc)

start = timer()
docs = list(doc_bin.get_docs(nlp.vocab))
end = timer()
print(f"{end - start:.3f}s")
# before
(spacy) peter@Peters-MacBook-Pro spaCy % python dummytest.py
7.341s

# after
(spacy) peter@Peters-MacBook-Pro spaCy % python dummytest.py
1.211s

Open Questions

I am not sure of the best way to create a test for the DocBin improvement. I attempted to mock a few different things, but could not quite figure out the right combination of methods to mock, so I'm submitting this as a draft until we have a nice way to check whether this is working as intended.

More broadly, I'm not sure if we want to change the implementation of span_groups to do something more efficient in general. If there are no span_groups at all on any docs in a DocBin, we still waste some time serializing an empty list for each doc

I'd also appreciate feedback on test_span_group_copy - I'm not sure this is the best way to test this.

Types of change

Test, improvement

Checklist

  • I confirm that I have the right to submit this contribution under the project's MIT license.
  • I ran the tests, and all new and existing tests passed.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@pmbaumgartner
Copy link
Contributor Author

pmbaumgartner commented Feb 10, 2022

Added a related improvement on SpanGroups.to_bytes, by adding an _is_empty method on SpanGroups, where to_bytes returns the serialized empty list from a constant value, rather than making a call to serialize an empty list every time. I wasn't sure where to put this constant so I threw it in util.py.

Some more non-rigorous timings (updated script).

Expand for timing script
import spacy
from spacy.tokens import DocBin
from timeit import default_timer as timer


doc_bin = DocBin()
texts = ["A single sentence, wow!"] * 100000
nlp = spacy.blank("en")
start = timer()
for doc in nlp.pipe(texts):
    doc_bin.add(doc)
end = timer()
print(f"{end - start:.3f}s doc_bin.add")

start = timer()
docs = list(doc_bin.get_docs(nlp.vocab))
end = timer()
print(f"{end - start:.3f}s get_docs")

master branch

7.894s doc_bin.add
7.424s get_docs

prior commit 830de78

7.732s doc_bin.add
1.188s get_docs

this commit 8563c61

1.556s doc_bin.add
1.240s get_docs

I should also say the reason I think this is worthwhile is that I think most docs will not have SpanGroups, so this cost of serializing every time impacts a lot of users. My very quick search of the projects repo indicates that we have two example projects do use them (spancat and weak supervision), and I'm assuming the rest don't.

@pmbaumgartner
Copy link
Contributor Author

pmbaumgartner commented Feb 10, 2022

Also added a conditional to SpanGroups.from_bytes for empty span groups, which greatly speeds up the round-trip that SpanGroups.copy normally makes even when there are no spans saved on a doc. This time it impacts copying docs, which is called by things like _copy_examples in a training loop.

Expand for timing script
import spacy
from spacy.tokens import DocBin
from timeit import default_timer as timer


doc_bin = DocBin()
texts = ["A single sentence, wow!"] * 100000
nlp = spacy.blank("en")
start = timer()
for doc in nlp.pipe(texts):
    doc_bin.add(doc)
end = timer()
print(f"{end - start:.3f}s doc_bin.add")

start = timer()
docs = list(doc_bin.get_docs(nlp.vocab))
end = timer()
print(f"{end - start:.3f}s get_docs")

# new
start = timer()
docs_copy = [doc.copy() for doc in docs]
end = timer()
print(f"{end - start:.3f}s doc.copy")

master branch

7.779s doc_bin.add
7.346s get_docs
13.353s doc.copy

from_bytes update commit 8ae6d11

1.578s doc_bin.add
1.212s get_docs
1.042s doc.copy

@pmbaumgartner pmbaumgartner marked this pull request as ready for review February 10, 2022 02:36
@svlandeg svlandeg added feat / doc Feature: Doc, Span and Token objects feat / serialize Feature: Serialization, saving and loading enhancement Feature requests and improvements labels Feb 10, 2022
@adrianeboyd
Copy link
Contributor

adrianeboyd commented Feb 10, 2022

Can you figure what the timing differences are for the msgpack deserialization vs. the span group deserialization for the empty cases? It seems odd that srsly.msgpack.dumps([]) would be so slow on its own, but maybe I've just never looked into the speed for large numbers of things.

I think we shouldn't need SpanGroups._is_empty because len(SpanGroups) already works fine as it inherits from UserDict. Or was there another issue here?

If we do add a global, we add them toward the top under the imports. For now, I don't think I'd put this in spacy.util, maybe just in spacy.tokens._dict_proxies.

@pmbaumgartner
Copy link
Contributor Author

I think we shouldn't need SpanGroups._is_empty because len(SpanGroups) already works fine as it inherits from UserDict. Or was there another issue here?

You're right, this is a better way to do this. I was just unfamiliar with UserDict and didn't realize it had implemented __len__.

If we do add a global, we add them toward the top under the imports. For now, I don't think I'd put this in spacy.util, maybe just in spacy.tokens._dict_proxies.

I've updated to do this, but I also use this in tokens/_serialize.py (here) and it seems a little odd to me to have that shared constant imported from _dict_proxies. But I can't say I have a good grasp of "where things should go" in the codebase.

Can you figure what the timing differences are for the msgpack deserialization vs. the span group deserialization for the empty cases? It seems odd that srsly.msgpack.dumps([]) would be so slow on its own, but maybe I've just never looked into the speed for large numbers of things.

I'm not sure my speed tests capture what you're asking, but I ran the following:

Expand for serialization speed tests
import srsly
from timeit import Timer

MSGPACK_SERIALIZED_EMPTY_LIST = srsly.msgpack_dumps([])


def return_empty_list():
    return []


def serialize_empty_list():
    return srsly.msgpack_dumps([])


def serialize_empty_list_known():
    return MSGPACK_SERIALIZED_EMPTY_LIST


def deserialize_empty_list():
    return srsly.msgpack_loads(MSGPACK_SERIALIZED_EMPTY_LIST)


def serialization_round_trip():
    return srsly.msgpack_loads(srsly.msgpack_dumps([]))


assert deserialize_empty_list() == return_empty_list()
assert serialize_empty_list() == serialize_empty_list_known()
assert serialization_round_trip() == return_empty_list()


print("return_empty_list()", min(Timer("return_empty_list()", globals=globals()).repeat()))
print("serialize_empty_list()", min(Timer("serialize_empty_list()", globals=globals()).repeat()))
print("serialize_empty_list_known()", min(Timer("serialize_empty_list_known()", globals=globals()).repeat()))
print("deserialize_empty_list()", min(Timer("deserialize_empty_list()", globals=globals()).repeat()))
print("serialization_round_trip()", min(Timer("serialization_round_trip()", globals=globals()).repeat()))
# in µs
return_empty_list() 0.04251429100000001
serialize_empty_list_known() 0.03924079199999997
serialize_empty_list() 2.1079921249999996
deserialize_empty_list() 1.743938
serialization_round_trip() 3.956638875000003

I suppose I'm not thinking of serialization as slow, I think msgpack is one of the faster serialization formats, but comparing it against the creation of an empty list it doesn't seem like a fair contest.

@adrianeboyd
Copy link
Contributor

I suggested _dict_proxies (which should be renamed at some point, v4 at the latest) because this detail is primarily related to SpanGroups and not DocBin, and next to SpanGroups seemed like a better place than SpanGroup.

This global remains a bit hacky for my taste. How about a private SpanGroups property for _serialized_empty_bytes or similar that you can check in DocBin?

@pmbaumgartner
Copy link
Contributor Author

How about a private SpanGroups property for _serialized_empty_bytes or similar that you can check in DocBin?

I like this idea organizationally, but I'm not clear how it would work. When we need to check this in DocBin (ref), at that point the span_groups property is just List[bytes] -- they're not yet SpanGroups until they're deserialized and attached to a doc, so how would we check the property prior to it being deserialized an instantiated? It's like we have a recursive serialization problem then.

@adrianeboyd
Copy link
Contributor

Then maybe as a SpanGroups class attribute?

@pmbaumgartner
Copy link
Contributor Author

Didn't even think of that! Thanks for the suggestion, I think that works.

@adrianeboyd
Copy link
Contributor

As a class attribute rather than a property, I'd go back to a global-looking name, maybe just _EMPTY_BYTES?

@pmbaumgartner
Copy link
Contributor Author

_EMPTY_BYTES makes me think b"", but I suppose a full description is probably too lengthy and the fact that it's a class attribute provides some additional context. My only other ideas are _EMPTY_SERIALIZATION or _SERIALIZED_EMPTY, if neither of those seem right to you then I'm happy with _EMPTY_BYTES.

@adrianeboyd
Copy link
Contributor

I had the same thoughts, but there would be no need to have b"" like this and it's private, so I think it's okay. I don't think SERIALIZED adds much info here.

@pmbaumgartner
Copy link
Contributor Author

On the bright side, the short name means this gets formatted on a single line.

Copy link
Member

@svlandeg svlandeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I think having SpanGroups._EMPTY_BYTES makes sense, and it's nice to see the speedup :-)

spacy/tests/doc/test_span.py Outdated Show resolved Hide resolved
Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>
@svlandeg svlandeg merged commit 3358fb9 into explosion:master Feb 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / doc Feature: Doc, Span and Token objects feat / serialize Feature: Serialization, saving and loading
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants