Miscellaneous Minor SpanGroups/DocBin Improvements #10250

pmbaumgartner · 2022-02-09T21:27:05Z

Right now this PR does two things:

Adds a test to verify that calling doc.copy and modifying a span group on the original document does not modify the span group on the copied document
Updates DocBin.get_docs to check each indexed span_group for an empty serialized list, rather than a truthy value by index, an empty span group will be an empty serialized list as a result of doc.spans.to_bytes()

Here's a dummy (not very rigorous) speed test with the improvement compared to the original (where every empty list bytes is deserialized):

Before/After

# dummytest.py
import spacy
from spacy.tokens import DocBin
from timeit import default_timer as timer


doc_bin = DocBin()
texts = ["A single sentence, wow!"] * 10000
nlp = spacy.blank("en")
for doc in nlp.pipe(texts):
    doc_bin.add(doc)

start = timer()
docs = list(doc_bin.get_docs(nlp.vocab))
end = timer()
print(f"{end - start:.3f}s")

# before
(spacy) peter@Peters-MacBook-Pro spaCy % python dummytest.py
7.341s

# after
(spacy) peter@Peters-MacBook-Pro spaCy % python dummytest.py
1.211s

Open Questions

I am not sure of the best way to create a test for the DocBin improvement. I attempted to mock a few different things, but could not quite figure out the right combination of methods to mock, so I'm submitting this as a draft until we have a nice way to check whether this is working as intended.

More broadly, I'm not sure if we want to change the implementation of span_groups to do something more efficient in general. If there are no span_groups at all on any docs in a DocBin, we still waste some time serializing an empty list for each doc

I'd also appreciate feedback on test_span_group_copy - I'm not sure this is the best way to test this.

Types of change

Test, improvement

Checklist

I confirm that I have the right to submit this contribution under the project's MIT license.
I ran the tests, and all new and existing tests passed.
My changes don't require a change to the documentation, or if they do, I've added all required information.

pmbaumgartner · 2022-02-10T01:31:05Z

Added a related improvement on SpanGroups.to_bytes, by adding an _is_empty method on SpanGroups, where to_bytes returns the serialized empty list from a constant value, rather than making a call to serialize an empty list every time. I wasn't sure where to put this constant so I threw it in util.py.

Some more non-rigorous timings (updated script).

Expand for timing script

import spacy
from spacy.tokens import DocBin
from timeit import default_timer as timer


doc_bin = DocBin()
texts = ["A single sentence, wow!"] * 100000
nlp = spacy.blank("en")
start = timer()
for doc in nlp.pipe(texts):
    doc_bin.add(doc)
end = timer()
print(f"{end - start:.3f}s doc_bin.add")

start = timer()
docs = list(doc_bin.get_docs(nlp.vocab))
end = timer()
print(f"{end - start:.3f}s get_docs")

master branch

7.894s doc_bin.add
7.424s get_docs

prior commit 830de78

7.732s doc_bin.add
1.188s get_docs

this commit 8563c61

1.556s doc_bin.add
1.240s get_docs

I should also say the reason I think this is worthwhile is that I think most docs will not have SpanGroups, so this cost of serializing every time impacts a lot of users. My very quick search of the projects repo indicates that we have two example projects do use them (spancat and weak supervision), and I'm assuming the rest don't.

pmbaumgartner · 2022-02-10T02:25:16Z

Also added a conditional to SpanGroups.from_bytes for empty span groups, which greatly speeds up the round-trip that SpanGroups.copy normally makes even when there are no spans saved on a doc. This time it impacts copying docs, which is called by things like _copy_examples in a training loop.

Expand for timing script

import spacy
from spacy.tokens import DocBin
from timeit import default_timer as timer


doc_bin = DocBin()
texts = ["A single sentence, wow!"] * 100000
nlp = spacy.blank("en")
start = timer()
for doc in nlp.pipe(texts):
    doc_bin.add(doc)
end = timer()
print(f"{end - start:.3f}s doc_bin.add")

start = timer()
docs = list(doc_bin.get_docs(nlp.vocab))
end = timer()
print(f"{end - start:.3f}s get_docs")

# new
start = timer()
docs_copy = [doc.copy() for doc in docs]
end = timer()
print(f"{end - start:.3f}s doc.copy")

master branch

7.779s doc_bin.add
7.346s get_docs
13.353s doc.copy

from_bytes update commit 8ae6d11

1.578s doc_bin.add
1.212s get_docs
1.042s doc.copy

adrianeboyd · 2022-02-10T10:00:44Z

Can you figure what the timing differences are for the msgpack deserialization vs. the span group deserialization for the empty cases? It seems odd that srsly.msgpack.dumps([]) would be so slow on its own, but maybe I've just never looked into the speed for large numbers of things.

I think we shouldn't need SpanGroups._is_empty because len(SpanGroups) already works fine as it inherits from UserDict. Or was there another issue here?

If we do add a global, we add them toward the top under the imports. For now, I don't think I'd put this in spacy.util, maybe just in spacy.tokens._dict_proxies.

pmbaumgartner · 2022-02-10T20:49:29Z

I think we shouldn't need SpanGroups._is_empty because len(SpanGroups) already works fine as it inherits from UserDict. Or was there another issue here?

You're right, this is a better way to do this. I was just unfamiliar with UserDict and didn't realize it had implemented __len__.

If we do add a global, we add them toward the top under the imports. For now, I don't think I'd put this in spacy.util, maybe just in spacy.tokens._dict_proxies.

I've updated to do this, but I also use this in tokens/_serialize.py (here) and it seems a little odd to me to have that shared constant imported from _dict_proxies. But I can't say I have a good grasp of "where things should go" in the codebase.

Can you figure what the timing differences are for the msgpack deserialization vs. the span group deserialization for the empty cases? It seems odd that srsly.msgpack.dumps([]) would be so slow on its own, but maybe I've just never looked into the speed for large numbers of things.

I'm not sure my speed tests capture what you're asking, but I ran the following:

Expand for serialization speed tests

import srsly
from timeit import Timer

MSGPACK_SERIALIZED_EMPTY_LIST = srsly.msgpack_dumps([])


def return_empty_list():
    return []


def serialize_empty_list():
    return srsly.msgpack_dumps([])


def serialize_empty_list_known():
    return MSGPACK_SERIALIZED_EMPTY_LIST


def deserialize_empty_list():
    return srsly.msgpack_loads(MSGPACK_SERIALIZED_EMPTY_LIST)


def serialization_round_trip():
    return srsly.msgpack_loads(srsly.msgpack_dumps([]))


assert deserialize_empty_list() == return_empty_list()
assert serialize_empty_list() == serialize_empty_list_known()
assert serialization_round_trip() == return_empty_list()


print("return_empty_list()", min(Timer("return_empty_list()", globals=globals()).repeat()))
print("serialize_empty_list()", min(Timer("serialize_empty_list()", globals=globals()).repeat()))
print("serialize_empty_list_known()", min(Timer("serialize_empty_list_known()", globals=globals()).repeat()))
print("deserialize_empty_list()", min(Timer("deserialize_empty_list()", globals=globals()).repeat()))
print("serialization_round_trip()", min(Timer("serialization_round_trip()", globals=globals()).repeat()))

# in µs
return_empty_list() 0.04251429100000001
serialize_empty_list_known() 0.03924079199999997
serialize_empty_list() 2.1079921249999996
deserialize_empty_list() 1.743938
serialization_round_trip() 3.956638875000003

I suppose I'm not thinking of serialization as slow, I think msgpack is one of the faster serialization formats, but comparing it against the creation of an empty list it doesn't seem like a fair contest.

adrianeboyd · 2022-02-11T07:55:30Z

I suggested _dict_proxies (which should be renamed at some point, v4 at the latest) because this detail is primarily related to SpanGroups and not DocBin, and next to SpanGroups seemed like a better place than SpanGroup.

This global remains a bit hacky for my taste. How about a private SpanGroups property for _serialized_empty_bytes or similar that you can check in DocBin?

pmbaumgartner · 2022-02-11T11:42:37Z

How about a private SpanGroups property for _serialized_empty_bytes or similar that you can check in DocBin?

I like this idea organizationally, but I'm not clear how it would work. When we need to check this in DocBin (ref), at that point the span_groups property is just List[bytes] -- they're not yet SpanGroups until they're deserialized and attached to a doc, so how would we check the property prior to it being deserialized an instantiated? It's like we have a recursive serialization problem then.

adrianeboyd · 2022-02-11T11:57:09Z

Then maybe as a SpanGroups class attribute?

pmbaumgartner · 2022-02-11T12:05:16Z

Didn't even think of that! Thanks for the suggestion, I think that works.

adrianeboyd · 2022-02-14T07:22:08Z

As a class attribute rather than a property, I'd go back to a global-looking name, maybe just _EMPTY_BYTES?

pmbaumgartner · 2022-02-14T13:39:17Z

_EMPTY_BYTES makes me think b"", but I suppose a full description is probably too lengthy and the fact that it's a class attribute provides some additional context. My only other ideas are _EMPTY_SERIALIZATION or _SERIALIZED_EMPTY, if neither of those seem right to you then I'm happy with _EMPTY_BYTES.

adrianeboyd · 2022-02-14T13:46:48Z

I had the same thoughts, but there would be no need to have b"" like this and it's private, so I think it's okay. I don't think SERIALIZED adds much info here.

pmbaumgartner · 2022-02-14T14:26:21Z

On the bright side, the short name means this gets formatted on a single line.

svlandeg

Looks good! I think having SpanGroups._EMPTY_BYTES makes sense, and it's nice to see the speedup :-)

spacy/tests/doc/test_span.py

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

pmbaumgartner added 7 commits December 21, 2021 08:13

MultiHashEmbed vector docs correction

2cba312

Merge branch 'explosion:master' into master

5bdca26

doc copy span test

0b90713

ignore empty lists in DocBin.span_groups

6e26a9e

Merge branch 'explosion:master' into master

70bd990

Merge branch 'master' into spangroup-copy

830de78

serialized empty list const + SpanGroups.is_empty

8563c61

add conditional deserial on from_bytes

8ae6d11

pmbaumgartner marked this pull request as ready for review February 10, 2022 02:36

svlandeg added feat / doc Feature: Doc, Span and Token objects feat / serialize Feature: Serialization, saving and loading enhancement Feature requests and improvements labels Feb 10, 2022

pmbaumgartner added 2 commits February 10, 2022 15:49

clean up + reorganize

4bbec95

rm test

f78da8a

add constant as class attribute

3fa9c5e

rename to _EMPTY_BYTES

7e297c7

svlandeg approved these changes Feb 17, 2022

View reviewed changes

spacy/tests/doc/test_span.py Outdated Show resolved Hide resolved

Update spacy/tests/doc/test_span.py

ed8cadc

Co-authored-by: Sofie Van Landeghem <svlandeg@users.noreply.github.com>

svlandeg merged commit 3358fb9 into explosion:master Feb 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Miscellaneous Minor SpanGroups/DocBin Improvements #10250

Miscellaneous Minor SpanGroups/DocBin Improvements #10250

pmbaumgartner commented Feb 9, 2022

pmbaumgartner commented Feb 10, 2022 •

edited

Loading

pmbaumgartner commented Feb 10, 2022 •

edited

Loading

adrianeboyd commented Feb 10, 2022 •

edited

Loading

pmbaumgartner commented Feb 10, 2022

adrianeboyd commented Feb 11, 2022

pmbaumgartner commented Feb 11, 2022

adrianeboyd commented Feb 11, 2022

pmbaumgartner commented Feb 11, 2022

adrianeboyd commented Feb 14, 2022

pmbaumgartner commented Feb 14, 2022

adrianeboyd commented Feb 14, 2022

pmbaumgartner commented Feb 14, 2022

svlandeg left a comment

Miscellaneous Minor SpanGroups/DocBin Improvements #10250

Miscellaneous Minor SpanGroups/DocBin Improvements #10250

Conversation

pmbaumgartner commented Feb 9, 2022

Types of change

Checklist

pmbaumgartner commented Feb 10, 2022 • edited Loading

pmbaumgartner commented Feb 10, 2022 • edited Loading

adrianeboyd commented Feb 10, 2022 • edited Loading

pmbaumgartner commented Feb 10, 2022

adrianeboyd commented Feb 11, 2022

pmbaumgartner commented Feb 11, 2022

adrianeboyd commented Feb 11, 2022

pmbaumgartner commented Feb 11, 2022

adrianeboyd commented Feb 14, 2022

pmbaumgartner commented Feb 14, 2022

adrianeboyd commented Feb 14, 2022

pmbaumgartner commented Feb 14, 2022

svlandeg left a comment

Choose a reason for hiding this comment

pmbaumgartner commented Feb 10, 2022 •

edited

Loading

pmbaumgartner commented Feb 10, 2022 •

edited

Loading

adrianeboyd commented Feb 10, 2022 •

edited

Loading