Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpanFinder into spaCy from experimental #12507

Merged
merged 58 commits into from
Jun 7, 2023
Merged
Show file tree
Hide file tree
Changes from 56 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
638ac9f
span finder integrated into spacy from experimental
kadarakos Apr 6, 2023
21d5360
Merge branch 'master' of https://github.com/explosion/spaCy into add-…
kadarakos Apr 6, 2023
9cfcbdd
black
kadarakos Apr 6, 2023
8fc84fb
isort
kadarakos Apr 6, 2023
4d88616
black
kadarakos Apr 6, 2023
85dd4d4
default spankey constant
kadarakos Apr 13, 2023
d544c90
black
kadarakos Apr 13, 2023
58f1aa2
Update spacy/pipeline/spancat.py
kadarakos Apr 17, 2023
1796cf3
rename
kadarakos Apr 17, 2023
19bae09
rename
kadarakos Apr 17, 2023
3b41a98
Merge branch 'master' of https://github.com/explosion/spaCy into add-…
kadarakos Apr 26, 2023
4ef70c0
max_length and min_length as Optional[int] and strict checking
kadarakos May 3, 2023
9252e62
black
kadarakos May 3, 2023
02c1bc0
mypy fix for integer type infinity
kadarakos May 3, 2023
02d8d62
revert line order
kadarakos May 3, 2023
fe4c094
implement all comparison operators for inf int
kadarakos May 3, 2023
6b2e836
avoid two for loops over all docs by not precomputing
kadarakos May 3, 2023
db361db
interleave thresholding with span creation
kadarakos May 3, 2023
a5b9e63
black
kadarakos May 3, 2023
82f6a81
revert to not interleaving (relized its faster)
kadarakos May 3, 2023
6fe7c66
black
kadarakos May 3, 2023
2c8408f
Update spacy/errors.py
kadarakos May 8, 2023
9d67936
update dosctring
kadarakos May 8, 2023
d5da2df
enforce that the gold and predicted documents have the same text
kadarakos May 8, 2023
b7ce3ab
new error for ensuring reference and predicted texts are the same
kadarakos May 8, 2023
530d812
remove todo
kadarakos May 8, 2023
1036542
adjust test
kadarakos May 8, 2023
e5b2a8b
Merge branch 'add-span-finder' of https://github.com/kadarakos/spaCy …
kadarakos May 8, 2023
11a1797
black
kadarakos May 9, 2023
6e46ecf
handle misaligned tokenization
kadarakos May 31, 2023
f599bd5
return correct variable
kadarakos May 31, 2023
90af16a
failing overfit test
kadarakos May 31, 2023
6f750d0
only use a single spans_key like in spancat
kadarakos Jun 1, 2023
af80225
black
kadarakos Jun 1, 2023
2a1cb13
remove debug lines
kadarakos Jun 1, 2023
4c2f80c
typo
kadarakos Jun 1, 2023
09b5f61
remove comment
kadarakos Jun 1, 2023
fe964e7
remove near duplicate reduntant method
kadarakos Jun 1, 2023
8c7c34d
use the 'spans_key' variable name everywhere
kadarakos Jun 1, 2023
56de107
Update spacy/pipeline/span_finder.py
kadarakos Jun 1, 2023
658c4ae
flaky test fix suggestion, hand set bias terms
kadarakos Jun 1, 2023
37c4ad5
only test suggester and test result exhaustively
kadarakos Jun 2, 2023
752b306
make it clear that the span_finder_suggester is more general (not spe…
kadarakos Jun 2, 2023
a33c7e0
Update spacy/tests/pipeline/test_span_finder.py
kadarakos Jun 2, 2023
3abdca2
Apply suggestions from code review
adrianeboyd Jun 2, 2023
bd71b87
remove question comment
kadarakos Jun 2, 2023
3ec1cb5
Merge branch 'add-span-finder' of https://github.com/kadarakos/spaCy …
kadarakos Jun 2, 2023
9372b22
move preset_spans_suggester test to spancat tests
kadarakos Jun 2, 2023
f84b59d
Add docs and unify default configs for spancat and span finder
adrianeboyd Jun 2, 2023
bb62ee9
Fix offset bug in set_annotations
adrianeboyd Jun 2, 2023
024679c
Format
adrianeboyd Jun 2, 2023
9c403f1
Merge remote-tracking branch 'upstream/master' into add-span-finder
adrianeboyd Jun 2, 2023
ce4d33e
Add span_finder to quickstart template
adrianeboyd Jun 5, 2023
dac12fb
Move settings to self.cfg, store min/max unset as None
adrianeboyd Jun 5, 2023
d52d7d9
Remove debugging
adrianeboyd Jun 5, 2023
b2c56a0
Update docstrings and docs
adrianeboyd Jun 5, 2023
92a8ad8
Update spacy/pipeline/span_finder.py
adrianeboyd Jun 7, 2023
77e6962
Fix imports
adrianeboyd Jun 7, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 47 additions & 2 deletions spacy/cli/templates/quickstart_training.jinja
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ the docs and the init config command. It encodes various best practices and
can help generate the best possible configuration, given a user's requirements. #}
{%- set use_transformer = hardware != "cpu" and transformer_data -%}
{%- set transformer = transformer_data[optimize] if use_transformer else {} -%}
{%- set listener_components = ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker", "spancat", "spancat_singlelabel", "trainable_lemmatizer"] -%}
{%- set listener_components = ["tagger", "morphologizer", "parser", "ner", "textcat", "textcat_multilabel", "entity_linker", "span_finder", "spancat", "spancat_singlelabel", "trainable_lemmatizer"] -%}
[paths]
train = null
dev = null
Expand All @@ -28,7 +28,7 @@ lang = "{{ lang }}"
tok2vec/transformer. #}
{%- set with_accuracy_or_transformer = (use_transformer or with_accuracy) -%}
{%- set textcat_needs_features = has_textcat and with_accuracy_or_transformer -%}
{%- if ("tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "spancat" in components or "spancat_singlelabel" in components or "trainable_lemmatizer" in components or "entity_linker" in components or textcat_needs_features) -%}
{%- if ("tagger" in components or "morphologizer" in components or "parser" in components or "ner" in components or "span_finder" in components or "spancat" in components or "spancat_singlelabel" in components or "trainable_lemmatizer" in components or "entity_linker" in components or textcat_needs_features) -%}
{%- set full_pipeline = ["transformer" if use_transformer else "tok2vec"] + components -%}
{%- else -%}
{%- set full_pipeline = components -%}
Expand Down Expand Up @@ -127,6 +127,30 @@ grad_factor = 1.0
@layers = "reduce_mean.v1"
{% endif -%}

{% if "span_finder" in components -%}
[components.span_finder]
factory = "span_finder"
max_length = null
min_length = null
scorer = {"@scorers":"spacy.span_finder_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.span_finder.model]
@architectures = "spacy.SpanFinder.v1"

[components.span_finder.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = 2

[components.span_finder.model.tok2vec]
@architectures = "spacy-transformers.TransformerListener.v1"
grad_factor = 1.0

[components.span_finder.model.tok2vec.pooling]
@layers = "reduce_mean.v1"
{% endif -%}

{% if "spancat" in components -%}
[components.spancat]
factory = "spancat"
Expand Down Expand Up @@ -392,6 +416,27 @@ nO = null
width = ${components.tok2vec.model.encode.width}
{% endif %}

{% if "span_finder" in components %}
[components.span_finder]
factory = "span_finder"
max_length = null
min_length = null
scorer = {"@scorers":"spacy.span_finder_scorer.v1"}
spans_key = "sc"
threshold = 0.5

[components.span_finder.model]
@architectures = "spacy.SpanFinder.v1"

[components.span_finder.model.scorer]
@layers = "spacy.LinearLogistic.v1"
nO = 2

[components.span_finder.model.tok2vec]
@architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
{% endif %}

{% if "spancat" in components %}
[components.spancat]
factory = "spancat"
Expand Down
4 changes: 4 additions & 0 deletions spacy/errors.py
Original file line number Diff line number Diff line change
Expand Up @@ -973,6 +973,10 @@ class Errors(metaclass=ErrorsWithCodes):
E1052 = ("Unable to copy spans: the character offsets for the span at "
"index {i} in the span group do not align with the tokenization "
"in the target doc.")
E1053 = ("Both 'min_length' and 'max_length' should be larger than 0, but found"
" 'min_length': {min_length}, 'max_length': {max_length}")
E1054 = ("The text, including whitespace, must match between reference and "
"predicted docs when training {component}.")


# Deprecated model shortcuts, only used in errors and warnings
Expand Down
1 change: 1 addition & 0 deletions spacy/ml/models/__init__.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from .entity_linker import * # noqa
from .multi_task import * # noqa
from .parser import * # noqa
from .span_finder import * # noqa
from .spancat import * # noqa
from .tagger import * # noqa
from .textcat import * # noqa
Expand Down
42 changes: 42 additions & 0 deletions spacy/ml/models/span_finder.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
from typing import Callable, List, Tuple

from thinc.api import Model, chain, with_array
from thinc.types import Floats1d, Floats2d

from spacy.tokens import Doc
adrianeboyd marked this conversation as resolved.
Show resolved Hide resolved

from ...util import registry

InT = List[Doc]
OutT = Floats2d


@registry.architectures("spacy.SpanFinder.v1")
def build_finder_model(
tok2vec: Model[InT, List[Floats2d]], scorer: Model[OutT, OutT]
) -> Model[InT, OutT]:

logistic_layer: Model[List[Floats2d], List[Floats2d]] = with_array(scorer)
model: Model[InT, OutT] = chain(tok2vec, logistic_layer, flattener())
model.set_ref("tok2vec", tok2vec)
model.set_ref("scorer", scorer)
model.set_ref("logistic_layer", logistic_layer)

return model


def flattener() -> Model[List[Floats2d], Floats2d]:
"""Flattens the input to a 1-dimensional list of scores"""

def forward(
model: Model[Floats1d, Floats1d], X: List[Floats2d], is_train: bool
) -> Tuple[Floats2d, Callable[[Floats2d], List[Floats2d]]]:
lens = model.ops.asarray1i([len(doc) for doc in X])
Y = model.ops.flatten(X)

def backprop(dY: Floats2d) -> List[Floats2d]:
return model.ops.unflatten(dY, lens)

return Y, backprop

return Model("Flattener", forward=forward)
14 changes: 8 additions & 6 deletions spacy/pipeline/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,21 +2,22 @@
from .dep_parser import DependencyParser
from .edit_tree_lemmatizer import EditTreeLemmatizer
from .entity_linker import EntityLinker
from .ner import EntityRecognizer
from .entityruler import EntityRuler
from .functions import merge_entities, merge_noun_chunks, merge_subtokens
from .lemmatizer import Lemmatizer
from .morphologizer import Morphologizer
from .ner import EntityRecognizer
from .pipe import Pipe
from .trainable_pipe import TrainablePipe
from .senter import SentenceRecognizer
from .sentencizer import Sentencizer
from .senter import SentenceRecognizer
from .span_finder import SpanFinder
from .span_ruler import SpanRuler
from .spancat import SpanCategorizer
from .tagger import Tagger
from .textcat import TextCategorizer
from .spancat import SpanCategorizer
from .span_ruler import SpanRuler
from .textcat_multilabel import MultiLabel_TextCategorizer
from .tok2vec import Tok2Vec
from .functions import merge_entities, merge_noun_chunks, merge_subtokens
from .trainable_pipe import TrainablePipe

__all__ = [
"AttributeRuler",
Expand All @@ -32,6 +33,7 @@
"Sentencizer",
"SpanCategorizer",
"SpanRuler",
"SpanFinder",
adrianeboyd marked this conversation as resolved.
Show resolved Hide resolved
"Tagger",
"TextCategorizer",
"Tok2Vec",
Expand Down
Loading