Releases: explosion/spaCy
v2.0.16: Fix msgpack-numpy pin
🔴 Bug fixes
- Fix
msgpack-numpy
pin, which could affect serialization on Python 2.7.
v2.0.15: More wheels and GPU improvements
✨ New features and improvements
- Improve version compatibility to support wheels for all spaCy dependencies maintained by us:
thinc
,cymem
,preshed
andmurmurhash
. - Support GPU installation by specifying
spacy[cuda]
,spacy[cuda90]
,spacy[cuda91]
,spacy[cuda92]
orspacy[cuda10]
, which will installcupy
andthinc_gpu_ops
. - Add
spacy.prefer_gpu()
andspacy.require_gpu()
functions.
📖 Documentation and examples
- Update GPU installation and usage docs.
v2.0.13: Wheels, alpha support for Telugu and Sinhala, rule-based lemmatization for French and Greek, plus various small fixes
✨ New features and improvements
- NEW: Pre-built wheels and up to 10 times faster installation! This release starts the journey towards pre-built wheels for all of spaCy's dependencies. Once that's completed, you won't even need a local compiler anymore to install the library. For more details on our wheels process, see
explosion/wheelwright
. - NEW: Alpha support for Telugu and Sinhala.
- NEW: Rule-based lemmatization for Greek and French.
- Port over Chinese support (#1210) from v1.x.
- Improve language data for Persian, Greek, Swedish, Bengali, Polish, Portuguese, Indonesian, French, German and Russian.
- Add
Span.ents
property for consistency withDoc.ents
. - Add
--verbose
option tospacy train
to output more details for debugging.
🔴 Bug fixes
- Fix issue #653: Introduce bulk merge function.
- Fix issue #1445, #1917, #2209, #2362, #2371, #2383, #2501, #2743, #2758: Fix Keras examples.
- Fix issue #2261, #2800: Fix bug that could cause a crash with too many entity types.
- Fix issue #2540: Improve French stop words.
- Fix issue #2582, #2640, #2645, #2657, #2705, #2784, #2815, #2841, #2845: Fix typos and inconsistencies in documentation.
- Fix issue #2593: Prevent
numpy
warning. - Fix issue #2706: Add missing label
FAC
tospacy.explain
glossary. - Fix issue #2709: Pass default option when calling
getoption()
inconftest.py
.
📖 Documentation and examples
- Improve Keras examples.
- Update training examples to use minibatching.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @DimaBryuhanov, @kororo, @AndriyMulyar, @katarkor, @giannisdaras, @bphi, @vikaskyadav, @sammous, @EmilStenstrom, @howl-anderson, @ohenrik, @aashishg, @aryaprabhudesai, @steve-prod, @njsmith, @aniruddha-adhikary, @pzelasko, @mbkupfer, @sainathadapa, @tyburam, @grivaz, @filipecaixeta, @aongko, @free-variation, @mauryaland, @pmj642, @keshan, @darindf, @charlax, @phojnacki, @skrcode, @jacopofar, @Cinnamy and @JKhakpour for the pull requests and contributions!
v2.1.0a1: New models, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly
. It's not intended for production use.
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
✨ New features and improvements
Tagger, Parser & NER
- NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Fix bugs in beam-search training objective.
- Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v6.11, which defaults to single-thread with fast OpenBLAS kernel. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
- NEW: Statistical models for Greek.
CLI
- NEW: New
ud-train
command, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download
. - Pass additional arguments of
download
command topip
to customise installation. - Improve
train
command by lettingGoldCorpus
stream data, instead of loading into memory. - Improve
init-model
command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocab
command, which is now deprecated. - Add support for multi-task objectives to
train
command. - Add support for data-augmentation to
train
command.
Other
- NEW:
Doc.retokenize
context manager for merging tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- NEW: Built-in rule-based NER component to add entities based on match patterns (see #2513).
- Add warnings if
.similarity
method is called with empty vectors or without word vectors. - Improve rule-based
Matcher
and addreturn_matches
keyword argument toMatcher.pipe
to yield(doc, matches)
tuples instead of onlyDoc
objects, andas_tuples
to add context to theDoc
objects. - Make stop words via
Token.is_stop
andLexeme.is_stop
case-insensitive.
🚧 Under construction
This section includes new features and improvements that are planned for the stable
v2.1.x
release, but aren't included in the nightly yet.
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()
context manager. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1665: Correct typos in symbol
Animacy_inan
and addAnimacy_nhum
. - Fix issue #1865: Correct licensing of
it_core_news_sm
model. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcl
dependency label to symbols. - Fix issue #2014: Make
Token.pos_
writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix issue #2671, #2675: Fix incorrect match ID on some patterns.
- Fix serialization of custom tokenizer if not all functions are defined.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validate
command to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
Matcher
API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcher
inv2.1.x
may produce different results compared to theMatcher
inv2.0.x
. - Also note that some of the model licenses have changed:
it_core_news_sm
is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
Model | Language | Version | UAS | LAS | POS | NER F | Vec | Size |
---|---|---|---|---|---|---|---|---|
en_core_web_sm |
English | 2.1.0a0 | 91.8 | 90.0 | 96.8 | 85.6 | 𐄂 | 28 MB |
en_core_web_md |
English | 2.1.0a0 | 92.0 | 90.2 | 97.0 | 86.2 | ✓ | 107 MB |
en_core_web_lg |
English | 2.1.0a0 | 92.1 | 90.3 | 97.0 | 86.2 | ✓ | 805 MB |
de_core_news_sm |
German | 2.1.0a0 | 92.0 | 90.1 | 97.2 | 83.8 | 𐄂 | 26 MB |
de_core_news_md |
German | 2.1.0a0 | 92.4 | 90.7 | 97.4 | 84.2 | ✓ | 228 MB |
es_core_news_sm |
Spanish | 2.1.0a0 | 90.1 | 87.2 | 96.9 | 89.4 | 𐄂 | 28 MB |
es_core_news_md |
Spanish | 2.1.0a0 | 90.7 | 88.0 | 97.2 | 89.5 | ✓ | 88 MB |
pt_core_news_sm |
Portuguese | 2.1.0a0 | 89.4 | 86.3 | 80.1 | 82.7 | 𐄂 | 29 MB |
fr_core_news_sm |
French | 2.1.0a0 | 88.8 | 85.7 | 94.4 | 67.3 1 | 𐄂 | 32 MB |
fr_core_news_md |
French | 2.1.0a0 | 88.7 | 86.0 | 95.0 | 70.4 1 | ✓ | 100 MB |
it_core_news_sm |
Italian | 2.1.0a0 | 90.7 | 87.1 | 96.1 | 81.3 | 𐄂 | 27 MB |
nl_core_news_sm |
Dutch | 2.1.0a0 | 83.5 | 77.6 | 91.5 | 87.3 | 𐄂 | 27 MB |
el_core_news_sm |
Greek | 2.1.0a0 | 84.5 | 81.0 | 95.0 | 73.5 | 𐄂 | 27 MB |
el_core_news_md |
Greek | 2.1.0a0 | 87.7 | 84.7 | 96.3 | 80.2 | ✓ | 143 MB |
xx_ent_wiki_sm |
Multi | 2.1.0a0 | - | - | - | 83.8 | 𐄂 | 9 MB |
- We're currently investigating this, as the results are anomalously low.
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_
). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @DuyguA, @giannisdaras, @mgogoulos and @louridas for the pull requests and contributions.
v2.0.12: Greek, Arabic, Urdu, Tatar, improved language data, better model downloads & various compatibility and bug fixes
We had to release another update to the v2.0.x
branch of spaCy to resolve a dependency issue, so we decided to also include and/or backport a bunch of features and fixes that were originally intended for v2.1.0
(see here for the nightly version).
✨ New features and improvements
- NEW: Alpha tokenization and language data for Arabic, Urdu, Tatar and Greek.
- NEW: Mecab-based Japanese tokenization and lemmatization.
- NEW: Add Norwegian rule-based and lookup lemmatization.
- NEW: Add Danish lookup lemmatization based on the Den store danske SprogTeknologiske Ordbase, STO dataset, courtesy of The University of Copenhagen.
- NEW: Romanian lookup lemmatization.
- Improve language data for Polish, Turkish, French, Romanian, Swedish and Japanese.
- Improve case-sensitive lookup lemmatization in German.
- Add
Token.sent
property that returns the sentenceSpan
the token is part of. - Add
remove_extension
method onDoc
,Token
andSpan
. - Add
Doc.is_sentenced
property that returnsTrue
if sentence boundaries have been applied. - Allow ignoring warning by code via the
SPACY_WARNING_IGNORE
environment variable. - Add
--silent
option toinfo
command.
🔴 Bug fixes
- Fix issue #1456: Pass additional arguments of
download
command topip
and check if model is already installed before downloading it. - Fix issue #2191: Update
README
section on tests and dependencies. - Fix issue #2194: Ensure that
Doc.noun_chunks_iterator
isn'tNone
before calling it. - Fix issue #2196: Return data in
cli.info
and addsilent
option. - Fix issue #2200: Correct typo in
spacy package
command message. - Fix issue #2210: Fix bug in Spanish noun chunks.
- Fix issue #2211, #2320: Resolve problem in
download
command and userequests
library again. - Fix issue #2219: Fix token similarity of single-letter tokens.
- Fix issue #2222, #2223: Fix typos in documentation and docstrings.
- Fix issue #2226: Use correct, non-deprecated merge syntax in
merge_ents
. - Fix issue #2228: Fix deserialization when using
tensor=False
orsentiment=False
. - Fix issue #2238: Correct Swedish lookup lemmatization.
- Fix issue #2242: Add
remove_extension
method onDoc
,Token
andSpan
. - Fix issue #2266: Add
collapse_phrases
option to displaCy visualizer. - Fix issue #2269: Fix
KeyError
by renamingSP
to_SP
. - Fix issue #2304: Don't require
attrs
argument inDoc.retokenize
and allow ints/unicode. - Fix issue #2361: Escape HTML tags in
displacy.render
. - Fix issue #2376: Improve
Matcher
examples and add section on using pipeline components. - Fix issue #2385: Handle multi-word entities correctly in IOB to BILUO conversion.
- Fix issue #2452: Fix bug that would cause
displacy
arrows to only point in one direction. - Fix issue #2477: Also allow
Span
objects indisplacy.render
. - Fix issue #2490: Update Thinc's dependencies for Python 3.7 compatibility.
- Fix issue #2495: Fix loading tokenizer with custom prefix search.
- Fix issue #2514: Switch from
msgpack-python
tomsgpack
to hopefully prevent conda from downloading a two-year-old spaCy version when installing with latest the Anaconda distribution. - Ensure that
Doc.is_tagged
is set correctly when usingLanguage.pipe
. - Fix bug in
merge_noun_chunks
factory that would returnNone
ifDoc
wasn't parsed. - Explicitly require
pathlib
backport on Python 2 only.
📖 Documentation and examples
- NEW: Edit and execute code examples in your browser – all across the documentation!
- NEW: The spaCy Universe, a collection of plugins, extensions and other resources for spaCy.
- NEW: Experimental rule-based
Matcher
Explorer demo – create token patterns interactively, test them against your text and copy-paste the Python pattern code. - NEW: Document Cython API.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @mollerhoj, @howl-anderson, @pktippa, @skrcode, @miroli, @ivyleavedtoadflax, @5hirish, @therealronnie, @alexvy86, @mn3mos, @polm, @knoxdw, @bellabie, @mauryaland, @LRAbbade, @janimo, @vishnumenon, @tzano, @cclauss, @armsp, @aristorinjuang, @BigstickCarpet, @idealley, @ansgar-t, @mpszumowski, @91ns, @msklvsk, @himkt, @DanielRuf, @nathanathan, @GolanLevy, @nipunsadvilkar, @cjhurst, @aliiae, @mirfan899, @ohenrik, @btrungchi, @kleinay, @DuyguA, @stefan-it, @Eleni170, @datascouting, @tjkemp, @x-ji, @giannisdaras, @kororo and @katarkor for the pull requests and contributions.
v2.1.0a0: New models, joint word segmentation and parsing, better Matcher, bug fixes & more
🌙 This is an alpha pre-release of spaCy v2.1.0 and available on pip as spacy-nightly
. It's not intended for production use.
pip install -U spacy-nightly
If you want to test the new version, we recommend using a new virtual environment. Also make sure to download the new models – see below for details and benchmarks.
✨ New features and improvements
Tagger, Parser & NER
- NEW: Allow parser to do joint word segmentation and parsing. If you pass in data where the tokenizer over-segments, the parser now learns to merge the tokens.
- Make parser, tagger and NER faster, through better hyperparameters.
- Fix bugs in beam-search training objective.
- Remove document length limit during training, by implementing faster Levenshtein alignment.
- Use Thinc v6.11, which defaults to single-thread with fast OpenBLAS kernel. Parallelisation should be performed at the task level, e.g. by running more containers.
Models & Language Data
- NEW: Small accuracy improvements for parsing, tagging and NER for 6+ languages.
- NEW: The English and German models are now available under the MIT license.
CLI
- NEW: New
ud-train
command, to train and evaluate using the CoNLL 2017 shared task data. - Check if model is already installed before downloading it via
spacy download
. - Pass additional arguments of
download
command topip
to customise installation. - Improve
train
command by lettingGoldCorpus
stream data, instead of loading into memory. - Improve
init-model
command, including support for lexical attributes and word-vectors, using a variety of formats. This replaces thespacy vocab
command, which is now deprecated.
Other
- NEW:
Doc.retokenize
context manager for merging tokens more efficiently. - NEW: Add support for custom pipeline component factories via entry points (#2348).
- NEW: Implement fastText vectors with subword features.
- Add warnings if
.similarity
method is called with empty vectors or without word vectors. - Improve rule-based
Matcher
and addreturn_matches
keyword argument toMatcher.pipe
to yield(doc, matches)
tuples instead of onlyDoc
objects, andas_tuples
to add context to theDoc
objects. - Make stop words via
Token.is_stop
andLexeme.is_stop
case-insensitive.
🚧 Under construction
This section includes new features and improvements that are planned for the stable
v2.1.x
release, but aren't included in the nightly yet.
- Enhanced pattern API for rule-based
Matcher
(see #1971).- Built-in rule-based NER component to add entities based on match patterns (see #2513).
- Improve tokenizer performance (see #1642).
- Allow retokenizer to update
Lexeme
attributes on merge (see #2390).md
andlg
models and new, pre-trained word vectors for German, French, Spanish, Italian, Portuguese and Dutch.
🔴 Bug fixes
- Fix issue #1487: Add
Doc.retokenize()
context manager. - Fix issue #1574: Make sure stop words are available in medium and large English models.
- Fix issue #1665: Correct typos in symbol
Animacy_inan
and addAnimacy_nhum
. - Fix issue #1865: Correct licensing of
it_core_news_sm
model. - Fix issue #1889: Make stop words case-insensitive.
- Fix issue #1903: Add
relcl
dependency label to symbols. - Fix issue #2014: Make
Token.pos_
writeable. - Fix issue #2369: Respect pre-defined warning filters.
- Fix serialization of custom tokenizer if not all functions are defined.
⚠️ Backwards incompatibilities
- This version of spaCy requires downloading new models. You can use the
spacy validate
command to find out which models need updating, and print update instructions. - If you've been training your own models, you'll need to retrain them with the new version.
- While the
Matcher
API is fully backwards compatible, its algorithm has changed to fix a number of bugs and performance issues. This means that theMatcher
inv2.1.x
may produce different results compared to theMatcher
inv2.0.x
. - Also note that some of the model licenses have changed:
it_core_news_sm
is now correctly licensed under CC BY-NC-SA 3.0, and all English and German models are now published under the MIT license.
📈 Benchmarks
Model | Version | UAS | LAS | POS | NER F | Vec | Size |
---|---|---|---|---|---|---|---|
en_core_web_sm |
2.1.0a0 | 91.8 | 90.0 | 96.8 | 85.6 | 𐄂 | 28 MB |
en_core_web_md |
2.1.0a0 | 92.0 | 90.2 | 97.0 | 86.2 | ✓ | 107 MB |
en_core_web_lg |
2.1.0a0 | 92.1 | 90.3 | 97.0 | 86.2 | ✓ | 805 MB |
de_core_news_sm |
2.1.0a0 | 92.0 | 90.1 | 97.2 | 83.8 | 𐄂 | 26 MB |
de_core_news_md |
2.1.0a0 | 92.4 | 90.7 | 97.4 | 84.2 | ✓ | 228 MB |
es_core_news_sm |
2.1.0a0 | 90.1 | 87.2 | 96.9 | 89.4 | 𐄂 | 28 MB |
es_core_news_md |
2.1.0a0 | 90.7 | 88.0 | 97.2 | 89.5 | ✓ | 88 MB |
pt_core_news_sm |
2.1.0a0 | 89.4 | 86.3 | 80.1 | 82.7 | 𐄂 | 29 MB |
fr_core_news_sm |
2.1.0a0 | 88.8 | 85.7 | 94.4 | 67.3 1 | 𐄂 | 32 MB |
fr_core_news_md |
2.1.0a0 | 88.7 | 86.0 | 95.0 | 70.4 1 | ✓ | 100 MB |
it_core_news_sm |
2.1.0a0 | 90.7 | 87.1 | 96.1 | 81.3 | 𐄂 | 27 MB |
nl_core_news_sm |
2.1.0a0 | 83.5 | 77.6 | 91.5 | 87.3 | 𐄂 | 27 MB |
xx_ent_wiki_sm |
2.1.0a0 | - | - | - | 83.8 | 𐄂 | 9 MB |
- We're currently investigating this, as the results are anomalously low.
💬 UAS: Unlabelled dependencies (parser). LAS: Labelled dependencies (parser). POS: Part-of-speech tags (fine-grained tags, i.e.
Token.tag_
). NER F: Named entities (F-score). Vec: Model contains word vectors. Size: Model file size (zipped archive).
📖 Documentation and examples
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @DuyguA for the pull requests and contributions.
v2.0.11: Alpha Vietnamese support, fixes to vectors, improved errors and more
📊 Help us improve spaCy and take the User Survey 2018!
✨ New features and improvements
- NEW: Alpha Vietnamese support with tokenization via Pyvi.
- NEW: Improved system for error messages and warnings. Errors now have unique error codes and are referenced in one place, and all unspecified
assert
s have been replaced with descriptive errors. See #2163 for implementation details, and let us know if you have any suggestions for errors and warnings in #2164! - Improve language data for Polish.
- Tidy up dependencies and drop
six
,html5lib
,ftfy
andrequests
. - Improve efficiency (and potentially accuracy) of beam-search training, by randomly using greedy updates for some sentences. This can be controlled by changing the
beam_update_prob
entry innlp.parser.cfg
. The default value is 0.5, so 50% of beam updates will be done as greedy updates.
🔴 Bug fixes
- Fix issue #1554, #1752, #2159: Fix
Token.ent_iob
afterDoc.merge()
, and ensure consistency inDoc.ents
. - Fix issue #1660: Fix loading of multiple vector models.
- Fix issue #1967: Allow entity types with dashes.
- Fix issue #2032: Fix accidentally quadratic runtime in
Vocab.set_vector
. - Fix issue #2050: Correct mistakes in Italian lemmatizer data.
- Fix issue #2073: Make
Token.set_extension
work as expected. - Fix issue #2100, #2151, #2181: Drop
six
andhtml5lib
and prevent dependency conflict with TensorFlow / Keras. - Fix issue #2101: Improve error message if token text is empty string.
- Fix issue #2121: Fix
Language.to_bytes
and pickling in Thinc. - Fix issue #2156: Fix hashtag example in
Matcher
docs. - Fix issue #2177: Don't raise error in
set_extension
ifgetter
andsetter
are specified or ifdefault=None
, and add error ifsetter
is specified with nogetter
.
📖 Documentation and examples
- Add example for TensorBoard's standalone embedding projector.
- Improve example for training a new entity type.
- Add formal
CITATION
for assigning a DOI via Zenodo.
👥 Contributors
Thanks to @jimregan, @justindujardin, @trungtv, @katrinleinweber and @skrcode for the pull requests and contributions.
v2.0.10: Built-in factories to merge spans, small improvements and bug fixes
📊 Help us improve spaCy and take the User Survey 2018!
✨ New features and improvements
- Improve language data for Turkish and Croatian.
- Add built-in factories for
merge_entities
andmerge_noun_chunks
to allow models to specify those components as part of their pipeline.
merge_entities = nlp.create_pipe('merge_entities')
nlp.add_pipe(merge_entities, after='ner')
🔴 Bug fixes
- Fix issue #2012: Fix Spanish
noun_chunks
failure caused by typo. - Fix issue #2040: Make sure
Token.lemma
always returns a hash value. - Fix issue #2063: Correct typo in English lookup lemmatization table.
- Fix issue #2103: Correct typo in documentation.
- Fix pickling of
Vectors
class.
📖 Documentation and examples
- Add example for visualizing spaCy vectors with the TensorBoard Embedding Projector.
- Fix various typos and inconsistencies.
👥 Contributors
Thanks to @thomasopsomer, @alldefector, @DuyguA, @dejanmarich, @justindujardin, @calumcalder, @SebastinSanty, @iann0036, @doug-descombaz and @willismonroe for the pull requests and contributions.
v1.10.1: Fix compatibility with pip
🔴 Bug fixes
- Fix issue #2112: Avoid
import pip
to ensure compatibility with pip v9.0.2 which deprecated this usage. See pypa/pip#5081 for more details.
👥 Contributors
Thanks to @mdcclv for the pull request!
v2.0.9: Fix issue with msgpack dependency
📊 Help us improve spaCy and take the User Survey 2018!
🔴 Bug fixes
- Fix issue #2015: Pin
msgpack-python
to0.5.4
to avoid conflict with newmsgpack
release.