Release v1.6.0: Improvements to tokenizer and tests · explosion/spaCy

✨ Major features and improvements

Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.
Improve how tokenizer exceptions for English contractions and punctuations are generated.
Update language data for Hungarian and Swedish tokenization.
Update to use Thinc v6 to prepare for spaCy v2.0.

Fix issue #326: Tokenizer is now more consistent and handles abbreviations correctly.
Fix issue #344: Tokenizer now handles URLs correctly.
Fix issue #483: Period after two or more uppercase letters is split off in tokenizer exceptions.
Fix issue #631: Add richcmp method to Token.
Fix issue #718: Contractions with She are now handled correctly.
Fix issue #736: Times are now tokenized with correct string values.
Fix issue #743: Token is now hashable.
Fix issue #744: were and Were are now excluded correctly from contractions.

Modernise and reorganise all tests and remove model dependencies where possible.
Improve test speed to ~20s for basic tests (from previously >80s) and ~100s including models (from previously >200s).
Add fixtures for spaCy components and test utilities, e.g. to create Doc object manually.
Add documentation for tests to explain conventions and organisation.

Thanks to @oroszgy, @magnusburton, @guyrosin and @danielhers and for the pull requests!