Skip to content

v1.6.0: Improvements to tokenizer and tests

Compare
Choose a tag to compare
@honnibal honnibal released this 16 Jan 13:14
· 11960 commits to master since this release

✨ Major features and improvements

  • Updated token exception handling mechanism to allow the usage of arbitrary functions as token exception matchers.
  • Improve how tokenizer exceptions for English contractions and punctuations are generated.
  • Update language data for Hungarian and Swedish tokenization.
  • Update to use Thinc v6 to prepare for spaCy v2.0.

🔴 Bug fixes

  • Fix issue #326: Tokenizer is now more consistent and handles abbreviations correctly.
  • Fix issue #344: Tokenizer now handles URLs correctly.
  • Fix issue #483: Period after two or more uppercase letters is split off in tokenizer exceptions.
  • Fix issue #631: Add richcmp method to Token.
  • Fix issue #718: Contractions with She are now handled correctly.
  • Fix issue #736: Times are now tokenized with correct string values.
  • Fix issue #743: Token is now hashable.
  • Fix issue #744: were and Were are now excluded correctly from contractions.

📋 Tests

  • Modernise and reorganise all tests and remove model dependencies where possible.
  • Improve test speed to ~20s for basic tests (from previously >80s) and ~100s including models (from previously >200s).
  • Add fixtures for spaCy components and test utilities, e.g. to create Doc object manually.
  • Add documentation for tests to explain conventions and organisation.

👥 Contributors

Thanks to @oroszgy, @magnusburton, @guyrosin and @danielhers and for the pull requests!