Skip to content

v1.2.0: Alpha tokenizers for Chinese, French, Spanish, Italian and Portuguese

Compare
Choose a tag to compare
@honnibal honnibal released this 05 Nov 01:35
· 12625 commits to master since this release

✨ Major features and improvements

  • NEW: Support Chinese tokenization, via Jieba.
  • NEW: Alpha support for French, Spanish, Italian and Portuguese tokenization.

🔴 Bug fixes

  • Fix issue #376: POS tags for "and/or" are now correct.
  • Fix issue #578: --force argument on download command now operates correctly.
  • Fix issue #595: Lemmatization corrected for some base forms.
  • Fix issue #588: Matcher now rejects empty patterns.
  • Fix issue #592: Added exception rule for tokenization of "Ph.D."
  • Fix issue #599: Empty documents now considered tagged and parsed.
  • Fix issue #600: Add missing token.tag and token.tag_ setters.
  • Fix issue #596: Added missing unicode import when compiling regexes that led to incorrect tokenization.
  • Fix issue #587: Resolved bug that caused Matcher to sometimes segfault.
  • Fix issue #429: Ensure missing entity types are added to the entity recognizer.