Skip to content

John Snow Labs Spark-NLP 1.8.1: ML SentenceDetector, improved ContextSpellChecker and bugfixes

Compare
Choose a tag to compare
@saif-ellafi saif-ellafi released this 26 Jan 00:20
· 7304 commits to master since this release

Overview

This hotfix version of Spark-NLP improves framework support by adding Maven coordinates for OCR and allowing S3 retrieval of files.
We also included code for generating Graphs for NerDL and also for creating your own metadata files for a private model downloader.
As new features, we are including a new experimental machine learning based sentence detector, which uses NER for bounds detections.
Aside from this, we are including a few bug fixes and OCR improvements. Enjoy! and thanks again for community contributions!


New Features

  • New DeepSentenceDetector annotator takes Spark-NLP's NER Deep Learning models as a base to improve sentence detection

Enhancements

  • Improved accuracy of ContextSpellChecker by enabling re-ranking of candidate words according to a weighted levenshtein distance
  • OCR process now defaults to split content in rows whether paragraphs or pages are identified for improved parallelism. Maybe turned off

Examples and use cases

  • Added Scala examples for Sentiment analysis and Lemmatizer in Italian (Thanks Vincenzo Gaudenzi from DXC.technology for dataset and model contribution!!!)

Bugfixes

  • Fixed a bug in Norvig and Symmetric SpellCheckers where the pattern parameter was not provided properly in Scala side (Thanks @johnmccain for reporting!)

Framework

  • Added hadoop-aws dependency for remote download capabilities (e.g. word embeddings sets)

Other

  • Metadata files for pretrained model downloads code is now included. This may be useful if anyone wants to set up their own private local model downloader service
  • NerDL Graphs generation code is now included in the library. This allows the usage of custom word embedding dimensions and feature counts.

Special mentions

  • Vincenzo Gaudenzi (DXC.technology) for contributing Italian datasets and models. @maziyarpanahi for creating examples with them.
  • @correlator from Deep6.ai for contributing feedback in slack and features feedback in general
  • @johnmccain for reporting bugs in spell checker
  • @rohit-nlp for delivering maven coordinates for OCR
  • @haimco10 for contributing a sentence detector improvement with apostrophe's use case. Not merged due specific issues involved.