Release John Snow Labs Spark-NLP 2.0.2: DL Annotators performance improvemnts, Word Embedding enhancements and better parallelism · JohnSnowLabs/spark-nlp

Thank you for joining us in this exciting Spark NLP year!. We continue to make progress towards a better performing library, both in speed and in accuracy.
This release focuses strongly in the quality and stability of the library, making sure it works well in most cluster environments
and improving the compatibility across systems. Word Embeddings continue to be improved for better performance and lower memory blueprint.
Context Spell Checker continues to receive enhancements in concurrency and usage of spark. Finally, tensorflow based annotators
have been significantly improved by refactoring the serialization design. Help us with feedback and we'll welcome any issue reports!

New Features

NerCrf annotator has now includeConfidence param that includes confidence scores for predictions in metadata

Enhancements

Cluster mode performance improved in tensorflow annotators by serializing to bytes internal information
Doc2Chunk annotator added new params startCol, startColByTokenIndex, failOnMissing and lowerCase allows better chunking of documents
All annotations that derive from sentence or chunk types now contain metadata information referring to the sentence or chunk ID they belong to
ContextSpellChecker now creates a window around the token to improve computation performance
Improved WordEmbeddings matching accuracy by trying alternative case sensitive tokens
WordEmbeddings won't load twice if already loaded
WordEmbeddings can use embeddingsRef if source was not provided, improving reutilization of embeddings in a pipeline
WordEmbeddings new param includeEmbeddings allow annotators not to save entire embeddings source along them
Contrib tensorflow dependencies now only load if necessary

Bugfixes

Added missing Symmetric delete pretrained model
Fixed a broken param name in Normalizer (thanks @RobertSassen)
Fixed Cloudera cluster support
Fixed concurrent access in ContextSpellChecker in high partition number use cases and LightPipelines
Fixed POS dataset creator to better handle corrupted pairs
Fixed a bug in Word Embeddings not matching exact case sensitive tokens in some scenarios
Fixed OCR Tess4J initialization problems in concurrent scenarios

Models and Pipelines

Renaming of models and pipelines (work in progress)
Better output column naming in pipelines

Developer API

Unified more WordEmbeddings interface with dimension params and individual setters
Improved unit tests for better compatibility on Windows
Python embeddings moved to sparknlp.embeddings

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

John Snow Labs Spark-NLP 2.0.2: DL Annotators performance improvemnts, Word Embedding enhancements and better parallelism

New Features

Enhancements

Bugfixes

Models and Pipelines

Developer API

Contributors