Releases · JohnSnowLabs/spark-nlp

29 May 19:46

2.0.5

8324a2f

John Snow Labs Spark-NLP 2.0.5: NerDL customizable graphs and cluster fixes

This release bumps Spark NLP by default to Apache Spark 2.4.3. Spark has been undergoing testing with Scala 2.12 and they are back in 2.11 now, so this should be a working release.
In this version, we fixed a series of Pretrained models, as well as focused on improving the flexibility of NerDL annotator, which is, if not, the most popular one based on user feedback.
Users can point to graphs they create without having to re-compile the library, graph options as well whether to use Tensorflow contrib is now user defined.
Particular thanks to @CyborgDroid because of reporting important and well-reported bugs that helped us improve Spark NLP.
Thank you for reporting issues and feedback, and we always welcome more. Join us on Slack!

Enhancements

ViveknSentiment annotator now includes confidence score in metadata
NerDL now has setGraphFolder to allow a path to folder with custom generated graphs using python/tensorflow code
NerDL now has setConfigProtoBytes to allow users submit his own ConfigProto (serialized) to the graph settings
NerDLApproach now has setUseContrib to let training user decide whether or not to use contrib. Contrib LSTM Cells are proved to return more accurate results, but does not work in Windows yet.
Updated default tensorflow settings to include GPU allow_growth by default, disabled log device placement spamming message
Spark version bumped to 2.4.3

Bugfixes

Fixed contrib NerDL models not work properly in clusters such as Databricks (Thanks @CyborgDroid)
Fixed sparknlp.start(include_ocr=True) missing dependencies for OCR
Fixed DependencyParser pretrained models not working properly in Python

Models and Pipelines

NerDL will download noncontrib model if windows is detected, for better compatibility
noncontrib version of pipelines with NerDL have been uploaded, as well as new models. Check documentation for complete list
Improved error message when user is under windows and trying to load a contrib NerDL model
Fixed ViveknSentimentModel not working properly (Thanks @CyborgDroid)

Developer API

Embeddings in python moved to annotator module for consistency
SourceStream ResourceHelper class now properly handles cluster files for Dependency Parser
Metadata model reader now ignores empty lines instead of failing
Unified lang instead of language attribute name in pretrained API

Contributors

CyborgDroid

Assets 2

22 May 19:46

saif-ellafi

2.0.4

d919814

John Snow Labs Spark-NLP 2.0.4: Fixes for dependency parser and pretrained models

We are excited about the Spark NLP workshop (spark-nlp-workshop repository) being so useful for many users.
Now we also made a step forward by moving the website's documentation to an easy to maintain Jekyll template with Markdown. Spark NLP library received key bug fixes
on this release. Thanks to the community for reporting issues on GitHub. Much more to come, as always.

Bugfixes

Fixed DependencyParser and TypedDependencyParser working inaccurately
Fixed a bug preventing a load of WordEmbeddingsModel class from python
Fixed wrong pre-trained model names preventing some pre-trained models to work properly
Fixed BertEmbeddings not being capable of loading from file due to a reader exception

Documentation

Website documentation migrated to GitHub wiki page (WIP)

Developer API

OcrHelper now reports failed file name when throwing exceptions (Thanks @kgeis)
Fixed Annotation function explodeAnnotations to consider replacing output column scenarios
Fixed TRAVIS CI unit tests

Contributors

kgeis

Assets 2

29 Apr 21:13

saif-ellafi

2.0.3

c41f341

John Snow Labs Spark-NLP 2.0.3: Hotfix for tensorflow models in cluster improvements

Overview

Short after 2.0.2, a hotfix release was made to address two bugs that prevented users from using pretrained tensorflow models in clusters.
Please read release notes for 2.0.2 to catch up!

Bugfixes

Fixed logger serializable, causing issues in executors to serialize TensorflowWrapper
Fixed contrib loading in cluster, when retrieving a Tensorflow session

Assets 2

29 Apr 08:58

saif-ellafi

2.0.2

e41ca09

John Snow Labs Spark-NLP 2.0.2: DL Annotators performance improvemnts, Word Embedding enhancements and better parallelism

Thank you for joining us in this exciting Spark NLP year!. We continue to make progress towards a better performing library, both in speed and in accuracy.
This release focuses strongly in the quality and stability of the library, making sure it works well in most cluster environments
and improving the compatibility across systems. Word Embeddings continue to be improved for better performance and lower memory blueprint.
Context Spell Checker continues to receive enhancements in concurrency and usage of spark. Finally, tensorflow based annotators
have been significantly improved by refactoring the serialization design. Help us with feedback and we'll welcome any issue reports!

New Features

NerCrf annotator has now includeConfidence param that includes confidence scores for predictions in metadata

Enhancements

Cluster mode performance improved in tensorflow annotators by serializing to bytes internal information
Doc2Chunk annotator added new params startCol, startColByTokenIndex, failOnMissing and lowerCase allows better chunking of documents
All annotations that derive from sentence or chunk types now contain metadata information referring to the sentence or chunk ID they belong to
ContextSpellChecker now creates a window around the token to improve computation performance
Improved WordEmbeddings matching accuracy by trying alternative case sensitive tokens
WordEmbeddings won't load twice if already loaded
WordEmbeddings can use embeddingsRef if source was not provided, improving reutilization of embeddings in a pipeline
WordEmbeddings new param includeEmbeddings allow annotators not to save entire embeddings source along them
Contrib tensorflow dependencies now only load if necessary

Bugfixes

Added missing Symmetric delete pretrained model
Fixed a broken param name in Normalizer (thanks @RobertSassen)
Fixed Cloudera cluster support
Fixed concurrent access in ContextSpellChecker in high partition number use cases and LightPipelines
Fixed POS dataset creator to better handle corrupted pairs
Fixed a bug in Word Embeddings not matching exact case sensitive tokens in some scenarios
Fixed OCR Tess4J initialization problems in concurrent scenarios

Models and Pipelines

Renaming of models and pipelines (work in progress)
Better output column naming in pipelines

Developer API

Unified more WordEmbeddings interface with dimension params and individual setters
Improved unit tests for better compatibility on Windows
Python embeddings moved to sparknlp.embeddings

Contributors

RobertSassen

Assets 2

31 Mar 08:09

saif-ellafi

1.8.4

6f77952

John Snow Labs Spark-NLP 1.8.4: Chunk annotators match content by sentence, sentences include id

This release is meant to push downstream a few improvements from 2.0.x to the 1.8.x branch, mostly with the objective of keeping the stable branch line stable, and solving a few serious issues that were pending. This makes 1.8.4 an ideal version for stable deployments.

Enhancements

CHUNK type annotators now match content within sentence bounds, improves accuracy
Improved CHUNK type annotators to include sentence index information in metadata. May be used to improve matching accuracy.
Doc2Chunk annotator now has new params to failOnMissing, lowerCase match or startCol is token indexed
SentenceDetector and DeepSentenceDetector now disabled maxLength by default, also works appropriately to split in whitespaces
SentenceDetector include in metadata they sentence id

Assets 2

24 Mar 06:47

saif-ellafi

2.0.1

8060bc2

John Snow Labs Spark-NLP 2.0.1: Performance imprvements, serialization refactors and fixed cluster mode support

Thanks for following up after our 2.0.0 release!. This release covers a few holes left by the immense 2.0.0 release,
to address high priority issues found after release. More importantly, the library should now behave correctly when using
Spark cluster modes, and memory and CPU utilization should be reduced to normal levels after some serious profiling of Serialization
revealed a bunch of problems. Aside from performance and resource management improvements, we include an OCR dependency handler in start() function as well
as improve the support of GPU for NER Deep Learning models. Finally, check out our spark-nlp-workshop repo, it has cool features!

Enhancements

Improved serialization of Deep Learning models, shows performance boosts of up to 2.5 times over 1.8.3
Tensorflow contrib libraries now managed correctly across a cluster
Reverted useFeatureBroadcasting after internal benchmarks proved it was performing better
SparkNLP.start() and sparknlp.start() now accept an includeOCR parameter which allows to automatically include OCR library
Recreated NerDL Graphs to allow GPU allow_growth in tensorflow to improve memory management with GPU
Expanded GPU coverage in NerDL graph
Reduced NerDL Batch Size for better compatibility with GPUs

Bugfixes

Fixed deep learning models not working across cluster due a bug in inputBuffers from graph reading
Fixed a bug in POS() training function which did not work correctly from Python
Fixed a bug in OCR where page number and intersection was not correctly matched
Correctly handle exceptions when training Norvig and Symmetric Spell Checkers from dataframes

Developer API

ContextSpellChecker now follows Features API correctly

Documentation

spark-nlp-workshop repository has been expanded with better documentation and new notebooks
we are still catching up with 2.x release!

Assets 2

21 Mar 19:33

saif-ellafi

2.0.0

1adcd70

John Snow Labs Spark-NLP 2.0.0: Bert embeddings, embeddings as annotators, better OCR, new pretrained pipelines

Thank you for following up with the biggest changelog ever on Spark NLP: Spark NLP 2.0.0! Where to begin?
We have no less than 50 Pull Requests merged this time. Most importantly, we become the first library to have a production
ready implementation of BERT embeddings. Along with this interesting deep learning and context based embeddings algorithm, here is a quick overview of new things:

Word Embeddings as well as Bert Embeddings are now annotators, just like any other component in the library. This means, embeddings can be
cached on memory through DataFrames, can be saved on disk and shared as part of pipelines!
We revamped and enhanced Named Entity Recognition (NER) Deep Learning models to a new state of the art level, reaching up to 93% F1 micro-averaged accuracy in the industry standard.
We upgraded tensorflow version and also started using contrib LSTM Cells.
Performance and memory usage improvements also tag along by improving serialization throughput of Deep Learning annotators by receiving feedback from Apache Spark contributor Davies Liu.
Revamping and expanding our pretrained pipelines list, plus the addition of new pretrained models for different languages together with
tons of new example notebooks, which include changes that aim the library to be easier to use. API overall was modified towards helping new comers get started.
OCR module comes with a handful of improvements that increase accuracy.
All of this comes together with a full range of bug fixes and annotator improvements, follow up the details below!
Bear with us since documentation is still catching up a little bit behind, as well as new models to be made available. Stay tuned on Slack!

New Features

BertEmbeddings annotator, with four google ready models ready to be used through Spark NLP as part of your pipelines, includes Wordpiece tokenization.
WordEmbeddings, our previous embeddings system is now an Annotator to be serialized along Spark ML pipelines
Created training helper functions that create spark datasets from files, such as CoNLL and POS tagging
NER DL has been revamped by using contrib LSTM Cells. Added library handling for different OS.

Enhancements

OCR improved handling of images by adding binarizing of buffered segments
OCR now allows automatic adaptive scaling
SentenceDetector params merged between DL and Rule based annotators
SentenceDetector max length has been disabled by default, and now truncates by whitespace
Part of Speech, NER, Spell Checking and Vivekn Sentiment Analysis annotators now train from dataset passed to fit() using Spark in the process
Tokens and Chunks now hold metadata information regarding which sentence they belong to by sentence ID
AnnotatorApproach annotators now allow a param trainingCols allowing them to use different inputs in training and in prediction. Improves Pipeline versatility.
LightPipelines now allow method transform() to call against a DataFrame
Noticeable performance gains by improving serialization performance in annotators through removal of transient variables
Spark NLP in 30 seconds now provides a function SparkNLP.start() and sparknlp.start() (python) that automatically creates a local Spark session.
Improved DateMatcher accuracy
Improved Normalizer annotator by supporting and tokenizing a slang dictionary, with case sensitivity matching option
ContextSpellChecker now is capable of handling multiple sentences in a row
PretrainedPipeline feature now allows handling John Snow Labs remote pretrained pipelines to make it easy to update and access new models
Symmetric Delete spell checking model improved training performance

Models and Pipelines

Added more than 15 pretrained pipelines that cover a huge range of use cases. To be documented
Improved multi language support by adding french and italian pipelines and models. More to come!
Dependency Parser annotators now include a pretrained english model based on CoNLL-U 2009

Bugfixes

Fixed python classname reference when deserializing pipelines
Fixed serialization in ContextSpellChecker
Fixed a bug in LightPipeline causing not to include output from embedded pipelines in a PipelineModel
Fixed DateMatcher wrong param name not allowing to access it properly
Fixed a bug where DateMatcher didn't know how to handle dash in dates where year had two digits instead of four
Fixed a ContextSpellChecker bug that prevented it from being used repeatedly with collections in LightPipeline
Fixed a bug in OCR that made it blow up with some image formats when using text preferred method
Fixed a bug on OCR which made params not to work in cluster mode
Fixed OCR setSplitPages and setSplitRegions to work properly if tesseract detected multiple regions

Developer API

AnnotatorType params renamed to inputAnnotatorTypes and outputAnnotatorTypes
Embeddings now serialize along a FloatArray in Annotation class
Disabled useFeatureBroadcasting, showed better performance number when training large models in annotators that use Features
OCR must be instantiated
OCR works best with 4.0.0-beta.1

Build and release

Added GPU build with tensorflow-gpu to Maven coordinates
Removed .jar file from pip package

Assets 2

24 Feb 05:34

saif-ellafi

1.8.3

62c78fc

John Snow Labs Spark-NLP 1.8.3: Revisited DeepSentenceDetector, embeddings from S3, fixed python deserialization modules

Overview

We're glad to announce a new release for Spark NLP. This one calls the attention of the community who contributed
immensely towards reporting bugs and feedback to the library. This release focuses in various bugfixes around DeepSentenceDetector
and also python deserialization of some specific pipelines. It also improves the DeepSentenceDetector allowing further fine-tuning
and customization. Then, we have embeddings that are being cached in the models folder, and further improvements towards accessing
them through S3 storage. Finally, we have made serious improvements in noteoboks and documentation around the library.
Special thanks to @Tshimanga and @haimco10 for very interesting contributions. See you on Slack!

Enhancements

Improved OCR performance in skew detection
SentenceDetector now better handles single quote protections (Thanks @haimco10)
DeepSentenceDetector now can explodeSentences (Thanks @Tshimanga from Deep6.ai)
EmbeddingsHelper now is capable of caching downloaded embeddings to avoid re-downloading
Application.conf file may now be read from an s3 location
DeepSentenceDetector has now access to all pragmatic SentenceDetector params in order to fine-tune it

Bugfixes

Fixed ambiguous classpath resolution in pyspark, causing errors in deserializing some models
Fixed DeepSentenceDetector not being deserializable in PySpark
Fixed Chunk2Doc and Doc2Chunk annotators not being loadable in PySpark
Fixed a bug where DeepSentenceDetector wouldn't corrent denote start and end offsets (Thanks @Tshimanga from Deep6.ai)
Fixed a bug where DeepSentenceDetector would miss sentence parts when NER model missed header sentence (Thanks @Tshimanga from Deep6.ai)
Cleaned and optimized DeepSentenceDetector code (Thanks @danilojsl)
Fixed a missing dependency for OCR

Documentation and notebooks

Added support and instructions for Anaconda deployment (Thanks @maziyarpanahi)
Updated various python notebooks to show utilization of spark packages instead of jars
Added a new conference talk with Spark NLP in French at XebiCon'18
Updated documentation towards less use of jars in favor of dependency solving

Contributors

maziyarpanahi, haimco10, and 2 other contributors

Assets 2

08 Feb 04:15

saif-ellafi

1.8.2

95134da

John Snow Labs Spark-NLP 1.8.2: OCR Autorotation, Embeddings bugfixes, new utility annotators and languages

Overview

This release potentially targets to improve performance and resource usage in some pipelines that use word embeddings, it also comes
together with a very interesting autorotation feature in OCR, and a couple of new annotators to solve particular needs, including the ChunkTokenizer
or a Param to limit sentence lengths. Finally, we are starting to organize our multilingual store of models and data for training models.
Check the examples for some italian notebooks!. Thanks again to all community for such quick feedback all the time.

New Features

OCR now capable of automatic rotation, significantly improving accuracy in some scenarios
ChunkTokenizer is a new annotator that Tokenizes CHUNK type annotations. Extends Tokenizer algorithm and stores chunk ID for reference.
SentenceDetector new Param maxLength now cuts off sentences longer than (by default) 240 characters. It avoids Deep Learning annotator issues and may improve performance in some scenarios.
NerConverter new Param whiteList now allows a list of NER labels to be considered, while discarding the rest. May be useful for selective CHUNKing pipelines.

Enhancements

Pipelines using Word Embeddings should now perform faster due to a group of RocksDB optimizations allowing annotators to reuse current open connections to DB

Bugfixes

Fixed a bug where DeepSentenceDetector was missing the load() interface (Thanks @Tshimanga from Deep6!)
Fixed a bug where RocksDB opened too many files at once causing pipelines to fail or to work very slowly
Fixed NerCrfModel when prefetching RocksDB causing slower performance

Framework

Added missing artifact resolution dependencies for OCR Module
Started adding and organizing multilanguage models (Thanks @maziyarpanahi)
Updated RocksDB to 5.17.2

Contributors

maziyarpanahi and Tshimanga

Assets 2

26 Jan 00:20

saif-ellafi

1.8.1

acd4c09

John Snow Labs Spark-NLP 1.8.1: ML SentenceDetector, improved ContextSpellChecker and bugfixes

Overview

This hotfix version of Spark-NLP improves framework support by adding Maven coordinates for OCR and allowing S3 retrieval of files.
We also included code for generating Graphs for NerDL and also for creating your own metadata files for a private model downloader.
As new features, we are including a new experimental machine learning based sentence detector, which uses NER for bounds detections.
Aside from this, we are including a few bug fixes and OCR improvements. Enjoy! and thanks again for community contributions!

New Features

New DeepSentenceDetector annotator takes Spark-NLP's NER Deep Learning models as a base to improve sentence detection

Enhancements

Improved accuracy of ContextSpellChecker by enabling re-ranking of candidate words according to a weighted levenshtein distance
OCR process now defaults to split content in rows whether paragraphs or pages are identified for improved parallelism. Maybe turned off

Examples and use cases

Added Scala examples for Sentiment analysis and Lemmatizer in Italian (Thanks Vincenzo Gaudenzi from DXC.technology for dataset and model contribution!!!)

Bugfixes

Fixed a bug in Norvig and Symmetric SpellCheckers where the pattern parameter was not provided properly in Scala side (Thanks @johnmccain for reporting!)

Framework

Added hadoop-aws dependency for remote download capabilities (e.g. word embeddings sets)

Other

Metadata files for pretrained model downloads code is now included. This may be useful if anyone wants to set up their own private local model downloader service
NerDL Graphs generation code is now included in the library. This allows the usage of custom word embedding dimensions and feature counts.

Special mentions

Vincenzo Gaudenzi (DXC.technology) for contributing Italian datasets and models. @maziyarpanahi for creating examples with them.
@correlator from Deep6.ai for contributing feedback in slack and features feedback in general
@johnmccain for reporting bugs in spell checker
@rohit-nlp for delivering maven coordinates for OCR
@haimco10 for contributing a sentence detector improvement with apostrophe's use case. Not merged due specific issues involved.

Contributors

correlator, maziyarpanahi, and 3 other contributors

Assets 2

Releases: JohnSnowLabs/spark-nlp

John Snow Labs Spark-NLP 2.0.5: NerDL customizable graphs and cluster fixes

Enhancements

Bugfixes

Models and Pipelines

Developer API

Contributors

John Snow Labs Spark-NLP 2.0.4: Fixes for dependency parser and pretrained models

Bugfixes

Documentation

Developer API

Contributors

John Snow Labs Spark-NLP 2.0.3: Hotfix for tensorflow models in cluster improvements

Overview

Bugfixes

John Snow Labs Spark-NLP 2.0.2: DL Annotators performance improvemnts, Word Embedding enhancements and better parallelism

New Features

Enhancements

Bugfixes

Models and Pipelines

Developer API

Contributors

John Snow Labs Spark-NLP 1.8.4: Chunk annotators match content by sentence, sentences include id

Enhancements

John Snow Labs Spark-NLP 2.0.1: Performance imprvements, serialization refactors and fixed cluster mode support

Enhancements

Bugfixes

Developer API

Documentation

John Snow Labs Spark-NLP 2.0.0: Bert embeddings, embeddings as annotators, better OCR, new pretrained pipelines

New Features

Enhancements

Models and Pipelines

Bugfixes

Developer API

Build and release

John Snow Labs Spark-NLP 1.8.3: Revisited DeepSentenceDetector, embeddings from S3, fixed python deserialization modules

Overview

Enhancements

Bugfixes

Documentation and notebooks

Contributors

John Snow Labs Spark-NLP 1.8.2: OCR Autorotation, Embeddings bugfixes, new utility annotators and languages

Overview

New Features

Enhancements

Bugfixes

Framework

Contributors

John Snow Labs Spark-NLP 1.8.1: ML SentenceDetector, improved ContextSpellChecker and bugfixes

Overview

New Features

Enhancements

Examples and use cases

Bugfixes

Framework

Other

Special mentions

Contributors