02 May 18:11

e042183

John Snow Labs Spark-NLP 1.5.3: Retroactive version matching, fixed Sentence Detector param and Symmetric pretrained

Overview

This quick release is a hotfix for issues found on 1.5.2 after it's release. Thanks to the users who quickly tested this out.
It fixes Symmetric spell checker not being capable of reading the pretrained model, a SentenceDetector missing default value and retroactive version matching to the downloader

Bug fixes

Fixed a bug causing the library to fail when trying to save or read an annotator with an unset Feature without default
Added missing default Param value to SentenceDetector. Thanks @superman24-7
Symmetric spell checker now utilizes List instead of ListBuffer on its prediction layer
Fixed Vivekn Sentiment Analysis failing when training with a sentiment column

Models

Symmetric Spell Checker pretrained model now works well and may be downloaded
Vivekn Sentiment pretrained model now defaults to "token" input column instead of "spell"

Other

Downloader now works retroactively when a newer version finds a model of a previous release
Renamed folder argument to remote_loc for downloader remote location, which caused confusion. Thanks @AtulSehgal
Added new Scala example in example folder, also available on website

Contributors

superman24-7 and AtulSehgal

Assets 2

30 Apr 17:33

saif-ellafi

1.5.2

9016a22

John Snow Labs Spark-NLP 1.5.2: Downloader uses distributed fs, new spell checker and better assertion status

Overview

This release focuses on improving model downloader stability, fixing word embedding reading issues and joining spark ecosystem filesystem configuration appropriately, utilizing spark's defined default filesystem, in order to work properly with clusters and multi node environments. This includes Databricks cloud clusters or amazon EMR yarn HDFS nodes.

Aside of that we come up with exciting new features, a brand new Spell Checker with higher accuracy inspired on the Symmetric delete algorithm.

Finally Assertion Status can be trained and predicted on top of NER output, since before this only worked by providing assertion status Start and End boundaries for the target to assert.

New Features

Assertion status annotators can now be trained and predict against NER output instead of start and end boundaries. Entities can now be directly asserted
Brand new Symmetric Delete annotator (SymmetricDeleteApproach) with closer to start of the art optimal accuracy 80%

Enhancements

Model downloader now uses proper spark filesystem. Works properly with distributed storage, databricks cloud clusters or amazon EMR seamlessly
Fixed several race condition while loading word embeddings from disk or download resources, library is more stable
Improved several assertion status validations and error messages

Bug fixes

Stand alone Annotator models are now properly read from disk in python

Models

New Symmetric Delete Spell checker pretrained model
Vivekn Sentiment annotator may now be downloaded standalone with pretrained()

Assets 2

16 Apr 20:48

saif-ellafi

1.5.1

24c9394

John Snow Labs Spark-NLP 1.5.1: Better pretrained models, downloader improvements

Overview

This release is an enhancement release to 1.5.0 which includes improved downloader properties and better annotator defaults.
Also, assertion status models have been included as pretrained, which are models trained on top of Glove Stanford word embeddings

Enhancements

SentenceDetector has now a useCustomOnly param which enforces into using only the custom bounds provided (thanks @atomobianco)
Normalizer defaults to not lowerCase words leads to better implicit accuracy in pipelines (thanks @marek.modry)
SpellChecker defaults to be case sensitive leads to better accuracy
DateMatcher improved speed performance
com.johnsnowlabs.annotator._ in Scala now also includes RecursivePipelines and LightPipelines for easier imports
ModelDownloader has been improved with better directory management

Models

New Assertion Status (LogisticRegression and DeepLearning) pretrained models now available
Vivekn, Basic and Advanced pretrained Pipelines improved accuracy (thanks @marek.modry)

Other

S3 library dependencies updated

Contributors

atomobianco

Assets 2

30 Mar 07:57

saif-ellafi

1.5.0

85ccb5f

John Snow Labs Spark-NLP 1.5.0: Deep Learning, Light Pipelines and pretrained models

Overview

We are proud to announce if not the biggest release in terms of content in Spark-NLP!
This release makes the library miles easier to use for new comers, allowing easier to import
annotators and the extended use of model downloader throughout pretrained models and pipelines.
This also includes two new annotators that use deep learning algorithms with graphs from TensorFlow, which
is the first time we do so.
Apart from this, we include new Light Pipelines that are 10x times faster when working with data smaller than about
50,000 rows length.
Finally, we included several bugfixes across the library, from algorithm wise to developer API.
We'll gladly welcome any feedback! The website has been extensively updated.

New features

Light Pipelines are Annotator Pipelines created from SparkML pipelines that run more than 10x faster in small datasets
Deep Learning NER based on Bi-LSTM and Convolutional Neural Networks from word embeddings datasets
Deep Learning Assertion Status model based on LSTM to compute status identification from word embeddings
Easier to use Spark-NLP:

Imports have been made easy in scala API (com.johnsnowlabs.annotator._) to bring all annotators
BasicPipeline and AdvancedPipeline downloadable pipelines created for quick annotation of text
Light Pipelines are easy to use and accept simple strings to annotate a Spark ML Pipeline without spark datasets

New Downloadable models: CRF NER, Lemmatizer, POS and Spell checker
New Downloadable pipelines: Vivekn Sentiment analysis, BasicPipeline and AdvancedPipeline

Enhancements

Model downloader significantly improved in terms of usability

Documentation

Website widely improved
Added invite to our first slack chat channel

Bugfixes

Fixed positional index wrong value when creating Annotations from constructor
Fixed hamming distance calculation in spell checker
Fixed Downloadable NER model failing sporadically due to missing temporary files
Fixed SearchTrie algorithm used in TextMatcher (fmy. EntiyExtractor) thanks @avenka11 for reporting and proposing solution
Fixed some model deserialization issues happening on Windows

Other

Thanks to @showy we have TravisCI automatic integration testing
Finisher now outputs to array by default
Training example resources removed in advantage of using the model downloader more

Contributors

showy

Assets 2

12 Mar 06:23

saif-ellafi

1.4.2

99f5c35

John Snow Labs Spark-NLP 1.4.2: Fixed protocol reading, improved Windows support and more bug fixes

Overview

This release does not include any new improvements or features, but is instead focused on fixing bugs and consolidating the 1.4.0 release. Among the bug fixes, we improved Windows support across the library by fixing a few End of Line character issues. We also fixed an issue affecting word embeddings and some annotators, which prevented reading from external sources located in different storage types, such as S3 or HDFS. Finally, this release reorganizes Model Downloader content and functions in order to have a more consistent API.

Bugfixes

Filesystem protocols now properly read across the library, fixed use case for S3:// protocol (thanks @avenka11)
Library now works properly in Windows environments
PySpark annotator param getters now work properly when retrieving default values
Fixed stemmer serialization due to misspelled param name
Fixed Tokenizer infixPattern param name to infixPatterns, leading to broken pyspark serialization of such param
Added missing addInfixPattern() function to PySpark, to allow adding patterns to current value
Model Downloader clearCache now properly removes both .zip files and extracted content
Model Downloader is now capable of reading all types of models properly
Added missing clearCache function into PySpark

Developer API

Function names in model downloader code has been refactored consistently

Other

RocksDB rolled back to previous version to support Windows
NerCRF unittest modified to reduce time to test
Removed training scripts from repository
Updated build spark and scala version

Assets 2

25 Feb 23:51

saif-ellafi

1.4.1

9f6cb15

John Snow Labs Spark-NLP 1.4.1: Model Downloader and easier to use External Resource API

Overview

Here we present an exciting release, since we are including for the first time in the library, the base code for a model and pipeline downloader. This will be used by ourselves to provide quality pre-trained models and pipelines that will allow the user to quickly predict or tag a dataset with NLP annotators out of the box, provided what is the pipeline or model trained for.

The next important enhancement is how we deal with External sources for training annotators. This has been unified in 1.4.0 and now further improved by making it easier to provide reading properties, such as how is it preferred to be read (depending on the size of the target, line by line or as a spark dataset will put significant impact on performance), and allowing protocol reading such as hdfs:// or file:// for local following the spark native HadoopConfiguration setting.

Th rest of the release is about improving and fixing issues on the new 1.4.0 Tokenizer and a few critical bugs on CRF NER. Many users contributed reporting these bugs so we are thankful. There were improvements on PySpark API to make it easy to extend and maintain annotators.

New features

Model and Pipeline Downloader
We are glad to announce our first experimental model downloader, working both in Python and Scala.
This allows to download pre-trained models from our public storage. This does not include any pre-trained models yet but just the logic to be able to do it.

Enhancements

Improved ExternalResource API (introduced in 1.4.0) to make it easier to provide external corpus and resource information
on annotators such as readAs (which allows setting how would you like SparkNLP to read your source), delimiters and parse settings among
other options that might be passed to Spark Reader directly. Annotators using external sources now all share this functionality.
WordEmbeddings are not yet supported on this format.
All python annotators now properly have getter functions to retrieve param values

Bugfixes

Fixed some annotators in python not de-serializable on their own outside a Pipeline
Fixed CRF NER not working when not using word embeddings (thanks @Crisliu for reporting)
Fixed Tokenizer not properly recognizing some stop words (thanks @easimadi)
Fixed Tokenizer not properly recognizing composite tokens when changing target pattern param (thanks @easimadi)
ReadAs parameter now properly read from string in all ExternalResource setters

Developer API

PySpark API further improvements within AnnotatorApproach, AnnotatorModel and now private internal _AnnotatorModel for fit() result representation
Automated getter have been written in order not to have to write getter functions in all annotators manually

Other

RocksDB dependency rolled back to 5.2.1 for better universal compatibility particularly to support databricks platform
Tests jar is now available in maven central (Thanks @lorenz-nlp for the idea)

Documentation

Updated website components page to match 1.4.x
Replaced notebooks site to a placeholder linking to current python notebooks for lower maintenance

Contributors

Crisliu, easimadi, and lorenz-nlp

Assets 2

21 Feb 19:29

saif-ellafi

1.4.0

fd64bb9

John Snow Labs Spark-NLP 1.4.0: Unified external resources, use Hadoop accordingly, improved resources performance

New features

All annotator external sources have been unified through an ExternalResource component.
This is used to represents external data information deals with content in HDFS or local just as spark deals with data.
It also improves performance globally and allows customization
into how these sources are read (e.g. as RDD or Line by Line sequences)
NorvigSweeting SpellChecker, ViveknSentiment and POS Perceptron can now train from the dataset passed to fit().
For Spell Checker, this will be applied if the user did not supply a corpus, forcing fit() to learn from words in the data column.
For ViveknSentiment and POS Perceptron, this strategy will be applied if sentimentCol and posCol params have been set respectively.

Enhancements

ResourceHelper now has an improved SourceStream class which allows for more consistent HDFS/Filesystem reading by using
more of the Hadoop APIs.
application.conf is now a global setting and can be overridden in run-time through ConfigLoader.setConfigPath(). It may also be accessed from PySpark
EntityMatcher now uses recursive Pipelines
Part-of-Speech tagging performance has been improved throughout the prediction algorithm
EntityMatcher may now use RecursivePipeline in order to tokenize external data with the same pipeline provided by the user

Developer API

PySpark API has been severly improved to make it easier to extend JVM classes
PySpark API improved for extending annotator approaches and models appropriately

Bugfixes

Reverted a bug introduced causing NER not to read datasets properly from HDFS
Fixed EntityMatcher wrongly normalizing external content (thanks @sofianeh)

Documentation

Fixed EntityMatcher documentation obsolete params (Thanks @sofianeh)
Fixed NER CRF documentation in website

Contributors

sofianeh

Assets 2

27 Jan 23:01

saif-ellafi

1.3.0

9656fc0

John Snow Labs Spark-NLP 1.3.0: Better tokenizer, assertion status annotator, and more

========
1.3.0

IMPORTANT: Pipelines from 1.2.6 or older cannot be loaded from 1.3.0

We are happy to announce a big release this time. 1.3.0 includes a brand new annotator for assertion status and an improved tokenizer, along with many enhancements that bring side-effects to the library.

New features

#94
Tokenizer annotator has been revamped. It now follows standard NLP Rules, matching above 90% of StanfordNLP Tokens
This annotator has now more complex rules allowing setting custom composite words as exceptions (e.g. to not break New York)
and custom Prefix, Infix, Suffix and Breaking rules. It uses regular expression groups in order to match various tokens per target word
Defaults have been updated to also be language agnostic and support foreign characters from Unicode charset
#93
Assertion Status. This annotator identifies negated sequences within target scope. Assertion status is a machine learning
annotator and works throughout a set of Word Embeddings which a set of them is provided as a part of our Python notebook examples.
#90
Recursive Pipelines. We have created our own Pipeline class which will take more advantages from Spark-NLP annotators.
Although this Pipeline is completely optional and works well with default Apache Spark estimators and transforms, it allows
training our annotators more efficiently by allowing annotator approaches access the previous state of the Pipeline,
allowing them to use it to tokenize or transform their own external content. It is recommended to use such Pipelines.

Enhancements

#83
Part of Speech training has been improved in both performance and quality, and now better makes use of the input corpus provided.
New params have been extended in order to have more control of its training, through corpusFormat and corpusLimit, allowing
whether to read training data as Dataset or raw text files, and the number of limit files if a folder is provided
#84
Thanks to @lambdaofgod to allow Normalizer to optionally lower case tokens
Thanks to Lorenz Bernauer, Normalizer default pattern now becomes language agnostic by not breaking unicode characters such as Spanish or German letters
Features now have appropriate default values which are lazy by nature and executed only once upon request. This improves by side effect to the Lemmatizer performance.
RuleFactory (A regex rule factory) performance has been improved due to set to use a Factory pattern and not re-check it's strategy on every transformation in run-time.
This might have positive side effects in SentenceDetector, DateMatcher and RegexMatcher which extensively use this class.

Class Renames

RegexTokenizer -> Tokenizer (it is not just regex anymore)
SentenceDetectorModel -> SentenceDetector (it is not a model, it is a rule-based algorithm)
SentimentDetectorModel -> SentimentDetector (it is not a model, it is a rule-based algorithm)

User Utilities

ResourceHelper has a function createDatasetFromText which allows the user to more
easily read one or multiple text files from path into a dataset with various options,
including filename by row or by file aggregation. This class should be more widely
used since it helps dealing with local files parsing. It shall be better documented.
com.johnsnowlabs.util now contains a Benchmark class which allows measuring the time of
any function easily, by using it as Benchmark.time("Description of measured") {someFunction()}

Developer API

https://github.com/JohnSnowLabs/spark-nlp/pull/89/files
Word embedding traits have been generalized. Now any annotator who might want to use them can easily access their properties
Recursive pipelines now allow injecting PipelineModel object into train() stage. It is an optional parameter. If the user
utilizes RecursivePipeline, the annotator might use this pipeline for transforming secondary data inputs.
Annotator abstract class has been divided into a previous RawAnnotator class which contains all annotator properties
and validations, but does not make use of the annotate() function. This allows annotators that need to work directly with
the transform() call, but also participate between other annotators in the pipeline

Bugfixes

Fixed a bug in annotators with word embeddings not correctly serializing into disk
Fixed a bug creating temporary folders in home folder
Fixed a broken geospatial pattern in sentence detection

Contributors

lambdaofgod

Assets 2

12 Jan 04:45

saif-ellafi

1.2.6

5055bc4

John Snow Labs Spark-NLP 1.2.6: Improved Serialization Performance

Enhancements

#82
Vivekn Sentiment Analysis improved memory consumption and training performance
Parameter pruneCorpus is an adjustable value now, defaults to 1. Higher values lead to better performance
but are meant on larger corpora. tokenPattern params are meant to allow different tokenization regex
within the corpora provided on Vivekn and Norvig models.
#81
Serialization improvements. New default format (parquet lasted little) is RDD objects. Proved to be lighter on
heap memory. Also added lazier default values for Feature containers. New application.conf performance tunning
settings allow to customize whether we want to Feature broadcast or not, and use parquet or objects in serialization.

Assets 2

08 Jan 22:11

saif-ellafi

1.2.5

5bcd2c6

John Snow Labs Spark-NLP 1.2.5

Note: Pipelines from 1.2.4 or older cannot be loaded from 1.2.5

New features

#70
Word embeddings parameter for CRF NER annotator
#78
Annotator Features replace spark Params and are now serialized using Kryo and partitioned parquet files, increases performance and smaller memory consumption in Driver for saving and loading pipelines with large corpora. Such features are now also broadcasted for better performance in distributed environments. This enhancement is a breaking change, does not allow to load older pipelines

Bug fixes

cb9aa43
Stemmer was not capable of being deserialized (Implements DefaultParamsReadable)
#75
Sentence Boundary detector was not properly setting bounds

Documentation (thanks @maziyarpanahi)

#79
Typo in code
#74
Bad description

Contributors

maziyarpanahi

Assets 2

Releases: JohnSnowLabs/spark-nlp

John Snow Labs Spark-NLP 1.5.3: Retroactive version matching, fixed Sentence Detector param and Symmetric pretrained

Overview

Bug fixes

Models

Other

Contributors

John Snow Labs Spark-NLP 1.5.2: Downloader uses distributed fs, new spell checker and better assertion status

Overview

New Features

Enhancements

Bug fixes

Models

John Snow Labs Spark-NLP 1.5.1: Better pretrained models, downloader improvements

Overview

Enhancements

Models

Other

Contributors

John Snow Labs Spark-NLP 1.5.0: Deep Learning, Light Pipelines and pretrained models

Overview

New features

Enhancements

Documentation

Bugfixes

Other

Contributors

John Snow Labs Spark-NLP 1.4.2: Fixed protocol reading, improved Windows support and more bug fixes

Overview

Bugfixes

Developer API

Other

John Snow Labs Spark-NLP 1.4.1: Model Downloader and easier to use External Resource API

Overview

Th rest of the release is about improving and fixing issues on the new 1.4.0 Tokenizer and a few critical bugs on CRF NER. Many users contributed reporting these bugs so we are thankful. There were improvements on PySpark API to make it easy to extend and maintain annotators.

New features

Enhancements

Bugfixes

Developer API

Other

Documentation

Contributors

John Snow Labs Spark-NLP 1.4.0: Unified external resources, use Hadoop accordingly, improved resources performance

New features

Enhancements

Developer API

Bugfixes

Documentation

Contributors

John Snow Labs Spark-NLP 1.3.0: Better tokenizer, assertion status annotator, and more

======== 1.3.0

New features

Enhancements

Class Renames

User Utilities

Developer API

Bugfixes

Contributors

John Snow Labs Spark-NLP 1.2.6: Improved Serialization Performance

Enhancements

John Snow Labs Spark-NLP 1.2.5

Note: Pipelines from 1.2.4 or older cannot be loaded from 1.2.5

New features

Bug fixes

Documentation (thanks @maziyarpanahi)

Contributors

========
1.3.0