Skip to content

Commit

Permalink
- Updated assertion notebook
Browse files Browse the repository at this point in the history
- Updated content for release 1.3.0
  • Loading branch information
saif-ellafi committed Jan 27, 2018
1 parent 7a396b2 commit 9656fc0
Show file tree
Hide file tree
Showing 4 changed files with 106 additions and 18 deletions.
70 changes: 70 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,73 @@
========
1.3.0
========
IMPORTANT: Pipelines from 1.2.6 or older cannot be loaded from 1.3.0
---------------
New features
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/94
Tokenizer annotator has been revamped. It now follows standard NLP Rules, matching above 90% of StanfordNLP Tokens
This annotator has now more complex rules allowing setting custom composite words as exceptions (e.g. to not break New York)
and custom Prefix, Infix, Suffix and Breaking rules. It uses regular expression groups in order to match various tokens per target word
Defaults have been updated to also be language agnostic and support foreign characters from Unicode charset
* https://github.com/JohnSnowLabs/spark-nlp/pull/93
Assertion Status. This annotator identifies negated sequences within target scope. Assertion status is a machine learning
annotator and works throughout a set of Word Embeddings which a set of them is provided as a part of our Python notebook examples.
* https://github.com/JohnSnowLabs/spark-nlp/pull/90
Recursive Pipelines. We have created our own Pipeline class which will take more advantages from Spark-NLP annotators.
Although this Pipeline is completely optional and works well with default Apache Spark estimators and transforms, it allows
training our annotators more efficiently by allowing annotator approaches access the previous state of the Pipeline,
allowing them to use it to tokenize or transform their own external content. It is recommended to use such Pipelines.

----------------
Enhancements
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/83
Part of Speech training has been improved in both performance and quality, and now better makes use of the input corpus provided.
New params have been extended in order to have more control of its training, through corpusFormat and corpusLimit, allowing
whether to read training data as Dataset or raw text files, and the number of limit files if a folder is provided
* https://github.com/JohnSnowLabs/spark-nlp/pull/84
Thanks to @lambdaofgod to allow Normalizer to optionally lower case tokens
* Thanks to Lorenz Bernauer, Normalizer default pattern now becomes language agnostic by not breaking unicode characters such as Spanish or German letters
* Features now have appropriate default values which are lazy by nature and executed only once upon request. This improves by side effect to the Lemmatizer performance.
* RuleFactory (A regex rule factory) performance has been improved due to set to use a Factory pattern and not re-check it's strategy on every transformation in run-time.
This might have positive side effects in SentenceDetector, DateMatcher and RegexMatcher which extensively use this class.

----------------
Class Renames
----------------
RegexTokenizer -> Tokenizer (it is not just regex anymore)
SentenceDetectorModel -> SentenceDetector (it is not a model, it is a rule-based algorithm)
SentimentDetectorModel -> SentimentDetector (it is not a model, it is a rule-based algorithm)

----------------
User Utilities
----------------
* ResourceHelper has a function createDatasetFromText which allows the user to more
easily read one or multiple text files from path into a dataset with various options,
including filename by row or by file aggregation. This class should be more widely
used since it helps dealing with local files parsing. It shall be better documented.
* com.johnsnowlabs.util now contains a Benchmark class which allows measuring the time of
any function easily, by using it as Benchmark.time("Description of measured") {someFunction()}

----------------
Developer API
----------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/89/files
Word embedding traits have been generalized. Now any annotator who might want to use them can easily access their properties
* Recursive pipelines now allow injecting PipelineModel object into train() stage. It is an optional parameter. If the user
utilizes RecursivePipeline, the annotator might use this pipeline for transforming secondary data inputs.
* Annotator abstract class has been divided into a previous RawAnnotator class which contains all annotator properties
and validations, but does not make use of the annotate() function. This allows annotators that need to work directly with
the transform() call, but also participate between other annotators in the pipeline

----------------
Bugfixes
----------------
* Fixed a bug in annotators with word embeddings not correctly serializing into disk
* Fixed a bug creating temporary folders in home folder
* Fixed a broken geospatial pattern in sentence detection

========
1.2.6
========
Expand Down
14 changes: 7 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,18 @@ Take a look at our official spark-nlp page: http://nlp.johnsnowlabs.com/ for use

This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .

To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.2.6` to you spark command
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.3.0` to you spark command

```sh
spark-shell --packages JohnSnowLabs:spark-nlp:1.2.6
spark-shell --packages JohnSnowLabs:spark-nlp:1.3.0
```

```sh
pyspark --packages JohnSnowLabs:spark-nlp:1.2.6
pyspark --packages JohnSnowLabs:spark-nlp:1.3.0
```

```sh
spark-submit --packages JohnSnowLabs:spark-nlp:1.2.6
spark-submit --packages JohnSnowLabs:spark-nlp:1.3.0
```

If you want to use and old version check the spark-packages websites to see all the releases.
Expand All @@ -36,19 +36,19 @@ Our package is deployed to maven central. In order to add this package as a depe
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.11</artifactId>
<version>1.2.6</version>
<version>1.3.0</version>
</dependency>
```

#### SBT
```sbtshell
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.2.6"
libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.3.0"
```

If you are using `scala 2.11`

```sbtshell
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.2.6"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.3.0"
```

## Using the jar manually
Expand Down
2 changes: 1 addition & 1 deletion build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ name := "spark-nlp"

organization := "com.johnsnowlabs.nlp"

version := "1.2.6"
version := "1.3.0"

scalaVersion := scalaVer

Expand Down
38 changes: 28 additions & 10 deletions python/example/logreg-assertion/assertion.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,8 @@
"from sparknlp.common import *\n",
"from sparknlp.base import *\n",
"\n",
"from pathlib import Path\n",
"\n",
"if sys.version_info[0] < 3:\n",
" from urllib import urlretrieve\n",
"else:\n",
Expand Down Expand Up @@ -53,7 +55,9 @@
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import time\n",
Expand All @@ -62,7 +66,8 @@
"embeddingsFile = 'PubMed-shuffle-win-2.bin'\n",
"embeddingsUrl = 'https://s3.amazonaws.com/auxdata.johnsnowlabs.com/PubMed-shuffle-win-2.bin'\n",
"# this may take a couple minutes\n",
"urlretrieve(embeddingsUrl, embeddingsFile)\n",
"if not Path(embeddingsFile).is_file():\n",
" urlretrieve(embeddingsUrl, embeddingsFile)\n",
"\n",
"documentAssembler = DocumentAssembler()\\\n",
" .setInputCol(\"sentence\")\\\n",
Expand Down Expand Up @@ -109,19 +114,24 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"scrolled": false
},
"outputs": [],
"source": [
"start = time.time()\n",
"print(\"Start fitting\")\n",
"model = pipeline.fit(data)\n",
"print(\"Fitting is ended\")"
"print(\"Fitting is ended\")\n",
"print (time.time() - start)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"result = model.transform(data)\n",
Expand All @@ -144,7 +154,6 @@
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"scrolled": false
},
"outputs": [],
Expand All @@ -155,6 +164,15 @@
"sameModel = PipelineModel.read().load(\"./assertion_model\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sameModel.transform(data).select(\"sentence\", \"target\", \"finished_assertion\").show()"
]
},
{
"cell_type": "code",
"execution_count": null,
Expand All @@ -168,21 +186,21 @@
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 2",
"display_name": "Python 3",
"language": "python",
"name": "python2"
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
"pygments_lexer": "ipython3",
"version": "3.5.2"
}
},
"nbformat": 4,
Expand Down

0 comments on commit 9656fc0

Please sign in to comment.