- Updated assertion notebook

- Updated content for release 1.3.0
JohnSnowLabs · Jan 27, 2018 · 9656fc0 · 9656fc0
1 parent 7a396b2
commit 9656fc0
Show file tree

Hide file tree

Showing 4 changed files with 106 additions and 18 deletions.
diff --git a/CHANGELOG b/CHANGELOG
@@ -1,3 +1,73 @@
+========
+1.3.0
+========
+IMPORTANT: Pipelines from 1.2.6 or older cannot be loaded from 1.3.0
+---------------
+New features
+---------------
+* https://github.com/JohnSnowLabs/spark-nlp/pull/94
+Tokenizer annotator has been revamped. It now follows standard NLP Rules, matching above 90% of StanfordNLP Tokens
+This annotator has now more complex rules allowing setting custom composite words as exceptions (e.g. to not break New York)
+and custom Prefix, Infix, Suffix and Breaking rules. It uses regular expression groups in order to match various tokens per target word
+Defaults have been updated to also be language agnostic and support foreign characters from Unicode charset
+* https://github.com/JohnSnowLabs/spark-nlp/pull/93
+Assertion Status. This annotator identifies negated sequences within target scope. Assertion status is a machine learning
+annotator and works throughout a set of Word Embeddings which a set of them is provided as a part of our Python notebook examples.
+* https://github.com/JohnSnowLabs/spark-nlp/pull/90
+Recursive Pipelines. We have created our own Pipeline class which will take more advantages from Spark-NLP annotators.
+Although this Pipeline is completely optional and works well with default Apache Spark estimators and transforms, it allows
+training our annotators more efficiently by allowing annotator approaches access the previous state of the Pipeline,
+allowing them to use it to tokenize or transform their own external content. It is recommended to use such Pipelines.
+
+----------------
+Enhancements
+----------------
+* https://github.com/JohnSnowLabs/spark-nlp/pull/83
+Part of Speech training has been improved in both performance and quality, and now better makes use of the input corpus provided.
+New params have been extended in order to have more control of its training, through corpusFormat and corpusLimit, allowing
+whether to read training data as Dataset or raw text files, and the number of limit files if a folder is provided
+* https://github.com/JohnSnowLabs/spark-nlp/pull/84
+Thanks to @lambdaofgod to allow Normalizer to optionally lower case tokens
+* Thanks to Lorenz Bernauer, Normalizer default pattern now becomes language agnostic by not breaking unicode characters such as Spanish or German letters
+* Features now have appropriate default values which are lazy by nature and executed only once upon request. This improves by side effect to the Lemmatizer performance.
+* RuleFactory (A regex rule factory) performance has been improved due to set to use a Factory pattern and not re-check it's strategy on every transformation in run-time.
+This might have positive side effects in SentenceDetector, DateMatcher and RegexMatcher which extensively use this class.
+
+----------------
+Class Renames
+----------------
+RegexTokenizer -> Tokenizer (it is not just regex anymore)
+SentenceDetectorModel -> SentenceDetector (it is not a model, it is a rule-based algorithm)
+SentimentDetectorModel -> SentimentDetector (it is not a model, it is a rule-based algorithm)
+
+----------------
+User Utilities
+----------------
+* ResourceHelper has a function createDatasetFromText which allows the user to more
+easily read one or multiple text files from path into a dataset with various options,
+including filename by row or by file aggregation. This class should be more widely
+used since it helps dealing with local files parsing. It shall be better documented.
+* com.johnsnowlabs.util now contains a Benchmark class which allows measuring the time of
+any function easily, by using it as Benchmark.time("Description of measured") {someFunction()}
+
+----------------
+Developer API
+----------------
+* https://github.com/JohnSnowLabs/spark-nlp/pull/89/files
+Word embedding traits have been generalized. Now any annotator who might want to use them can easily access their properties
+* Recursive pipelines now allow injecting PipelineModel object into train() stage. It is an optional parameter. If the user
+utilizes RecursivePipeline, the annotator might use this pipeline for transforming secondary data inputs.
+* Annotator abstract class has been divided into a previous RawAnnotator class which contains all annotator properties
+and validations, but does not make use of the annotate() function. This allows annotators that need to work directly with
+the transform() call, but also participate between other annotators in the pipeline
+
+----------------
+Bugfixes
+----------------
+* Fixed a bug in annotators with word embeddings not correctly serializing into disk
+* Fixed a bug creating temporary folders in home folder
+* Fixed a broken geospatial pattern in sentence detection
+
 ========
 1.2.6
 ========

diff --git a/README.md b/README.md
@@ -10,18 +10,18 @@ Take a look at our official spark-nlp page: http://nlp.johnsnowlabs.com/ for use
 
 This library has been uploaded to the spark-packages repository https://spark-packages.org/package/JohnSnowLabs/spark-nlp .
 
-To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.2.6` to you spark command
+To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:1.3.0` to you spark command
 
 ```sh
-spark-shell --packages JohnSnowLabs:spark-nlp:1.2.6
+spark-shell --packages JohnSnowLabs:spark-nlp:1.3.0
 ```
 
 ```sh
-pyspark --packages JohnSnowLabs:spark-nlp:1.2.6
+pyspark --packages JohnSnowLabs:spark-nlp:1.3.0
 ```
 
 ```sh
-spark-submit --packages JohnSnowLabs:spark-nlp:1.2.6
+spark-submit --packages JohnSnowLabs:spark-nlp:1.3.0
 ```
 
 If you want to use and old version check the spark-packages websites to see all the releases.
@@ -36,19 +36,19 @@ Our package is deployed to maven central. In order to add this package as a depe
 <dependency>
   <groupId>com.johnsnowlabs.nlp</groupId>
   <artifactId>spark-nlp_2.11</artifactId>
-  <version>1.2.6</version>
+  <version>1.3.0</version>
 </dependency>
 ```
 
 #### SBT
 ```sbtshell
-libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.2.6"
+libraryDependencies += "com.johnsnowlabs.nlp" % "spark-nlp_2.11" % "1.3.0"
 ```
 
 If you are using `scala 2.11`
 
 ```sbtshell
-libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.2.6"
+libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "1.3.0"
 ```
 
 ## Using the jar manually 

diff --git a/build.sbt b/build.sbt
@@ -7,7 +7,7 @@ name := "spark-nlp"
 
 organization := "com.johnsnowlabs.nlp"
 
-version := "1.2.6"
+version := "1.3.0"
 
 scalaVersion := scalaVer
 

diff --git a/python/example/logreg-assertion/assertion.ipynb b/python/example/logreg-assertion/assertion.ipynb
@@ -18,6 +18,8 @@
     "from sparknlp.common import *\n",
     "from sparknlp.base import *\n",
     "\n",
+    "from pathlib import Path\n",
+    "\n",
     "if sys.version_info[0] < 3:\n",
     "    from urllib import urlretrieve\n",
     "else:\n",
@@ -53,7 +55,9 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": [
     "import time\n",
@@ -62,7 +66,8 @@
     "embeddingsFile = 'PubMed-shuffle-win-2.bin'\n",
     "embeddingsUrl = 'https://s3.amazonaws.com/auxdata.johnsnowlabs.com/PubMed-shuffle-win-2.bin'\n",
     "# this may take a couple minutes\n",
-    "urlretrieve(embeddingsUrl, embeddingsFile)\n",
+    "if not Path(embeddingsFile).is_file():\n",
+    "    urlretrieve(embeddingsUrl, embeddingsFile)\n",
     "\n",
     "documentAssembler = DocumentAssembler()\\\n",
     "    .setInputCol(\"sentence\")\\\n",
@@ -109,19 +114,24 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
+    "collapsed": true,
     "scrolled": false
    },
    "outputs": [],
    "source": [
+    "start = time.time()\n",
     "print(\"Start fitting\")\n",
     "model = pipeline.fit(data)\n",
-    "print(\"Fitting is ended\")"
+    "print(\"Fitting is ended\")\n",
+    "print (time.time() - start)"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "collapsed": true
+   },
    "outputs": [],
    "source": [
     "result = model.transform(data)\n",
@@ -144,7 +154,6 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "collapsed": true,
     "scrolled": false
    },
    "outputs": [],
@@ -155,6 +164,15 @@
     "sameModel = PipelineModel.read().load(\"./assertion_model\")"
    ]
   },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "sameModel.transform(data).select(\"sentence\", \"target\", \"finished_assertion\").show()"
+   ]
+  },
   {
    "cell_type": "code",
    "execution_count": null,
@@ -168,21 +186,21 @@
  "metadata": {
   "anaconda-cloud": {},
   "kernelspec": {
-   "display_name": "Python 2",
+   "display_name": "Python 3",
    "language": "python",
-   "name": "python2"
+   "name": "python3"
   },
   "language_info": {
    "codemirror_mode": {
     "name": "ipython",
-    "version": 2
+    "version": 3
    },
    "file_extension": ".py",
    "mimetype": "text/x-python",
    "name": "python",
    "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython2",
-   "version": "2.7.12"
+   "pygments_lexer": "ipython3",
+   "version": "3.5.2"
   }
  },
  "nbformat": 4,