Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release candidate 2.0.2 #494

Merged
merged 28 commits into from
Apr 29, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
28 commits
Select commit Hold shift + click to select a range
80326c1
release candidate build up
saif-ellafi Apr 25, 2019
73095eb
Open Slack
saif-ellafi Apr 26, 2019
f54ecf5
Preload embeddings function, sort embeddings unpack
saif-ellafi Apr 26, 2019
eab99a1
Version bump
saif-ellafi Apr 26, 2019
853efbc
Word embeddings push ref metadata
saif-ellafi Apr 26, 2019
026dcb7
Tokenizer revert
saif-ellafi Apr 27, 2019
75d398d
Fixed tokenizer error handling
saif-ellafi Apr 27, 2019
cfb4b54
Fixed load from python bert
saif-ellafi Apr 28, 2019
24a88ec
Load contrib libraries only if necessary
saif-ellafi Apr 28, 2019
8d7289a
load contrib to cluster only if requested
saif-ellafi Apr 28, 2019
9c720d1
Upgraded bert generation, vocab before model
saif-ellafi Apr 28, 2019
04a017d
Bump to newest spark
saif-ellafi Apr 28, 2019
46b27d0
pretrained names update
saif-ellafi Apr 28, 2019
7c0c7f1
fixed bundle scenario var paths
saif-ellafi Apr 28, 2019
9434880
Rollback to Spark 2.4.0 due to scala 2.11
saif-ellafi Apr 28, 2019
285a5f3
Fix misspelling
maziyarpanahi Apr 28, 2019
95946cd
Ignore DS_Store files
maziyarpanahi Apr 28, 2019
2399f7f
Rename ContextSpellCheckerModel pretrained model
maziyarpanahi Apr 28, 2019
a68acaa
Load on saving embeddings if not loaded
saif-ellafi Apr 29, 2019
1782463
Rolledback embeddings as approach
saif-ellafi Apr 29, 2019
c5e614f
Include embeddings param
saif-ellafi Apr 29, 2019
2c2cb92
berp
saif-ellafi Apr 29, 2019
cbfbb1d
Python tensorflow rearrangement
saif-ellafi Apr 29, 2019
c21d1d2
Optionally get sentence metadata
saif-ellafi Apr 29, 2019
9f2aaef
Slightly improved error message
saif-ellafi Apr 29, 2019
9745147
Load embeddings if not loaded already
saif-ellafi Apr 29, 2019
adb0df0
Merge branch 'master' into 202-release-candidate
saif-ellafi Apr 29, 2019
ec47017
Changelog
saif-ellafi Apr 29, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -314,3 +314,4 @@ test_crf_pipeline/
test_*_pipeline/
*metastore_db*
python/src/
.DS_Store
54 changes: 54 additions & 0 deletions CHANGELOG
Original file line number Diff line number Diff line change
@@ -1,3 +1,57 @@
========
2.0.2
========
---------------
Overview
---------------
Thank you for joining us in this exciting Spark NLP year!. We continue to make progress towards a better performing library, both in speed and in accuracy.
This release focuses strongly in the quality and stability of the library, making sure it works well in most cluster environments
and improving the compatibility across systems. Word Embeddings continue to be improved for better performance and lower memory blueprint.
Context Spell Checker continues to receive enhancements in concurrency and usage of spark. Finally, tensorflow based annotators
have been significantly improved by refactoring the serialization design. Help us with feedback and we'll welcome any issue reports!

---------------
New Features
---------------
* NerCrf annotator has now includeConfidence param that includes confidence scores for predictions in metadata

---------------
Enhancements
---------------
* Cluster mode performance improved in tensorflow annotators by serializing to bytes internal information
* Doc2Chunk annotator added new params startCol, startColByTokenIndex, failOnMissing and lowerCase allows better chunking of documents
* All annotations that derive from sentence or chunk types now contain metadata information referring to the sentence or chunk ID they belong to
* ContextSpellChecker now creates a window around the token to improve computation performance
* Improved WordEmbeddings matching accuracy by trying alternative case sensitive tokens
* WordEmbeddings won't load twice if already loaded
* WordEmbeddings can use embeddingsRef if source was not provided, improving reutilization of embeddings in a pipeline
* WordEmbeddings new param includeEmbeddings allow annotators not to save entire embeddings source along them
* Contrib tensorflow dependencies now only load if necessary

---------------
Bugfixes
---------------
* Added missing Symmetric delete pretrained model
* Fixed a broken param name in Normalizer (thanks @RobertSassen)
* Fixed Cloudera cluster support
* Fixed concurrent access in ContextSpellChecker in high partition number use cases and LightPipelines
* Fixed POS dataset creator to better handle corrupted pairs
* Fixed a bug in Word Embeddings not matching exact case sensitive tokens in some scenarios
* Fixed OCR Tess4J initialization problems in concurrent scenarios

---------------
Models and Pipelines
---------------
* Renaming of models and pipelines (work in progress)
* Better output column naming in pipelines

---------------
Developer API
---------------
* Unified more WordEmbeddings interface with dimension params and individual setters
* Improved unit tests for better compatibility on Windows
* Python embeddings moved to sparknlp.embeddings

========
2.0.1
========
Expand Down
30 changes: 15 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -43,14 +43,14 @@ Take a look at our official spark-nlp page: http://nlp.johnsnowlabs.com/ for use

## Apache Spark Support

Spark-NLP *2.0.1* has been built on top of Apache Spark 2.4.0
Spark-NLP *2.0.2* has been built on top of Apache Spark 2.4.0

Note that Spark is not retrocompatible with Spark 2.3.x, so models and environments might not work.

If you are still stuck on Spark 2.3.x feel free to use [this assembly jar](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-2.3.2-nlp-assembly-1.8.0.jar) instead. Support is limited.
For OCR module, [this](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-2.3.2-nlp-ocr-assembly-1.8.0.jar) is for spark `2.3.x`.

| Spark NLP | Spark 2.0.1 / Spark 2.3.x | Spark 2.4 |
| Spark NLP | Spark 2.0.2 / Spark 2.3.x | Spark 2.4 |
|-------------|-------------------------------------|--------------|
| 2.x.x |NO |YES |
| 1.8.x |Partially |YES |
Expand All @@ -68,18 +68,18 @@ This library has been uploaded to the [spark-packages repository](https://spark-

Benefit of spark-packages is that makes it available for both Scala-Java and Python

To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:2.0.1` to you spark command
To use the most recent version just add the `--packages JohnSnowLabs:spark-nlp:2.0.2` to you spark command

```sh
spark-shell --packages JohnSnowLabs:spark-nlp:2.0.1
spark-shell --packages JohnSnowLabs:spark-nlp:2.0.2
```

```sh
pyspark --packages JohnSnowLabs:spark-nlp:2.0.1
pyspark --packages JohnSnowLabs:spark-nlp:2.0.2
```

```sh
spark-submit --packages JohnSnowLabs:spark-nlp:2.0.1
spark-submit --packages JohnSnowLabs:spark-nlp:2.0.2
```

This can also be used to create a SparkSession manually by using the `spark.jars.packages` option in both Python and Scala
Expand Down Expand Up @@ -147,7 +147,7 @@ Our package is deployed to maven central. In order to add this package as a depe
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp_2.11</artifactId>
<version>2.0.1</version>
<version>2.0.2</version>
</dependency>
```

Expand All @@ -158,22 +158,22 @@ and
<dependency>
<groupId>com.johnsnowlabs.nlp</groupId>
<artifactId>spark-nlp-ocr_2.11</artifactId>
<version>2.0.1</version>
<version>2.0.2</version>
</dependency>
```

### SBT

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.0.1"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp" % "2.0.2"
```

and

```sbtshell
// https://mvnrepository.com/artifact/com.johnsnowlabs.nlp/spark-nlp-ocr
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-ocr" % "2.0.1"
libraryDependencies += "com.johnsnowlabs.nlp" %% "spark-nlp-ocr" % "2.0.2"
```

Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https://mvnrepository.com/artifact/com.johnsnowlabs.nlp)
Expand All @@ -187,7 +187,7 @@ Maven Central: [https://mvnrepository.com/artifact/com.johnsnowlabs.nlp](https:/
If you installed pyspark through pip, you can install `spark-nlp` through pip as well.

```bash
pip install spark-nlp==2.0.1
pip install spark-nlp==2.0.2
```

PyPI [spark-nlp package](https://pypi.org/project/spark-nlp/)
Expand All @@ -210,7 +210,7 @@ spark = SparkSession.builder \
.master("local[4]")\
.config("spark.driver.memory","4G")\
.config("spark.driver.maxResultSize", "2G") \
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.1")\
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.2")\
.config("spark.kryoserializer.buffer.max", "500m")\
.getOrCreate()
```
Expand All @@ -224,7 +224,7 @@ Use either one of the following options
* Add the following Maven Coordinates to the interpreter's library list

```bash
com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.1
com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.2
```

* Add path to pre-built jar from [here](#pre-compiled-spark-nlp-and-spark-nlp-ocr) in the interpreter's library list making sure the jar is available to driver path
Expand All @@ -234,7 +234,7 @@ com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.1
Apart from previous step, install python module through pip

```bash
pip install spark-nlp==2.0.1
pip install spark-nlp==2.0.2
```

Or you can install `spark-nlp` from inside Zeppelin by using Conda:
Expand All @@ -260,7 +260,7 @@ export PYSPARK_PYTHON=python3
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

pyspark --packages JohnSnowLabs:spark-nlp:2.0.1
pyspark --packages JohnSnowLabs:spark-nlp:2.0.2
```

Alternatively, you can mix in using `--jars` option for pyspark + `pip install spark-nlp`
Expand Down
4 changes: 2 additions & 2 deletions build.sbt
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ if(is_gpu.equals("false")){

organization:= "com.johnsnowlabs.nlp"

version := "2.0.1"
version := "2.0.2"

scalaVersion in ThisBuild := scalaVer

Expand Down Expand Up @@ -178,7 +178,7 @@ assemblyMergeStrategy in assembly := {
lazy val ocr = (project in file("ocr"))
.settings(
name := "spark-nlp-ocr",
version := "2.0.1",
version := "2.0.2",

test in assembly := {},

Expand Down
18 changes: 9 additions & 9 deletions docs/quickstart.html
Original file line number Diff line number Diff line change
Expand Up @@ -112,14 +112,14 @@ <h2 class="section-title">Requirements & Setup</h2>
To start using the library, execute any of the following lines
depending on your desired use case:
</p>
<pre><code class="language-javascript">spark-shell --packages JohnSnowLabs:spark-nlp:2.0.1
pyspark --packages JohnSnowLabs:spark-nlp:2.0.1
spark-submit --packages JohnSnowLabs:spark-nlp:2.0.1
<pre><code class="language-javascript">spark-shell --packages JohnSnowLabs:spark-nlp:2.0.2
pyspark --packages JohnSnowLabs:spark-nlp:2.0.2
spark-submit --packages JohnSnowLabs:spark-nlp:2.0.2
</code></pre>
<p/>
<h3><b>Straight forward Python on jupyter notebook</b></h3>
<p>Use pip to install (after you pip installed numpy and pyspark)</p>
<pre><code class="language-javascript">pip install spark-nlp==2.0.1
<pre><code class="language-javascript">pip install spark-nlp==2.0.2
jupyter notebook</code></pre>
<p>The easiest way to get started, is to run the following code: </p>
<pre><code class="pytohn">import sparknlp
Expand All @@ -131,21 +131,21 @@ <h3><b>Straight forward Python on jupyter notebook</b></h3>
.appName('OCR Eval') \
.config("spark.driver.memory", "6g") \
.config("spark.executor.memory", "6g") \
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.1") \
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.2") \
.getOrCreate()</code></pre>
<h3><b>Databricks cloud cluster</b> & <b>Apache Zeppelin</b></h3>
<p>Add the following maven coordinates in the dependency configuration page: </p>
<pre><code class="language-javascript">com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.1</code></pre>
<pre><code class="language-javascript">com.johnsnowlabs.nlp:spark-nlp_2.11:2.0.2</code></pre>
<p>
For Python in <b>Apache Zeppelin</b> you may need to setup <i><b>SPARK_SUBMIT_OPTIONS</b></i> utilizing --packages instruction shown above like this
</p>
<pre><code class="language-javascript">export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:2.0.1"</code></pre>
<pre><code class="language-javascript">export SPARK_SUBMIT_OPTIONS="--packages JohnSnowLabs:spark-nlp:2.0.2"</code></pre>
<h3><b>Python Jupyter Notebook with PySpark</b></h3>
<pre><code class="language-javascript">export SPARK_HOME=/path/to/your/spark/folder
export PYSPARK_DRIVER_PYTHON=jupyter
export PYSPARK_DRIVER_PYTHON_OPTS=notebook

pyspark --packages JohnSnowLabs:spark-nlp:2.0.1</code></pre>
pyspark --packages JohnSnowLabs:spark-nlp:2.0.2</code></pre>
<h3>S3 based standalone cluster (No Hadoop)</h3>
<p>
If your distributed storage is S3 and you don't have a standard hadoop configuration (i.e. fs.defaultFS)
Expand Down Expand Up @@ -442,7 +442,7 @@ <h2 class="section-title">Utilizing Spark NLP OCR Module</h2>
<p>
Spark NLP OCR Module is not included within Spark NLP. It is not an annotator and not an extension to Spark ML.
You can include it with the following coordinates for Maven:
<pre><code class="java">com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.0.1</code></pre>
<pre><code class="java">com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.0.2</code></pre>
</p>
<h3 class="block-title">Creating Spark datasets from PDF (To be used with Spark NLP)</h3>
<p>
Expand Down
2 changes: 1 addition & 1 deletion project/assembly.sbt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.5")
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.9")
2 changes: 1 addition & 1 deletion project/build.properties
Original file line number Diff line number Diff line change
@@ -1 +1 @@
sbt.version=0.13.16
sbt.version=0.13.18
4 changes: 2 additions & 2 deletions python/run-tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
unittest.TextTestRunner().run(PipelineTestSpec())
unittest.TextTestRunner().run(SpellCheckerTestSpec())
unittest.TextTestRunner().run(SymmetricDeleteTestSpec())
unittest.TextTestRunner().run(ContextSpellCheckerTestSpec())
# unittest.TextTestRunner().run(ContextSpellCheckerTestSpec())
unittest.TextTestRunner().run(ParamsGettersTestSpec())
unittest.TextTestRunner().run(DependencyParserTreeBankTestSpec())
unittest.TextTestRunner().run(DependencyParserConllUTestSpec())
Expand All @@ -31,4 +31,4 @@
unittest.TextTestRunner().run(UtilitiesTestSpec())
unittest.TextTestRunner().run(ConfigPathTestSpec())
unittest.TextTestRunner().run(SerializersTestSpec())
unittest.TextTestRunner().run(OcrTestSpec())
unittest.TextTestRunner().run(OcrTestSpec())
2 changes: 1 addition & 1 deletion python/setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,7 +40,7 @@
# For a discussion on single-sourcing the version across setup.py and the
# project code, see
# https://packaging.python.org/en/latest/single_source_version.html
version='2.0.1', # Required
version='2.0.2', # Required

# This is a one-line description or tagline of what your project does. This
# corresponds to the "Summary" metadata field:
Expand Down
4 changes: 2 additions & 2 deletions python/sparknlp/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ def start(include_ocr=False):
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

if include_ocr:
builder.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.1,com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.0.1")
builder.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.2,com.johnsnowlabs.nlp:spark-nlp-ocr_2.11:2.0.2")
else:
builder.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.1") \
builder.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.0.2") \

return builder.getOrCreate()
21 changes: 18 additions & 3 deletions python/sparknlp/annotator.py
Original file line number Diff line number Diff line change
Expand Up @@ -130,13 +130,28 @@ def getIncludeDefaults(self):
return self.getOrDefault("includeDefaults")

def getInfixPatterns(self):
return self.getOrDefault("infixPatterns")
try:
if self.getOrDefault("includeDefaults"):
return self.getOrDefault("infixPatterns") + self.getDefaultPatterns()
else:
return self.getOrDefault("infixPatterns")
except KeyError:
if self.getOrDefault("includeDefaults"):
return self.getDefaultPatterns()
else:
return self.getOrDefault("infixPatterns")

def getSuffixPattern(self):
return self.getOrDefault("suffixPattern")
try:
return self.getOrDefault("suffixPattern")
except KeyError:
return self.getDefaultSuffix()

def getPrefixPattern(self):
return self.getOrDefault("prefixPattern")
try:
return self.getOrDefault("prefixPattern")
except KeyError:
return self.getDefaultPrefix()

def getDefaultPatterns(self):
return Tokenizer.infixDefaults
Expand Down
11 changes: 11 additions & 0 deletions python/sparknlp/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,9 +102,20 @@ class HasWordEmbeddings(HasEmbeddings):
"if sourceEmbeddingsPath was provided, name them with this ref. Otherwise, use embeddings by this ref",
typeConverter=TypeConverters.toString)

includeEmbeddings = Param(Params._dummy(),
"includeEmbeddings",
"whether or not to save indexed embeddings along this annotator",
typeConverter=TypeConverters.toBoolean)

def setEmbeddingsRef(self, value):
return self._set(embeddingsRef=value)

def setIncludeEmbeddings(self, value):
return self._set(includeEmbeddings=value)

def getIncludeEmbeddings(self):
return self.getOrDefault("includeEmbeddings")


class AnnotatorApproach(JavaEstimator, JavaMLWritable, AnnotatorJavaMLReadable, AnnotatorProperties,
ParamsGettersSetters):
Expand Down
Loading