forked from JohnSnowLabs/spark-nlp
-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathCHANGELOG
2241 lines (1963 loc) · 105 KB
/
CHANGELOG
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
========
2.7.4
========
----------------
Bugfixes
----------------
* Fix Tensors with a 0 dimension issue in ClassifierDL and SentimentDL
* Fix index error in TokenAssembler
* Fix MatchError in DateMatcher and MultiDateMatcher annotators
* Fix setOutputAsArray and its default value for valueSplitSymbol in Finisher annotator
----------------
Enhancements
----------------
* Implement missing frequencyThreshold and ambiguityThreshold params in WordSegmenterApproach annotator
* Downgrade Hadoop from 3.2 to 2.7 which caused an issue with S3
* Update Apache HTTP Client
========
2.7.3
========
---------------
New Features
---------------
* Add anchorDateYear, anchorDateMonth, and anchorDateDay to DateMatcher and MultiDateMatcher to be used for relative dates extraction
----------------
Bugfixes
----------------
* Fix the default value for action parameter in Python wrapper for DocumentNormalizer annotator
* Fix Lemmatizer pretrained models published in 2021
----------------
Enhancements
----------------
* Improve T5Transformer performance on documents with many sentences
========
2.7.2
========
----------------
Bugfixes
----------------
* Fix casual mask calculations resulting in bad translation in MarianTransformer
* Fix Serialization issue in the cluster while training ContextSpellChecker
* Fix calculating CHUNK spans based on the sentences' boundaries in RegexMatcher
----------------
Enhancements
----------------
* Add GPU support for training ContextSpellChecker
* Adding Scalatest ability to control tests by tags
========
2.7.1
========
----------------
Bugfixes
----------------
* Fix default pretrained model T5Transformer
* Fix default pretrained model WordSegmenter
* Fix missing reference to WordSegmenter in ResourceDwonloader
* Fix T5Transformer models crashing due to unknown task
* Fix the issue of saving and reading ClassifierDL, SentimentDL, and MultiClassifierDL models introduced in the 2.7.0 release
----------------
Enhancements
----------------
* Export new T5 models with optimized Encoder/Decoder
* Add support for alternative tagging with the positional parser in RegexTokenizer
* Refactor AssertAnnotations
----------------
Backward compatibility
----------------
* In order to fix the issue of Classifiers in the clusters, we had to export new TF models and change the read/write functions of these annotators. This caused any model trained prior to the 2.7.0 release not to be compatible with 2.7.1 and require retraining including pre-trained models. (we are re-training all the existing text classification models with 2.7.1)
========
2.7.0
========
------------------------------
Major features and improvements
------------------------------
* Introducing MarianTransformer annotator for machine translation based on MarianNMT models. Marian is an efficient, free Neural Machine Translation framework mainly being developed by the Microsoft Translator team (646+ pretrained models & pipelines in 192+ languages)
* Introducing T5Transformer annotator for Text-To-Text Transfer Transformer (Google T5) models to achieve state-of-the-art results on multiple NLP tasks such as Translation, Summarization, Question Answering, Sentence Similarity, and so on
* Introducing brand new and refactored language detection and identification models. The new LanguageDetectorDL is faster, more accurate, and supports up to 375 languages
* Introducing WordSegmenter annotator, a trainable annotator for word segmentation of languages without any rule-based tokenization such as Chinese, Japanese, or Korean
* Introducing DocumentNormalizer annotator cleaning content from HTML or XML documents, applying either data cleansing using an arbitrary number of custom regular expressions either data extraction following the different parameters
* [Spark NLP Display](https://github.com/JohnSnowLabs/spark-nlp-display) for visualization of different types of annotations
* Add support for new multi-lingual models in UniversalSentenceEncoder annotator
* Add support to Lemmatizer to be trained directly from a DataFrame instead of a text file
* Add training helper to transform CoNLL-U into Spark NLP annotator type columns
----------------
Bugfixes and Enhancements
----------------
* Fix all the known issues in ClassifierDL, SentimentDL, and MultiClassifierDL annotators in a Cluster
* NerDL enhancements for memory optimization and logging during the training with the test dataset
* SentenceEmbeddings annotator now reuses the storageRef of any embeddings used in prior
* Fix dropout in SentenceDetectorDL models for more deterministic results. Both English and Multi-lingual models are retrained for the 2.7.0 release
* Fix Python dataType Annotation
* Upgrade to Apache Spark 2.4.7
========
2.6.5
========
----------------
Bugfixes
----------------
* Fix a bug in batching sentences in BertSentenceEmbeddings
* Fix AttributeError when trying to load a saved EmbeddingsFinisher in Python
----------------
Enhancements
----------------
* Improve handeling exceptions in DocumentAssmbler when user uses a corrupted DataFrame
========
2.6.4
========
----------------
Bugfixes
----------------
* Fix loading from a local folder with no access to the cache folder
* Fix NullPointerException in DocumentAssembler when there are null in the rows
* Fix dynamic padding in BertSentenceEmbeddings
========
2.6.3
========
---------------
New Features
---------------
* Add enableMemoryOptimizer to allow training NerDLApproach on a dataset larger than the memory
* Add option to explode sentences in SentenceDetectorDL
----------------
Enhancements
----------------
* Improve POS (AveragedPerceptron) performance
* Improve Norvig Spell Checker performance
----------------
Bugfixes
----------------
* Fix SentenceDetectorDL unsupported model error in pretrained function
* Fix a race condition in Lru that can cause NullPointerException during a LightPipeline operations with embeddings
* Fix max sequence length calculation in BertEmbeddings and BertSentenceEmbeddings
* Fix threshold in YakeModel on Python side
========
2.6.2
========
---------------
New Features
---------------
* Introducing a new SentenceDetectorDL
----------------
Enhancements
----------------
* Improved BioBERT models quality for BertEmbeddings (it achieves higher accuracy in sequence classification)
* Improved Sentence BioBERT models quality for BertSentenceEmbeddings (it achieves higher accuracy in text classification)
* Add unit test to MultiClassifierDL annotator
* Better error handling in SentimentDLApproach
* Improve loadSavedModel in BertEmbeddings and BertSentenceEmbeddings
----------------
Bugfixes
----------------
* Fix BERT LaBSE model for BertSentenceEmbeddings
* Fix loadSavedModel for BertSentenceEmbeddings in Python
---------------
Deprecations
---------------
* DeepSentenceDetector is deprecated in favor of SentenceDetectorDL
========
2.6.1
========
----------------
Bugfixes
----------------
* Fix a bug in ClassifierDL that resulted in low accuracy during the training
========
2.6.0
========
------------------------------
Major features and improvements
------------------------------
* **NEW:** A new MultiClassifierDL annotator for multi-label text classification
* **NEW:** A new BertSentenceEmbeddings annotator with 41 available pre-trained models for sentence embeddings used in SentimentDL, ClassifierDL, and MultiClassifierDL annotators
* **NEW:** A new YakeModel annotator for an unsupervised, corpus-independent, domain, and language-independent and single-document keyword extraction algorithm
* Integrate 24 new Small BERT models where the smallest model is 24x times smaller and 28x times faster compare to BERT base models
* Add 3 new ELECTRA small, base, and large models
* Add 4 new Finnish BERT models for BertEmbeddings and BertSentenceEmbeddings
* Improve BertEmbeddings memory consumption by 30%
* Improve BertEmbeddings performance by more than 70% with a new built-in dynamic shape inputs
* Remove the poolingLayer parameter in BertEmbeddings in favor of sequence_output that is provided by TF Hub models for new BERT models
* Add validation loss, validation accuracy, validation F1, and validation True Positive Rate during the training in MultiClassifierDL
* Add parameter to enable/disable list detection in SentenceDetector
* Unify the loggings in ClassifierDL and SentimentDL during training
----------------
Bugfixes
----------------
* Fix Tokenization bug with Bigrams in the exception list
* Fix the versioning error in second SBT projects causing models not being found via pretrained function
* Fix logging to file in NerDLApproach, ClassifierDL, SentimentDL, and MultiClassifierDL on HDFS
* Fix ignored modified tokens in BertEmbeddings, now it will consider modified tokens instead of originals
========
2.5.5
========
---------------
New Features
---------------
- Add getClasses() function to NerDLModel
- Add getClasses() function to ClassifierDLModel
- Add getClasses() function to SentimentDLModel
---------------------
Enhancements
---------------------
- Improve max sequence length calculation in BertEmbeddings and XlnetEmbeddings
----------------
Bugfixes
----------------
- Fix a bug in RegexTokenizer in Python
- Fix StopWordsCleaner exception in Python when pretrained() is used
- Fix max sequence length issue in AlbertEmbeddings and SentencePiece generation
- Fix HDFS support for setGaphFolder param in NerDLApproach
========
2.5.4
========
---------------
New Features
---------------
* Add support for Apache Spark 2.3.x including new Maven artifacts and full support of all pre-trained models/pipelines
* Add 43 new pre-trained models in 43 languages to StopWordsCleaner annotator
* Introduce a new RegexTokenizer to split text by regex pattern
---------------------
Enhancements
---------------------
* Retrained 6 new BioBERT and ClinicalBERT models
* Add a new param to `start()` function to start the session for Apache Spark 2.3.x
----------------
Bugfixes
----------------
* Add missing library for SentencePiece used by AlbertEmbeddings and XlnetEmbeddings on Windows
* Fix ModuleNotFoundError in LanguageDetectorDL pipelines in Python
========
2.5.3
========
---------------
New Features
---------------
* TextMatcher now can construct the chunks from tokens instead of the original documents via buildFromTokens param
* CoNLLGenerator now is accessible in Python
----------------
Bugfixes
----------------
* Fix a bug in ContextSpellChecker resulting in IllegalArgumentException
---------------------
Enhancements
---------------------
* Improve RocksDB connection to support different storage capabilities
* Improve parameters naming convention in ContextSpellChecker
---------------------
Enhancements
---------------------
* Add NerConverter to documentation
* Fix multi-language tabs in documentation
========
2.5.2
========
---------------
New Features
---------------
* Introducing a new LanguageDetectorDL state-of-the-art annotator to detect and identify languages in documents and sentences
* Add a new param entityValue to TextMatcher to add custom value inside metadata. Useful in post-processing when there are multiple TextMatcher annotators with multiple dictionaries https://github.com/JohnSnowLabs/spark-nlp/issues/920
----------------
Bugfixes
----------------
* Add missing TensorFlow graphs to train ContextSpellChecker annotator https://github.com/JohnSnowLabs/spark-nlp/issues/912
* Fix misspelled param in classThreshold param in ContextSpellChecker annotator https://github.com/JohnSnowLabs/spark-nlp/issues/911
* Fix a bug where setGraphFolder in NerDLApproach annotator couldn't find a graph on Databricks (DBFS) https://github.com/JohnSnowLabs/spark-nlp/issues/739
* Fix a bug in NerDLApproach when includeConfidence was set to true https://github.com/JohnSnowLabs/spark-nlp/issues/917
* Fix a bug in BertEmbeddings https://github.com/JohnSnowLabs/spark-nlp/issues/906 https://github.com/JohnSnowLabs/spark-nlp/issues/918
---------------------
Enhancements
---------------------
* Improve TF backend in ContextSpellChecker annotator
========
2.5.1
========
---------------
New Features
---------------
* Add Python support for PubTator reader to convert automatic annotations of the biomedical datasets into DataFrame
* Add 6 new pre-trained BERT models from BioBERT and ClinicalBERT
---------------------
Enhancements
---------------------
* Add unit tests for XlnetEmbeddings
* Add unit tests for AlbertEmbeddings
* Add unit tests for ContextSpellChecker
========
2.5.0
========
---------------
New Features
---------------
* A new AlbertEmbeddings annotator with 4 available pre-trained models
* A new XlnetEmbeddings annotator with 2 available pre-trained models
* A new ContextSpellChecker annotator, the state-of-the-art annotator for spell checking
* A new SentimentDL annotator for multi-class sentiment analysis. This annotator comes with 2 available pre-trained models trained on IMDB and Twitter datasets
* Add new PubTator reader to convert automatic annotations of the biomedical datasets into DataFrame
* Introducing a new outputLogsPath param for NerDLApproach, ClassifierDLApproach and SentimentDLApproach annotators
* Refactored CoNLLGenerator to actually use NER labels from the DataFrame
* Unified params in NerDLModel in both Scala and Python
* Extend and complete Scaladoc APIs for all the annotators
----------------
Bugfixes
----------------
* Fix position of tokens in Normalizer
* Fix Lemmatizer exception on a bad input
* Fix annotator logs failing on object storage file systems like DBFS
----------------
Documentation
----------------
* Update documentation for release of Spark NLP 2.5.x
* Update the entire [spark-nlp-workshop](https://github.com/JohnSnowLabs/spark-nlp-models) notebooks for Spark NLP 2.5.x
* Update the entire [spark-nlp-models](https://github.com/JohnSnowLabs/spark-nlp-workshop) repository with new pre-trained models and pipelines
========
2.4.5
========
---------------
Overview
---------------
We are very excited to extend Spark NLP support to 6 new Databricks runtimes and add support to Cloudera and EMR YARN cluster-mode.
As always, we thank our community for their feedback and questions in our Slack channel.
---------------
New Features
---------------
* Extend Spark NLP support for Databricks runtimes:
* 6.2
* 6.2 ML
* 6.3
* 6.3 ML
* 6.4
* 6.4 ML
* 6.5
* 6.5 ML
* Add support for cluster-mode in Cloudera and EMR YARN clusters
* New splitPattern param in Tokenizer to split tokens by regex rules
----------------
Bugfixes
----------------
* Fix ClassifierDLModel save and load in Python
* Fix ClassifierDL TensorFlow session reuse
* Fix Normalizer positions of new tokens
----------------
Documentation
----------------
* Update documentation for release of Spark NLP 2.4.x
* Update the entire [spark-nlp-workshop](https://github.com/JohnSnowLabs/spark-nlp-models) notebooks for Spark NLP 2.4.x
* Update the entire [spark-nlp-models](https://github.com/JohnSnowLabs/spark-nlp-workshop) repository with new pre-trained models and pipelines
========
2.4.4
========
---------------
Overview
---------------
* We are very excited to release the very first multi-class text classifier in Spark NLP v2.4.4! We have built a generic ClassifierDL annotator that uses the state-of-the-art Universal Sentence Encoder as an input for text classifications. The ClassifierDL annotator uses a deep learning model (DNNs) we have built inside TensorFlow and supports up to 50 classes.
* We are also happy to announce the support of yet another language: Russian! We have trained and prepared 5 pre-trained models and 6 pre-trained pipelines in Russian.
**NOTE**: ClassifierDL is an experimental feature in 2.4.4 release. We have worked hard to aim for simplicity and we are looking forward to your feedback as always.
---------------
New Features
---------------
* Introducing an experimental multi-class text classification by using the DNNs model in TensorFlow called `ClassifierDL`. This annotator can train any dataset from 2 up to 50 classes.
* 5 new pretrained Russian models (Lemma, POS, 3x NER)
* 6 new pretrained Russian pipelines
---------------
Enhancements
---------------
* Add param to NerConverter to override modified tokens instead of original tokens
----------------
Bugfixes
----------------
* Fix TokenAssembler
* Fix NerConverter exception when NerDL is trained with different tagging style than IOB/IOB2
========
2.4.3
========
---------------
Overview
---------------
This minor release fixes a bug on our Python side that was introduced in 2.4.2 release.
As always, we thank our community for their feedback and questions in our Slack channel.
----------------
Bugfixes
----------------
* Fix Python imports which resulted in AttributeError: module 'sparknlp' has no attribute
========
2.4.2
========
---------------
Overview
---------------
This minor release fixes a few bugs in some of our annotators reported by our community.
As always, we thank our community for their feedback and questions in our Slack channel.
----------------
Bugfixes
----------------
* Fix UniversalSentenceEncoder.pretrained() that failed in Python
* Fix ElmoEmbeddings.pretrained() that failed in Python
* Fix ElmoEmbeddings poolingLayer param to be a string as expected
* Fix ChunkEmbeddings to preserve chunk's index
* Fix NGramGenerator and missing chunk metadata
---------------
New Features
---------------
* Add GPU support param in Spark NLP start function: sparknlp.start(gpu=true)
* Improve create_model.py to create custom TF graph for NerDLApproach
----------------
Documentation
----------------
* Update documentation for release of Spark NLP 2.4.x
* Update the entire [spark-nlp-workshop](https://github.com/JohnSnowLabs/spark-nlp-models) notebooks for Spark NLP 2.4.x
* Update the entire [spark-nlp-models](https://github.com/JohnSnowLabs/spark-nlp-workshop) repository with new pre-trained models and pipelines
========
2.4.1
========
---------------
Overview
---------------
This minor release fixes a few bugs in some of our annotators reported by our community.
As always, we thank our community for their feedback and questions in our Slack channel.
----------------
Bugfixes
----------------
* Improve ChunkEmbeddings annotator and fix the empty chunk result
* Fix UniversalSentenceEncoder crashing on empty Tensor
* Fix NorvigSweetingModel missing sentenceId that results in NGramsGenerator crashing
* Fix missing storageRef in embeddings' column for ElmoEmbeddings annotator
----------------
Documentation
----------------
* Update documentation for release of Spark NLP 2.4.x
* Add new features such as ElmoEmbeddings and UniversalSentenceEncoder
* Add multiple programming languages for demos and examples
* Update the entire [spark-nlp-models](https://github.com/JohnSnowLabs/spark-nlp-models) repository with new pre-trained models and pipelines
========
2.4.0
========
---------------
Overview
---------------
We are very excited to finally release Spark NLP v2.4.0! This has been one of the largest releases we have ever made since the inception of the library!
The new release of Spark NLP `2.4.0` has been migrated to TensorFlow `1.15.0` which takes advantage of the latest deep learning technologies and pre-trained models.
As always, thanks to the community for the feedback and questions in our Slack channel.
Please beware as this release breaks backwards compatibility with previously saved models, particularly on Tensorflow and Embeddings, aside from code-breaking changes in the API.
We will be working in our documentation to enhance the learning curve.
---------------
New Features
---------------
* TensorFlow 1.15.0 now works behind Spark NLP. This brings implicit improvements in performance, accuracy and functionalities
* New Annotator UniversalSentenceEncoder with 2 pre-trained models from TF Hub. Check our spark-nlp-models repo for updates
* New Annotator MultiDateMatcher capable of matching more than one date per sentence (Extends DateMatcher algorithm)
* New Annotator NGramGenerator with Param tweaks for customization
* New Annotator BigTextMatcher works best with large amounts of input data
* New Annotator ElmoEmbeddings with a pre-trained model from TF Hub. Check our spark-nlp-models repo for updates
* BertEmbeddings improvements with 5 new models from TF Hub
* RecursivePipelineModel as an enhanced PipelineModel allows Annotators access previous annotators in the pipeline for more ML strategies
* LazyAnnotators: A new Param in Annotators allow them to stand idle in the Pipeline and do nothing. Can be called by other Annotators in a RecursivePipeline
---------------
Enhancements
---------------
* RocksDB now available as a flexible API called `Storage`. Allows any annotator to have it's own distributed local index database
* Now our Tensorflow pre-trained models are cross-platform. Enabling multi-language models and other improvements to Windows users.
* Improved IO performance in general for handling embeddings
* Improved cache cleanup and GC by liberating open files utilized in RocksDB (to be improved further)
* Tokenizer and SentenceDetector Params minLength and MaxLength to filter out annotations outside these bounds
* Tokenizer improvements in splitChars and simplified rules
* DateMatcher improvements
* TextMatcher improvements preload algorithm information within the model for faster prediction
* Annotators the utilize embeddings have now a strict validation to be using exactly the embeddings they were trained with
* Improvements in the API allow Annotators with Storage to save and load their RocksDB database independently and let it be shared across Annotators
----------------
Bugfixes
----------------
* Fixes in Chunk and SentenceEmbeddings to better deal with empty cleaned-up Annotations
* Fixed PretrainedPipeline in Python to allow accessing the inner PipelineModel in the instance
* Probably a bunch of uncommented bugfixes along the way :)
========
2.3.6
========
---------------
Overview
---------------
This minor release fixes a bug in ChunkEmbeddings causing an out of boundaries exception in some scenarios. We
also switch to maven coordinates as default source for start() function since spark-packages has not been responsive
on their package approval process. Thank you all for your consistent feedback.
---------------
Bugfixes
---------------
* Fixed a bug in Chunk Embeddings caused by out of bound exception in some scenarios
---------------
Other
---------------
* start() function switched to use maven coordinates instead
========
2.3.5
========
---------------
Overview
---------------
We would like to thank you all for your valuable feedback via our Slack channels and our GitHub repositories.
Spark NLP `2.3.4` is a very stable and rock-solid release. However, we wanted to fix the few remaining minor bugs before moving to our bigger release `2.4.0`!
---------------
Bugfixes
---------------
* https://github.com/JohnSnowLabs/spark-nlp/pull/702 Date matcher fixes flexible dates
* https://github.com/JohnSnowLabs/spark-nlp/pull/718 Fixed a bug in a pragmatic sentence detector where a sub matched group contained a dollar sign.
* https://github.com/JohnSnowLabs/spark-nlp/pull/719 Move import to top-level to avoid import fail in Spark NLP functions
* https://github.com/JohnSnowLabs/spark-nlp/pull/709 https://github.com/JohnSnowLabs/spark-nlp/pull/716 Some improvements in our documentation thanks to @marcinic @howmuchcomputer
========
2.3.4
========
---------------
Overview
---------------
Thank you, as always, for the feedback given at Slack and our repos. The most important part of this release,
is how we internally started organizing models. We'll be deploying our model news in
https://github.com/JohnSnowLabs/spark-nlp-models . The models repo will be kept up to date.
As for this release, it improves various internal API functionalities, allowing for positive side-effects across
the library. As an important enhancement, we have added user UDFs and functions for both Scala and Python users
to be able to easily manipulate annotations on DataFrames. Finally, we have fixed various bugs in embeddings
metadata to make sure we provide accurate offsetting information for other annotators to consume it successfully.
---------------
Enhancements
---------------
* Revamped functions in Scala and python to help users deal with annotations from dataframes or in UDF form, such as `map_annotations` and `filter_by_annotations`
---------------
Bugfixes
---------------
* Fixed bugs in ChunkEmbeddings and SentenceEmbeddings causing them to report wrong metadata and offset values
* Fixed a nested import issue in Python causing LightPipelines not to work in some environments
---------------
Developer API
---------------
* downloadModel is now flexible as to which inner downloader class is being used to access AnnotatorModel reference
* pretrained API now deals with defaultModelName as an Option to allow non default pretrained models
---------------
Other
---------------
* version() now returns the version string instead of just printing it
========
2.3.3
========
---------------
Overview
---------------
We are very glad to announce this release, it actually ended up much bigger than we expected.
Thanks to the community feedback, we arranged many bugfixes. We also spent some times and started building
models for the TextMatcher, so it got various improvements and bugfixes when dealing with empty sentences or cleaned up tokens.
We also added UDF ready functions in Python to easily deal with Annotations. Finally, we fixed a few bugs when loading models from disk.
Thank you very much for constant feedback on Slack.
---------------
New Features
---------------
* TextMatcher new param `mergeOverlapping` allows for handling overlapping output chunks when matching entities share keywords
* NER overwriter annotator allows for overwriting NER output with custom entities
* Added `map_annotations`, `map_annotations_strict`, `map_annotations_col`, `filter_by_annotations_col` and `explode_annotations_col` functions to python side. Allows dealing with Annotations easily.
---------------
Enhancement
---------------
* Made ChunkEmbeddings output to be compatible with SentenceEmbeddings for better flexibility in pipelines
---------------
Bugfixes
---------------
* Fixed BertEmbeddings crashing on empty input sentences
* Fixed missing load API and import shorcuts on the new Embeddings annotators
* Added missing metadata fields in ChunkEmbeddings
* Fixed wrong sentence IDs in sentences or tokens that got a cleanup during the pipeline
* Fixed typos in docs. Thanks @marcinic
* Fixed bad deprecated OCR and SpellChecker python classpath
========
2.3.2
========
---------------
Overview
---------------
This release addresses multiple bug fixes and some enhancements regarding memory consumption in BertEmbeddings annotator.
Thanks for your feedback and reports!
---------------
Bugfixes
---------------
* Fix missing EmbeddingsFinisher in Scala and Python
* Reverted embeddings move to copy due to CRC issue
* Fix IndexOutOfBoundsException in SentenceEmbeddings
---------------
Enhancement
---------------
* Optimize BertEmbeddings memory consumption
========
2.3.1
========
---------------
Overview
---------------
This quick release addresses a bug in Lemmatizer loading/pretrained function causing it not to work in 2.3.0.
We took the chance to include a feature which did not make it for base 2.3.0 and slightly changed protected variables for
better Java API, also including a pretrained compatible function with Java. Thanks for the quick issue feedback again!
---------------
New Features
---------------
* New EmbeddingsFinisher specializes in dealing with embedding annotators output. Traditional finisher still behaves the same as 2.3.0
---------------
Bugfixes
---------------
* Fixed a bug in previous release causing LemmatizerModel not to be loaded or pretrained load
* Fixed pretrained() function to return proper type in Java
---------------
Developer API
---------------
* defaultModelName, defaultLang and defaultLoc static pretrained properties are now public
========
2.3.0
========
---------------
Overview
---------------
Thanks for your contributions and feedback on Slack. This amazing release comes with many new features in the embeddings scope,
allowing pipeline builders to retrieve embeddings for specific bodies of texts in any form given, from sentences to chunks or n-grams.
We also worked a lot on making sure Spark NLP on Java works as intended. Finally, we improved aws profiles compatibility for frameworks
that utilize multiple credential profiles. Unfortunately, we have deprected Eval and OCR due to internal patents in some of the latest improvements
John Snow Labs has contributed to.
---------------
New Features
---------------
* New SentenceEmbeddings annotator utilizes WordEmbeddings or BertEmbeddings to generate sentence or document embeddings
* New ChunkEmbeddings annotator utilizes WordEmbeddings or BertEmbeddings to generate chunk embeddings from Chunker or NGramGenerator outputs
* New StopWordsCleaner integrates Spark ML StopWordsRemoval function into Spark NLP pipeline
* New NGramGenerator annotator integrates Spark ML NGram function into Spark ML with a new cumulative feature to also generate range ngrams like the scikit-learn library
---------------
Enhancements
---------------
* Improved Java intercompatibility on Pretrained and LightPipeline APIs. Examples added.
* Finisher and LightPipelines Parse Embeddings Vector flag allows for optional vector processing to save memory and improve performance
* setInputCols in python can be passed as *args
* new Param enableScore in SentimentDetector to switch output types between confidence score and results (Thanks @maxwellpaulm)
* spark_nlp profile name by default in AWS config allows for multiple profile download compatible
---------------
Bugfixes
---------------
* Fixed POS training dataset creator to improve performance
---------------
Deprecations
---------------
* OCR Module dropped from open source support
* Eval Module dropped from open source support
========
2.2.2
========
---------------
Overview
---------------
Thank you again for all your feedback and questions in our Slack channel. Such feedback from users and contributors
(thank you Stuart Lynn @sllynn) helped to find several python module bugs. We also fixed and improved OCR support
towards extracting page coordinates and fixed NerDL evaluator from Python
---------------
Enhancements
---------------
* Added a create_models.py python script to generate Graphs for NerDL without the need of jupyter
* Added a new annotator Token2Chunk to convert all tokens to chunk types (useful for extracting token coordinates from OCR)
* Added OCR Page Dimensions
* Python setInputCols now accepts *args no need to input list
---------------
Bugfixes
---------------
* Fixed python support of NerDL evaluation not taking all params appropriately
* Fixed a bug in case sensitivity matching of embeddings format in python (Thanks @sllynn)
* Fixed a bug in python DateMatcher with dateFormat param not working (Thanks @sllynn)
* Fixed a bug in PositionFinder reporting duplicate coordinate elements
----------------
Developer API
----------------
* Renamed trainValidationProp to validationSplit in NerDLApproach
----------------
Documentation
----------------
* Added several missing annotator documentation in docs page
========
2.2.1
========
---------------
Overview
---------------
This short release is to address a few uncovered issues in the previous 2.2.0 release. Thank you all for quick feedback.
---------------
Enhancements
---------------
* NerDLApproach new param includeValidationProp allows partitioning the training set and exclude a fraction
* NerDLApproach trainValidationProp now randomly samples the data as opposed to head first
---------------
Bugfixes
---------------
* Fixed a bug in ResourceHelper causing folder resources to fail when a folder is empty (affects various annotators)
* Fixed a bug in python embeddings format not parsed to upper case
* Fixed a bug in python causing an incapability to load PipelineModels after loading embeddings
========
2.2.0
========
---------------
Overview
---------------
Last time, following a release candidate schedule proved to be a quite effective method to avoid silly bugs right after release!
Fortunately, there were no breaking bugs by carefully testing releases alongside the community,
which ended up in various pull requests. This huge release features OCR based coordinate highlighting, BERT embeddings refactor and tuning, more tools for accuracy evaluation in python, and much more.
We welcome your feedback in our Slack channels, as always!
---------------
New Features
---------------
* OCRHelper now returns coordinate positions matrix for text converted from PDF
* New annotator PositionFinder consumes OCRHelper positions to return rectangle coordinates for CHUNK annotator types
* Evaluation module now also ported to Python
* WordEmbeddings now include coverage metadata information and new static functions `withCoverageColumn` and `overallCoverage` offer metric analysis
* NerDL Now has `includeConfidence` param that enables confidence scores on prediction metadata
* NerDLApproach now has `enableOutputLog` outputs training metric logs to file
* New Param in BERT `poolingLayer` allows for polling layer selection
---------------
Enhancements
---------------
* BERT Embeddings now merges much better with Spark NLP, returning state of the art accuracy numbers for NER (Details will be expanded). Thank you for community feedback.
* Progress bar and size estimate report when downloading pretrained models and loading embeddings
* Models and pipeline cache now more efficiently managed and includes CRC (not retroactive)
* Finisher and LightPipeline now deal with embeddings properly, including them in pre processed result (Thank you Will Held)
* Tokenizer now allows regular expressions in the list of Exceptions (Thank you @atomobianco)
* PretrainedPipelines now allow function `fullAnnotate` to retrieve fully information of Annotations
* DocumentAssembler new cleanup modes: each, each_full and delete_full allow more control over text cleaning up (different ways of dealing with new lines and tabs)
---------------
Bugfixes
---------------
* Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
* Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
* Fixed missing setters for whitelist param in NerConverter
* Fixed a bug where parameters from a BERT model were incorrectly being read from python because of not being correctly serialized
* Fixed a bug where ResourceDownloader conflicted S3 credentials with public model access (Thank you Dimitris Manikis)
* Fixed Context Spell Checker bugs with performance improvements (pretrained model disabled until we get a better one)
========
2.1.1
========
---------------
Overview
---------------
Thank you so much for your feedback on slack. This release is to extend life length of the 2.1.x release, with important bugfixes from upstream
---------------
Bugfixes
---------------
* Fixed a bug in NerConverter caused by empty entities, returning an error when flushing entities
* Fixed a bug when creating BERT Models from python, where contrib libraries were not loaded
* Fixed missing setters for whitelist param in NerConverter
========
2.1.0
========
---------------
Overview
---------------
Thank you for following up with release candidates. This release is backwards breaking because two basic annotators have been redesigned.
The tokenizer now has easier to customize params and simplified exception management.
DocumentAssembler `trimAndClearNewLiens` was redesigned into a `cleanupMode` for further control over the cleanup process.
Tokenizer now supports pretrained models, meaning you'll be capable of accessing any of our language based Tokenizers.
Another big introduction is the `eval` module. An optional Spark NLP sub-module that provides evaluation scripts, to
make it easier when looking to measure your own models are against a validation dataset, now using MLFlow.
Some work also began on metrics during training, starting now with the `NerDLApproach`.
Finally, we'll have Scaladocs ready for easy library reference.
Thank you for your feedback in our Slack channels.
Particular thanks to @csnardi for fixing a bug in one of the release candidates.
---------------
New Features
---------------
* Spark NLP Eval module, includes functions to evaluate NER and Spell Checkers with MLFlow (Python support and more annotators to come)
---------------
Enhancements
---------------
* DocumentAssembler new param `cleanupMode` allows user to decide what kind of cleanup to apply to source
* Tokenizer has been severely enhanced to allow easier and more intuitive customization
* Norvig and Symmetric spell checkers now report confidence scores in metadata
* NerDLApproach now reports metrics and f1 scores with an automated dataset splitting through `setTrainValidationProp`
* Began making progress towards OCR reporting more meaningful metadata (noise levels, confidence score, etc), sets ground base for further development
---------------
Bugfixes
---------------
* Fixed Dependency Parser not reporting offsets correctly
* Dependency Parser now only shows head token as part of the result, instead of pairs
* Fixed NerDLModel not allowing to pick noncontrib versions from linux
* Fixed a bug in embeddingsRef validation allowing the user to override ref when not possible
* Removed unintentional gc calls causing some performance issues
---------------
Framework
---------------
* ResourceDownloader now capable of utilizing credentials from aws standard means (variables, credentials folder)
---------------
Documentation
---------------
* Scaladocs for Spark NLP reference
* Added Google Colab workthrough guide
* Added Approach and Model class names in reference documentation
* Fixed various typos and outdated pieces in documentation
========
2.0.8
========
---------------
Overview
---------------
This release fixes a few tiny but meaningful issues that prevent from new trained models having internal compatibility issues.
---------------
Bugfixes
---------------
* Fixed wrong logic when checking embeddingsRef is being overwritten in a WordEmbeddingsModel
* Deleted unnecessary chunk index from tokens
* Fixed some of the new trained models compatibility issues when python API had mismatching pretrained models compared to scala
========
2.0.7
========
---------------
Overview
---------------
This release addresses bugs related to cluster support, improving error messages and fixing various potential bugs depending
on the cluster configuration, such as Kryo Serialization or non default FS systems
---------------
Bugfixes
---------------
* Fixed a bug introduced in 2.0.5 that caused NerDL not to work in clusters with Kryo serialization enabled
* NerDLModel was not properly reading user provided config proto bytes during prediction
* Improved cluster embeddings message to hit user of cluster mode without shared filesystems
* Removed lazy model downloading on PretrainedPipeline to download the model at instantiation
* Fixed URI construction for cluster embeddings on non defaultFS configurations, improves cluster compatibility
========
2.0.6
========
---------------
Overview
---------------
Following the 2.0.5 (read notes below), this release fixes a bug when disabling contrib param in NerDLApproach on non-windows OS
---------------
Bugfixes
---------------
* Fixed NerDLApproach failing when training with setUseContrib(false)
========
2.0.5
========
---------------
Overview
---------------
This release bumps Spark NLP by default to Apache Spark 2.4.3. Spark has been undergoing testing with Scala 2.12 and they are back in 2.11 now, so this should be a working release.
In this version, we fixed a series of Pretrained models, as well as focused on improving the flexibility of NerDL annotator, which is, if not, the most popular one based on user feedback.
Users can point to graphs they create without having to re-compile the library, graph options as well whether to use Tensorflow contrib is now user defined.
Particular thanks to @CyborgDroid because of reporting importantly and well-reported bugs that helped us improve Spark NLP.
Thank you for reporting issues and feedback, and we always welcome more. Join us on Slack!
---------------
Enhancements
---------------
* ViveknSentiment annotator now includes confidence score in metadata
* NerDL now has setGraphFolder to allow a path to folder with custom generated graphs using python/tensorflow code
* NerDL now has setConfigProtoBytes to allow users submit his own ConfigProto (serialized) to the graph settings
* NerDLApproach now has setUseContrib to let training user decide whether or not to use contrib. Contrib LSTM Cells are proved to return more accurate results, but does not work in Windows yet.
* Updated default tensorflow settings to include GPU allow_growth by default, disabled log device placement spamming message
* Spark version bumped to 2.4.3
---------------
Bugfixes
---------------
* Fixed contrib NerDL models not work properly in clusters such as Databricks (Thanks @CyborgDroid)
* Fixed sparknlp.start(include_ocr=True) missing dependencies for OCR
* Fixed DependencyParser pretrained models not working properly in Python
---------------
Models and Pipelines
---------------
* NerDL will download noncontrib model if windows is detected, for better compatibility
* noncontrib version of pipelines with NerDL have been uploaded, as well as new models. Check documentation for complete list
* Improved error message when user is under windows and trying to load a contrib NerDL model
* Fixed ViveknSentimentModel not working properly (Thanks @CyborgDroid)
---------------
Developer API
---------------
* Embeddings in python moved to annotator module for consistency
* SourceStream ResourceHelper class now properly handles cluster files for Dependency Parser
* Metadata model reader now ignores empty lines instead of failing
* Unified lang instead of language attribute name in pretrained API