Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/doc similarity ranker #13858

Merged
Merged
Show file tree
Hide file tree
Changes from 42 commits
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
439ba66
Added doc similarity ranker annotator template
Jan 21, 2023
fda156c
Created ranker model
Feb 16, 2023
e3e7c01
gitignore modified
Feb 16, 2023
d5f6d60
Merge branch 'master' into feature/doc-similarity-ranker
Feb 16, 2023
528de51
Merge branch 'master' into feature/doc-similarity-ranker
Feb 23, 2023
7439062
Merge branch 'master' into feature/doc-similarity-ranker
Feb 27, 2023
7a25fa3
Merge branch 'master' into feature/doc-similarity-ranker
Feb 27, 2023
ee81d37
Added params to LSH models
Feb 27, 2023
b5b9a63
Added BRP LSH as annotator engine
Feb 28, 2023
9053be4
Added replace features col with embeddings
Feb 28, 2023
7dd00bb
Added LSH logic on vector cast
Mar 2, 2023
06a2940
Added skeleton for lsh doc sim ranker - WIP
Mar 3, 2023
ed3edad
Merge branch 'master' into feature/doc-similarity-ranker
Mar 3, 2023
8a91494
Fixed mh3 hash calculation
Mar 4, 2023
6c7c475
Fixed dataset assertions id vs neghbours
Mar 4, 2023
b7742f0
Converting neighbours result string to map
Mar 6, 2023
6969526
Added finisher to extract lsh id and neighbors
Mar 7, 2023
d970632
Labels refactoring
Mar 7, 2023
da50d61
Added distance param to show in rankings
Mar 11, 2023
5c924fc
Added logic to select nearest neighbor
Mar 13, 2023
4c87fd0
Merge branch 'master' into feature/doc-similarity-ranker
Mar 13, 2023
2d3451d
Added identity ranking for debugging
Mar 15, 2023
610dbd0
Merge branch 'master' into feature/doc-similarity-ranker
Mar 15, 2023
632b3ca
Adding Python interface to doc sim ranker approach and model
Mar 18, 2023
04956ca
Merge branch 'master' into feature/doc-similarity-ranker
Mar 18, 2023
abc3ff4
WIP - Python interface
Mar 25, 2023
e66d8f2
Merge branch 'master' into feature/doc-similarity-ranker
Mar 25, 2023
625c643
WIP - fixed umbalanced embeddings Py test
Mar 27, 2023
a227406
Merge branch 'master' into feature/doc-similarity-ranker
Mar 27, 2023
eef709e
Added MinHash engine to doc sim ranker
Mar 28, 2023
fcb7068
Merge branch 'master' into feature/doc-similarity-ranker
Apr 5, 2023
12aa799
Merge branch 'master' into feature/doc-similarity-ranker
Apr 12, 2023
5a38c8e
Fixed serde for ranker map params
Apr 13, 2023
d7e5f43
Merge branch 'master' into feature/doc-similarity-ranker
Apr 13, 2023
72ebae7
Clean up pytests
Apr 18, 2023
cca03f0
Merge branch 'master' into feature/doc-similarity-ranker
Apr 18, 2023
e6e9497
Added doc sim ranker finisher Python interface
Apr 20, 2023
5b61a13
Merge branch 'master' into feature/doc-similarity-ranker
Apr 29, 2023
f0a8688
Merge branch 'master' into feature/doc-similarity-ranker
May 6, 2023
1488c8f
Merge branch 'master' into feature/doc-similarity-ranker
May 29, 2023
d8adc5c
Merge branch 'master' into feature/doc-similarity-ranker
Jun 15, 2023
f1a8e38
stabilized tests for doc sim ranker
Jun 19, 2023
d8f4ed9
Moved and enriched test for doc sim ranker
Jul 1, 2023
62a2945
Bumped version 5.0.0 in doc sim ranker test
Jul 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Empty file.
232 changes: 232 additions & 0 deletions python/sparknlp/annotator/similarity/document_similarity_ranker.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
# Copyright 2017-2023 John Snow Labs
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Contains classes for DocumentSimilarityRanker."""

from sparknlp.common import *
from pyspark import keyword_only
from pyspark.ml.param import TypeConverters, Params, Param
from sparknlp.internal import AnnotatorTransformer


class DocumentSimilarityRankerApproach(AnnotatorApproach, HasEnableCachingProperties):
inputAnnotatorTypes = [AnnotatorType.SENTENCE_EMBEDDINGS]

outputAnnotatorType = AnnotatorType.DOC_SIMILARITY_RANKINGS

similarityMethod = Param(Params._dummy(),
"similarityMethod",
"The similarity method used to calculate the neighbours. (Default: 'brp', "
"Bucketed Random Projection for Euclidean Distance)",
typeConverter=TypeConverters.toString)

numberOfNeighbours = Param(Params._dummy(),
"numberOfNeighbours",
"The number of neighbours the model will return (Default:`10`)",
typeConverter=TypeConverters.toInt)

bucketLength = Param(Params._dummy(),
"bucketLength",
"The bucket length that controls the average size of hash buckets. "
"A larger bucket length (i.e., fewer buckets) increases the probability of features "
"being hashed to the same bucket (increasing the numbers of true and false positives).",
typeConverter=TypeConverters.toFloat)

numHashTables = Param(Params._dummy(),
"numHashTables",
"number of hash tables, where increasing number of hash tables lowers the "
"false negative rate,and decreasing it improves the running performance.",
typeConverter=TypeConverters.toInt)

visibleDistances = Param(Params._dummy(),
"visibleDistances",
"Whether to set visibleDistances in ranking output (Default: `false`).",
typeConverter=TypeConverters.toBoolean)

identityRanking = Param(Params._dummy(),
"identityRanking",
"Whether to include identity in ranking result set. Useful for debug. (Default: `false`).",
typeConverter=TypeConverters.toBoolean)

def setSimilarityMethod(self, value):
"""Sets the similarity method used to calculate the neighbours.
(Default: `"brp"`, Bucketed Random Projection for Euclidean Distance)

Parameters
----------
value : str
the similarity method to calculate the neighbours.
"""
return self._set(similarityMethod=value)

def setNumberOfNeighbours(self, value):
"""Sets The number of neighbours the model will return for each document(Default:`"10"`).

Parameters
----------
value : str
the number of neighbours the model will return for each document.
"""
return self._set(numberOfNeighbours=value)

def setBucketLength(self, value):
"""Sets the bucket length that controls the average size of hash buckets (Default:`"2.0"`).

Parameters
----------
value : float
Sets the bucket length that controls the average size of hash buckets.
"""
return self._set(bucketLength=value)

def setNumHashTables(self, value):
"""Sets the number of hash tables.

Parameters
----------
value : int
Sets the number of hash tables.
"""
return self._set(numHashTables=value)

def setVisibleDistances(self, value):
"""Sets the document distances visible in the result set.

Parameters
----------
value : bool
Sets the document distances visible in the result set.
Default('False')
"""
return self._set(visibleDistances=value)

def setIdentityRanking(self, value):
"""Sets the document identity ranking inclusive in the result set.

Parameters
----------
value : bool
Sets the document identity ranking inclusive in the result set.
Useful for debugging.
Default('False').
"""
return self._set(identityRanking=value)

@keyword_only
def __init__(self):
super(DocumentSimilarityRankerApproach, self)\
.__init__(classname="com.johnsnowlabs.nlp.annotators.similarity.DocumentSimilarityRankerApproach")
self._setDefault(
similarityMethod="brp",
numberOfNeighbours=10,
bucketLength=2.0,
numHashTables=3,
visibleDistances=False,
identityRanking=False
)

def _create_model(self, java_model):
return DocumentSimilarityRankerModel(java_model=java_model)


class DocumentSimilarityRankerModel(AnnotatorModel, HasEmbeddingsProperties):

name = "DocumentSimilarityRankerModel"
inputAnnotatorTypes = [AnnotatorType.SENTENCE_EMBEDDINGS]
outputAnnotatorType = AnnotatorType.DOC_SIMILARITY_RANKINGS

def __init__(self, classname="com.johnsnowlabs.nlp.annotators.similarity.DocumentSimilarityRankerModel",
java_model=None):
super(DocumentSimilarityRankerModel, self).__init__(
classname=classname,
java_model=java_model
)


class DocumentSimilarityRankerFinisher(AnnotatorTransformer):

inputCols = Param(Params._dummy(),
"inputCols",
"name of input annotation cols containing document similarity ranker results",
typeConverter=TypeConverters.toListString)
outputCols = Param(Params._dummy(),
"outputCols",
"output DocumentSimilarityRankerFinisher output cols",
typeConverter=TypeConverters.toListString)
extractNearestNeighbor = Param(Params._dummy(), "extractNearestNeighbor",
"whether to extract the nearest neighbor document",
typeConverter=TypeConverters.toBoolean)

name = "DocumentSimilarityRankerFinisher"

@keyword_only
def __init__(self):
super(DocumentSimilarityRankerFinisher, self).__init__(classname="com.johnsnowlabs.nlp.finisher.DocumentSimilarityRankerFinisher")
self._setDefault(
extractNearestNeighbor=False
)

@keyword_only
def setParams(self):
kwargs = self._input_kwargs
return self._set(**kwargs)

def setInputCols(self, *value):
"""Sets name of input annotation columns containing embeddings.

Parameters
----------
*value : str
Input columns for the annotator
"""

if len(value) == 1 and type(value[0]) == list:
return self._set(inputCols=value[0])
else:
return self._set(inputCols=list(value))

def setOutputCols(self, *value):
"""Sets names of finished output columns.

Parameters
----------
*value : List[str]
Input columns for the annotator
"""

if len(value) == 1 and type(value[0]) == list:
return self._set(outputCols=value[0])
else:
return self._set(outputCols=list(value))

def setExtractNearestNeighbor(self, value):
"""Sets whether to extract the nearest neighbor document, by default False.

Parameters
----------
value : bool
Whether to extract the nearest neighbor document
"""

return self._set(extractNearestNeighbor=value)

def getInputCols(self):
"""Gets input columns name of annotations."""
return self.getOrDefault(self.inputCols)

def getOutputCols(self):
"""Gets output columns name of annotations."""
if len(self.getOrDefault(self.outputCols)) == 0:
return ["finished_" + input_col for input_col in self.getInputCols()]
else:
return self.getOrDefault(self.outputCols)
1 change: 1 addition & 0 deletions python/sparknlp/common/annotator_type.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,4 @@ class AnnotatorType(object):
NODE = "node"
TABLE = "table"
DUMMY = "dummy"
DOC_SIMILARITY_RANKINGS = "doc_similarity_rankings"
223 changes: 223 additions & 0 deletions python/sparknlp/lib/test_doc_sim_ranker.ipynb

Large diffs are not rendered by default.

Empty file.
90 changes: 90 additions & 0 deletions python/test/annotator/similarity/doc_similarity_ranker_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Copyright 2017-2022 John Snow Labs
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import unittest

import pytest

from sparknlp.annotator import *
from sparknlp.annotator.similarity.document_similarity_ranker import *
from sparknlp.base import *
from test.util import SparkSessionForTest


@pytest.mark.slow
class DocumentSimilarityRankerTestSpec(unittest.TestCase):
def setUp(self):
self.spark = SparkSessionForTest.spark

self.data = SparkSessionForTest.spark.createDataFrame([
["First document, this is my first sentence. This is my second sentence."],
["Second document, this is my second sentence. This is my second sentence."],
["Third document, climate change is arguably one of the most pressing problems of our time."],
["Fourth document, climate change is definitely one of the most pressing problems of our time."],
["Fifth document, Florence in Italy, is among the most beautiful cities in Europe."],
["Sixth document, Florence in Italy, is a very beautiful city in Europe like Lyon in France."],
["Seventh document, the French Riviera is the Mediterranean coastline of the southeast corner of France."],
["Eighth document, the warmest place in France is the French Riviera coast in Southern France."]
]).toDF("text")

def runTest(self):
document_assembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("document")
sentence_detector = SentenceDetector() \
.setInputCols(["document"]) \
.setOutputCol("sentence")
tokenizer = Tokenizer() \
.setInputCols(["sentence"]) \
.setOutputCol("token")

sentence_embeddings = RoBertaSentenceEmbeddings.pretrained() \
.setInputCols(["document"]) \
.setOutputCol("sentence_embeddings")

document_similarity_ranker = DocumentSimilarityRankerApproach() \
.setInputCols("sentence_embeddings") \
.setOutputCol("doc_similarity_rankings") \
.setSimilarityMethod("brp") \
.setNumberOfNeighbours(10) \
.setBucketLength(2.0) \
.setNumHashTables(3) \
.setVisibleDistances(True) \
.setIdentityRanking(True)

document_similarity_ranker_finisher = DocumentSimilarityRankerFinisher() \
.setInputCols("doc_similarity_rankings") \
.setOutputCols(
"finished_doc_similarity_rankings_id",
"finished_doc_similarity_rankings_neighbors") \
.setExtractNearestNeighbor(True)

pipeline = Pipeline(stages=[
document_assembler,
sentence_detector,
tokenizer,
sentence_embeddings,
document_similarity_ranker,
document_similarity_ranker_finisher
])

model = pipeline.fit(self.data)

(
model
.transform(self.data)
.select("text",
"finished_doc_similarity_rankings_id",
"finished_doc_similarity_rankings_neighbors")
.show(10, False)
)
2 changes: 1 addition & 1 deletion src/main/scala/com/johnsnowlabs/nlp/AnnotatorType.scala
Original file line number Diff line number Diff line change
Expand Up @@ -38,5 +38,5 @@ object AnnotatorType {
val NODE = "node"
val TABLE = "table"
val DUMMY = "dummy"

val DOC_SIMILARITY_RANKINGS = "doc_similarity_rankings"
}
Loading