-
Notifications
You must be signed in to change notification settings - Fork 710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LaBSE Sentence Embeddings model output vectors are not equal to original #2846
Comments
We can see differences in embeddings on examples which mean the same thing from Spark-nlp 20 langs multilingual classification demo:
Spark-nlp LaBSE embeddings cosine similiraty heatmap Colab сode to reproduce: As we can see maximum difference Spark-nlp from Original TF-hub model in sentence 4 on Vietnamese language. This means that the Spark-nlp and Original models difference is not only in max_seq_length, lowercased sentences, but also in tokenization process and other (using tf.float64, padding short squences by 0f...)? |
For simplify debugging issue there are attahed csv with sorted sentences by distance from Spark-nlp embeddings to Original Tf-hub model and colab for reproduce: |
With activated tf.float64 and setting max_seq_length = 128 and lowercased input sentences to Original TF-Hub model (! lowercased input sentences, but don't touch do_lower_case option in original model example code, because after setting do_lower_case=True this model makes much worse embeddings than lowercased input) on tasks from Maybe main diferrence from Original model in Spark-nlp multilingual tokenization procedure and embeddings difference appears ony when input sentences are on some languages: Vietnamese, Japan, Urdu... ? |
These are very interesting findings, thanks for sharing them. A couple of things come to my mind: Some explanations:
Some strong possibilities:
I suspect +2/-2 up and down F1-score in English is caused by the custom Tokenization (maybe use RegexTokenizer for simple tokenization by whitespace instead of Tokenizer) and the big difference in the F1-score within multi-lingual is caused by bad multi-lingual tokenization (to be confirmed if you can do the metrics by language to compare how good and bad they are compared to Spark NLP). Will continue debugging since this is a very useful multi-lingual embedding. |
just tagging you @C-K-Loan as an FYI to be sure |
Thanks for your reply. For this reason
in order to avoid such hard-to-find differences / inaccuracies in the process of integrating multilingual models in the future and to assess the quality of multilingual embeddings, I propose two methods:
I think these techniques will be useful not only for solving this differences in LaBSE model, but for integrating any multilingual models in the future (like USE, LASER, Sentence Transformers models etc) as a simple fast standard tests of the correctness of integration process. |
For test examples from section "Model with 20 languages!" from 2 Class Stock Market Sentiment Classifier Training i got the following results with original tf-hub model embeddings (+max_seq_length = 128): |
Thanks, we will investigate this further. The big difference between macro and the micro in Spark NLP indicates some classes did good but some did very bad. Which means the tokenizer may not be doing so great on some languages. @C-K-Loan Could you please do the test by separating those languages and do them in two different groups? (Some with Tokenizer and the rest by using WordSegmenter) |
This option is very usefull
but as I understand, by default in Spark-nlp model config this model is not caseSensitive:
As I know
And I'm get testing results that in this model same thing sentences in many languages without lowercased are closer to each other and distance between sentence and lowercased sentense sometimes not so small: I'm think that caseSensitive:false by default for this model may get better results on small train/test datasets (if so, then it can be shown separately in the examples of using this model), but in general, this may not be so good, as it disables the advantage of the original model and may lead to the above inaccuracies, or am I wrong? |
That’s true, I think by mistake that param was set to false when it was saved. Since the current version is 1 for TF v1, we will make a new one for TF v2 from the version 2 and make sure that param is set to true as well when we upload it. |
With a powerfull of wikipedia interlanguage links for future I'm prepared multilingual test with sentences about one thing on 171 languages - 108 languages is supported by LaBSE (could not find a sentence in the wo (Wolof) language).
*Test sentence language is indicated at the top of the images |
Updated the above multilingual dataset with code for debugging and idea for future autotesting. Thank you for great job on this amazing project and many interesting workshops. Looking forward Multilingual T5 (mT5) and future releases. |
Facebook AI Open-Source The FLORES-101 Data Set consisting of 3001 sentences translated in 101 languages by professional translators (more info). Maybe it is great dataset to measure quality of multilingual embeddings and starting point to create auto tests as I’m describes early. |
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 5 days |
I’am using embeddings from example https://nlp.johnsnowlabs.com/2020/09/23/labse.html and output vectors although close, but not equal to original vectors https://tfhub.dev/google/LaBSE/1 Why?
How original vectors converted to this model, maybe original model modified or finetuned or spark-nlp using different normalization? - on example multiclass/multilabel tasks spark-nlp LaBSE embeddings work differently than original vectors, how get this vectors from original model? Original vectors are case sensitive, but in spark-nlp config this vectors are case insensitive, original vectors max_seq_length = 64, but in spark-nlp = 128... any other differences? For re-produce problem: run code from original vectors page with ["I love NLP", "Many thanks"] and compare outputs to outputs from spark-nlp model page.
How open spark-nlp model by tf.saved_model.load after unzip pb file and run infer in tensorflow for "low-level" compare outputs and analyse problem (tensorflow 2.x can't load unzipped bert_sentence_tensorflow model) - maybe difference/problem occurs in converted tf model?
Notebook to reproduce:
Compare_outputs_of_Spark_nlp_LaBSE_embeddings_and_Original_TF_hub_LaBSE_embeddings.zip
The text was updated successfully, but these errors were encountered: