-
Notifications
You must be signed in to change notification settings - Fork 123
FLORES-101 benchmark and Alternative Spelling rules in some languages, using FLORES-101 for benchmarking embeddings models #26
Comments
Hi @Fikavec, thank you for providing such an extensive and detailed feedback. I'm not an expert in alternative spelling, but will consult with some of our linguists to get a more informed opinion. For your specific proposals:
|
Thanks for your reply and work, @guzmanhe. My experiments with LASER, USE, LaBSE, distiluse, XLM-R, M2M100 show that the problem of "Alternative Spelling rules" has not yet been solved when AI training and assessment. The rules of alternative spelling are, in fact, the replacement of some characters by others according to tables officially accepted by people in some countries, but no more. Words with an alternative spelling and without it are not synonyms or close words, they are the 100% same words, but modern multilingual AI's do not know this (and do not learn it at training time) and therefore make mistakes (have less quality) at the level of sentences containing such words, be then embedding models, pre-trained language models or translators, because they consider them synonymous or similar words, and not 100% identical words, obtained by replacing characters according to some table of rules. But for people, the situation is different and no less interesting.
Using German as an example, is the reader a native speaker or has he studied it in other country? If he is a native speaker, is he a German or a Swiss, and is he old or young? He reads an old book or a modern newspaper? - it all matters when it comes to the expectations of the reader. From wikipedia:
Thus, a Swiss, when reading a modern newspaper in German, certainly does not expect to see Straße. etc. It seems to me that solving this problem for every language with linguists will be the next step in the development of multilingual AI, and the first step could be creating a specialized test for this problem. In your opinion, the solution to this problem should be carried out at the preprocessing / tokenization stage (replace all alternative spellings so that the AI model always receives words in only one spelling, both during training and during inference) or using augmentation (to balance the training sample words with and without alternative spelling) or methods related to the learning process (fine-tuning on "Alternative Spelling rules") or network architecture (so that a model with "special alternative spelling layers" can learn the rules for replacing tables and give the same outputs for this)?
How to use FLORES for evaluation multilingual embeddings models (LASER, USE, LaBSE, distiluse), could you suggest some suitable metric for this? |
Thanks for Open-Source The FLORES-101 Data Set. While working with him, I noticed a certain feature that I wanted to share here. Some languages contain Alternative Spelling rules therefore some words that have more than one accepted spelling. This is known-well feature for German, Danish, Swedish or Traditional to Simplified Chinese conversion rules, etc. For example, in german language alternative spelling rules are:
Consider the sentence № 991 from Flores-101 dev (deu.dev and eng.dev):
In this sentence:
Therefore sentence:
is fully equivalent of sentence (1), but not for Ai (I’m see this for many Ai translation and embeddings models):
As I can see for German language (and other?), Flores-101 created without examples considering these Alternative Spelling rules. And today Ai trained without considering these Alternative Spelling rules (I mean the methods of augmentation, since when training on big datasets like CommonCrawl + Wikipedia, Ai receives an unbalanced set of alternative spellings (because Alternative Spelling reforms took place recently (in many languages average ~1960-1990 year including modern dictionaries, english language and qwerty keyboard layout spread) and texts contains alternatives spelling unevenly, for example old books or newspapers without alternative spelling and modern text from books, newspapers, internet with alternative spelling), which leads to the fact that (Straße and Strasse (alternative spelling) on german language = street on english lanuage):
words != words in alternative spelling (Straße != Strasse) for Ai
during training, the meaning of the context for words Straße, Strasse gets distorted or lost
for Ai after learning on unbalanced by Alternative Spelling CC + Wiki dataset: Straße = street and Strasse != street OR Straße != street and Strasse = street OR Straße != street and Strasse != street
How about extend Flores-101 (or create additional dataset) with sentences in Alternative Spelling for languages contains this rules for benchmarking (and create metric for measure quality) in two cases:
How about extend Flores-101 (or create additional dataset) with metric or test cases for measure quality of aligning language spaces for embeddings models like LASER, USE etc… for tasks other than machine translation: multilingual Ai tasks (scientific problems) like classification, similarity measure, BUCC, few-shot multilingual learning as discuss there?
The text was updated successfully, but these errors were encountered: