FLORES-101 benchmark and Alternative Spelling rules in some languages, using FLORES-101 for benchmarking embeddings models #26

Fikavec · 2021-06-20T12:30:06Z

Thanks for Open-Source The FLORES-101 Data Set. While working with him, I noticed a certain feature that I wanted to share here. Some languages contain Alternative Spelling rules therefore some words that have more than one accepted spelling. This is known-well feature for German, Danish, Swedish or Traditional to Simplified Chinese conversion rules, etc. For example, in german language alternative spelling rules are:

Consider the sentence № 991 from Flores-101 dev (deu.dev and eng.dev):

ENGLISH (0): “The walls and roofs of ice caves can collapse and cracks can get closed.“
GERMAN (1) : “Die Wände und Decken von Eishöhlen können einstürzen und Risse sich schließen.”

In this sentence:

Therefore sentence:

ALTERNATIVE SPELLING GERMAN (2): “Die Waende und Decken von Eishoehlen koennen einstuerzen und Risse sich schliessen.“

is fully equivalent of sentence (1), but not for Ai (I’m see this for many Ai translation and embeddings models):

in German (1) - German Alternative Spelling (2):
and in German (1), German Alternative Spelling (2) - English (0):
and sentences (1) and (2) for Neural machine translation model:

As I can see for German language (and other?), Flores-101 created without examples considering these Alternative Spelling rules. And today Ai trained without considering these Alternative Spelling rules (I mean the methods of augmentation, since when training on big datasets like CommonCrawl + Wikipedia, Ai receives an unbalanced set of alternative spellings (because Alternative Spelling reforms took place recently (in many languages average ~1960-1990 year including modern dictionaries, english language and qwerty keyboard layout spread) and texts contains alternatives spelling unevenly, for example old books or newspapers without alternative spelling and modern text from books, newspapers, internet with alternative spelling), which leads to the fact that (Straße and Strasse (alternative spelling) on german language = street on english lanuage):

words != words in alternative spelling (Straße != Strasse) for Ai
during training, the meaning of the context for words Straße, Strasse gets distorted or lost
for Ai after learning on unbalanced by Alternative Spelling CC + Wiki dataset: Straße = street and Strasse != street OR Straße != street and Strasse = street OR Straße != street and Strasse != street

How about extend Flores-101 (or create additional dataset) with sentences in Alternative Spelling for languages contains this rules for benchmarking (and create metric for measure quality) in two cases:

How equals words/sentences in Alternative Spelling in one language (is Straße == Strasse for model) for languages with Alternative Spelling rules
How equals words/sentences in Alternative Spelling in crosslanguages case (as I showed above) German sentence == English sentence VS Alternative Spelling German sentence == English sentence

How about extend Flores-101 (or create additional dataset) with metric or test cases for measure quality of aligning language spaces for embeddings models like LASER, USE etc… for tasks other than machine translation: multilingual Ai tasks (scientific problems) like classification, similarity measure, BUCC, few-shot multilingual learning as discuss there?

guzmanhe · 2021-07-12T17:42:36Z

Hi @Fikavec, thank you for providing such an extensive and detailed feedback. I'm not an expert in alternative spelling, but will consult with some of our linguists to get a more informed opinion.
However, do you have a sense of what is the expectation of a reader? For example, in a formal context, will people consider Straße equally acceptable than Strasse? From looking at your analysis, it seems that one is more common than the other. The first seems to be more prevalent than the second (e.g. LASER trained on parallel corpus prefers the first over the second).

For your specific proposals:

1. We don't have a plan to include alternative spelling for specific languages. However, we're open to community contributions. So if you're proposing alternative spellings, we can see to merge them into the repo as community contributed.
1. You can use FLORES for evaluation of LASER, MUSE etc. As mentioned in the paper, we want the community to use the set for as many tasks as possible.

Fikavec · 2021-07-18T09:21:23Z

Thanks for your reply and work, @guzmanhe. My experiments with LASER, USE, LaBSE, distiluse, XLM-R, M2M100 show that the problem of "Alternative Spelling rules" has not yet been solved when AI training and assessment. The rules of alternative spelling are, in fact, the replacement of some characters by others according to tables officially accepted by people in some countries, but no more. Words with an alternative spelling and without it are not synonyms or close words, they are the 100% same words, but modern multilingual AI's do not know this (and do not learn it at training time) and therefore make mistakes (have less quality) at the level of sentences containing such words, be then embedding models, pre-trained language models or translators, because they consider them synonymous or similar words, and not 100% identical words, obtained by replacing characters according to some table of rules. But for people, the situation is different and no less interesting.

However, do you have a sense of what is the expectation of a reader?

Using German as an example, is the reader a native speaker or has he studied it in other country? If he is a native speaker, is he a German or a Swiss, and is he old or young? He reads an old book or a modern newspaper? - it all matters when it comes to the expectations of the reader. From wikipedia:

In Swiss Standard German, "ss" usually replaces every "ß". This is officially sanctioned by the reformed German orthography rules, which state in §25 E2: "In der Schweiz kann man immer „ss“ schreiben" ("In Switzerland, one may always write 'ss'"). Liechtenstein follows the same practice.

Thus, a Swiss, when reading a modern newspaper in German, certainly does not expect to see Straße. etc. It seems to me that solving this problem for every language with linguists will be the next step in the development of multilingual AI, and the first step could be creating a specialized test for this problem. In your opinion, the solution to this problem should be carried out at the preprocessing / tokenization stage (replace all alternative spellings so that the AI model always receives words in only one spelling, both during training and during inference) or using augmentation (to balance the training sample words with and without alternative spelling) or methods related to the learning process (fine-tuning on "Alternative Spelling rules") or network architecture (so that a model with "special alternative spelling layers" can learn the rules for replacing tables and give the same outputs for this)?

You can use FLORES for evaluation of LASER, MUSE etc.

How to use FLORES for evaluation multilingual embeddings models (LASER, USE, LaBSE, distiluse), could you suggest some suitable metric for this?

Fikavec changed the title ~~FLORES-101 benchmark and Alternative Spelling rules in some languages~~ FLORES-101 benchmark and Alternative Spelling rules in some languages, using FLORES-101 for benchmarking embeddings models Jun 20, 2021

guzmanhe self-assigned this Jul 12, 2021

Fikavec mentioned this issue Aug 25, 2023

Possible languages specific Alternative Spelling or Capitalization Rules alignment issue for future improvement SONAR cross-lingual vector space or BLASER quality measure alignment facebookresearch/SONAR#6

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FLORES-101 benchmark and Alternative Spelling rules in some languages, using FLORES-101 for benchmarking embeddings models #26

FLORES-101 benchmark and Alternative Spelling rules in some languages, using FLORES-101 for benchmarking embeddings models #26

Fikavec commented Jun 20, 2021 •

edited

Loading

guzmanhe commented Jul 12, 2021

Fikavec commented Jul 18, 2021

FLORES-101 benchmark and Alternative Spelling rules in some languages, using FLORES-101 for benchmarking embeddings models #26

FLORES-101 benchmark and Alternative Spelling rules in some languages, using FLORES-101 for benchmarking embeddings models #26

Comments

Fikavec commented Jun 20, 2021 • edited Loading

guzmanhe commented Jul 12, 2021

Fikavec commented Jul 18, 2021

Fikavec commented Jun 20, 2021 •

edited

Loading