Skip to content
This repository has been archived by the owner on Nov 21, 2023. It is now read-only.

The Cantonese (Yue Chinese, yue_Hant) data in FLORES-200 is not Cantonese at all #61

Open
ayaka14732 opened this issue Jun 1, 2023 · 1 comment

Comments

@ayaka14732
Copy link

ayaka14732 commented Jun 1, 2023

The Cantonese (Yue Chinese, yue_Hant) data in FLORES-200 is completely wrong. The data is not Cantonese at all, but rather Mandarin Chinese in Traditional Chinese Script (zho_Hant), which only has stylistic differences compared to the zho_Hant data in the dataset.

Furthermore, the paper mentioned that the yue_Hant and zho_Hant data tend to be predicted as each other. It turns out that both datasets actually consist of zho_Hant data exclusively. yue_Hant and zho_Hant should actually be very easy to distinguish from each other.

Here is how correct yue_Hant data would look like:

Language Code Sentence
eng_Latn They found the Sun operated on the same basic principles as other stars: The activity of all stars in the system was found to be driven by their luminosity, their rotation, and nothing else.
zho_Hant 他們發現太陽的運作與其他恆星的基本原理相同:系統中所有恆星的活動均受其光度、自轉所推動,就是這麼簡單。
yue_Hant (wrong) 他們發現,太陽和其他恆星的運行原理是一樣的:系統中所有恆星的活動都是由它們的亮度、自轉驅動的,而並非其他因素。
yue_Hant (corrected) 佢哋發現,太陽其他恆星運行原理分別:系統入面所有恆星活動都淨係佢哋嘅亮度自轉推動,而包括其他因素。

(Bold denotes words that are used exclusively in yue_Hant)

@laubonghaudoi
Copy link

This has been complaint by others for a long time https://twitter.com/chaakming/status/1555246138105614336

I guess nobody in the FLORES team knows Cantonese and Mandarin well enough to understand the unique situation of this language. The current data collected for yue is Hong Kong Chinese, NOT Cantonese. We recommend using this classifier to filter the real Cantonese data https://github.com/CanCLID/cantonese-classifier

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants