-
-
Notifications
You must be signed in to change notification settings - Fork 934
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Verify that the locale data use the correct characters #1934
Comments
My only concern is that if this is maintained in a seperate project (validatorjs), then any time we need to whitelist a new character for a specific language we need to wait for validatorjs to accept the change and make a new release before we can release the related change in faker.js. |
Also its not very obvious what letters to allow for different languages. For example do you allow é for |
Also this might catch the case where say an English locale uses a Cyrillic A, but not where a Russian locale uses a Latin A (because you'll certainly have to allow Latin a-zA-Z in all locales). |
We still have to do some char stripping on our side anyway (e.g. the pattern placeholders), so temporarily stripping that one character as well won't hurt, it will at least raise our awareness that that might be a special case for that character.
For simplicity reasons, I would remove that value then (because that sounds like a French name to me).
Why would we? (We would allow them in special locations like patterns, but otherwise I think we won't) |
English has many loanwords which keep the diacritics in normal usage say crêpe or naïve or when spelling other country names like Côte d'Ivoire. We shouldn't pretend English is ascii only. And languages with other scripts use Latin characters for things like building numbers or scientific uses even if they are not used much in regular text. My apartment number in Thai is 2A even though the rest of the address uses Thai characters. |
#1520 is another good example. Rather than remove jalapeño we modified code that relied on it to strip accents when they can't be used in ascii contexts. |
Yeah, you are right. We shouldnt remove them, but check them if they are correct. |
You could generate some kind of snapshot test, like a dict of all unique characters used in a locale like this, then it would be obvious if new characters were accidentally added {
af_ZA: ' #-.01234568ABCDEFGHIJKLMNOPRSTUVWYZ_abcdefghijklmnopqrstuvwxyz{}',
ar: ' #()-.13T_acefilmnoprstuxy{}،ءآأؤإئابةتثجحخدذرزسشصضطظعغـفقكلمنهوىيَُِّْ',
az: ' #(),-.289ABCDEFGHJKLMNOPQRSTUVXYZ_abcdefghijklmnopqrstuvxyz{}ÇÖÜçöüğİıŞşƏəадеимнрсту’',
cz: ' #()+-./0123456789ABCDEFGHIJKLMNOPQRSTUVWXZ_abcdefghijklmnopqrstuvwxyz{}ÁÍÚáéíóöúüýČčĎďěňŘřŠšťůűŽž',
de: " #&'()+,-.0149:ABCDEFGHIJKLMNOPQRSTUVWXYZ\\_abcdefghijklmnopqrstuvwxyz{}ÄÖÜßàãäéíöúü",
de_AT: ' #&()+,-.01346ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz{}ÄÖÜßãäéíöúü',
de_CH: ' #&+,-./0123456789ABCDEFGHIJKLMNOPRSTUVWYZ_abcdefghijklmnopqrstuvwxyz{}äçèéôöü',
dv: ' #&+-.03456789_acdefijlmnoprstuvxy{}ހށނރބޅކއވމފދތލގޏސޑޒޓޔޕޖޗޘޙޚޛޝޞޟޠޡޢޣޤަާިީުޫެޭޮޯް',
el: ' #&(),-./0123456789ABCDEFGHIJLMNOPQRSTUVXY_abcdefghiklmnopqrstuvwxyz{}ΆΈΌΐΑΒΓΔΕΖΗΘΙΚΛΜΝΞΟΠΡΣΤΥΦΧΨΩάέήίαβγδεζηθικλμνξοπρςστυφχψωϊόύώ',
en: " !#%&'()*+,-./0123456789?ABCDEFGHIJKLMNOPQRSTUVWXYZ[]_abcdefghijklmnopqrstuvwxyz{}ÉØáãäåçèéíïñóöøüýăđēńőřşŠšţūŻŽž’",
en_AU: " #'+.0146ABCDEFGHIJKLMNOPQRSTVWXYZ_abcdefghijklmnopqrstuvwxyz{}",
en_AU_ocker: ' #+.01234567ABCDEFGHIJKLMNOPQRSTVWXYZ_abcdefghijklmnopqrstuvwxyz{}',
en_BORK: '-BINTUabcdefghijklmnopqrstuvxyz',
en_CA: ' !#()-.1?ABCDEFGHIJKLMNOPQRSTUVWXY_abcdefghijklmnopqrstuvwxyz{}Îâèéô–’',
en_GB: ' #-.0123456789?ABCDEFGHIKLMNOPRSTUVWY_abcdefghiklmnopqrstuvwxyz{}',
en_GH: ' #+-.012345678?ABCDEFGHIJKLMNOPQRSTUVWYZ_abcdefghijklmnoprstuvwxyz{}ɔε',
en_IE: ' #.0123456789CDGIKLMNORSTW_abcdefghiklmnoprstuvwxy{}',
en_IN: " #'()+,-.16789ABCDEFGHIJKLMNOPQRSTUVWYZ_abcdefghijklmnopqrstuvwxyz{}",
en_NG: ' #+-.02345789ABCDEFGHIJKLMNOPRSTUVWYZ_abcdefghijklmnopqrstuvwxyz{}',
en_US: ' .0123456789ANSU_abcdefgilmnoprstuxyz{}',
en_ZA: " #'()+-.0123456789ABCDEFGHIJKLMNOPRSTVWXYZ_abcdefghijklmnopqrstuvwxyz{}",
es: ' #,-./2345679ABCDEFGHIJKLMNOPQRSTUVYZ_abcdefghijklmnopqrstuvwxyz{}ÁÓáéíñóúüý',
es_MX: ' #,-./0234567ABCDEFGHIJKLMNOPQRSTUVXYZ_abcdefghijklmnopqrstuvwxyz{}ÁÑÓáéíñóúü',
fa: ' #+,-./0123456789_acefghilmnoprstuxy{}،ءآئابتثجحخدذرزسشصضطظعغفقلمنهوئًپچژکگی',
fi: 'AEHIJKLMNOPRSTVabefhijklmnoprstuvyä',
fr: " #%'()+,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz{}ÆÇÈÉÎÑØàáâãäåçèéêëíîïñóôøùúûüāđğıłńœŽ’“”",
fr_BE: " #'+-./0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz{}ÈÉàâçèéêëîïôû",
fr_CA: ' #,-.1?ABCEGHJKLMNOPQRSTUVXY_abcdefghiklmnopqrstuvwxyz{}ÉÎé',
fr_CH: " #%'()+,-./0123456789:;?ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz{}ÆÇÉÎÑØàáâãäåçèéêëíîïñóôöøùúûüÿāđğıłńœšŽſ’“”",
fr_LU: ' #+-.1235679ACDEGLMRVW_abcdefghiklmnoprstuvxyz{}',
ge: ' #()+-.012359_acefghilmnoprstuxy{}აბგდევზთიკლმნოპჟრსტუფქღყშჩცძწჭხჯჰ’',
he: ` "#%&'()+,-.0123456789ABCDEFGHIJKLMNOPRSTUVWXYZ_abcdefghilmnopqrstuvxy{}ִֹאבגדהוזחטיךכלםמןנסעףפץצקרשת׳`,
hr: ' #()+-.03589ABCDEFGHIJKLMNOPQRSTUVWYZ_abcdefghijklmnopqrstuvwxyz{}äöüĆćČčĐ𩹮ž',
hu: ' #%&()+,-./0123456789:ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz{}ÁÉÍÓÖÚÜáãäåçèéíóöúüıŐőůű́–’',
hy: " #'()+-./347_abcefgilmnoprstuxyz{}ԱԲԳԴԵԶԷԹԻԼԽԿՀՁՂՃՄՅՆՇՈՉՊՋՌՍՎՏՓՔՕՖաբգդեզէըթժիլխծկհձղճմյնշոչպջռսվտրցւփքօֆև",
id_ID: ' #()+,-.023456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz{}',
it: " #&'()+,-./0234679ABCDEFGHIJKLMNOPQRSTUVWYZ_abcdefghijklmnopqrstuvwxyz{}°àèéìòù�",
ja: '#-.0123456789_acefgilmnoprstuxy{}々〜あいうえおかがきぎくぐけげこごさざしじすずせぜそぞただちぢっつづてでとどなにぬねのはばぱひびぴふぶぷへべぺほぼぽまみむめもゃやゅゆょよらりるれろわをんアィイウェエオカガキギクグケコゴサザシジスズセゼソタダチッツテデトドナニネノハバパヒビピフブプヘベペホボポマミムメモャヤュヨラリルレロワン・ー一七丈三上下不世並中主丼乃久之乗乙九也乱乳乾事二井亜交京亭亮人仁介仏仕付代以仰伊伐休会伝似佐体何作佳依価保信修俵俸倒値健傑備僚儀優先光児入全八公六共兵具典写冬冷凛凜凝出分切刊列初判利制刷則前剛剤副割劇力功加劣助勇勉動勝勲化北匠区匿十千半協南博印原及友双反取受口古可台右号合吉同名向君否味命和咲哀品哉員哲哺唄問啓喜営噌器回因困囲図固国地型城基埼堀報塾境墓墟墨壁壊壌壮夏外夢大天太夫央失奇奈奉奔奨女奴好妃妥妻始姓委娘婚媒媛嫌子字孝学宅守安官定宜宝実室宮家富察審寮寺対封専尉尋小尚屈屋山岐岡岩島崇崎川巡左差巻市希帝帯帳幣平年幸広床序底店府度庫康廃廉延建式弔弘弥弱張当形彦彩彼待律復徳徹心忍忘応怒怖思急性恥恨息恵悔悠悦悪悲情惑愛慮慶憂憶懇成戦戸所扇手承抑投拒拓拘括持指捕掛採接推提揺携摘撃操擬放敏救教敬数敵文斎斗料斬断新方施旅既日旧早昇明星春昭普智暇暗暴曲書月有服望朝木未本杏材村来東松板枕林果枢架柄柱査栃栄栞株核根格桃案桑桜梨棄棒森椅検楓業極構模樹橋機欠次欧歌正武歩歯死殊残段殺殻母毎比毛氏気水汚江沖沙没油治況法波泥泰泳洋洗洲活浄浅浩浮海浸消液涼淳清済減渡渦測港湊湖湾満源溶滋滝漂漠漬潔潜潟潮濃濠濯瀬火点無焦然照煩熊燃燥版牙牛物牲特犠犬犯状狂独猿獣玉玲理琴瑛瓶生産用田由甲男町番異疎疾病白百的皇盆盛盤盲直県真着瞬瞳知石破碁磨礎社祉祐神禅禍福秀秋秘移程税稔穂積空窒窓立竜競筒箸節簿米粧糖糸系紀約紋純紛素紬累紺終結統絵絹継緊総締縄縛縮繁織缶置羊美群義羽翔翼老者耐聡聴育胃背脱腸自臭舗舞船艇良色花芳芸芽英茂茜茨莉菊菜華萌落葉葬葵蒙蒼蓮薬藤虚虫蛇血行術街衣裁装裏裕製襲西見視親観解設評詞試詰話誇誉誓誘語誠誤説談請諸謙謡譜警議譲豊象豪貞貨貫貴買費賀資賓賛賢赤走起超路踏車軒軸較輔輝輪輸辛辞辰農辺込近返迫迷退送逆通連逮週進遇運道達違遣遥遮遺避邦郎部郭都配酷酸里重野量金鈍鈴鉱銀鋭錠錯鍋鎮長門閉開間阜阪防限院陳陸険陽隆隔雄雅雇雑難雰零電霊青靖静非面韓音頂頃順頑領頭題額風颯食飽館首香馬駄駆駿騎験騰高髪魅魔鮮鳥鳴鶴鹿麻黒黙齢龍',
ko: ' #,-.03_abcefghiklmnoprstuvxyz{}·가간갈감강같개거건검겁게겨격견결겸경계고곡곤골공과곽관광굉교구국군굴굵권귀규균그극근금급긋기길김깊까깨꾸끄끊끔끗끼나난날남내냉넓년노녹놀농뇌누눈느는늘늙능니닉다단닫달닮담답당대댑더덕데도독돈동되된될두드든들듭때또똑뚤뛰뜸라락란람랑래략량러럽레렉려력련렬령로록론롭뢰료룡루룩룬룸룹류륜률륭르른를름릉리릭린림립마막만많맛망매맹먹먼멍메며면멸명모목몬몹무묵문물미민믿밀바박반받발밝방배백버벌범법벤벳벼벽변별병보복본봉부북분불붐붙브비빈빛빠빨쁘쁜쁨사삭산살상새색샘생서석선설섬섭성세센셉셰소속손솔송쇠수숙순술쉬슈스슬습승시식신실심싼쌍써씬아악안알암애앤앨야약얀양어억언얼엄업없에엘여역연열염엽영예오옥온올옹와완왕외요욕용우욱운울움웅워원월위윗윙유육윤율으은을음읍응의이익인일임입있자작잔잠잡장재저적전절젊점정제젠조족존종좋좌죄주죽준중즈즐증지직진집짧찍차착찬참창채처천철청체초총최추축출충취치친침카케코쾌쿨크큰타탁탄탈탐태택터테토통트특파판팽편평포표푸품풍프픈피필하학한할함합항해행향허헌험헨혁현협형혜호혹혼홍화확환활황회효후훈훌훤휘휼흔흠흥희히힌힘',
lv: ' #()+,-.12367ABCDEFGHIJKLMNOPRSTUVZ_abcdefghijklmnopqrstuvxyz{}ĀāČčĒēĢģĪīĶķĻļņŠšŪūŽžайкнопрсуы',
mk: ' #()+,-.012345789I_acefghijklmnoprstuvxy{}ЃЅЈЉЊЌЏАБВГДЕЖЗИКЛМНОПРСТУФХЦЧШабвгдежзиклмнопрстуфхцчшѓјљњќџ’',
nb_NO: ' #+,-.047ABCDEFGHIJKLMNOPRSTUVW_abcdefghijklmnoprstuvxy{}Øåæéø',
ne: ' #+-.79ABCDGHIJKLMNPRST_abcdefghijklmnoprstuvwxy{}',
nl: " !#'(),-.01236;?ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz{}\x96Ãâãéêëïöúû",
nl_BE: " #'+-./01234ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz{}éë",
pl: ` !"#'(),-./0123456789ABCDEFGHIJKLMNOPQRSTUVWYZ[]_abcdefghijklmnopqrstuvwxyz{}äçéóöüąĆćꣳńŚśźŻż–`,
pt_BR: ' #&()+,-.5ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz{}ÁÍáâãçéêíóôõú',
pt_PT: ' #()+-./123569ABCDEFGHIJKLMNOPQRSTUVWXZ_abcdefghijklmnopqrstuvwxyz{}ªºÁÂÉÍáâãçéêíóôõú',
ro: ' #-.0123456789ABCDEFGHIJKLMNOPRSTUVXZ_abcdefghijklmnoprstuvwxyz{}ÎâăȘșȚț',
ru: " !#'(),-.0189ABCDEFGHIJLMNOPRSTUX_abcdefghijlmnoprstuvwxyz{}АБВГДЕЖЗИЙКЛМНОПРСТУФХЦЧШЩЭЮЯабвгдежзийклмнопрстуфхцчшщъыьэюяё—",
sk: ' "#()+-./012345679ABCDEFGHIJKLMNOPQRSTUVWXZ_abcdefghijklmnopqrstuvwxyz{}ÍÚáäéíóôöúýČčĎď켾ňŕřŠšťŽž',
sv: ' #,-.ABCDEFGHIJKLMNOPQRSTUVWYZ_abcdefghijklmnopqrstuvwxyz{}ÄÅÖãäåçéíö',
tr: " #%&'()+,-.0123456789ABCDEFGHIJKLMNOPRSTUVWYZ_abcdefghijklmnopqrstuvwxyz{}ÂÇÖÜâçéîöûüğİıŞş",
uk: ' #()-.0123456789_abcdefghiklmnoprstuvxyz{}ЄІАБВГДЕЖЗЙКЛМНОПРСТУФХЦЧШЩЮЯабвгдежзийклмнопрстуфхцчшщьюяєіїґ’',
ur: ' #&+-.0239ABCDGIJKPST_abcdefghiklmnoprstuvwxy{}ؑءئابتثجحخدرزسشصطظعغفقلمنهوَُْٰٓٗٹپچڈڑژکگںھہیے',
vi: ' #-.0235789ABCDEFGHIJKLMNOPQRSTUVXYZ_abcdefghijklmnopqrstuvwxyz{}ÁÂÚÝàáâãèéêìíòóôõùúýăĐđĩũơưạẢảẤấầẨẩẫậắằẵặẽếềểễệỉịọỏốồổỗộớờởợụủứừửữựỳỷỹ',
zh_CN: '#-.012369_acefghilmnopqrstuvxy{}一丁万上东严中丹丽乐乡于云京令仲任伟何余依侬侯俊修倩健傅兰冀内军冯凤凯刘刚勇包北华南博卢原厦县口古台史叶司吉吕君吴周哲唐啸嘉四国城堂夏天太头奕妍姚姜娜娟婷子孔孙孟宁宇安宋家宸容尧尹展山峻崔川州巷市帅平广庆廖建弘张强彤彬彭徐徒徽心志思怡悦愚慕懿成戴振擎敏文斌新方旁旭昊明昕晋晓晟晨智曹曾朱李杜杨杰松林果柏栋桂桐桥梁梅梓楷欣欧正武段毛民汉汐江汪沈沐沙沪沫河波泽洁洋津洪济浙浩海涛涵淼渊渝港湖湘湾源滇潇潘澳瀚炎炫烨焱然煊煜熊熙熠燕狐玉王玥玲珍珠琪琴琼瑜瑞瑶瑾甘田疆白皓皖省睿石码磊祥祺福离秀秦程空立笑粤红绍罗翊耀聪肃胡胤致航良艳艺芬芳苍苏苑苡若英范茗荣莫萍萧萱董蒋蒙蔡薛藏街衡袁西覃许诚语诸诺谢谭豪豫贵贺贾赖赣赵超越路轩辉辰辽远邓邱邵邹郑郝郭都鄂重金鑫钟钰钱锦长门闽阎阳陆陈陕陶雄雨雪雷霖霞青靖静韦韩顾颖风飞香馨马驰驹骞高魏鲁鲜鸿鹏鹤鹭黄黎黑黔默齐龙龚',
zh_TW: ' #()-.029CORT_acefilmnoprstuwxy{}丁中任何侯俊修偉健傅傑凱劉化北南博台史君吳呂哲唐嘉嘯嚴園城基堂堯夏天姚姜子孔孟孫宇宋宜家宸尹屏展峻崔市廖建弘張強彤彬彭彰徐志思愚懿戴投振擎文新方於旭昊明昕晉晟晨智曉曹曾朱李杜杰東松林果柏栗桃梁梓楊楷榮樂正武段毛江汪沈洋洪浩涵淵淼湖源潔潘澎澤濤瀚瀟灣炎炫焱然煊煜熊熙熠燁王琪瑜瑞瑾田白皓盧省睿石磊祥祺福秦程立竹笑範紹縣羅義翊耀聰胡胤致臺航花苑苗莫華萬葉董蒼蓮蔡蔣蕭薛蘇蘭街袁西覃許語誠謝譚豪賀賈賴超越趙路軒輝連週遠邱邵郝郭鄒鄧鄭金鈺錢錦鐘鑫門閻陳陶陸隆雄雨雪雲雷霖靖韋韓顧風飛餘馬馮馳駒騫高魏鴻鵬鶴鷺黃黎默齊龍龔',
zu_ZA: ' #.01234568ABCDFGHIJKLMNPRSTUVWYZ_abcdefghijklmnopqrstuvwxyz{}'
} (code to generate) const fs = require('fs');
const {
allLocales
} = require('@faker-js/faker');
let hash = {}
for (let locale of Object.keys(allLocales)) {
let uniqueChars = new Set();
let definitions = allLocales[locale];
for (let module of Object.keys(definitions)) {
let keys = Object.keys(definitions[module]);
for (let key of keys) {
let trValues = allLocales[locale][module][key];
if (Array.isArray(trValues)) {
for (let val of trValues) {
if (typeof val === 'string') {
for (let char of val) {
uniqueChars.add(char);
}
}
}
}
}
}
hash[locale] = [...uniqueChars].sort().join("")
}
console.dir(hash) |
Team Decision We will use @matthewmayer 's suggestion as a snapshot test. We have to adjust the loop to search recursively so that we also check person.first_name.female in the future. Internet.emoji should ignored in all locales. |
I created PR #3276 to tackle this. |
Clear and concise description of the problem
We had several issues where invalid characters were in the locale data for a particular locale.
E.g.
Suggested solution
Verify the characters in the locale data using a test.
Alternative
Wait for community reports to fix the bad characters.
Additional context
Requires:
[Feature Request] Add method to check if a text belongs to a given locale validatorjs/validator.js#2201
I'm willing to create a PR.
The text was updated successfully, but these errors were encountered: