You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, language detection only works for Korean and Japanese written in native script.
It would be incredibly helpful to have the ability to detect with both native and romanised script.
(Example use case: I'm currently writing a web scrapper to grab metadata for manga and manwha. Many sources list the romanised version of the title while others use the English version. Knowing which they used would be helpful to me.)
Behaviour
With the following script...
fromlinguaimportLanguage, LanguageDetectorBuilder# Create a language detectordetector=LanguageDetectorBuilder.from_languages(Language.JAPANESE, Language.ENGLISH).build()
# Sample textstext_jp="これは日本語の文章です。"text_ro_jp="kore wa nihongo no bunshou desu"text_en="This is an English sentence."# Detect the language of the sample textsresult_jp=detector.detect_language_of(text_jp)
result_ro_jp=detector.detect_language_of(text_ro_jp)
result_en=detector.detect_language_of(text_en)
print(f"Detected language for text_jp: {result_jp}")
print(f"Detected language for text_ro_jp: {result_ro_jp}")
print(f"Detected language for text_en: {result_en}")
Current output:
Detected language for text_jp: Language.JAPANESE
Detected language for text_ro_jp: Language.ENGLISH
Detected language for text_en: Language.ENGLISH
Expected output:
Detected language for text_jp: Language.JAPANESE
Detected language for text_ro_jp: Language.JAPANESE
Detected language for text_en: Language.ENGLISH
The text was updated successfully, but these errors were encountered:
Currently, language detection only works for Korean and Japanese written in native script.
It would be incredibly helpful to have the ability to detect with both native and romanised script.
(Example use case: I'm currently writing a web scrapper to grab metadata for manga and manwha. Many sources list the romanised version of the title while others use the English version. Knowing which they used would be helpful to me.)
Behaviour
With the following script...
Current output:
Expected output:
The text was updated successfully, but these errors were encountered: