Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for Latin/Romanised Japanese and Korean #252

Open
PuukiKuma opened this issue Jan 14, 2025 · 0 comments
Open

Support for Latin/Romanised Japanese and Korean #252

PuukiKuma opened this issue Jan 14, 2025 · 0 comments

Comments

@PuukiKuma
Copy link

Currently, language detection only works for Korean and Japanese written in native script.

It would be incredibly helpful to have the ability to detect with both native and romanised script.

(Example use case: I'm currently writing a web scrapper to grab metadata for manga and manwha. Many sources list the romanised version of the title while others use the English version. Knowing which they used would be helpful to me.)

Behaviour

With the following script...

from lingua import Language, LanguageDetectorBuilder

# Create a language detector
detector = LanguageDetectorBuilder.from_languages(Language.JAPANESE, Language.ENGLISH).build()

# Sample texts
text_jp = "これは日本語の文章です。"
text_ro_jp = "kore wa nihongo no bunshou desu"
text_en = "This is an English sentence."

# Detect the language of the sample texts
result_jp = detector.detect_language_of(text_jp)
result_ro_jp = detector.detect_language_of(text_ro_jp)
result_en = detector.detect_language_of(text_en)

print(f"Detected language for text_jp: {result_jp}")
print(f"Detected language for text_ro_jp: {result_ro_jp}")
print(f"Detected language for text_en: {result_en}")

Current output:

Detected language for text_jp: Language.JAPANESE
Detected language for text_ro_jp: Language.ENGLISH
Detected language for text_en: Language.ENGLISH

Expected output:

Detected language for text_jp: Language.JAPANESE
Detected language for text_ro_jp: Language.JAPANESE
Detected language for text_en: Language.ENGLISH
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant