-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace cld2 to whatlanggo #47
base: master
Are you sure you want to change the base?
Conversation
Can you share how you're using this? We also plan to deprecate all of blevex soon as well because no one is maintaining it. As for |
I don't actually use this feature (detect_lang_filter), but I am actively trying to support Blevex. |
BTW, you mentioned deprecating all of blevex soon, Japanese tokenizer will also be deprecated? |
So, unless we can better understand that the detect_lang filter has some actual use, I would prefer to get rid of it, rather than change which library it uses. The Japanese tokenizer is the only thing in blevex that I think makes sense to save. Most likely it would move to be it's own top-level module. Do you think there is anything else of value in blevex? |
Yes, I think you're right about Basically, I'd like to keep the language analysis modules. For example, |
Does the icu tokenizer still work? Does it work with recent version of icu or some specific old ones? It hasn't been touched for 5 years, and it was difficult to get working back then, so I'm surprised if it does. I believe all the languages supported by libstemmer (using cgo) are also supported by our pure Go snowball stemmers: https://github.com/blevesearch/snowballstem The only 2 languages not covered there are Japanese, which we plan to continue supporting, and Thai, which uses a dictionary based tokenizer as part of ICU. So it seems like Thai is the only language we would lose support for. Are you aware of any alternative tokenizers for Thai? |
I was not aware of the existence of snowballsrem. With this, I don't need to use libstemmer. Thank you for letting me know! How about this for Thai tokenizer? |
I would like to replace cld2 to whatlanggo as it seems to be archived and not maintained.
What do you think about this?