Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace cld2 to whatlanggo #47

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

mosuka
Copy link
Contributor

@mosuka mosuka commented Apr 9, 2020

I would like to replace cld2 to whatlanggo as it seems to be archived and not maintained.
What do you think about this?

@mschoch
Copy link
Contributor

mschoch commented Apr 9, 2020

Can you share how you're using this? We also plan to deprecate all of blevex soon as well because no one is maintaining it.

As for detectlang in particular, we originally thought it might be useful to do text analysis based on the language that was detected. But, bleve offers not way to make use of this info, because the mappings are static, and the language isn't detected until index time. So, can you explain how you're using it?

@mosuka
Copy link
Contributor Author

mosuka commented Apr 9, 2020

I don't actually use this feature (detect_lang_filter), but I am actively trying to support Blevex.
I knew that CLD2 was deprecated and suggested replacing it with an alternative library.
The request to replace CLD2 with something that doesn't depend on another CGO has been there for a long time from users of my product, so this time I sent you a PR.

@mosuka
Copy link
Contributor Author

mosuka commented Apr 9, 2020

BTW, you mentioned deprecating all of blevex soon, Japanese tokenizer will also be deprecated?

@mschoch
Copy link
Contributor

mschoch commented Apr 9, 2020

So, unless we can better understand that the detect_lang filter has some actual use, I would prefer to get rid of it, rather than change which library it uses.

The Japanese tokenizer is the only thing in blevex that I think makes sense to save. Most likely it would move to be it's own top-level module. Do you think there is anything else of value in blevex?

@mosuka
Copy link
Contributor Author

mosuka commented Apr 10, 2020

Yes, I think you're right about detectlang. I can't think of a good use case either...

Basically, I'd like to keep the language analysis modules. For example, icu, lang, stemmer. It would be helpful if you could save these that I can support as many languages like Lucene.

@mschoch
Copy link
Contributor

mschoch commented Apr 10, 2020

Does the icu tokenizer still work? Does it work with recent version of icu or some specific old ones? It hasn't been touched for 5 years, and it was difficult to get working back then, so I'm surprised if it does.

I believe all the languages supported by libstemmer (using cgo) are also supported by our pure Go snowball stemmers: https://github.com/blevesearch/snowballstem

The only 2 languages not covered there are Japanese, which we plan to continue supporting, and Thai, which uses a dictionary based tokenizer as part of ICU. So it seems like Thai is the only language we would lose support for. Are you aware of any alternative tokenizers for Thai?

@mosuka
Copy link
Contributor Author

mosuka commented Apr 11, 2020

I was not aware of the existence of snowballsrem. With this, I don't need to use libstemmer. Thank you for letting me know!

How about this for Thai tokenizer?
https://github.com/veer66/mapkha

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants