-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Emoicons tokenization problem #1088
Comments
Thanks for pointing this out! I agree, the tokenizer should always split emoji into single tokens. My first instinct was to just add a character class for emoji, and then treat them like punctuation/special characters in the prefix, suffix and infix rules. Edit: This actually seems to work pretty well so far with some regex fiddling and a simple class for
This would make it much nicer going forward, because in spaCy v2.0, it'll finally be easy to add more flags. So there could be an I'll start implementing this on the develop branch 👍 |
Currently doesn't work for Hungarian, because of conflicts with the custom punctuation rules. Also doesn't take multi-character emoji like 👩🏽💻 into account.
Thank you for the quick fix, I really like it :) I think it is one of the best solution, but I really miss Hungarian support because I also want to use Hungarian tokenizer. |
@Andras7 Yeah, it's still a pretty rough fix and I haven't spent that much time debugging it yet. So I'm sure we'll be able to make Hungarian work as well. Luckily, the tokenizer tests for Hungarian are very good and include a lot of complex punctuation examples. So if you're working with Hungarian and happen to find a good way to integrate the new symbols class, that'd be cool. (Fiddling with punctuation rules is often a little easier if you actually know the language well.) It'd be nice to get this stable – I've had a lot of fun experimenting with emoji, and there's definitely a lot of potential in this. (For instance, here's an example from the v2.0 alpha docs using the new |
@ines I really appreciate your help, first I am trying to understand spacy's source because I'm new here, after maybe I can help to improve Hungarian tokenizer. I really like this library :) I have got another observation, should we improve smile detection or it's too expensive? |
@oroszgy I've tried this for Hungarian, but it caused a bunch of puctuation tests to fail. I'm not sure what the problem here is... maybe the The :DDDD stuff would probably a case for the |
Yes, I think so :) Thank you for your help! |
@ines thanks for the insights. I'll have a look at the problem, and will keep you updated. |
Closing this as it all seems to be working now on |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Here it is an example: "hey playa!:):3.....@shaq can you still dunk?#old🍕🍔😵LOL"
I think there we can use a pretty obvious rule :)
The text was updated successfully, but these errors were encountered: