Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Emoicons tokenization problem #1088

Closed
Andras7 opened this issue May 26, 2017 · 10 comments
Closed

Emoicons tokenization problem #1088

Andras7 opened this issue May 26, 2017 · 10 comments
Labels
🌙 nightly Discussion and contributions related to nightly builds

Comments

@Andras7
Copy link

Andras7 commented May 26, 2017

Here it is an example: "hey playa!:):3.....@shaq can you still dunk?#old🍕🍔😵LOL"

I think there we can use a pretty obvious rule :)

@ines
Copy link
Member

ines commented May 26, 2017

Thanks for pointing this out! I agree, the tokenizer should always split emoji into single tokens.

My first instinct was to just add a character class for emoji, and then treat them like punctuation/special characters in the prefix, suffix and infix rules.

Edit: This actually seems to work pretty well so far with some regex fiddling and a simple class for Other_Symbols. It'll also match a bunch of other non-emoji symbols, but that's no problem, as they should also be included in the icons/symbols. The approach currently isn't available for Hungarian, though, as it needs more custom punctuation rules. Keeping multi-character emoji like 👩🏽‍💻
as one token is also not possible at the moment. The alternative would be a more complex emoji regex using token_match – just discussed this with @honnibal and the performance risks may outweigh the benefits here...

On the other hand, there are only 1851 emoji in total – so we could simply use the unicode range to generate them as tokenizer exceptions in the emoticons. The advantage here is that we can actually add annotations and flags to the tokens.

This would make it much nicer going forward, because in spaCy v2.0, it'll finally be easy to add more flags. So there could be an IS_ICON flag that includes emoticons and emoji, and that users could add custom bindings to for other symbols. On a token, you could check token.is_icon, or create a Matcher pattern with {IS_ICON: True} tokens etc.

I'll start implementing this on the develop branch 👍

ines added a commit that referenced this issue May 27, 2017
Currently doesn't work for Hungarian, because of conflicts with the
custom punctuation rules. Also doesn't take multi-character emoji like
👩🏽‍💻 into account.
@ines ines added the 🌙 nightly Discussion and contributions related to nightly builds label May 28, 2017
@Andras7
Copy link
Author

Andras7 commented May 29, 2017

Thank you for the quick fix, I really like it :) I think it is one of the best solution, but I really miss Hungarian support because I also want to use Hungarian tokenizer.

@ines
Copy link
Member

ines commented May 29, 2017

@Andras7 Yeah, it's still a pretty rough fix and I haven't spent that much time debugging it yet. So I'm sure we'll be able to make Hungarian work as well. Luckily, the tokenizer tests for Hungarian are very good and include a lot of complex punctuation examples.

So if you're working with Hungarian and happen to find a good way to integrate the new symbols class, that'd be cool. (Fiddling with punctuation rules is often a little easier if you actually know the language well.)

It'd be nice to get this stable – I've had a lot of fun experimenting with emoji, and there's definitely a lot of potential in this. (For instance, here's an example from the v2.0 alpha docs using the new Matcher API and emoji for basic sentiment analysis 😄 )

@Andras7
Copy link
Author

Andras7 commented May 29, 2017

@ines I really appreciate your help, first I am trying to understand spacy's source because I'm new here, after maybe I can help to improve Hungarian tokenizer. I really like this library :)

I have got another observation, should we improve smile detection or it's too expensive?
e.g. "It was funny :DD" - there is a double D witch isn't detected, we can extend smiles dictionary with this new symbol but I would be more robust if we could use regexps to detect infinite D-s and so on.
"It was funny:)))) and interesting:):):)" - there we also can't detect smiles because there is no space between tokens.

@oroszgy
Copy link
Contributor

oroszgy commented May 29, 2017

Szia @Andras7, I haven't managed to look into the issue as the current dev branch is not compiling for me, but my guess would be that adding LIST_ICONSto _prefixes _suffixes and _infixes here could do the trick.

@ines
Copy link
Member

ines commented May 30, 2017

@oroszgy I've tried this for Hungarian, but it caused a bunch of puctuation tests to fail. I'm not sure what the problem here is... maybe the Other_Symbols class contains some punctuation, maybe I've made some logic mistake when I set this up.

The :DDDD stuff would probably a case for the token_match - however, we do have to think about whether it's worth it to include all possible emoticon options with regular expressions, or whether to just hard-code the very common ones and leave the rest up to the user.

@23454365546
Copy link

Yes, I think so :) Thank you for your help!

@oroszgy
Copy link
Contributor

oroszgy commented May 30, 2017

@ines thanks for the insights. I'll have a look at the problem, and will keep you updated.

@ines
Copy link
Member

ines commented Jun 2, 2017

Closing this as it all seems to be working now on develop, including Hungarian – thanks @oroszgy! The update will be included in v2.0.0 alpha (released as soon as we fix the last serialization issues).

@ines ines closed this as completed Jun 2, 2017
@lock
Copy link

lock bot commented May 8, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
🌙 nightly Discussion and contributions related to nightly builds
Projects
None yet
Development

No branches or pull requests

4 participants