Emoicons tokenization problem #1088

Andras7 · 2017-05-26T14:49:15Z

Here it is an example: "hey playa!:):3.....@shaq can you still dunk?#old🍕🍔😵LOL"

I think there we can use a pretty obvious rule :)

ines · 2017-05-26T15:17:17Z

Thanks for pointing this out! I agree, the tokenizer should always split emoji into single tokens.

My first instinct was to just add a character class for emoji, and then treat them like punctuation/special characters in the prefix, suffix and infix rules.

Edit: This actually seems to work pretty well so far with some regex fiddling and a simple class for Other_Symbols. It'll also match a bunch of other non-emoji symbols, but that's no problem, as they should also be included in the icons/symbols. The approach currently isn't available for Hungarian, though, as it needs more custom punctuation rules. Keeping multi-character emoji like 👩🏽‍💻
as one token is also not possible at the moment. The alternative would be a more complex emoji regex using token_match – just discussed this with @honnibal and the performance risks may outweigh the benefits here...

On the other hand, there are only 1851 emoji in total – so we could simply use the unicode range to generate them as tokenizer exceptions in the emoticons. The advantage here is that we can actually add annotations and flags to the tokens.

This would make it much nicer going forward, because in spaCy v2.0, it'll finally be easy to add more flags. So there could be an IS_ICON flag that includes emoticons and emoji, and that users could add custom bindings to for other symbols. On a token, you could check token.is_icon, or create a Matcher pattern with {IS_ICON: True} tokens etc.

I'll start implementing this on the develop branch 👍

Currently doesn't work for Hungarian, because of conflicts with the custom punctuation rules. Also doesn't take multi-character emoji like 👩🏽‍💻 into account.

Andras7 · 2017-05-29T07:46:36Z

Thank you for the quick fix, I really like it :) I think it is one of the best solution, but I really miss Hungarian support because I also want to use Hungarian tokenizer.

ines · 2017-05-29T08:20:28Z

@Andras7 Yeah, it's still a pretty rough fix and I haven't spent that much time debugging it yet. So I'm sure we'll be able to make Hungarian work as well. Luckily, the tokenizer tests for Hungarian are very good and include a lot of complex punctuation examples.

So if you're working with Hungarian and happen to find a good way to integrate the new symbols class, that'd be cool. (Fiddling with punctuation rules is often a little easier if you actually know the language well.)

It'd be nice to get this stable – I've had a lot of fun experimenting with emoji, and there's definitely a lot of potential in this. (For instance, here's an example from the v2.0 alpha docs using the new Matcher API and emoji for basic sentiment analysis 😄 )

Andras7 · 2017-05-29T13:04:28Z

@ines I really appreciate your help, first I am trying to understand spacy's source because I'm new here, after maybe I can help to improve Hungarian tokenizer. I really like this library :)

I have got another observation, should we improve smile detection or it's too expensive?
e.g. "It was funny :DD" - there is a double D witch isn't detected, we can extend smiles dictionary with this new symbol but I would be more robust if we could use regexps to detect infinite D-s and so on.
"It was funny:)))) and interesting:):):)" - there we also can't detect smiles because there is no space between tokens.

oroszgy · 2017-05-29T19:34:54Z

Szia @Andras7, I haven't managed to look into the issue as the current dev branch is not compiling for me, but my guess would be that adding LIST_ICONSto _prefixes _suffixes and _infixes here could do the trick.

ines · 2017-05-30T07:24:54Z

@oroszgy I've tried this for Hungarian, but it caused a bunch of puctuation tests to fail. I'm not sure what the problem here is... maybe the Other_Symbols class contains some punctuation, maybe I've made some logic mistake when I set this up.

The :DDDD stuff would probably a case for the token_match - however, we do have to think about whether it's worth it to include all possible emoticon options with regular expressions, or whether to just hard-code the very common ones and leave the rest up to the user.

23454365546 · 2017-05-30T07:25:37Z

Yes, I think so :) Thank you for your help!

oroszgy · 2017-05-30T07:39:36Z

@ines thanks for the insights. I'll have a look at the problem, and will keep you updated.

ines · 2017-06-02T10:54:39Z

Closing this as it all seems to be working now on develop, including Hungarian – thanks @oroszgy! The update will be included in v2.0.0 alpha (released as soon as we fix the last serialization issues).

lock · 2018-05-08T20:39:05Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added the performance label May 26, 2017

ines added the 🌙 nightly Discussion and contributions related to nightly builds label May 28, 2017

oroszgy mentioned this issue May 30, 2017

Fixed emoji handling for Hungarian #1093

Merged

8 tasks

ines closed this as completed Jun 2, 2017

lock bot locked as resolved and limited conversation to collaborators May 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Emoicons tokenization problem #1088

Emoicons tokenization problem #1088

Andras7 commented May 26, 2017

ines commented May 26, 2017 •

edited

Loading

Andras7 commented May 29, 2017 •

edited

Loading

ines commented May 29, 2017

Andras7 commented May 29, 2017 •

edited

Loading

oroszgy commented May 29, 2017

ines commented May 30, 2017

23454365546 commented May 30, 2017

oroszgy commented May 30, 2017

ines commented Jun 2, 2017

lock bot commented May 8, 2018

Emoicons tokenization problem #1088

Emoicons tokenization problem #1088

Comments

Andras7 commented May 26, 2017

ines commented May 26, 2017 • edited Loading

Andras7 commented May 29, 2017 • edited Loading

ines commented May 29, 2017

Andras7 commented May 29, 2017 • edited Loading

oroszgy commented May 29, 2017

ines commented May 30, 2017

23454365546 commented May 30, 2017

oroszgy commented May 30, 2017

ines commented Jun 2, 2017

lock bot commented May 8, 2018

ines commented May 26, 2017 •

edited

Loading

Andras7 commented May 29, 2017 •

edited

Loading

Andras7 commented May 29, 2017 •

edited

Loading