Java implementation of UAX #29 text segmentation algorithm, plus token types for URLs, emoji, emails, hashtags, cashtags, and mentions.
The tokenizer produces the following token types:
ALPHANUM
-- A sequence of alphabetic and numeric characters, e.g., hello, test123NUM
-- A number, e.g., 123SOUTHEAST_ASIAN
-- A sequence of characters from South and Southeast Asian languages, including Thai, Lao, Myanmar, and KhmerIDEOGRAPHIC
-- A single CJKV ideographic characterHIRAGANA
-- A single hiragana characterKATAKANA
-- A sequence of katakana charactersHANGUL
-- A sequence of Hangul charactersURL
-- A URL, e.g., https://www.example.com/EMAIL
-- An email address or mailto link, e.g., info@example.comEMOJI
-- A sequence of Emoji characters, e.g., 🙂HASHTAG
-- A social media hashtag, e.g., #hashtagCASHTAG
-- A social media cashtag, e.g., $CASHMENTION
-- A social media mention, e.g., @twitter
To process text into tokens, use code like the following:
try (UAX29URLEmailTokenizer tokenizer=new UAX29URLEmailTokenizer("example text")) {
for(Token token=tokenizer.nextToken();token!=null;token=tokenizer.nextToken(token)) {
// Process the token here
}
}