Word Tokenization - Unexpected Output #139

albertnanda · 2022-01-24T14:25:41Z

Is this expected?

text = '''Mr. G. B. Shaw, known at his insistence simply as Bernard Shaw, was an Irish playwright.'''
print(blingfire.text_to_words(text).split())
print(list(nlp(text))) ##spacy

['Mr', '.', 'G', '.', 'B', '.', 'Shaw', ',', 'known', 'at', 'his', 'insistence', 'simply', 'as', 'Bernard', 'Shaw', ',', 'was', 'an', 'Irish', 'playwright', '.']
[Mr., G., B., Shaw, ,, known, at, his, insistence, simply, as, Bernard, Shaw, ,, was, an, Irish, playwright, .]

The dot(.) in Mr. and G. should be not treated as distinct token, it should be a single token.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word Tokenization - Unexpected Output #139

Word Tokenization - Unexpected Output #139

albertnanda commented Jan 24, 2022

Word Tokenization - Unexpected Output #139

Word Tokenization - Unexpected Output #139

Comments

albertnanda commented Jan 24, 2022