-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vocabulary is not really sorted by frequency rank #932
Comments
You could sort by |
Thanks, it does get rid of the structural strings.
There still seems to be an over-representation for American politics, though. Obama, McCain and Palin are still in the top 1000 words. |
That over-representation isn't great :(. I used Reddit comments in 2015 for the counts. I'll be generating new frequencies for v2.0. I'll keep an eye out for this sort of problem, thanks. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
I want to get word vectors for the N most frequent words, for use in a NN model.
According to vocab.pyx, the vocabulary consists of structural strings (POS tags, dependency labels, etc.) and then actual words ordered by frequency rank. I see two major problems with this:
has_vector
to get only lexemes with vectors, but some of the structural strings are also valid lexemes ("LOWER", "agent" etc.), albeit very infrequent ones. How do I really get just the most common lexemes?and this is a sample from what I got back:
I find it hard to believe that "McCain" is really in the 1000 most common words.
Info about spaCy
The text was updated successfully, but these errors were encountered: