Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vocabulary is not really sorted by frequency rank #932

Closed
danielhers opened this issue Mar 27, 2017 · 4 comments
Closed

Vocabulary is not really sorted by frequency rank #932

danielhers opened this issue Mar 27, 2017 · 4 comments
Labels
usage General spaCy usage

Comments

@danielhers
Copy link
Contributor

I want to get word vectors for the N most frequent words, for use in a NN model.
According to vocab.pyx, the vocabulary consists of structural strings (POS tags, dependency labels, etc.) and then actual words ordered by frequency rank. I see two major problems with this:

  1. I couldn't find a way to skip the structural strings. I'm using has_vector to get only lexemes with vectors, but some of the structural strings are also valid lexemes ("LOWER", "agent" etc.), albeit very infrequent ones. How do I really get just the most common lexemes?
  2. Looking at the lexemes supposedly after the structural strings, there still seem to be quite infrequent strings pretty early on. Here is what I did:
import spacy
nlp = spacy.load('en')
print("\n".join('"' + nlp.vocab[i].orth_ + '"' for i in range(1000) if nlp.vocab[i].has_vector))

and this is a sample from what I got back:

"ID"
"ORTH"
"LOWER"
...
"agent"
"attr"
"aux"
...
"en"
"the"
"xxx"
...
"McCain"
"mccain"
"M"
...
"yeah"
"eah"
"high"

I find it hard to believe that "McCain" is really in the 1000 most common words.

Info about spaCy

  • Python version: 3.5.2+
  • Installed models: en_glove_cc_300_1m_vectors-1.0.0, en, cache, en-1.1.0
  • Platform: Linux-4.8.4-aufs-1-x86_64-with-debian-stretch-sid
  • spaCy version: 1.7.2
@honnibal
Copy link
Member

honnibal commented Mar 27, 2017

You could sort by .prob if you need the true ordering. Agree that this is currently less convenient than it should be.

@danielhers
Copy link
Contributor Author

danielhers commented Mar 27, 2017

Thanks, it does get rid of the structural strings.

from operator import attrgetter
import spacy
nlp = spacy.load('en')
print("\n".join(map(attrgetter("orth_"),
                    sorted(filter(attrgetter("has_vector"), nlp.vocab), key=attrgetter("prob"), reverse=True)[:1000])))

There still seems to be an over-representation for American politics, though. Obama, McCain and Palin are still in the top 1000 words.

@honnibal honnibal added the usage General spaCy usage label Mar 31, 2017
@honnibal
Copy link
Member

honnibal commented Apr 7, 2017

That over-representation isn't great :(. I used Reddit comments in 2015 for the counts. I'll be generating new frequencies for v2.0. I'll keep an eye out for this sort of problem, thanks.

@honnibal honnibal closed this as completed Apr 7, 2017
@lock
Copy link

lock bot commented May 9, 2018

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators May 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

2 participants