Vocabulary is not really sorted by frequency rank #932

danielhers · 2017-03-27T09:24:42Z

I want to get word vectors for the N most frequent words, for use in a NN model.
According to vocab.pyx, the vocabulary consists of structural strings (POS tags, dependency labels, etc.) and then actual words ordered by frequency rank. I see two major problems with this:

I couldn't find a way to skip the structural strings. I'm using has_vector to get only lexemes with vectors, but some of the structural strings are also valid lexemes ("LOWER", "agent" etc.), albeit very infrequent ones. How do I really get just the most common lexemes?
Looking at the lexemes supposedly after the structural strings, there still seem to be quite infrequent strings pretty early on. Here is what I did:

import spacy
nlp = spacy.load('en')
print("\n".join('"' + nlp.vocab[i].orth_ + '"' for i in range(1000) if nlp.vocab[i].has_vector))

and this is a sample from what I got back:

"ID"
"ORTH"
"LOWER"
...
"agent"
"attr"
"aux"
...
"en"
"the"
"xxx"
...
"McCain"
"mccain"
"M"
...
"yeah"
"eah"
"high"

I find it hard to believe that "McCain" is really in the 1000 most common words.

Info about spaCy

Python version: 3.5.2+
Installed models: en_glove_cc_300_1m_vectors-1.0.0, en, cache, en-1.1.0
Platform: Linux-4.8.4-aufs-1-x86_64-with-debian-stretch-sid
spaCy version: 1.7.2

The text was updated successfully, but these errors were encountered:

honnibal · 2017-03-27T09:48:46Z

You could sort by .prob if you need the true ordering. Agree that this is currently less convenient than it should be.

danielhers · 2017-03-27T15:01:29Z

Thanks, it does get rid of the structural strings.

from operator import attrgetter
import spacy
nlp = spacy.load('en')
print("\n".join(map(attrgetter("orth_"),
                    sorted(filter(attrgetter("has_vector"), nlp.vocab), key=attrgetter("prob"), reverse=True)[:1000])))

There still seems to be an over-representation for American politics, though. Obama, McCain and Palin are still in the top 1000 words.

…#932)

honnibal · 2017-04-07T15:21:10Z

That over-representation isn't great :(. I used Reddit comments in 2015 for the counts. I'll be generating new frequencies for v2.0. I'll keep an eye out for this sort of problem, thanks.

lock · 2018-05-09T00:38:33Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

danielhers added a commit to huji-nlp/ucca that referenced this issue Mar 27, 2017

Avoid structural strings in external word embeddings (explosion/spaCy…

a4468be

…#932)

honnibal added the usage General spaCy usage label Mar 31, 2017

honnibal closed this as completed Apr 7, 2017

lock bot locked as resolved and limited conversation to collaborators May 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vocabulary is not really sorted by frequency rank #932

Vocabulary is not really sorted by frequency rank #932

danielhers commented Mar 27, 2017

honnibal commented Mar 27, 2017 •

edited

Loading

danielhers commented Mar 27, 2017 •

edited

Loading

honnibal commented Apr 7, 2017

lock bot commented May 9, 2018

Vocabulary is not really sorted by frequency rank #932

Vocabulary is not really sorted by frequency rank #932

Comments

danielhers commented Mar 27, 2017

Info about spaCy

honnibal commented Mar 27, 2017 • edited Loading

danielhers commented Mar 27, 2017 • edited Loading

honnibal commented Apr 7, 2017

lock bot commented May 9, 2018

honnibal commented Mar 27, 2017 •

edited

Loading

danielhers commented Mar 27, 2017 •

edited

Loading