Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement separate functions to load FT embeddings and models #2376

Merged
merged 7 commits into from
Mar 7, 2019
Merged

implement separate functions to load FT embeddings and models #2376

merged 7 commits into from
Mar 7, 2019

Conversation

mpenkov
Copy link
Collaborator

@mpenkov mpenkov commented Feb 6, 2019

Introduced two new pure functions to the gensim.models.fasttext module:

  1. load_facebook_vectors: loads embeddings from binaries in FB's fastText .bin format
  2. load_facebook_model: loads the full model from binaries in FB's fastText .bin format

The existing FastText.load_fasttext_format method loads full models only. I've placed a deprecation warning around it. The full_model parameter is gone - it was only introduced in 3.7.1, so it's not too late to just rip it out, IMHO.

When releasing 3.7.2, we should include the above in the change log, as it changes the behavior wrt to 3.6.0

Fixes #2372

@mpenkov mpenkov changed the title [WIP] implement KeyedVectors.load_fasttext_format method implement KeyedVectors.load_fasttext_format method Feb 7, 2019
Copy link
Owner

@piskvorky piskvorky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is having both methods share the same name a good idea?

I'd prefer something more clearly distinct, like KeyedVectors.load_fasttext_format vs FastText.load_fasttext_full_model or something similarly "obvious".

Plus FastText.load_fasttext_format could probably proxy/alias to KeyedVectors.load_fasttext_format, for backward compatibility.

@mpenkov
Copy link
Collaborator Author

mpenkov commented Feb 7, 2019

Is having both methods share the same name a good idea?

Yes, I think it's a good idea, because it adheres to the Zen of Python (20. Namespaces are one honking great idea -- let's do more of those!).

For example, have a look at how many times the open function is present in the Python standard library. There is no confusion there because the namespaces are different.

Similarly, in this PR, KeyedVectors.load_fasttext_format loads KeyedVectors (word embeddings) and FastText.load_fasttext_format loads FastText instances (models).

That said, I think load_fasttext_format is an uninformative name. Something like load_facebook_format would be more helpful.

I'd prefer something more clearly distinct, like KeyedVectors.load_fasttext_format vs FastText.load_fasttext_full_model or something similarly "obvious".

I don't think exposing the user to the "full model" detail is a good idea. We already distinguish between word embeddings and models in the documentation. Adding "full model" or "partial model" to the mix would only make things worse, in my opinion.

Plus FastText.load_fasttext_format could probably proxy/alias to KeyedVectors.load_fasttext_format, for backward compatibility.

I strongly recommend against this, because it would not achieve backwards compatibility: the alias would point to a function with a different return type, and existing code using that function name will still break.

@piskvorky
Copy link
Owner

OK. I don't feel strongly about this, up to you Misha.

It's just that having two methods with the same name doing different things rang the "support headache" alarm bell for me.

@gojomo
Copy link
Collaborator

gojomo commented Feb 8, 2019

If one only loads the vectors – maybe even just the full-word vectors, not even the char-ngrams – while another loads a full traininable model, it looks like a situation ripe for confusion to me (and not at all in keeping with the open() precedent, as this method is parsing a specific format delivering an object of certain characteristics).

For better or worse, there are a lot of low-attention-to-detail users of gensim. More explicit method names (and variable names in examples) can only help. I'd suggest a descriptive distinction like.load_fasttext_model() and .load_fasttext_vectors().

(And, just to moan about #1777 some more, having many distinct models - w2v, ft, d2v, etc – all have their implementation classes inside the same keyedvectors.py file has made browsing the source, and reading the autogenerated docs to find just one relevant implementation, or just the high-order important methods amongst the internal cruft, much harder.)

@mpenkov
Copy link
Collaborator Author

mpenkov commented Feb 9, 2019

For better or worse, there are a lot of low-attention-to-detail users of gensim. More explicit method names (and variable names in examples) can only help. I'd suggest a descriptive distinction like.load_fasttext_model() and .load_fasttext_vectors().

@gojomo @piskvorky I can sort of see where you're coming from (those name are distinguishable without any additional context), but look at the fully qualified method name that you're proposing:

gensim.models.keyedvectors.KeyedVectors.load_fasttext_vectors

The word vectors appears a total of three times! Do we really need to hammer that point? It's more than obvious: you're loading vectors. Now, you might argue that people won't use the module name, so we're dealing with:

KeyedVectors.load_fasttext_vectors

@gojomo Does that not seem redundant to you? I understand that you're probably not responsible for the module naming and the class naming, and there's little we can do about them right now, but do we really need to add more redundancy into the naming?

Finally, it's worth pointing out that it's virtually impossible to call those methods without the class name. It's always going to be KeyedVectors.load_foo, never just .load_foo by itself, because it's a class method or a static method.

One thing we could do is make those pure functions instead of static/class methods, so you have:

  • gensim.models.keyedvectors.load_fasttext_vectors
  • gensim.models.fasttext.load_fasttext_model

The current static/class method approach has zero benefits, other than consistency with the rest of the x2v hierarchy.

@piskvorky @gojomo What do you think?

Just thinking out loud here.

(And, just to moan about #1777 some more, having many distinct models - w2v, ft, d2v, etc – all have their implementation classes inside the same keyedvectors.py file has made browsing the source, and reading the autogenerated docs to find just one relevant implementation, or just the high-order important methods amongst the internal cruft, much harder.)

Totally on board with you there. The current division of the functionality into packages, modules and classes is quite poor. It's hard to fix that would breaking backwards compatibility in a big way. I'm thinking of a three-stage process to resolve this over the long term:

  1. Write a thin wrapper to hide most of the insanity from the users
  2. Wait for the majority of people to start using the wrapper
  3. Quietly refactor the insanity behind the wrapper so people don't notice

but that's out of scope for the current ticket.

@piskvorky
Copy link
Owner

It's always going to be KeyedVectors.load_foo, never just .load_foo by itself, because it's a class method or a static method.

You're underestimating how popular "data science" is, and the level of general cluelessness that brings. People call class/static methods on already-instantiated objects all the time. Many don't even have a concept of a "class method", they're in full Stackoverflow copy-paste mode.

Which is not to say we have to cater to them, or make our design ugly. But given an option to make something "obvious and fool-proof", we should go for it. For our own sake, saving time and nerves.

@gojomo
Copy link
Collaborator

gojomo commented Feb 12, 2019

I personally don't find the three occurrences of vectors in the proposed fully-specified name excessive. The common convention of putting classes into modules of the same name already dictates two repetitions. Then, having an extra-qualifying method name, that contrasts against other possible (and confusable) kinds-of-loads of the same files, would have to use the same word. And in practice, imports will mean most invocations are just KeyedVectors.load_fasttext_vectors().

It's tough to armor the code against all the misconceptions of unsophisticated users, but I find every bit of explicitness in names can help head off frustrations (and then questions). And perhaps, helps slightly-more-sophisticated users also improve their clarity-of-understanding, even if their struggles show up less often in forums.

Moving such 'utility load' functionality to pure functions might make sense, as a design/style direction. (Especially, for non-native formats, as opposed to a 'native-to-gensim load()' where the user would be expected to know the exact class they previously save()d. I suspect the current practice was influenced somewhat by habits from the Java world, where such utility functions need to live on a related class, but in Python it does lead to naive users calling static methods on instances. and being confused about the errors/effects that has. (I forget where, but somewhere in gensim we added an explicit warning about that.)

(Regarding a potential 'clean-up' beyond the scope of this ticket: my concern with a "simplifying wrapper" as the 1st step is that it adds even more code, and more permutations of ways to call the existing code, in existing versions, before any compelling functionality-benefits pull people to change their code/habits/docs. As mentioned in my comments at #1623 (comment) and nearby, I'd prefer a "greenfield" implementation, free to discard API compatibility, which is informed by the frustrations of heavy-users and some checklist of the kinds of new/experimental extension points that have been commonly-requested or appear in research papers. Whatever bubbles from that drives the new API – so the API is less dictated by either the need to 'wrap' the old, or by hopes-without-real-features-yet-underneath. When that novel implementations reaches certain levels of parity with the old, then porting guides or possible compatibility wrappers could be made, if appropriate.)

@mpenkov
Copy link
Collaborator Author

mpenkov commented Feb 13, 2019

@gojomo I agree with most of your points. Here's the main one I want to clarify:

Moving such 'utility load' functionality to pure functions might make sense, as a design/style direction. (Especially, for non-native formats, as opposed to a 'native-to-gensim load()' where the user would be expected to know the exact class they previously save()d. I suspect the current practice was influenced somewhat by habits from the Java world, where such utility functions need to live on a related class, but in Python it does lead to naive users calling static methods on instances. and being confused about the errors/effects that has. (I forget where, but somewhere in gensim we added an explicit warning about that.)

So, should I go ahead and introduce pure functions? We're breaking backwards compatibility in this PR, so it may be a good opportunity to make things a little better.

Everything below is me thinking out loud about some of your arguments. On one hand, I don't want to turn this PR into a debate, so I'm OK with maintaining the current naming convetion. On the other, I think it's worth capturing the stuff below in case we ever revisit this decision. So, read on only if you have the time/interest :)


I personally don't find the three occurrences of vectors in the proposed fully-specified name excessive. The common convention of putting classes into modules of the same name already dictates two repetitions.

Is that really a common convention in the Python world, though? Here's a popular talk advising against that sort of practice, among other things. I remember it being a thing in Java (many moons ago), but the naming conventions there give birth to monstrosities like ProjectContractChargingPeriodProjectAccountReferenceVM.

Then, having an extra-qualifying method name, that contrasts against other possible (and confusable) kinds-of-loads of the same files, would have to use the same word. And in practice, imports will mean most invocations are just KeyedVectors.load_fasttext_vectors().

Yeah, I agree that most people will do:

from gensim.models.keyedvectors import KeyedVectors
v = KeyedVectors.load_fasttext_vectors()

instead of

import gensim.models.keyedvectors
v = gensim.models.keyedvectors.KeyedVectors.load_fasttext_vectors(...)

but I think one of the main reasons why is our long, nested package names. We're forcing the user to abandon the idea of namespaces because the alternative is impractical. Ideally, it'd be something short and simple like:

import gensim.fasttext
v = gensim.fasttext.load_vectors(...)

Wouldn't that be a better and more Pythonic API?

Unfortunately, we're stuck with the existing package scheme for the foreseable future.

It's tough to armor the code against all the misconceptions of unsophisticated users, but I find every bit of explicitness in names can help head off frustrations (and then questions). And perhaps, helps slightly-more-sophisticated users also improve their clarity-of-understanding, even if their struggles show up less often in forums.

👍

(Regarding a potential 'clean-up' beyond the scope of this ticket: my concern with a "simplifying wrapper" as the 1st step is that it adds even more code, and more permutations of ways to call the existing code, in existing versions, before any compelling functionality-benefits pull people to change their code/habits/docs. As mentioned in my comments at #1623 (comment) and nearby, I'd prefer a "greenfield" implementation, free to discard API compatibility, which is informed by the frustrations of heavy-users and some checklist of the kinds of new/experimental extension points that have been commonly-requested or appear in research papers.

@piskvorky raised the idea of a green-field refactoring in some of our other discussions. I agree that there are benefits in starting from scratch, but I think there are also significant risks and costs. Do we have the ability and motivation to rewrite the fastText (and the other Word2Vec models) from scratch, making sure that we don't break things along the way? Does any one person on the team understand the entire code base sufficiently enough to improve it?

My preference is to gradually improve things, both from the users and developers' perspective. In contrast to a green-field refactoring, adding a wrapper has much less potential to do harm. If the wrapper sucks, people will simply not use it. Yeah, there'll be an additional code path, but I think that'&s a drop in the ocean compared to the problems we're currently facing. Of course, I'm new around here, so I may be misjudging the situation.

Whatever bubbles from that drives the new API – so the API is less dictated by either the need to 'wrap' the old, or by hopes-without-real-features-yet-underneath. When that novel implementations reaches certain levels of parity with the old, then porting guides or possible compatibility wrappers could be made, if appropriate.)

I think there's a catch 22 here:

  1. People won't start using the new API until it contains the functionality that they need
  2. We won't get any feedback about what the needed parts are until people start using the new API

@piskvorky
Copy link
Owner

piskvorky commented Feb 13, 2019

Do we have the ability and motivation to rewrite the fastText (and the other Word2Vec models) from scratch, making sure that we don't break things along the way? Does any one person on the team understand the entire code base sufficiently enough to improve it?

Sure. None of it is rocket science. The core was written in a weekend or two, and then kept accruing both useful functionality and cruft over the years (esp. in the last few years where I focused on other things and allowed student contributions). The result is complex because too many cooks spoiled the broth, but not complex in terms of core workflows.

2. We won't get any feedback about what the needed parts are until people start using the new API

Yes, this is tricky to put into exact specs.

On a high level, users want to train unsupervised text models (incrementally, streaming), persist them (save/load, continue training) and use them in other apps (import inputs from NLTK/spacy; export outputs into sklearn, keras etc).

Gensim started with those goals and still provides them, although the code structure got a bit tangled up, and some of the functionality is of sub-par quality (both code, algorithms and documentation).

But it's not like we're in the dark about what people use or what's inside Gensim. See also Gensim Survey 2018.

@gojomo
Copy link
Collaborator

gojomo commented Feb 14, 2019

I'm unsure how prevalent the "module-named-after-its-main-public-facing-class" convention is across Python; I know it's common in gensim. At least here, I see it as somewhat justifiable by the ease/isolation it offers to a user of just one or a few algorithms. Source, and docs, for what they're focused-on aren't grouped in larger files/pages with a lot of other vaguely-related algorithms.

Obviously existing things should be gradually improved, in-place, as possible (and necessary when doing overlapping work). But as the sole strategy, that also places a drag the potential wins, because:

  • it depends on long-term attention of one or more people keeping the end-in-mind (at least in rough form), at all times navigating/negotiating other incoming/simultaenous piecemeal work
  • it "taxes" any work on breakthrough new capabilities with the requirement to integrate/clean-up older cruft & maintain backward-compatibility at the same time
  • it may close off pursuit of radically-better approaches, because there's no set of incremental-steps that get there
  • every incremental step requires doc-updates, and compatibility-harnesses/deprecation-warnings, and version-sensitive clarifications/qualifications – and generates a "bug-tail" or "support-tail" as outdated online examples or deployed systems are run against the "same-name-but-subtly-different" code. (The requirement to import completely different modules, and use completely new code, to get certain benefits presents a more explicit opt-in threshold.)

Co-Authored-By: mpenkov <penkov@pm.me>
@mpenkov mpenkov added the 3.7.2 label Feb 21, 2019
@mpenkov mpenkov changed the title implement KeyedVectors.load_fasttext_format method implement separate functions to load FT embeddings and models Mar 7, 2019
@mpenkov mpenkov merged commit 58f91d1 into piskvorky:develop Mar 7, 2019
@mpenkov mpenkov deleted the lff branch March 7, 2019 07:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants