Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example Cython vocabulary count code (depends on external libraries, may complicate build) #556

Closed
wants to merge 5 commits into from

Conversation

honnibal
Copy link

@honnibal honnibal commented Dec 5, 2015

This PR passes tests but is not ready to merge. Added as part of discussion in PR #406 and Issue #400

This provides some Cython code to do fast vocabulary counting, with fall-back to the previous Python implementation if the code can't compile. Uses libraries developed for spaCy, but I've avoided depending on spaCy itself.

While this implementation is passing the tests, and preliminary experiments suggest it's much faster, this code comes with some significant caveats. It depends on my MurmurHash wrapper, which means there's this problem of finding the C header files to compile the code. This is causing us a lot of installation headaches for spaCy. It might be that your fall-back strategy avoids the problem. But, maybe not. Beware :).

…nder-document Cython modules, preshed and MurmurHash
…ds. Currently the's a likely bug in the counts in the same of mixed unicode and string data, as the two will receive different keys.
@honnibal
Copy link
Author

honnibal commented Dec 5, 2015

Ran some benchmarks, using the POS tagged and chunked Reddit data I'm using for some "sense2vec" experiments. These run a min_count of 5, and the data is atypical — the word distribution is far sparser than normal, because the vectors are over entities, chunks, and pos-tagged words.

Sentences Original Cython
10^6 6s 3s
10^7 90s 28s
10^8 3540s 301s

The 10^8 dictionary was occupying 50-60% of the system's memory, so about 8-10gb of data. I don't think the system was swapping. My experience is that past a certain size, the Python dictionary starts to perform really badly. I've never looked at the code to try to figure out why.

The Cython code only remembers the strings for the words that cross the minimum count threshold, so memory requirements are much smaller. The extent of that will depend on the minimum threshold selected, and the data being processed.

I would normally run something like this with simple multi-processing. I like the little wrapper provided by joblib for this. Example: https://github.com/honnibal/spaCy/blob/master/examples/pos_tag.py

@piskvorky
Copy link
Owner

This is great, thanks @honnibal !

Not sure why the large Python dict is so disproportionately slower, I've never observed/noticed that before.

Regarding multiprocessing -- how would you do this? It looks like the counting is way too fast to benefit from coarse-grained message passing between processes. What kind of inter-process communication do you have in mind?

@janrygl Also relevant to our new "counter abstraction API" (hyperloglog, MinCountSketch) in #508.

@piskvorky
Copy link
Owner

@cscorley @gojomo We should also decide how to go about our "pure Python" restriction going forward.

It used to be that everything in gensim was pure Python, using NumPy where performance was critical. This is no longer true and we offer alternative, optional C code paths in word2vec and doc2vec (and LDA soon).

Advantages of requiring a mandatory compiler:

  • simpler code, drop alternative code paths
  • more freedom to optimize
  • drop compiled dependencies that are only there for performance (looking at scipy in particular, Get rid of scipy #557 )

Disadvantages:

  • gensim harder to install for non-power users
  • "for humans" motto eroded
  • slippery slope -- may end up with a pure-C package instead :)

I probably forgot other important points; feel free to add / extend.

@honnibal
Copy link
Author

honnibal commented Dec 6, 2015

Please benchmark this on your own machines and data. As I said, I only ran this on my "nlp2vec" experiments, where the vocab is super sparse. So, your mileage may vary.

My take on the Python vs Cython thing is, if you're depending on numpy your users need either conda, wheel install, or C compiler anyway. Otherwise how are they using numpy? Most of numpy is written in C and Cython. So already you have tonnes of compiled code in your library :)

Personally I'd also rather fail explicitly if the library is not installed correctly. If Gensim somehow fails to use the compiled path, the word2vec and doc2vec code is actually non-functional for most use-cases.

@honnibal
Copy link
Author

honnibal commented Dec 6, 2015

Re multi-processing: send the data to the workers in large batches, like a batch-size of 100k or 1m or whatever. I use toolz.partition for this. You can see this in the POS tagging example: https://github.com/honnibal/spaCy/blob/master/examples/pos_tag.py

@piskvorky
Copy link
Owner

NumPy is everywhere, that's no issue :)

Re. chunking -- sure, but I suspect the communication overhead (pickling, unpickling) would be greater than the actual work done then (just incrementing counts). Worth checking though.

@cscorley
Copy link
Contributor

cscorley commented Dec 6, 2015

I starting to think that the advantages that Cython gives are worth dropping pure Python paths (maybe this warrants opening a separate issue thread?)

First, as @honnibal mentioned already, to install numpy from scratch requires compilers for both C and Fortran. That's already one more compiler than Cython needs 😉 Both packages are available through Anaconda and the like, and also various Linux distributions. I'm not sure users are losing out much here. A question here is how is Anaconda currently handling gensim with/without Cython?

Second, maintaining two code paths that do the same thing is work (shoutout to @gojomo). Related, triaging bug reports between actual bugs and simply not having Cython working correctly is also work.

Third, the "for humans" part should be expressed by flexible APIs around this code. Humans shouldn't see any of these details, whether it is in Cython shouldn't matter past day 0. For instance, something that needs work is the difference between using Similarity for building an index with LSI/LDA vs how most_similar works in Doc2vec. Also, making it clear/obvious as to why inference can lead to slightly different results is also a common pain-point. These are just a couple of (more pressing) examples where "for humans" isn't working well.

Finally, one point against Cython: it will discourage newcomers from contributing bug fixes and features. This is one reason why I started using this package; to be able to read and modify code as I needed since it was in Python. Cython will be another step up for a lot of people.

@gojomo
Copy link
Collaborator

gojomo commented Dec 7, 2015

The big slowdown in the 10^8 sentences case is suspicious to me as well; it may indicate something that can be fixed in pure-python.

For a multiprocessing speed win, we could also work with text sources that can provide ranges to be read from each process independently. (Either: split over many files, or a single-file in a format that allows random-seeks to aligned records.) Then each process is assigned files/ranges (rather than fed giant chunks of text from a bottleneck master reading process), does its own open/seek/read, and only reports back its full tallies at the end, for a final merge. That could allow nearly-linear speed improvement with the number of processors (memory permitting).

There are some rare cases where @honnibal's map-of-hashes technique could result in a mis-count – 64-bit hash collisions. (Barring intentionally-crafted collisions, you'd only expect a first collision when going over 5 billion tokens.) In practice, no one might care, but my preference would be to always offer a precise count as a prominent option (and perhaps the default), and make it clear when any approximation/optimization for speed (or compactness) might be used instead.

Performance can be very critical for this problem domain, for usual datasets – so I think there's no escaping some uses of closer-to-native code... especially when it offers 10x+ speedups. In all cases keeping APIs as 'pythonic' as possible still serves the main goal. (Maybe at some point the cython could be further minimized, retaining similar performance, with tools like Numba? I've seen nice demos but haven't yet tried it...)

@piskvorky
Copy link
Owner

I think this new paradigm, using multiple stream readers rather than a single one, could be useful. That scenario is fairly common in practice.

Question is, how do we wrap that in a sane API? Start offering corpora ctr parameter, instead of the current corpus (sentences, etc)? For example lda = LdaModel(corpora=[corpus1, corpus2, corpus2], num_topics=100)?

We're getting slightly off-topic, but it seems a major API facelift in gensim is in order :)

@syllog1sm
Copy link

There are some rare cases where @honnibal's map-of-hashes technique could result in a mis-count – 64-bit hash collisions. (Barring intentionally-crafted collisions, you'd only expect a first collision when going over 5 billion tokens.) In practice, no one might care, but my preference would be to always offer a precise count as a prominent option (and perhaps the default), and make it clear when any approximation/optimization for speed (or compactness) might be used instead.

Well, everything about the operation of this algorithm is approximation :). I mean, even for these counts,
what we're using them for is to create a frequency estimate. It would make sense to smooth the counts, but we don't, because it doesn't matter — we just use the literal ones.

So, I think it would actually be misleading and weird to talk about this source of imprecision, or worse, to surface it to the user as a choice. Why should the user have to think through the logic of the birthday paradox, check your estimate of the hash collision probability, and then see that indeed the hash collision probability totally doesn't matter?

@tmylk
Copy link
Contributor

tmylk commented Jan 10, 2016

@piskvorky Specifically about this PR - is it ready to merge? Are we taking on the new dependencies?

@piskvorky
Copy link
Owner

Do not merge yet; I owe Matthew a more thorough API plan and roadmap as discussed. Didn't get to that yet.

@honnibal
Copy link
Author

Currently revisiting this.

It's possible to do much better than the Cython code above, but only by accepting a much narrower API. I have a version now that seems to be about 5x faster. It seems to be disk bound at this point.

Specifically if you require the data to be passed in as a byte-string with space/newline delimited tokens, the program can be very fast. But if users can pass in tokens as a sequence of strings, I think performance will always be sort of bad.

I think it's often hard to add performance back in after you've committed to a certain (Python) API. For the use-cases I'm interested in, performance is super important. I'm often telling people that doing NLP is like computing in a world where Moore's law never happened: job sizes outpace improvements in computational resources.

So, for my own use of these things, I've found myself wanting much more performance oriented, and much more restrictive, versions of some of the Gensim functionality. The question is whether these should be merged into Gensim, or whether it's more natural to break these out into another project.

@honnibal honnibal closed this Jan 26, 2016
@honnibal honnibal reopened this Jan 26, 2016
@honnibal
Copy link
Author

(btw the closing this was a misclick =/)

@piskvorky
Copy link
Owner

No worries :)

Regarding non-Python, constrained APIs: definitely not the path we want to take with gensim, generally speaking.

I mean, we could simulate the "raw string" approach by concatenating all tokens into a single string, then passing that on, rather trivially. But I suspect that would negate any performance gains again, right :)

Or, we could offer both options. Python-friendly and flexible for normal users and use-cases, and this "input comes in very restricted C-like form" for power users.

Re. stand alone word-counting library: I like that idea. We're also working on algorithmic improvements in parallel to such low-level optimizations (approximate counts, parallelization; #508), and I think this particular task is universal enough, and hard enough, to be useful as a standalone lib.

My only constraint there is that the lib is sane to install and maintain across platforms, so we can add it as a dependency from gensim.

@menshikh-iv
Copy link
Contributor

Ping @honnibal, what status of this PR? Will you finish it soon?

@piskvorky
Copy link
Owner

@menshikh-iv this is one of our GSoC projects this year, I believe.

I'm in favour of making this "fast, parallelized, memory efficient counting" functionality a separate package, rather than a subpackage of gensim.

@menshikh-iv
Copy link
Contributor

Connected with #1446

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants