-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Summarisation optimisations #441
Conversation
While the paper showed best performance was obtained with nouns and adjectives, there are applications where including verbs, adverbs, or even prepositions in the keywords list might be desirable.
Looks good, thanks! Is there a way to make use of sparsity of the adjacency matrix? That would save us a LOT of memory and allow processing much larger inputs. @fbarrios is the adjacency matrix supposed to be symmetric or not? I didn't check the algo, but I see @erbas added some comments around complex numbers coming out of the eigendecompostion, which would suggest it's not symmetric...? |
Well, it's packed efficiently in the csr_matrix format, but then unpacked and made dense at full dimension by the addition of the To really scale this up I think you need a different algorithmic approach. All I've done is some obvious optimisations. Some form of count-min-sketch is probably the way to go. |
Yes, unless we find a way to make use of adjacency matrix' sparsity (and I don't see an easy way), there's no point in the CSR step. We should simply construct it as a dense array to start with, and avoid the pointless conversions (wasting both memory and CPU). Or implement the iterative algo for finding the principal component ourselves. It's very simple AFAIR, via power iterations. Then we can perhaps make use of both sparse @erbas Can you say a little more around what |
I looked at |
Yes, I agree with your analysis of the code. The graph should be undirected, but when I step through the code the adjacency matrix is not symmetric. Could be a bug? Could be "conceptually symmetric", i.e. numerical issues mean it's not symmetric to many decimal places? The latter seems unlikely to me as the differences were ~ 1, not 1.e-4. You asked about |
Let's wait for @fedelopez77's comment, so we don't get too ahead of ourselves :) |
I'm sorry I can't confirm this until I get home and run some tests, but I think the matrix is not symmetric. |
Ah ok, that explains it. Cheers @fedelopez77 :) |
@fedelopez77 My unit tests are failing on Travis because different hardware/library combinations give different keyword lists. I would not have expected the numerical differences between platforms to make such a difference, any suggestions? |
@erbas Maybe you're hitting some singular inputs (degenerate matrix spectrum)? Otherwise we want the algo to be stable of course = small numerical imprecisions should have only proportional (small) impact on the result. |
@piskvorky I'm developing on OS X and Travis is running linux inside Docker, so I would expect small differences from different BLAS libraries at least. But emphasis on "small". The really odd error is the mismatch number of keywords: ask for 15, get 16. I don't understand :( |
This reverts commit 45e6c5b.
@erbas what's the status here? We'll make a new release this week, would be great to get this in. |
Hi @piskvorky, so it all works on python 2. I don't understand unicode on python 3 well enough to understand the issue there. The unit tests I wrote are failing because of the previously noted instabilities in keywords. Again I'm not sure what's going on, I would have expected the largest eigenvector to be pretty robust. Seems more likely to me that there's something in the calculation of the scores which is platform dependent. |
@tmylk any ideas? Can you check / fix? |
Merged just a Keywords test with Linux keywords in PR-474 It passes in Python 2 in pre-BM25, pre-PR441 and post-PR441 on my Fedora and Travis docker. This shows that this PR hasn't deviated. The instability due to platform/BLAS version is a separate issue but pretty important. And of course @erbas actually gets different keywords on OSX. |
Hi @tmylk, I'm on numpy 1.9.2 and Xcode 7.1 for the BLAS (contained in the OS X Accelerate Framework). Would be worth checking the matrix doesn't have different values on different platforms too. Thanks for patching the Python3 issues, not my strong point. |
@erbas do you have any advice on how to get the keywords in this file test_data/mihalcea_tarau.kw.txt ? I didn't have any luck producing them. Everywhere I try I get one different set of keywords. I created a test with this new set in #474 and it passes on unix'es and OSX (python=2.7.9 numpy=1.9.2 Xcode=7.0) Thinking about removing mihalcea_tarau.kw.txt changes from this PR and using the ones in #474 instead. |
Nice work! I'm afraid I really don't have any better ideas. In some ways it's reassuring to think this sample text is somehow pathological for this algo but that does beg the question of which other texts might be problematic. Seems reasonable to me to replace the tests.
|
Summarisation and TextRank optimisations. Updated version of PR #441
@erbas All merged in. Thanks for the optimisations! Ratio parameters in tests updated to values more resistant to numeric fluctuations. |
@tmylk Thanks for digging into this, and sorry for not figuring it out :( |
Added speed improvements to keyword and summarisation methods.