Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Summarisation optimisations #441

Merged
merged 13 commits into from Oct 7, 2015
Merged

Summarisation optimisations #441

merged 13 commits into from Oct 7, 2015

Conversation

erbas
Copy link
Contributor

@erbas erbas commented Aug 29, 2015

Added speed improvements to keyword and summarisation methods.

erbas added 6 commits August 29, 2015 19:31
While the paper showed best performance was obtained with nouns and
adjectives, there are applications where including verbs, adverbs, or
even prepositions in the keywords list might be desirable.
@piskvorky
Copy link
Owner

Looks good, thanks!

Is there a way to make use of sparsity of the adjacency matrix? That would save us a LOT of memory and allow processing much larger inputs.

@fbarrios is the adjacency matrix supposed to be symmetric or not? I didn't check the algo, but I see @erbas added some comments around complex numbers coming out of the eigendecompostion, which would suggest it's not symmetric...?

@erbas
Copy link
Contributor Author

erbas commented Aug 29, 2015

Well, it's packed efficiently in the csr_matrix format, but then unpacked and made dense at full dimension by the addition of the probability_matrix. I don't have good theory on this but debugging it looked like the adjacency_matrix was not symmetric.

To really scale this up I think you need a different algorithmic approach. All I've done is some obvious optimisations. Some form of count-min-sketch is probably the way to go.

@piskvorky
Copy link
Owner

Yes, unless we find a way to make use of adjacency matrix' sparsity (and I don't see an easy way), there's no point in the CSR step. We should simply construct it as a dense array to start with, and avoid the pointless conversions (wasting both memory and CPU).

Or implement the iterative algo for finding the principal component ourselves. It's very simple AFAIR, via power iterations. Then we can perhaps make use of both sparse adjacency_matrix and the dense probability matrix (scalar?) in an efficient way, without having to combine them explicitly.

@erbas Can you say a little more around what WEIGHT_THRESHOLD does? What are the tradeoffs here? Worth making it a user-tunable parameter (rather than a hardwired constant)? Add documentation?

@piskvorky
Copy link
Owner

I looked at graph and as far as I can see, it should be undirected. Which means the adjacency matrix should be symmetrical, so we can simply use eigh / eigsh. @fedelopez77 @fbarrios can you please confirm?

@erbas
Copy link
Contributor Author

erbas commented Aug 31, 2015

Yes, I agree with your analysis of the code. The graph should be undirected, but when I step through the code the adjacency matrix is not symmetric. Could be a bug? Could be "conceptually symmetric", i.e. numerical issues mean it's not symmetric to many decimal places? The latter seems unlikely to me as the differences were ~ 1, not 1.e-4.

You asked about WEIGHT_THRESHOLD. The smallest non-zero entry in the matrix for a small sample of random texts was ~ 0.1, so 0.001 seemed a reasonable threshold. Perhaps it's safer to set it to zero? I can't see much advantage in making it user tuneable.

@piskvorky
Copy link
Owner

Let's wait for @fedelopez77's comment, so we don't get too ahead of ourselves :)

@fedelopez77
Copy link
Contributor

I'm sorry I can't confirm this until I get home and run some tests, but I think the matrix is not symmetric.
In the original TextRank paper, it is symmetric, but in the implementation that was merged, we use a variation of the Okapi BM25 function to set the edge's weights. Given that this function uses the results of the tf-idf function on the words of only one of the BM25 arguments, the results aren't the same (i.e. if A and B are sentences, BM25(A, B) != BM25(B, A).
And that's is also why we need a directed graph.

@piskvorky
Copy link
Owner

Ah ok, that explains it. Cheers @fedelopez77 :)

@erbas
Copy link
Contributor Author

erbas commented Sep 1, 2015

@fedelopez77 My unit tests are failing on Travis because different hardware/library combinations give different keyword lists. I would not have expected the numerical differences between platforms to make such a difference, any suggestions?

@piskvorky
Copy link
Owner

@erbas Maybe you're hitting some singular inputs (degenerate matrix spectrum)?

Otherwise we want the algo to be stable of course = small numerical imprecisions should have only proportional (small) impact on the result.

@erbas
Copy link
Contributor Author

erbas commented Sep 1, 2015

@piskvorky I'm developing on OS X and Travis is running linux inside Docker, so I would expect small differences from different BLAS libraries at least. But emphasis on "small". The really odd error is the mismatch number of keywords: ask for 15, get 16. I don't understand :(

@piskvorky
Copy link
Owner

@erbas what's the status here? We'll make a new release this week, would be great to get this in.

@erbas
Copy link
Contributor Author

erbas commented Sep 22, 2015

Hi @piskvorky, so it all works on python 2. I don't understand unicode on python 3 well enough to understand the issue there. The unit tests I wrote are failing because of the previously noted instabilities in keywords. Again I'm not sure what's going on, I would have expected the largest eigenvector to be pretty robust. Seems more likely to me that there's something in the calculation of the scores which is platform dependent.

@piskvorky
Copy link
Owner

@tmylk any ideas? Can you check / fix?

@tmylk tmylk mentioned this pull request Sep 30, 2015
@tmylk
Copy link
Contributor

tmylk commented Sep 30, 2015

Merged just a Keywords test with Linux keywords in PR-474 It passes in Python 2 in pre-BM25, pre-PR441 and post-PR441 on my Fedora and Travis docker. This shows that this PR hasn't deviated.

The instability due to platform/BLAS version is a separate issue but pretty important.
For example, the test_keywords_ratio passes on my Fedora but fails on Travis.

And of course @erbas actually gets different keywords on OSX.
@erbas What is the BLAS and numpy versions where the tests are passing? Will try to reproduce it in some OSX VM.

tmylk added a commit that referenced this pull request Sep 30, 2015
@erbas
Copy link
Contributor Author

erbas commented Oct 2, 2015

Hi @tmylk, I'm on numpy 1.9.2 and Xcode 7.1 for the BLAS (contained in the OS X Accelerate Framework). Would be worth checking the matrix doesn't have different values on different platforms too.

Thanks for patching the Python3 issues, not my strong point.

@tmylk
Copy link
Contributor

tmylk commented Oct 6, 2015

@erbas do you have any advice on how to get the keywords in this file test_data/mihalcea_tarau.kw.txt ?

I didn't have any luck producing them. Everywhere I try I get one different set of keywords. I created a test with this new set in #474 and it passes on unix'es and OSX (python=2.7.9 numpy=1.9.2 Xcode=7.0)

Thinking about removing mihalcea_tarau.kw.txt changes from this PR and using the ones in #474 instead.

@erbas
Copy link
Contributor Author

erbas commented Oct 6, 2015

Nice work! I'm afraid I really don't have any better ideas. In some ways it's reassuring to think this sample text is somehow pathological for this algo but that does beg the question of which other texts might be problematic. Seems reasonable to me to replace the tests.

On 6 Oct 2015, at 7:39 AM, Lev Konstantinovskiy notifications@github.com wrote:

@erbas do you have any advice on how to get the keywords in this file test_data/mihalcea_tarau.kw.txt ?

I didn't have any luck producing them. Everywhere I try I get one different set of keywords. I created a test with this new set in #474 and it passes on unix'es and OSX (python=2.7.9 numpy=1.9.2 Xcode=7.0)

Thinking about removing mihalcea_tarau.kw.txt changes from this PR and using the ones in #474 instead.


Reply to this email directly or view it on GitHub.

@tmylk tmylk mentioned this pull request Oct 7, 2015
@tmylk tmylk merged commit 8fdfff7 into piskvorky:develop Oct 7, 2015
tmylk added a commit that referenced this pull request Oct 7, 2015
Summarisation and TextRank optimisations. Updated version of PR #441
@tmylk
Copy link
Contributor

tmylk commented Oct 7, 2015

@erbas All merged in. Thanks for the optimisations!

Ratio parameters in tests updated to values more resistant to numeric fluctuations.
The cut-off used to be in the middle of a flat patch at 0.08 and that made it very unstable. There were differences even between consequent Travis runs.
scores

@erbas
Copy link
Contributor Author

erbas commented Oct 8, 2015

@tmylk Thanks for digging into this, and sorry for not figuring it out :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants