-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error while summarizing text #805
Comments
I am also affected by this bug... |
@madewild could you post a code snippet to reproduce? |
Unfortunately the text fed to the summarizer is confidential, but my guess is that the error was triggered by an unusually high repetition of some sentences... I also notice now that the error raised was not exactly the same: |
@tmylk The reason for this failure looks like the number of nodes in the graph that is used to calculate the pagerank of the corpus graph, after removing unreachable nodes the graph is left with only 2 nodes and hence it builds a matrix of shape 2 * 2 (for which scipy.sparse.linalg.eigs() will fail for k=1). We should probably raise an error if number of nodes (after removing unreachable nodes) goes below 3. |
@MridulS could you submit a pr for this? |
@tmylk What kind of error should I raise? |
The error should say "Please add more sentences to the text. The number of reachable nodes is below 3" |
Hi I worked on this issue. I have sent out a pull request for the same. Please review. |
…words (piskvorky#885) * Added check in summarize_corpus to fix bug in summarizer * Fix piskvorky#805: Added check in summarizing text * Added test for checking low number of distinct words in text * Text split method changed to allow running in Python 3.3 and above. * Change to fix test in python versions 3.3 and higher * Added blank line test_wikicorpus.py file Added blank line to fix issue with travis CI
Hello, I think that the problem is still open. I replicated this error with the document 1403 from the Hulth2003 dataset): Traceback (most recent call last): Looking to the document, make sense say that the possible problems are the terms frequencies! All the terms have frequency equal 1. |
Hmm, that's not good, looks like a bug. Can you suggest a fix @vitordouzi ? |
@piskvorky, no, I don't! sorry! Maybe this TODO in the pagerank_weighted.py file can help. File "/gensim/summarization/pagerank_weighted.py", line 24, in pagerank_weighted What exactly are the complex eigenvectors? |
Hello everyone. I started investigating this issue and basically, this is the same one as @MridulS described, but in different function:
On text, given by @vitordouzi we end up with graph:
which ends in 2x2 matrix and pagerank fails. But I'm not sure how to fix this. @vitordouzi, @menshikh-iv any ideas on the desired outcome? An exception this time doesn't feel right. Maybe set some predefined scores instead of running pagerank? Or maybe add special case to pagerank? |
Anyway, some notes about
|
About (1), (2) @xelez - need to handle special case, the comment from (3) should be useful too. |
* added a regression test for summarization.keywords() * handled case with graph smaller than 3 nodes * removed TODO about complex eigenvectors * added more comments
Hi,
I've received the following error when trying to summarize the body of this news article:
https://www.theguardian.com/media/2016/jun/19/sun-times-brexit-in-out-shake-it-all-about
The error follows:
File "/home/apps/comment_parser/venv/local/lib/python2.7/site-packages/gensim/summarization/summarizer.py", line 202, in summarize
most_important_docs = summarize_corpus(corpus, ratio=ratio if word_count is None else 1)
File "/home/apps/comment_parser/venv/local/lib/python2.7/site-packages/gensim/summarization/summarizer.py", line 161, in summarize_corpus
pagerank_scores = _pagerank(graph)
File "/home/apps/comment_parser/venv/local/lib/python2.7/site-packages/gensim/summarization/pagerank_weighted.py", line 24, in pagerank_weighted
vals, vecs = eigs(pagerank_matrix.T, k=1) # TODO raise an error if matrix has complex eigenvectors?
File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 1271, in eigs
ncv, v0, maxiter, which, tol)
File "/usr/lib/python2.7/dist-packages/scipy/sparse/linalg/eigen/arpack/arpack.py", line 685, in init
raise ValueError("k must be less than ndim(A)-1, k=%d" % k)
ValueError: k must be less than ndim(A)-1, k=1
Regards,
The text was updated successfully, but these errors were encountered: