-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-2199] [mllib] topic modeling #1269
Conversation
Jenkins, add to whitelist. |
Jenkins, test this please. |
QA tests have started for PR 1269. This patch merges cleanly. |
QA results for PR 1269: |
QA tests have started for PR 1269. This patch merges cleanly. |
QA results for PR 1269: |
@akopich Thanks for working on PLSA! This is a big feature and it introduces many public traits/classes. Could you please summarize the public methods? Some of them may be unnecessary to expose to end users and we should hide them. The other issue is about the code style. Please follow the guide at https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide and update the PR, for example:
|
QA tests have started for PR 1269. This patch merges cleanly. |
Thank you for your recommendations. Sorry for code style -- i believed it's ok since sbt/sbt scalastyle finds no issue. Hope, it's ok now. I've removed "created by ..." comments generated by intellij, organized imports, added docs for every public class/trait and method, fixed indentation, hidden all the classes/traits that end-user does not need. |
QA tests have started for PR 1269. This patch merges cleanly. |
QA results for PR 1269: |
QA results for PR 1269: |
I probably need your help. I have no idea why tests fail somewhere at spark.streaming. Could you please have a look at jenkins' log and give me a hint? |
Jenkins, retest this please. |
testPLSA(plsa) | ||
} | ||
|
||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a problem? You mentioned it several times but I don't think a trailing newline is required in Java or Scala.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that git need to blank line end.
QA tests have started for PR 1269. This patch merges cleanly. |
@akopich The failed tests might be irrelevant to this PR. It would be nice if you can make the public interfaces minimal and provide a summary of them. For example, You can make "robust" a parameter of |
QA results for PR 1269: |
|
||
import com.esotericsoftware.kryo.io.{Input, Output} | ||
import com.esotericsoftware.kryo.{Kryo, Serializer} | ||
import gnu.trove.map.hash.TObjectIntHashMap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be replaced with breeze.util.Index
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much! I wish I knew earlier about breeze.util.Index.
it looks like the higher the requested number of topics, the larger the documents have to be or else Perplexity goes to |
QA tests have started for PR 1269 at commit
|
@chazchandler, Probably, it's possible to treat this situation in a better manner. I'll think about this. All suggestions are welcome. |
QA tests have finished for PR 1269 at commit
|
@mengxr Unfortunately, I'm not sure if I understand what you suggest for robust plsa. Do you mean there should be some kind of facade for PLSA and RobustPLSA? |
QA tests have started for PR 1269 at commit
|
QA tests have started for PR 1269 at commit
|
QA tests have finished for PR 1269 at commit
|
QA tests have finished for PR 1269 at commit
|
Is there a straightforward way to get the Also, once infer has been run, what is the next expected step in the workflow to evaluate relevancy of additional documents to the corpus and/or train the corpus on new documents? That is, how would one take the results of infer and use them for these purposes? |
QA tests have started for PR 1269 at commit
|
I've been looking at the various topic modeling PRs (3 currently) to try to get a sense of how they compare in terms of accuracy and speed. By "scaling," I really meant speed, or comparing running times across implementations to get a sense of what is fastest & why. I'm envisioning the comparison on a small cluster at least; I'm hoping to run some such tests myself. Computing scaling curves as the # of machines increases would be awesome but should probably come later. For the test failures, I'll wait a little bit and then re-run the tests. |
How do you compare accuracy? Perplexity means nothing but perplexity -- topic models can be reliably compared only via application task (e.g. classification, recommendation... ). Should I add the dataset for "perplexity sanity check" to the repo? I am about to use 1000 arxiv papers. This dataset is about 20 MB (5.5 MB zipped). |
Yes, "accuracy" meant some kind of metric like perplexity. I agree perplexity does not correlate exactly with human perception, but it's as good as it gets (assuming no one here has the resources to run real experiments right now). You don't need to add a dataset to the repo, but posting any results you get would be helpful. I'll post some results once I have some. |
I've performed sanity check on the dataset i've described above. PLSA: tm project obtains perplexity of RobustPLSA: Seems to be sane. |
Test build #24647 has finished for PR 1269 at commit
|
And tests fail again in an obscure manner... |
@akopich I've filed a JIRA to investigate that test failure, since it looks like a flaky streaming test: https://issues.apache.org/jira/browse/SPARK-4905 |
I've fixed perplexity for robust plsa and updates perplexity value in the comment above. Now they are almost the same. |
By the way. May be it's off top, but this is related to initial approximation generation. Suppose, one has
But What's a proper way? |
collection length computation was wrong
Test build #24650 has finished for PR 1269 at commit
|
@akopich The right way to do pseudo-randomness is to do:
You can find examples in, e.g., mllib/tree/impl/BaggedPoint.scala |
@akopich I had hoped to get this into MLlib, but after more consideration, I believe it is too complex. Currently, what would be ideal is a simple implementation of LDA since that is all that most users need. While generalizations like robust PLSA may outperform LDA with proper tuning, it’s somewhat of a research area, and it may be better to go with LDA since it has been very widely tested and used. However, I am sure some users would want to use your implementation of Robust PLSA, so it would be valuable for you to make it available as a package for Spark. The best path right now, I believe, will be to create a simple PR with a minimal public API, where that API should be extensible with (a) extra parameters/features and (b) alternate optimization/learning algorithms. I've posted a public design doc on the LDA JIRA here, and I’m going to submit such a PR. I would of course appreciate your feedback on it. Thanks very much for your understanding. When we merge the initial LDA PR, @mengxr will be sure to include all of those who have participated as authors of Spark LDA PRs: @akopich @witgo @yinxusen @dlwh @EntilZha @jegonzal CC: @mengxr |
@jkbradley, @mengxr, please, include @IlyaKozlov as author too. He's helped a lot with the implementation. |
@akopich We'll make sure to do that. Thanks for letting us know. |
@IlyaKozlov Would you like your email included in the git commit for the initial LDA PR? If so, please let me (or @mengxr ) know ASAP. Thanks! |
@akopich how to assign document id? |
|
Thanks. 发件人: Avanesov Valeriy [mailto:notifications@github.com] @renchengchanghttps://github.com/renchengchang
— |
@renchengchang |
@akopich Since this is no longer an active PR, could you please close it? It was very helpful to have this PR as a major basis for the initial LDA PR. If you do end up using the merged LDA or future versions which may be added, it would be great to get your input about further improvements, especially if they can be added incrementally. There are people actively working on online variational Bayes and on Gibbs sampling, which should have very different behavior from EM. |
* add ZORDER syntaxsupport * fix style * pass columns list in zOrder * address comment
I have implemented Probabilistic Latent Semantic Analysis (PLSA) and Robust PLSA with support of additive regularization (that actually means that I've implemented Latent Dirichlet Allocation too).