-
Notifications
You must be signed in to change notification settings - Fork 28.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4894][mllib] Added Bernoulli option to NaiveBayes model in mllib #4087
Conversation
…model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html
Can one of the admins verify this patch? |
|
||
{ | ||
// Need to put an extra pair of braces to prevent Scala treating `i` as a member. | ||
def populateMatrix(arrayIn: Array[Array[Double]], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function seems excessive. Does the Breeze library support element-wise log/exp and addition/subtraction with matrices? If so, that would be cleaner and less verbose.
Thanks for the patch! A few comments:
Thanks! |
Thanks for the comments! The JIRA for the python API is: I will get the rest fixed tonight or tomorrow. |
ok to test |
Test build #25855 has started for PR 4087 at commit
|
…. Public api now has string instead of enumeration. Docs are updated."
Test build #25856 has started for PR 4087 at commit
|
Test build #25856 has finished for PR 4087 at commit
|
Test FAILed. |
Test build #25855 has finished for PR 4087 at commit
|
Test FAILed. |
Test build #25894 has started for PR 4087 at commit
|
@leahmcguire The updated patch looks great to me. :) |
Test build #25894 has finished for PR 4087 at commit
|
Test FAILed. |
Test build #26099 has started for PR 4087 at commit
|
Test build #26099 has finished for PR 4087 at commit
|
Test FAILed. |
val testRDD = sc.parallelize(testData, 2) | ||
testRDD.cache() | ||
|
||
val model = NaiveBayes.train(testRDD) | ||
val model = NaiveBayes.train(testRDD, 1.0, "Bernoulli") ///!!! this gives same result on both models check the math |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering--- is the bug listed here still happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this was resolved before the commit. I just forgot to remove the comment
Test build #28010 has started for PR 4087 at commit
|
There have been a lot of changes, which must be causing the merge issues. Could you please fix them? (Sorry for the slow review; I'll try to review ASAP once it merges cleanly.) |
Test build #28010 has finished for PR 4087 at commit
|
Test FAILed. |
Test PASSed. |
@leahmcguire It looks like the unclean merge came from the PR earlier today for adding Python save/load. I think rebasing and fixing conflicts should be straightforward. I'll update the save/load for versions ASAP if you can fix the merge issues. Thanks very much! |
Test build #28936 has started for PR 4087 at commit
|
Test build #28936 has finished for PR 4087 at commit
|
Test PASSed. |
…ype parameter was added. Updated tests. Also updated ModelType enum-like type.
Added model save/load version to support NaiveBayes ModelType
Test build #29119 has started for PR 4087 at commit
|
Test build #29119 has finished for PR 4087 at commit
|
Test PASSed. |
So...that discussion on the mailing list about enum-like types just keeps going with no decision yet. Speaking with @mengxr , it might be best to support only String values for the model type instead of something nicer like enum. I don't want to make you change that, so would you mind if I sent 1 more PR to your PR? |
(I was about to merge this, but then this issue came up.) After that adjustment, it should be fine. (And feel free to make this change yourself, but I'm offering to do it since the dev list discussion keeps going back and forth.) |
Either version is fine. If you have time to make the change on tomorrow go On Wed, Mar 25, 2015 at 12:41 PM, jkbradley notifications@github.com
|
If you have time, I'd really appreciate it---thank you! We can eliminate the special enum-like types entirely and just use String. |
Test build #29336 has started for PR 4087 at commit
|
Test build #29336 has finished for PR 4087 at commit
|
Test PASSed. |
if (supportedModelTypes.contains(modelType)) { | ||
new NaiveBayes(lambda, modelType).run(input) | ||
} else { | ||
throw new UnknownError(s"NaiveBayes was created with an unknown ModelType: $modelType") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please use require? Since this is an entry point, the parameter check should throw an IllegalArgumentException (which require does). Elsewhere, in the internals, we can throw UnknownErrors since those errors should never actually happen.
require(supportedModelTypes.contains(modelType), s"NaiveBayes was created with an unknown ModelType: $modelType")
@leahmcguire Thanks for updating the enum type. I just made 2 tiny comments; other than that, it looks fine. |
Test build #29438 has started for PR 4087 at commit |
Test build #29438 has finished for PR 4087 at commit
|
Test PASSed. |
LGTM Thanks very much for bearing with the issues in getting this in! Merging into master |
Added optional model type parameter for NaiveBayes training. Can be either Multinomial or Bernoulli.
When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction as per: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html.
Default for model is original Multinomial fit and predict.
Added additional testing for Bernoulli and Multinomial models.