-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Spark-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel #7388
Conversation
Test build #37187 has finished for PR 7388 at commit
|
|
||
/** | ||
* :: Experimental :: | ||
* Converts a text document to a sparse vector of token counts. | ||
* @param vocabulary An Array over terms. Only the terms in the vocabulary will be counted. | ||
* Extracts a vocabulary from document collections and generates a CountVectorizerModel. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"[[CountVectorizerModel]]"
That's all for now! |
Test build #38415 has finished for PR 7388 at commit
|
@feynmanliang Thanks for helping review. Sent an update |
I'll make a pass now. Sorry for the long delay! |
private[feature] trait CountVectorizerParams extends Params with HasInputCol with HasOutputCol { | ||
|
||
/** | ||
* size of the vocabulary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"size" --> "Max size"
Why did you remove minTermFreq? I think it's a useful way to make things sparser. Also, could you change "minCount" to instead be "minDocFreq" (which has slightly different semantics, operating on doc frequency instead of word counts)? That will match the semantics in sklearn ("min_df"). If it's easier, I'm also OK with not providing this option for the initial version, and adding it for the next release instead. |
Those are my initial comments. I'll check back! |
One more comment: When you move a file, it helps to use git mv so that Github understands it's the same file (for the diff). No big deal here though. |
Thanks for taking the time review. About minTermFreq, I renamed it to minCount since I thought users may misunderstand minTermFreq as a double value (like 0.01, 0.05). Anyway, minDocFreq is a great idea. I'll use it to replace minTermFreq. The hint about git mv is noted. Sorry for the extra effort during review. |
I actually agree with you about "freq" being misleading, but we decided to stick with it since it's used to mean "count" in many libraries. (This discussion was on some JIRA or PR, but I forget which one..) No problem about git mv : ) |
Actually I think I have time to help a little now. I'll send some updates soon. |
Please. Thanks a lot. |
* Default: 1 | ||
* @group param | ||
*/ | ||
val minTermFreq: IntParam = new IntParam(this, "minTermFreq", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should mention that this doesn't affect fitting.
Test build #40940 has finished for PR 7388 at commit
|
Talked with @mengxr and have slightly different plan for freq vs. count. We'll switch to sklearn's method and create new params call minTF and minDF as needed. I'll send a PR to update this one accordingly. I'll also include:
ETA 30 min |
ETA 9 more min |
waiting for tests to run before I send it... |
…save broadcast as private var
OK sent |
docFreq, termFreq, broadcast update
Merged commit of @jkbradley from hhbyyh#4. Thanks Joseph, the code looks much better. |
Btw, I was wrong about "git mv". Github still doesn't do the diff properly if you make any changes to the file after moving it. |
val minDf: Long = if ($(minDF) >= 1.0) { | ||
$(minDF).toLong | ||
} else { | ||
math.ceil($(minDF) * input.cache().count()).toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like this extra cache, but I'd guess it's better than not caching.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good...
Test build #41070 has finished for PR 7388 at commit
|
val vocSize = $(vocabSize) | ||
val input = dataset.select($(inputCol)).map(_.getAs[Seq[String]](0)) | ||
val minDf: Long = if ($(minDF) >= 1.0) { | ||
$(minDF).toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
math.ceil($(minDF)).toLong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The update will "eat" the unexpected input of minDF from users, like 3.1 or 2.5. Will it be misleading?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a user set minDF
to 5.5
, the expected minDF
should be 6
. I don't think the user would expect anything else. Btw, a simpler implementation is to keep minDf
as a Double
:
val minDf = if ($(minDF) >= 1.0) {
$(minDF)
} else {
$(minDF) * input.cache().count()
}
@mengxr @jkbradley Sent an update addressing the comment. |
Modifying the LDA example sounds good to me. It does mean using spark.ml classes in spark.mllib examples, but I don't really see a problem with that for examples. (I would not want to do that in the main code though.) |
Test build #41146 has finished for PR 7388 at commit
|
I'll merge this with master and branch-1.5. @hhbyyh Thanks a lot! |
…ntVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7388 from hhbyyh/cvEstimator. (cherry picked from commit 354f458) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
jira: https://issues.apache.org/jira/browse/SPARK-9028
Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency.
I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn.