[Spark-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel #7388

hhbyyh · 2015-07-14T03:20:34Z

jira: https://issues.apache.org/jira/browse/SPARK-9028

Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency.

I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn.

SparkQA · 2015-07-14T03:58:36Z

Test build #37187 has finished for PR 7388 at commit 93e1ad4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CountVectorizer(override val uid: String)
- class CountVectorizerModel(override val uid: String, val vocabulary: Array[String])
- case class Least(children: Expression*) extends Expression
- case class Greatest(children: Expression*) extends Expression

feynmanliang · 2015-07-15T22:31:36Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizerModel.scala


 /**
 * :: Experimental ::
- * Converts a text document to a sparse vector of token counts.
- * @param vocabulary An Array over terms. Only the terms in the vocabulary will be counted.
+ * Extracts a vocabulary from document collections and generates a CountVectorizerModel.


"[[CountVectorizerModel]]"

feynmanliang · 2015-07-15T22:40:33Z

That's all for now!

SparkQA · 2015-07-25T04:18:40Z

Test build #38415 has finished for PR 7388 at commit 589e93d.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class CountVectorizer(override val uid: String)
- class CountVectorizerModel(override val uid: String, val vocabulary: Array[String])

hhbyyh · 2015-07-25T04:53:59Z

@feynmanliang Thanks for helping review. Sent an update

jkbradley · 2015-08-13T01:37:33Z

I'll make a pass now. Sorry for the long delay!

jkbradley · 2015-08-13T01:56:08Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizerModel.scala

+private[feature] trait CountVectorizerParams extends Params with HasInputCol with HasOutputCol {
+
+  /**
+   * size of the vocabulary.


"size" --> "Max size"

jkbradley · 2015-08-13T01:56:53Z

Why did you remove minTermFreq? I think it's a useful way to make things sparser.

Also, could you change "minCount" to instead be "minDocFreq" (which has slightly different semantics, operating on doc frequency instead of word counts)? That will match the semantics in sklearn ("min_df"). If it's easier, I'm also OK with not providing this option for the initial version, and adding it for the next release instead.

jkbradley · 2015-08-13T01:57:11Z

Those are my initial comments. I'll check back!

jkbradley · 2015-08-13T01:58:36Z

One more comment: When you move a file, it helps to use git mv so that Github understands it's the same file (for the diff). No big deal here though.

…Estimator

hhbyyh · 2015-08-14T03:19:20Z

Thanks for taking the time review.

About minTermFreq, I renamed it to minCount since I thought users may misunderstand minTermFreq as a double value (like 0.01, 0.05). Anyway, minDocFreq is a great idea. I'll use it to replace minTermFreq.

The hint about git mv is noted. Sorry for the extra effort during review.

jkbradley · 2015-08-14T04:29:24Z

I actually agree with you about "freq" being misleading, but we decided to stick with it since it's used to mean "count" in many libraries. (This discussion was on some JIRA or PR, but I forget which one..)

No problem about git mv : )

jkbradley · 2015-08-14T04:33:33Z

Actually I think I have time to help a little now. I'll send some updates soon.

hhbyyh · 2015-08-14T05:52:17Z

Please. Thanks a lot.

* Renamed "minCount" to "minTokenCount" * Added "minTermFreq" back, including unit test * Moved all Params to include in both Estimator and Model so that they can be viewed in either.

mengxr · 2015-08-14T15:54:50Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala

+   * Default: 1
+   * @group param
+   */
+  val minTermFreq: IntParam = new IntParam(this, "minTermFreq",


Should mention that this doesn't affect fitting.

SparkQA · 2015-08-15T02:45:11Z

Test build #40940 has finished for PR 7388 at commit 17b3009.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-08-17T22:26:09Z

Talked with @mengxr and have slightly different plan for freq vs. count. We'll switch to sklearn's method and create new params call minTF and minDF as needed. I'll send a PR to update this one accordingly. I'll also include:

setMinTermFreq in the Estimator so that users can specify all parameters before fitting and have them passed to the model.
private var for broadcast

ETA 30 min

jkbradley · 2015-08-17T23:10:29Z

ETA 9 more min

jkbradley · 2015-08-17T23:19:03Z

waiting for tests to run before I send it...

…save broadcast as private var

jkbradley · 2015-08-17T23:30:06Z

OK sent

docFreq, termFreq, broadcast update

hhbyyh · 2015-08-17T23:43:57Z

Merged commit of @jkbradley from hhbyyh#4. Thanks Joseph, the code looks much better.

jkbradley · 2015-08-17T23:51:48Z

Btw, I was wrong about "git mv". Github still doesn't do the diff properly if you make any changes to the file after moving it.

jkbradley · 2015-08-17T23:53:25Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala

+    val minDf: Long = if ($(minDF) >= 1.0) {
+      $(minDF).toLong
+    } else {
+      math.ceil($(minDF) * input.cache().count()).toLong


I don't really like this extra cache, but I'd guess it's better than not caching.

I think it's good...

SparkQA · 2015-08-18T00:23:43Z

Test build #41070 has finished for PR 7388 at commit a5a8532.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-08-18T00:27:13Z

If this looks good to @hhbyyh , I'll wait for a final API check by @mengxr before merging this

mengxr · 2015-08-18T00:50:30Z

mllib/src/main/scala/org/apache/spark/ml/feature/CountVectorizer.scala

+    val vocSize = $(vocabSize)
+    val input = dataset.select($(inputCol)).map(_.getAs[Seq[String]](0))
+    val minDf: Long = if ($(minDF) >= 1.0) {
+      $(minDF).toLong


math.ceil($(minDF)).toLong

The update will "eat" the unexpected input of minDF from users, like 3.1 or 2.5. Will it be misleading?

If a user set minDF to 5.5, the expected minDF should be 6. I don't think the user would expect anything else. Btw, a simpler implementation is to keep minDf as a Double:

val minDf = if ($(minDF) >= 1.0) { $(minDF) } else { $(minDF) * input.cache().count() }

…Estimator

hhbyyh · 2015-08-18T17:04:58Z

@mengxr @jkbradley Sent an update addressing the comment.
Next, I plan to change the LDA example implementation if it's OK.

jkbradley · 2015-08-18T17:40:21Z

Modifying the LDA example sounds good to me. It does mean using spark.ml classes in spark.mllib examples, but I don't really see a problem with that for examples. (I would not want to do that in the main code though.)

SparkQA · 2015-08-18T17:52:20Z

Test build #41146 has finished for PR 7388 at commit a370816.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2015-08-18T17:58:56Z

I'll merge this with master and branch-1.5. @hhbyyh Thanks a lot!

…ntVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7388 from hhbyyh/cvEstimator. (cherry picked from commit 354f458) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>

hhbyyh added 3 commits July 13, 2015 18:30

count vectorizer estimator

3193c77

Merge remote-tracking branch 'upstream/master' into cvEstimator

1f919dc

add more ut for estimator

93e1ad4

feynmanliang reviewed Jul 15, 2015
View reviewed changes

hhbyyh added 3 commits July 17, 2015 16:51

Merge remote-tracking branch 'upstream/master' into cvEstimator

dc4bd27

Merge remote-tracking branch 'upstream/master' into cvEstimator

099952b

minor fix

589e93d

jkbradley reviewed Aug 13, 2015
View reviewed changes

hhbyyh added 2 commits August 14, 2015 10:09

Merge remote-tracking branch 'upstream/master' into cvEstimator

cf6d591

Merge branch 'cvEstimator' of https://github.com/hhbyyh/spark into cv…

458d297

…Estimator

Updates:

0fe9f96

* Renamed "minCount" to "minTokenCount" * Added "minTermFreq" back, including unit test * Moved all Params to include in both Estimator and Model so that they can be viewed in either.

hhbyyh mentioned this pull request Aug 14, 2015

Updated CountVectorizer hhbyyh/spark#3

Merged

mengxr reviewed Aug 14, 2015
View reviewed changes

replace minTokenCount with minDocFreq

17b3009

renamed docFreq to DF, termFreq to TF, and added fractional support. …

a9a9485

…save broadcast as private var

Merge pull request #4 from jkbradley/cntvec-update

a5a8532

docFreq, termFreq, broadcast update

jkbradley reviewed Aug 17, 2015
View reviewed changes

mengxr reviewed Aug 18, 2015
View reviewed changes

hhbyyh added 3 commits August 19, 2015 00:28

Merge remote-tracking branch 'upstream/master' into cvEstimator

c2b936b

Merge branch 'cvEstimator' of https://github.com/hhbyyh/spark into cv…

8a9971a

…Estimator

use minDF as Double

a370816

asfgit closed this in 354f458 Aug 18, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Spark-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel #7388

[Spark-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel #7388

hhbyyh commented Jul 14, 2015

SparkQA commented Jul 14, 2015

feynmanliang Jul 15, 2015

feynmanliang commented Jul 15, 2015

SparkQA commented Jul 25, 2015

hhbyyh commented Jul 25, 2015

jkbradley commented Aug 13, 2015

jkbradley Aug 13, 2015

jkbradley commented Aug 13, 2015

jkbradley commented Aug 13, 2015

jkbradley commented Aug 13, 2015

hhbyyh commented Aug 14, 2015

jkbradley commented Aug 14, 2015

jkbradley commented Aug 14, 2015

hhbyyh commented Aug 14, 2015

mengxr Aug 14, 2015

SparkQA commented Aug 15, 2015

jkbradley commented Aug 17, 2015

jkbradley commented Aug 17, 2015

jkbradley commented Aug 17, 2015

jkbradley commented Aug 17, 2015

hhbyyh commented Aug 17, 2015

jkbradley commented Aug 17, 2015

jkbradley Aug 17, 2015

hhbyyh Aug 18, 2015

SparkQA commented Aug 18, 2015

jkbradley commented Aug 18, 2015

mengxr Aug 18, 2015

hhbyyh Aug 18, 2015

mengxr Aug 18, 2015

hhbyyh commented Aug 18, 2015

jkbradley commented Aug 18, 2015

SparkQA commented Aug 18, 2015

jkbradley commented Aug 18, 2015

[Spark-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel #7388

[Spark-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel #7388

Conversation

hhbyyh commented Jul 14, 2015

SparkQA commented Jul 14, 2015

Choose a reason for hiding this comment

feynmanliang commented Jul 15, 2015

SparkQA commented Jul 25, 2015

hhbyyh commented Jul 25, 2015

jkbradley commented Aug 13, 2015

Choose a reason for hiding this comment

jkbradley commented Aug 13, 2015

jkbradley commented Aug 13, 2015

jkbradley commented Aug 13, 2015

hhbyyh commented Aug 14, 2015

jkbradley commented Aug 14, 2015

jkbradley commented Aug 14, 2015

hhbyyh commented Aug 14, 2015

Choose a reason for hiding this comment

SparkQA commented Aug 15, 2015

jkbradley commented Aug 17, 2015

jkbradley commented Aug 17, 2015

jkbradley commented Aug 17, 2015

jkbradley commented Aug 17, 2015

hhbyyh commented Aug 17, 2015

jkbradley commented Aug 17, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Aug 18, 2015

jkbradley commented Aug 18, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hhbyyh commented Aug 18, 2015

jkbradley commented Aug 18, 2015

SparkQA commented Aug 18, 2015

jkbradley commented Aug 18, 2015