Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark-9028] [ML] Add CountVectorizer as an estimator to generate CountVectorizerModel #7388

Closed
wants to merge 16 commits into from

Conversation

hhbyyh
Copy link
Contributor

@hhbyyh hhbyyh commented Jul 14, 2015

jira: https://issues.apache.org/jira/browse/SPARK-9028

Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency.

I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn.

@SparkQA
Copy link

SparkQA commented Jul 14, 2015

Test build #37187 has finished for PR 7388 at commit 93e1ad4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CountVectorizer(override val uid: String)
    • class CountVectorizerModel(override val uid: String, val vocabulary: Array[String])
    • case class Least(children: Expression*) extends Expression
    • case class Greatest(children: Expression*) extends Expression


/**
* :: Experimental ::
* Converts a text document to a sparse vector of token counts.
* @param vocabulary An Array over terms. Only the terms in the vocabulary will be counted.
* Extracts a vocabulary from document collections and generates a CountVectorizerModel.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"[[CountVectorizerModel]]"

@feynmanliang
Copy link
Contributor

That's all for now!

@SparkQA
Copy link

SparkQA commented Jul 25, 2015

Test build #38415 has finished for PR 7388 at commit 589e93d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class CountVectorizer(override val uid: String)
    • class CountVectorizerModel(override val uid: String, val vocabulary: Array[String])

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Jul 25, 2015

@feynmanliang Thanks for helping review. Sent an update

@jkbradley
Copy link
Member

I'll make a pass now. Sorry for the long delay!

private[feature] trait CountVectorizerParams extends Params with HasInputCol with HasOutputCol {

/**
* size of the vocabulary.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"size" --> "Max size"

@jkbradley
Copy link
Member

Why did you remove minTermFreq? I think it's a useful way to make things sparser.

Also, could you change "minCount" to instead be "minDocFreq" (which has slightly different semantics, operating on doc frequency instead of word counts)? That will match the semantics in sklearn ("min_df"). If it's easier, I'm also OK with not providing this option for the initial version, and adding it for the next release instead.

@jkbradley
Copy link
Member

Those are my initial comments. I'll check back!

@jkbradley
Copy link
Member

One more comment: When you move a file, it helps to use git mv so that Github understands it's the same file (for the diff). No big deal here though.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Aug 14, 2015

Thanks for taking the time review.

About minTermFreq, I renamed it to minCount since I thought users may misunderstand minTermFreq as a double value (like 0.01, 0.05). Anyway, minDocFreq is a great idea. I'll use it to replace minTermFreq.

The hint about git mv is noted. Sorry for the extra effort during review.

@jkbradley
Copy link
Member

I actually agree with you about "freq" being misleading, but we decided to stick with it since it's used to mean "count" in many libraries. (This discussion was on some JIRA or PR, but I forget which one..)

No problem about git mv : )

@jkbradley
Copy link
Member

Actually I think I have time to help a little now. I'll send some updates soon.

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Aug 14, 2015

Please. Thanks a lot.

* Renamed "minCount" to "minTokenCount"
* Added "minTermFreq" back, including unit test
* Moved all Params to include in both Estimator and Model so that they can be viewed in either.
* Default: 1
* @group param
*/
val minTermFreq: IntParam = new IntParam(this, "minTermFreq",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mention that this doesn't affect fitting.

@SparkQA
Copy link

SparkQA commented Aug 15, 2015

Test build #40940 has finished for PR 7388 at commit 17b3009.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

Talked with @mengxr and have slightly different plan for freq vs. count. We'll switch to sklearn's method and create new params call minTF and minDF as needed. I'll send a PR to update this one accordingly. I'll also include:

  • setMinTermFreq in the Estimator so that users can specify all parameters before fitting and have them passed to the model.
  • private var for broadcast

ETA 30 min

@jkbradley
Copy link
Member

ETA 9 more min

@jkbradley
Copy link
Member

waiting for tests to run before I send it...

@jkbradley
Copy link
Member

OK sent

docFreq, termFreq, broadcast update
@hhbyyh
Copy link
Contributor Author

hhbyyh commented Aug 17, 2015

Merged commit of @jkbradley from hhbyyh#4. Thanks Joseph, the code looks much better.

@jkbradley
Copy link
Member

Btw, I was wrong about "git mv". Github still doesn't do the diff properly if you make any changes to the file after moving it.

val minDf: Long = if ($(minDF) >= 1.0) {
$(minDF).toLong
} else {
math.ceil($(minDF) * input.cache().count()).toLong
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this extra cache, but I'd guess it's better than not caching.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good...

@SparkQA
Copy link

SparkQA commented Aug 18, 2015

Test build #41070 has finished for PR 7388 at commit a5a8532.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

If this looks good to @hhbyyh , I'll wait for a final API check by @mengxr before merging this

val vocSize = $(vocabSize)
val input = dataset.select($(inputCol)).map(_.getAs[Seq[String]](0))
val minDf: Long = if ($(minDF) >= 1.0) {
$(minDF).toLong
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

math.ceil($(minDF)).toLong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The update will "eat" the unexpected input of minDF from users, like 3.1 or 2.5. Will it be misleading?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a user set minDF to 5.5, the expected minDF should be 6. I don't think the user would expect anything else. Btw, a simpler implementation is to keep minDf as a Double:

val minDf = if ($(minDF) >= 1.0) {
  $(minDF)
} else {
  $(minDF) * input.cache().count()
}

@hhbyyh
Copy link
Contributor Author

hhbyyh commented Aug 18, 2015

@mengxr @jkbradley Sent an update addressing the comment.
Next, I plan to change the LDA example implementation if it's OK.

@jkbradley
Copy link
Member

Modifying the LDA example sounds good to me. It does mean using spark.ml classes in spark.mllib examples, but I don't really see a problem with that for examples. (I would not want to do that in the main code though.)

@SparkQA
Copy link

SparkQA commented Aug 18, 2015

Test build #41146 has finished for PR 7388 at commit a370816.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

I'll merge this with master and branch-1.5. @hhbyyh Thanks a lot!

asfgit pushed a commit that referenced this pull request Aug 18, 2015
…ntVectorizerModel

jira: https://issues.apache.org/jira/browse/SPARK-9028

Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency.

I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn.

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7388 from hhbyyh/cvEstimator.

(cherry picked from commit 354f458)
Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
@asfgit asfgit closed this in 354f458 Aug 18, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants