Skip to content

Commit

Permalink
[SPARK-5563] [MLLIB] LDA with online variational inference
Browse files Browse the repository at this point in the history
JIRA: https://issues.apache.org/jira/browse/SPARK-5563
The PR contains the implementation for [Online LDA] (https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf) based on the research of  Matt Hoffman and David M. Blei, which provides an efficient option for LDA users. Major advantages for the algorithm are the stream compatibility and economic time/memory consumption due to the corpus split. For more details, please refer to the jira.

Online LDA can act as a fast option for LDA, and will be especially helpful for the users who needs a quick result or with large corpus.

 Correctness test.
I have tested current PR with https://github.com/Blei-Lab/onlineldavb and the results are identical. I've uploaded the result and code to https://github.com/hhbyyh/LDACrossValidation.

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes apache#4419 from hhbyyh/ldaonline and squashes the following commits:

1045eec [Yuhao Yang] Merge pull request #2 from jkbradley/hhbyyh-ldaonline2
cf376ff [Joseph K. Bradley] For private vars needed for testing, I made them private and added accessors.  Java doesn’t understand package-private tags, so this minimizes the issues Java users might encounter.
6149ca6 [Yuhao Yang] fix for setOptimizer
cf0007d [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
54cf8da [Yuhao Yang] some style change
68c2318 [Yuhao Yang] add a java ut
4041723 [Yuhao Yang] add ut
138bfed [Yuhao Yang] Merge pull request #1 from jkbradley/hhbyyh-ldaonline-update
9e910d9 [Joseph K. Bradley] small fix
61d60df [Joseph K. Bradley] Minor cleanups: * Update *Concentration parameter documentation * EM Optimizer: createVertices() does not need to be a function * OnlineLDAOptimizer: typos in doc * Clean up the core code for online LDA (Scala style)
a996a82 [Yuhao Yang] respond to comments
b1178cf [Yuhao Yang] fit into the optimizer framework
dbe3cff [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
15be071 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
b29193b [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
d19ef55 [Yuhao Yang] change OnlineLDA to class
97b9e1a [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
e7bf3b0 [Yuhao Yang] move to seperate file
f367cc9 [Yuhao Yang] change to optimization
8cb16a6 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
62405cc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
02d0373 [Yuhao Yang] fix style in comment
f6d47ca [Yuhao Yang] Merge branch 'ldaonline' of https://github.com/hhbyyh/spark into ldaonline
d86cdec [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
a570c9a [Yuhao Yang] use sample to pick up batch
4a3f27e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline
e271eb1 [Yuhao Yang] remove non ascii
581c623 [Yuhao Yang] seperate API and adjust batch split
37af91a [Yuhao Yang] iMerge remote-tracking branch 'upstream/master' into ldaonline
20328d1 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline i
aa365d1 [Yuhao Yang] merge upstream master
3a06526 [Yuhao Yang] merge with new example
0dd3947 [Yuhao Yang] kMerge remote-tracking branch 'upstream/master' into ldaonline
0d0f3ee [Yuhao Yang] replace random split with sliding
fa408a8 [Yuhao Yang] ssMerge remote-tracking branch 'upstream/master' into ldaonline
45884ab [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s
f41c5ca [Yuhao Yang] style fix
26dca1b [Yuhao Yang] style fix and make class private
043e786 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into ldaonline s Conflicts: 	mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
d640d9c [Yuhao Yang] online lda initial checkin
  • Loading branch information
hhbyyh authored and jkbradley committed May 4, 2015
1 parent 9646018 commit 3539cb7
Show file tree
Hide file tree
Showing 4 changed files with 438 additions and 74 deletions.
65 changes: 26 additions & 39 deletions mllib/src/main/scala/org/apache/spark/mllib/clustering/LDA.scala
Original file line number Diff line number Diff line change
Expand Up @@ -78,35 +78,29 @@ class LDA private (
*
* This is the parameter to a symmetric Dirichlet distribution.
*/
def getDocConcentration: Double = {
if (this.docConcentration == -1) {
(50.0 / k) + 1.0
} else {
this.docConcentration
}
}
def getDocConcentration: Double = this.docConcentration

/**
* Concentration parameter (commonly named "alpha") for the prior placed on documents'
* distributions over topics ("theta").
*
* This is the parameter to a symmetric Dirichlet distribution.
* This is the parameter to a symmetric Dirichlet distribution, where larger values
* mean more smoothing (more regularization).
*
* This value should be > 1.0, where larger values mean more smoothing (more regularization).
* If set to -1, then docConcentration is set automatically.
* (default = -1 = automatic)
*
* Automatic setting of parameter:
* - For EM: default = (50 / k) + 1.
* - The 50/k is common in LDA libraries.
* - The +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
*
* Note: The restriction > 1.0 may be relaxed in the future (allowing sparse solutions),
* but values in (0,1) are not yet supported.
* Optimizer-specific parameter settings:
* - EM
* - Value should be > 1.0
* - default = (50 / k) + 1, where 50/k is common in LDA libraries and +1 follows
* Asuncion et al. (2009), who recommend a +1 adjustment for EM.
* - Online
* - Value should be >= 0
* - default = (1.0 / k), following the implementation from
* [[https://github.com/Blei-Lab/onlineldavb]].
*/
def setDocConcentration(docConcentration: Double): this.type = {
require(docConcentration > 1.0 || docConcentration == -1.0,
s"LDA docConcentration must be > 1.0 (or -1 for auto), but was set to $docConcentration")
this.docConcentration = docConcentration
this
}
Expand All @@ -126,13 +120,7 @@ class LDA private (
* Note: The topics' distributions over terms are called "beta" in the original LDA paper
* by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
*/
def getTopicConcentration: Double = {
if (this.topicConcentration == -1) {
1.1
} else {
this.topicConcentration
}
}
def getTopicConcentration: Double = this.topicConcentration

/**
* Concentration parameter (commonly named "beta" or "eta") for the prior placed on topics'
Expand All @@ -143,21 +131,20 @@ class LDA private (
* Note: The topics' distributions over terms are called "beta" in the original LDA paper
* by Blei et al., but are called "phi" in many later papers such as Asuncion et al., 2009.
*
* This value should be > 0.0.
* If set to -1, then topicConcentration is set automatically.
* (default = -1 = automatic)
*
* Automatic setting of parameter:
* - For EM: default = 0.1 + 1.
* - The 0.1 gives a small amount of smoothing.
* - The +1 follows Asuncion et al. (2009), who recommend a +1 adjustment for EM.
*
* Note: The restriction > 1.0 may be relaxed in the future (allowing sparse solutions),
* but values in (0,1) are not yet supported.
* Optimizer-specific parameter settings:
* - EM
* - Value should be > 1.0
* - default = 0.1 + 1, where 0.1 gives a small amount of smoothing and +1 follows
* Asuncion et al. (2009), who recommend a +1 adjustment for EM.
* - Online
* - Value should be >= 0
* - default = (1.0 / k), following the implementation from
* [[https://github.com/Blei-Lab/onlineldavb]].
*/
def setTopicConcentration(topicConcentration: Double): this.type = {
require(topicConcentration > 1.0 || topicConcentration == -1.0,
s"LDA topicConcentration must be > 1.0 (or -1 for auto), but was set to $topicConcentration")
this.topicConcentration = topicConcentration
this
}
Expand Down Expand Up @@ -223,14 +210,15 @@ class LDA private (

/**
* Set the LDAOptimizer used to perform the actual calculation by algorithm name.
* Currently "em" is supported.
* Currently "em", "online" is supported.
*/
def setOptimizer(optimizerName: String): this.type = {
this.ldaOptimizer =
optimizerName.toLowerCase match {
case "em" => new EMLDAOptimizer
case "online" => new OnlineLDAOptimizer
case other =>
throw new IllegalArgumentException(s"Only em is supported but got $other.")
throw new IllegalArgumentException(s"Only em, online are supported but got $other.")
}
this
}
Expand All @@ -245,8 +233,7 @@ class LDA private (
* @return Inferred LDA model
*/
def run(documents: RDD[(Long, Vector)]): LDAModel = {
val state = ldaOptimizer.initialState(documents, k, getDocConcentration, getTopicConcentration,
seed, checkpointInterval)
val state = ldaOptimizer.initialize(documents, this)
var iter = 0
val iterationTimes = Array.fill[Double](maxIterations)(0)
while (iter < maxIterations) {
Expand Down
Loading

0 comments on commit 3539cb7

Please sign in to comment.