- Recall
in MAP: - convert
back to probabilities by taking : -
\ $= Πi=1^N P(t_i|x_i)exp(-\frac{λ}{2}˜{w}^T˜{w})$
-
-
is proportional to a Gaussian probability density function (PDF): - We can write this as
- Minimizing
is equivalent to maximzing: - $Πi=1^N P(t_i|x_i)p(˜{w})$
- Take a look back at Baye’s rule
-
prior
: how likely is before observing data. -
likelihood
: how likely the data set is if the model parameter is . -
posterior
: how likely is after observing the data set . - Estimating the
(learning the model) by maximizing the posterior distribution is called maximum a posteriori (MAP) estimation.
- Unbelievably disappointing.
- This is the greatest screwball comedy ever filmed.
- It was pathetic. The worst part about it was the boxing scenes.
- Input:
- a document
. - a fixed set of classes
- a training set of
hand-labeled documents
- a document
- Output:
- a learned classifier
- a learned classifier
- Idea: Represent a text document as a feature vector in order to use machine learning methods.
- vocabulary: the set of all different feature words that occur in training set, with a count of how it occurs.
- ignore the order
- occurance is independent (naive Bayes): “hello” tends to be followed by a “world”
- Documents:
- D1: “Unbelievably disappointing.”
- D2: “This is the greatest screwball comedy ever filmed.”
- D3: “It was pathetic. The worst part about it was the boxing scenes.”
- D4: “Greatest film ever.”
- Vocabulary
- V = {disappointing: 1, greatest: 2, pathetic: 1, worst: 1}
- (Training) Dataset
- 12 samples, 2 different classes +,-.
- 2 features: color, geometrical shape.
- Denote
-
be class labels: for +, for -. -
be the 2-dimensional feature vectors: $x_j = [xj1 xj2], xj1 ∈ \{blue, green, red, yellow\}, xj2 ∈ \{circle, square\}$
-
- New Sample
- features
- class? (ground truth: +)
- features
- decision rule
- (MAP)
- (MAP)
- computing
- (prior)
- (likelihood, +)
(i.i.d.) - (likelihood, -)
(i.i.d.) - (posterior, + )
- (posterior, -)
- (prior)
- computing
- (prior)
- (likelihood, +)
(i.i.d.) - (likelihood, -)
(i.i.d.) - (posterior, + )
- (posterior, -)
- (prior)
- on dropping
-
is called evidence - no effect on the final result
-
- classification
-
, so classfied as +.
-
- New Sample
- features
- likelihood
?
- features
- Laplace (add-1) smoothing
- $= \frac{count(x_i, c) + 1}{Σx ∈ V(count(x, c) + 1)}$
- $= \frac{count(x_i, c) + 1}{Σx ∈ Vcount(x, c) + |V|}$
- (training set) feature extraction (bag of words)
- Naive Bayes and Language Modeling
- prior (class)
- likelihood (i.i.d., laplace smoothing)
- drop the evidence term
- compute posterior
- apply decision rule