Naive Bayes and Text Classification

Review of A1

Bayes Rule in A1 (from Lec04)

Recall $E$ in MAP:
- $E(˜{w}) = L(˜{w}) + \frac{λ}{2}˜{w}^T˜{w}$
convert $E$ back to probabilities by taking $exp(-E)$:
- $exp(-E(˜{w})) = exp(-L(˜{w}) - \frac{λ}{2}˜{w}^T˜{w})$
  $= exp(-L(˜{w}))exp(-\frac{λ}{2}˜{w}^T˜{w})$ \ $= Π_i=1^N P(t_i|x_i)exp(-\frac{λ}{2}˜{w}^T˜{w})$

Bayes Rule in A1 (from Lec04)

$exp(-\frac{λ}{2}˜{w}^T˜{w})$ is proportional to a Gaussian probability density function (PDF):
We can write this as $p(˜{w}) \propto exp(-\frac{λ}{2}˜{w}^T˜{w})$
Minimizing $E$ is equivalent to maximzing:
- $Π_i=1^N P(t_i|x_i)p(˜{w})$

Bayes Rule in A1 (from Lec04)

Take a look back at Baye’s rule
- $p(θ|D) = \frac{p(D|θ)p(θ)}{p(D)} \propto p(D|θ)p(θ)$
prior $(p(θ))$: how likely $θ$ is before observing data.
likelihood $(p(D|θ))$: how likely the data set $D$ is if the model parameter is $θ$.
posterior $(p(θ|D))$: how likely $θ$ is after observing the data set $D$.
Estimating the $θ$ (learning the model) by maximizing the posterior distribution is called maximum a posteriori (MAP) estimation.

The Problem of Text Classification

Positive or negative movie review?[Dan Jurafsky]

Unbelievably disappointing.
This is the greatest screwball comedy ever filmed.
It was pathetic. The worst part about it was the boxing scenes.

Classification Methods: Supervised Machine Learning

Input:
- a document $d$.
- a fixed set of classes $C = {c_1, c_2,\ldots, c_j}$
- a training set of $m$ hand-labeled documents $(d_1,c_1),\ldots,(d_m,c_m)$
Output:
- a learned classifier $γ: d → c$

The Bag of words model

Idea: Represent a text document as a feature vector in order to use machine learning methods.
vocabulary: the set of all different feature words that occur in training set, with a count of how it occurs.
- ignore the order
- occurance is independent (naive Bayes): “hello” tends to be followed by a “world”

Example of Bag of words model

Documents:
- D1: “Unbelievably disappointing.”
- D2: “This is the greatest screwball comedy ever filmed.”
- D3: “It was pathetic. The worst part about it was the boxing scenes.”
- D4: “Greatest film ever.”
Vocabulary
- V = {disappointing: 1, greatest: 2, pathetic: 1, worst: 1}

Naive Bayes Classifier

A Toy Example [Sebastian Raschka]

(Training) Dataset
- 12 samples, 2 different classes +,-.
- 2 features: color, geometrical shape.
Denote
- $c_j$ be class labels: $c_j=+$ for +, $c_j=-$ for -.
- $x_j$ be the 2-dimensional feature vectors: $x_j = [x_j1 x_j2], x_j1 ∈ \{blue, green, red, yellow\}, x_j2 ∈ \{circle, square\}$

Classify a new sample

New Sample
- features $x=[blue, square]$
- class? (ground truth: +)
decision rule
- (MAP) $P(c=+ | x=[blue, square]) ≥ P(c=- | x=[blue, square]) ? + : -;$

Classify a new sample

computing
- (prior) $P(+) = \frac{7}{12} = 0.58, P(-) = \frac{5}{12} = 0.42$
- (likelihood, +) $P(x | +) = P(blue | +) ⋅ P(square | +) = \frac{3}{7} ⋅ \frac{5}{7} = 0.31$ (i.i.d.)
- (likelihood, -) $P(x | -) = P(blue | -) ⋅ P(square | -) = \frac{3}{5} ⋅ \frac{3}{5} = 0.36$ (i.i.d.)
- (posterior, + ) $P( + | x) \propto P(x | + ) ⋅ P(+) = 0.31 ⋅ 0.58 = 0.18$
- (posterior, -) $P(- | x) \propto P(x | -) ⋅ P(-) = 0.36 ⋅ 0.42 = 0.15$

Classify a new sample

computing
- (prior) $P(+) = \frac{7}{12} = 0.58, P(-) = \frac{5}{12} = 0.42$
- (likelihood, +) $P(x | +) = P(blue | +) ⋅ P(square | +) = \frac{3}{7} ⋅ \frac{5}{7} = 0.31$ (i.i.d.)
- (likelihood, -) $P(x | -) = P(blue | -) ⋅ P(square | -) = \frac{3}{5} ⋅ \frac{3}{5} = 0.36$ (i.i.d.)
- (posterior, + ) $P( + | x) \propto P(x | + ) ⋅ P(+) = 0.31 ⋅ 0.58 = 0.18$
- (posterior, -) $P(- | x) \propto P(x | -) ⋅ P(-) = 0.36 ⋅ 0.42 = 0.15$
on dropping $p(x)$
- $p(x)$ is called evidence
- no effect on the final result
classification
- $P(+ | x) ≥ P(- | x)$, so classfied as +.

A trickier case

New Sample
- features $x=[yellow, square]$
- likelihood $P(x | +) = P(yellow | +) ⋅ P(square | +) = 0 ⋅ \frac{5}{7} = 0$ ?
Laplace (add-1) smoothing
- $\hat{P}(x_i | c)$
- $= \frac{count(x_i, c) + 1}{Σ_{x ∈ V}(count(x, c) + 1)}$
- $= \frac{count(x_i, c) + 1}{Σ_{x ∈ V}count(x, c) + |V|}$

Summarize and apply to Text Classification

(training set) feature extraction (bag of words)
Naive Bayes and Language Modeling
- prior (class)
- likelihood (i.i.d., laplace smoothing)
- drop the evidence term
- compute posterior
- apply decision rule

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

naivebayes.org

naivebayes.org

Naive Bayes and Text Classification

Review of A1

Bayes Rule in A1 (from Lec04)

Bayes Rule in A1 (from Lec04)

Bayes Rule in A1 (from Lec04)

The Problem of Text Classification

Positive or negative movie review?[Dan Jurafsky]

Classification Methods: Supervised Machine Learning

The Bag of words model

Example of Bag of words model

Naive Bayes Classifier

A Toy Example [Sebastian Raschka]

Classify a new sample

Classify a new sample

Classify a new sample

A trickier case

Summarize and apply to Text Classification

A Worked Example [Dan Jurafsky]

Files

naivebayes.org

Latest commit

History

naivebayes.org

File metadata and controls

Naive Bayes and Text Classification

Review of A1

Bayes Rule in A1 (from Lec04)

Bayes Rule in A1 (from Lec04)

Bayes Rule in A1 (from Lec04)

The Problem of Text Classification

Positive or negative movie review?[Dan Jurafsky]

Classification Methods: Supervised Machine Learning

The Bag of words model

Example of Bag of words model

Naive Bayes Classifier

A Toy Example [Sebastian Raschka]

Classify a new sample

Classify a new sample

Classify a new sample

A trickier case

Summarize and apply to Text Classification

A Worked Example [Dan Jurafsky]