Skip to content

Files

Latest commit

86d8828 · Mar 3, 2015

History

History
executable file
·
146 lines (129 loc) · 6.1 KB

naivebayes.org

File metadata and controls

executable file
·
146 lines (129 loc) · 6.1 KB

Naive Bayes and Text Classification

Review of A1

Bayes Rule in A1 (from Lec04)

  • Recall E in MAP:
    • E ( ˜ w ) = L ( ˜ w ) + λ 2 ˜ w T ˜ w
  • convert E back to probabilities by taking e x p ( E ) :
    • e x p ( E ( ˜ w ) ) = e x p ( L ( ˜ w ) λ 2 ˜ w T ˜ w )
      = e x p ( L ( ˜ w ) ) e x p ( λ 2 ˜ w T ˜ w ) \ $= Πi=1^N P(t_i|x_i)exp(-\frac{λ}{2}˜{w}^T˜{w})$

Bayes Rule in A1 (from Lec04)

  • e x p ( λ 2 ˜ w T ˜ w ) is proportional to a Gaussian probability density function (PDF):
  • We can write this as p ( ˜ w ) e x p ( λ 2 ˜ w T ˜ w )
  • Minimizing E is equivalent to maximzing:
    • i=1^N P(t_i|x_i)p(˜{w})$

Bayes Rule in A1 (from Lec04)

  • Take a look back at Baye’s rule
    • p ( θ | D ) = p ( D | θ ) p ( θ ) p ( D ) p ( D | θ ) p ( θ )
  • prior ( p ( θ ) ) : how likely θ is before observing data.
  • likelihood ( p ( D | θ ) ) : how likely the data set D is if the model parameter is θ .
  • posterior ( p ( θ | D ) ) : how likely θ is after observing the data set D .
  • Estimating the θ (learning the model) by maximizing the posterior distribution is called maximum a posteriori (MAP) estimation.

The Problem of Text Classification

Positive or negative movie review?[Dan Jurafsky]

  • Unbelievably disappointing.
  • This is the greatest screwball comedy ever filmed.
  • It was pathetic. The worst part about it was the boxing scenes.

Classification Methods: Supervised Machine Learning

  • Input:
    • a document d .
    • a fixed set of classes C = c 1 , c 2 , , c j
    • a training set of m hand-labeled documents ( d 1 , c 1 ) , , ( d m , c m )
  • Output:
    • a learned classifier γ : d c

The Bag of words model

  • Idea: Represent a text document as a feature vector in order to use machine learning methods.
  • vocabulary: the set of all different feature words that occur in training set, with a count of how it occurs.
    • ignore the order
    • occurance is independent (naive Bayes): “hello” tends to be followed by a “world”

Example of Bag of words model

  • Documents:
    • D1: “Unbelievably disappointing.”
    • D2: “This is the greatest screwball comedy ever filmed.”
    • D3: “It was pathetic. The worst part about it was the boxing scenes.”
    • D4: “Greatest film ever.”
  • Vocabulary
    • V = {disappointing: 1, greatest: 2, pathetic: 1, worst: 1}

Naive Bayes Classifier

A Toy Example [Sebastian Raschka]

./images/toy_dataset_1.png

  • (Training) Dataset
    • 12 samples, 2 different classes +,-.
    • 2 features: color, geometrical shape.
  • Denote
    • c j be class labels: c j = + for +, c j = for -.
    • x j be the 2-dimensional feature vectors: $x_j = [xj1 xj2], xj1 ∈ \{blue, green, red, yellow\}, xj2 ∈ \{circle, square\}$

Classify a new sample

./images/toy_dataset_2.png

  • New Sample
    • features x = [ b l u e , s q u a r e ]
    • class? (ground truth: +)
  • decision rule
    • (MAP) P ( c = + | x = [ b l u e , s q u a r e ] ) P ( c = | x = [ b l u e , s q u a r e ] ) ? + : ;

Classify a new sample

./images/toy_dataset_1.png ./images/toy_dataset_2.png

  • computing
    • (prior) P ( + ) = 7 12 = 0.58 , P ( ) = 5 12 = 0.42
    • (likelihood, +) P ( x | + ) = P ( b l u e | + ) P ( s q u a r e | + ) = 3 7 5 7 = 0.31 (i.i.d.)
    • (likelihood, -) P ( x | ) = P ( b l u e | ) P ( s q u a r e | ) = 3 5 3 5 = 0.36 (i.i.d.)
    • (posterior, + ) P ( + | x ) P ( x | + ) P ( + ) = 0.31 0.58 = 0.18
    • (posterior, -) P ( | x ) P ( x | ) P ( ) = 0.36 0.42 = 0.15

Classify a new sample

  • computing
    • (prior) P ( + ) = 7 12 = 0.58 , P ( ) = 5 12 = 0.42
    • (likelihood, +) P ( x | + ) = P ( b l u e | + ) P ( s q u a r e | + ) = 3 7 5 7 = 0.31 (i.i.d.)
    • (likelihood, -) P ( x | ) = P ( b l u e | ) P ( s q u a r e | ) = 3 5 3 5 = 0.36 (i.i.d.)
    • (posterior, + ) P ( + | x ) P ( x | + ) P ( + ) = 0.31 0.58 = 0.18
    • (posterior, -) P ( | x ) P ( x | ) P ( ) = 0.36 0.42 = 0.15
  • on dropping p ( x )
    • p ( x ) is called evidence
    • no effect on the final result
  • classification
    • P ( + | x ) P ( | x ) , so classfied as +.

A trickier case

./images/toy_dataset_3.png

  • New Sample
    • features x = [ y e l l o w , s q u a r e ]
    • likelihood P ( x | + ) = P ( y e l l o w | + ) P ( s q u a r e | + ) = 0 5 7 = 0 ?
  • Laplace (add-1) smoothing
    • P ^ ( x i | c )
    • $= \frac{count(x_i, c) + 1}{Σx ∈ V(count(x, c) + 1)}$
    • $= \frac{count(x_i, c) + 1}{Σx ∈ Vcount(x, c) + |V|}$

Summarize and apply to Text Classification

  • (training set) feature extraction (bag of words)
  • Naive Bayes and Language Modeling
    • prior (class)
    • likelihood (i.i.d., laplace smoothing)
    • drop the evidence term
    • compute posterior
    • apply decision rule

A Worked Example [Dan Jurafsky]

./images/text_class_eg.png