-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathPOS Tagging Lesson.Rmd
105 lines (67 loc) · 4.07 KB
/
POS Tagging Lesson.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---
title: "Lesson: Part-of-Speech Tagging"
output:
html_document:
toc: yes
toc_float: yes
number_sections: true
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Parts of Speech (PoS)
- Ryan
***
# PoS Tagging
**What is it?**
"Part-of-speech tagging is the process of assigning a part-of-speech marker to each word in an input text" (Jurafsky, 156).
**Why is it useful?**
- Information Extraction/ Retrieval
- Phrase Identification
- Named Entity Recognition
- Word Sense Disambiguation
- Text to speech (how should a word sound)?
\vspace{14cm}
**Why is it challenging?**
The challenge in POS-tagging is the ambiguity in words. "The goal of POS-tagging is to *resolve* the ambiguities, choosing the proper tag for the context" (Jurafsky, 156). Ambiguous words account for only 14-15% of the vocabulary, but since they are common words 55-67% of word tokens in running text are ambiguous (Jurafsky, 156).
Examples of ambiguous frequent words: *that*, *back*, *down*, *put*, and *set*\
-- The **calm** lasted for three days. (Noun)
-- **Calm** words show quiet minds (Adjective)
-- **Calm** your angry friend. (Verb)
\vspace{14cm}
**How are words disambiguated?**\
- Probability; each part-of-speech tag isn't equally likely.\
- Set of rules;
- Most Frequent Class Baseline: for each ambiguous word, choose the tag which is *most frequent* in the training corpus.\
- Aside from the *baseline* we will discuss 4 other methods for POS tagging: Hidden Markov Models, Maximum Entropy Markov Models, Conditional Random Fields, and Rule Based System.\
\vspace{14cm}
Accuracy: measure of the performance of part-of-speech taggers (percentage of tags correctly labeled).
***
## PoS Tagging Tools
### Part-of-Speech Tagset's
A key component to tagging text for part-of-speech is the tagset used to label corpora. Several tagset's exist, however the **Penn Treebank** is most common and is sufficient for the purposes of this lesson. Standard notation is to place the tag after each word, deliminated by a slash.\
\
\vspace{12pt}
### Tagged Corpora
Corpora labeled with parts-of-speech are crucial training (and testing) sets for statistical tagging algorithms. For English text, there are three tagged corpora that are typically used for training and testing part-of-speech taggers.
- **Brown corpus**: 1 million words sampled from 500 written texts from different genres published in the U.S. in 1961.
- **WSJ corpus**: 1 million words published in the Wall Street Journal in 1989.
- **Switchboard corpus**: 2 million words of telephone conversations from 1990-1991.
\vspace{12pt}
*For more information on labeled corpora check out:* http://www.helsinki.fi/varieng/CoRD/corpora/index.html
\
\vspace{14cm}
The tagged corpora were created by running an automatic part-of-speech tagger on the texts and then human annotators **hand-corrected** each tag. Minor differences exist between the tagsets used by the corpora. This is worth noting and has the potential to vary part-of-speech tags when applying different tagged corpora. *(We will explore this further on Thursday.)*
# Maximum Entropy Markov Models (MEMM)
One of the limitations of HMM is that it needs *massaging* to deal with unknown words, backoff, suffixes, and such. Maximum Entropy Markov Models (MEMM) deal with this issue by cleanly adding arbitrary features directly into the model.
**How is this done?**
- A logistic regression model
- A logistic regression can be a discriminative sequence model by running it on successive words and allowing the previous output to be a feature for the current instance.
Lets see how this works...(Jurafsky, 168)
- HMMs compute likelihood (observation word conditioned on tags) but MEMMs compute posterior (tags conditioned on observation words).
\
- The reason to use a discriminative sequence model is that it's easier to incorporate a lots of features.
\
\
# Example
First lets install the necessary packages for this example.