|
| 1 | +--- |
| 2 | +title: Introduction to Python for Social Science |
| 3 | +subtitle: Lecture 8 - Intro to Natural Language Processing |
| 4 | +author: Musashi Harukawa, DPIR |
| 5 | +date: 8th Week Hilary 2020 |
| 6 | +--- |
| 7 | + |
| 8 | +# This Week |
| 9 | + |
| 10 | +## Roadmap |
| 11 | + |
| 12 | +This week we dive into my favourite field, _Natural Language Processing_ (NLP), and look at the opportunities and difficulties that come with trying to harness text-as-data. |
| 13 | + |
| 14 | +Points covered: |
| 15 | + |
| 16 | +- What is NLP? |
| 17 | +- Considerations about language as a data source |
| 18 | +- Representations of Language and Relevant Metrics |
| 19 | +- Application: POS-taggers, NER, and noun extraction. |
| 20 | + |
| 21 | +# Intro to Natural Language Processing |
| 22 | + |
| 23 | +## What is NLP? |
| 24 | + |
| 25 | +_Natural Language Processing_ (NLP) is an cross-disciplinary field drawing on linguistics, computer science, information retrieval, machine learning and artificial intelligence (among other fields) focused on computational methods for language. Applications include _speech recognition_, _natural language generation_, and _natural language understanding_. |
| 26 | + |
| 27 | +- _Natural Language_ is most easily defined in contrast to artificial or constructed languages. It includes all world languages, but excludes languages such as programming languages or Esperanto. |
| 28 | + |
| 29 | +## Social Science Applications |
| 30 | + |
| 31 | +Applications in social science are usually focused on the _information retrieval_/_understanding_ aspect of NLP. In other words, our focus is on using language as a data source, rather than building a working model for languages. As such, some major applications include: |
| 32 | + |
| 33 | +- _Topic segmentation_ |
| 34 | +- _Summarisation_ |
| 35 | +- _Scaling_ |
| 36 | +- _Sentiment analysis_ |
| 37 | + |
| 38 | +## Terminology |
| 39 | + |
| 40 | +NLP has its own terminology, related to but separate from machine learning. Key terms include: |
| 41 | + |
| 42 | +- _Text-as-data_: Applications of NLP focused on the extraction of information from textual data. |
| 43 | +- _String_: A sequence of characters. |
| 44 | +- _Token_: A string having a semantic function, often delineated by spaces in English. Otherwise a word, or a word fragment. |
| 45 | +- _Document_: Sentences aggregated to a unit of analysis; can be a paragraph, a speech, a Tweet, etc. |
| 46 | +- _Corpus_: A set of documents. |
| 47 | +- _Vocabulary_: The set of all tokens. Usually the unique set of tokens contained within a given corpus. |
| 48 | + |
| 49 | +## Further Terminology |
| 50 | + |
| 51 | +- _Lemmatization_: The reduction of a word to its grammatical root. |
| 52 | +- _Stemming_: A set of rules for removing suffixes (and prefixes) from a word to reduce it to a word "stem". |
| 53 | +- _Part-of-Speech Tagging_: The process of identifying the syntactic function of each token in a sentence, e.g. noun, verb, quantifier, etc. |
| 54 | +- _Named Entity Recognition_: Automatic tagging of "named entities" within a text, usually proper nouns such as companies, people or places. |
| 55 | + |
| 56 | +# Language as a Data Source |
| 57 | + |
| 58 | +## Text-as-Data? |
| 59 | + |
| 60 | +In general, as social scientists, we are not intrinsically interested in the process by which utterances and texts are generated (natural language generation) but rather the conditions which influence the generation of one text over another. |
| 61 | + |
| 62 | +Put differently, we use language as a proxy to measure the states and processes that both produce and are produced by the data. |
| 63 | + |
| 64 | + |
| 65 | +## Example Question |
| 66 | + |
| 67 | +- Imagine we are interested in the following question: _What is the effect of electoral system on legislators' Twitter usage?_ |
| 68 | +- How do we go about answering this question? |
| 69 | + |
| 70 | +## Formalizing Language |
| 71 | + |
| 72 | +A tweet by legislator $i$ of party $j$ at time $t$ can be described as an ordered set of tokens, drawn from the vocabulary $V$: |
| 73 | + |
| 74 | +$$ |
| 75 | +d_{ijt} := \langle w_1, w_2, ..., w_n, w_\omega \rangle \; \forall w_i \in V |
| 76 | +$$ |
| 77 | + |
| 78 | +## Language Automata |
| 79 | + |
| 80 | +Given this chain of tokens, we can describe the tokens as being probabilisitcally generated by an _automaton_, process which places probabilities over selecting one token given that another one came prior. |
| 81 | + |
| 82 | + |
| 83 | + |
| 84 | +## Generative Process |
| 85 | + |
| 86 | +The probabilities in these automata can be described as a function placing probabilities over all tokens (the vocabulary) given the current state of the automaton. Aggregating this process up to the level of _document_, we can think of there as being a _document-generating process_. |
| 87 | + |
| 88 | +## The Tweet-Generating Process |
| 89 | + |
| 90 | +$$ |
| 91 | +d_{ijt} \leftarrow M(t, s_i, u_i, \lambda(\mathbf{w}_j, e_i, [...]), \mathcal{L}, [...]) |
| 92 | +$$ |
| 93 | + |
| 94 | +Where: |
| 95 | + |
| 96 | +- $t$ is the substantive _topic_ of the Tweet. |
| 97 | +- $s_i$ is the authorial _style_ of legislator $i$. |
| 98 | +- $u_i$ is the preferences, or _position_ of legislator $i$. |
| 99 | + - $\lambda()$ is a constraint function on the preferences of legislator $i$. |
| 100 | + - $\mathbf{w}_j$ is the aggregated preferences, or _position_ of party $j$'s elite. |
| 101 | + - $e_j$ is the electoral formula faced by legislator $i$. |
| 102 | +- $\mathcal{L}$ is the linguistic constraints on $d$, i.e. the language rules. |
| 103 | + |
| 104 | + |
| 105 | +## Sources of Variance |
| 106 | + |
| 107 | +Each of these inputs and constraints contribute _variance_ to the DGP. [Lauderdale and Herzog (_PA_, 2016)](http://benjaminlauderdale.net/files/papers/2016LauderdaleHerzogPA.pdf) provide a rough hierarchy of these sources of variance: |
| 108 | + |
| 109 | +1. Language |
| 110 | +2. Style |
| 111 | +3. Topic |
| 112 | +4. Position |
| 113 | + |
| 114 | +## Our Goal |
| 115 | + |
| 116 | +As we are social scientists, and not computational linguists, we are usually not interested in _recreating_ the document-generating process. |
| 117 | + |
| 118 | +- Rather, we are usually interested in _quantifying_ the effect of exogenous constraints on political processes. |
| 119 | +- Thinking of text as a proxy for the states that produced those texts, we are interested in quantifying the effect of particular aspects of those states. |
| 120 | +- Ultimately, our use case involves using statistical and machine learning methods to disentangle and quantify the heterogeneous sources of variance in our data. |
| 121 | + |
| 122 | +## Social Science Examples |
| 123 | + |
| 124 | +- "A Scaling Model for Estimating Time-Series Party Positions from Texts", [Slapin and Proksch, AJPS 2008](http://www.wordfish.org/uploads/1/2/9/8/12985397/slapin_proksch_ajps_2008.pdf): Looks to infer party positions along a single ideological dimension with a model that assumes word counts are generated by a Poisson distribution. |
| 125 | +- "How Words and Money Cultivate a Personal Vote: The Effect of Legislator Credit Claiming on Constituent Credit Allocation" [(Grimmer, Messing and Westwood 2012)](https://projects.iq.harvard.edu/ptr/files/grimmercreditclaiming.pdf) |
| 126 | + |
| 127 | +# Languages as Numbers |
| 128 | + |
| 129 | +## Representing Language |
| 130 | + |
| 131 | +- We are already familiar with one computational representation of language: _strings_. |
| 132 | +- All of the models we have encountered thus far require data that is both _numeric_ and _tabular_. |
| 133 | +- String data is _non-numeric_ and _non-tabular_. |
| 134 | +- Therefore we either need new kinds of models that can handle this structure of data, or a representation of language that can work with the kinds of models we have seen thus far. |
| 135 | + |
| 136 | +## Frequency-Based Approaches |
| 137 | + |
| 138 | +The simplest numeric representation of language data is to simply count occurrences of words. |
| 139 | + |
| 140 | +The document "The fox jumped over the fence." can be represented: |
| 141 | + |
| 142 | +| the | fox | jumped | over | fence | |
| 143 | +| --- | --- | ------ | ---- | ----- | |
| 144 | +| 2 | 1 | 1 | 1 | 1 | |
| 145 | + |
| 146 | +## Bag-of-Words |
| 147 | + |
| 148 | +When we apply word counts to a _corpus_, we get a representation of language known as _bag-of-words_. Given two documents $d_1$ and $d_2$, "the fox jumped over the fence" and "the buffalo kicked the fence", we can write: |
| 149 | + |
| 150 | +| **Document** | buffalo | fence | fox | jumped | kicked | over | the | |
| 151 | +| ------------ | ------- | ----- | --- | ------ | ------ | ---- | --- | |
| 152 | +| $d_1$ | 0 | 1 | 1 | 1 | 0 | 1 | 2 | |
| 153 | +| $d_2$ | 1 | 1 | 0 | 0 | 1 | 0 | 2 | |
| 154 | + |
| 155 | +## Variations |
| 156 | + |
| 157 | +Some common variations of bag-of-words include: |
| 158 | + |
| 159 | +- _Bag-of-ngrams_: An _ngram_ is an ordered set of $n$ words. Therefore a bag of bi-grams is the frequency of word pairs. |
| 160 | +- _tf-idf_: Words are weighted by the product of two statistics: |
| 161 | + - _Term Frequency_: The frequency of terms in the document, same as bag-of-words. |
| 162 | + - _Inverse Document Frequency_: $log(\frac{n\;documents}{n\;documents\;containing\;w})$ |
| 163 | + - This measure penalises words that occur across documents (e.g. "the", "and"), therefore favouring "unique" high-frequency words. |
| 164 | + |
| 165 | +## Limitations |
| 166 | + |
| 167 | +- _Syntactic Information Loss_: Frequency-based approaches are purely _semantic_ representations of language, and lose all _syntactic_ information. i.e. they measure _what words people choose_, and not _what these words mean in conjunction_ or _how these words are used_. |
| 168 | +- _False equivalence_: Sentences containing the same words can have radically different meanings due to singular differences, or word order. |
| 169 | + - e.g. "The panda eats shoots and leaves" and "The panda eats, shoots and leaves". |
| 170 | + |
| 171 | +<p class="fragment">Keep all this in mind when using models based on word frequencies— they often do not measure what you might think!</p> |
| 172 | + |
| 173 | +## Vector Representation |
| 174 | + |
| 175 | +_Word_ and _document embeddings_ refer to the vector representation of words and documents respectively. |
| 176 | + |
| 177 | +Bag-of-words provides vector representations of words and documents in a corpus: |
| 178 | + |
| 179 | +| **Document** | buffalo | fence | fox | jumped | kicked | over | the | |
| 180 | +| ------------ | ------- | ----- | --- | ------ | ------ | ---- | --- | |
| 181 | +| $d_1$ | 0 | 1 | 1 | 1 | 0 | 1 | 2 | |
| 182 | +| $d_2$ | 1 | 1 | 0 | 0 | 1 | 0 | 2 | |
| 183 | + |
| 184 | +- $vector(d_1) = [0, 1, 1, 1, 0, 1, 2]$ |
| 185 | +- $vector(buffalo) = [0, 1]$ |
| 186 | + |
| 187 | +## `word2vec` |
| 188 | + |
| 189 | +[Mikolov et al (2013)](https://arxiv.org/pdf/1301.3781.pdf) provide a method for word embeddings that is many senses superior to the bag-of-words model. |
| 190 | + |
| 191 | +`word2vec` works by training a neural network to predict the middle word of a moving window of words through the corpus. The resultant vector space has some very attractive properties: |
| 192 | + |
| 193 | + |
| 194 | + |
| 195 | +## `GloVe` |
| 196 | + |
| 197 | +Given the social sciences' (arguably justfied) aversion to neural networks, you may also prefer [Pennington et al (2014)](https://www.aclweb.org/anthology/D14-1162.pdf), who take a more principled approach to word embeddings. |
| 198 | + |
| 199 | +In essense, `GloVe` trains a statistical model to predict the co-occurrence of proximate terms. For further detail, see [this excellent blog post](https://mlexplained.com/2018/04/29/paper-dissected-glove-global-vectors-for-word-representation-explained/). |
| 200 | + |
| 201 | +In terms of accuracy, `GloVe` performs the same, or often _better_ than `word2vec`, although the difference is limited for downstream analysis. |
| 202 | + |
| 203 | +## Applications for Word Embeddings |
| 204 | + |
| 205 | +- As seen from the `word2vec` example, _distances_ in the high-dimensional vector space generated by a model can accord with intuitive conceptual distances. |
| 206 | +- With this space, we want to be able to _scale_ or _cluster_ words and documents, with the same intuitions that apply to other _scaling_ and _clustering_ tasks: |
| 207 | + - The ordering of and distance between observations on _latent dimensions_ should correspond with the concept we are aiming to measure, such as ideology. |
| 208 | + - Words and documents classified as similar should be similar in some meaningful way, by topic, position, etc. |
| 209 | + |
| 210 | +## Measuring Similarity |
| 211 | + |
| 212 | +- The standard metric for distance in Euclidean spaces, _Euclidean distance_. |
| 213 | +- This is unsuitable for linguistic data for a number of reasons, the foremost being that word frequencies follow a [_power law_](https://en.wikipedia.org/wiki/Zipf%27s_law). |
| 214 | +- The distribution of word frequencies are highly skewed, meaning most of the distance between documents by magnitude will be the result of relatively meaningless differences in the occurrence of common words. |
| 215 | + |
| 216 | +## Cosine Similarity |
| 217 | + |
| 218 | +- One common alternative to Euclidean distances is _cosine similarity_. |
| 219 | +- The cosine similarity between two vectors is a function of the angle created by the two vectors from the origin. |
| 220 | + |
| 221 | + |
| 222 | + |
| 223 | +## Word Mover's Distance |
| 224 | + |
| 225 | +[Kusner et al (2015)](http://proceedings.mlr.press/v37/kusnerb15.pdf) apply [Earth Mover's Distance](https://en.wikipedia.org/wiki/Earth_mover%27s_distance), a measure of the minimum number of transformations from one distribution to another to sentences, calling this Word Mover's Distance. |
| 226 | + |
| 227 | + |
| 228 | + |
| 229 | +# Applications |
| 230 | + |
| 231 | +## `spaCy` |
| 232 | + |
| 233 | +SpaCy provides industry-grade natural language processing resources with powerful out-of-the-box functionality. In particular, we can make quick use of the: |
| 234 | + |
| 235 | +- Part-of-Speech Tagger |
| 236 | +- Lemmatizer |
| 237 | +- Named Entity Recognition |
| 238 | + |
| 239 | +<p class="fragment">To a lesser extent, we may also be interested in: |
| 240 | + |
| 241 | +- Document and Word Similarity |
| 242 | +- Token Matching |
| 243 | + |
| 244 | +## Part-of-Speech Tagging |
| 245 | + |
| 246 | +- `spaCy` provides a POS tagger that automatically labels tokens by their function within the sentence. |
| 247 | +- We can use this to extract all of the nouns, adjectives, verbs, etc. from a text, a straightforward method for comparing intuitively a large collection of documents. |
| 248 | + |
| 249 | +## Lemmatization |
| 250 | + |
| 251 | +- Often we do not care about the exact form of a word: |
| 252 | + - _investigate(s|d)_, _investigator(s)_, _investigation(s)_ |
| 253 | +- Lemmatization is the reduction of a word to its morphological stem. |
| 254 | + - _investigat_ |
| 255 | +- Stemming is similar, but uses a series of rules to simply cut off the ends of words. |
| 256 | + |
| 257 | +## Named Entity Recognition |
| 258 | + |
| 259 | +NER is the automatic tagging of _named entities_. These include people, places, companies, events, and so on. |
| 260 | + |
| 261 | +- This functionality can also be useful when combing through large quantities of data, but be aware that NER is less accurate than POS, and only recognises named entities that were in, or are similar to, the training data. |
| 262 | + |
| 263 | +# Things Not Covered Because They're Implemented in `R` |
| 264 | + |
| 265 | +## Structural Topic Model |
| 266 | + |
| 267 | +_Topic models_ are a class of dimensionality reduction model that discovers latent substantive "topics" in a corpus. They have powerful applications in classifying documents or measuring attention. |
| 268 | + |
| 269 | +A dominant variant of the topic model is the _structural topic model_, which unfortunately is only implemented in `R`. |
| 270 | + |
| 271 | +## Wordfish and Wordshoals |
| 272 | + |
| 273 | +Wordfish (Slapin and Proksch 2008) and its more recent context-aware variant Wordshoals (Lauderdale and Herzog 2016) provide _scaling_ estimates of texts. Though originally applied to manifestos, it has a multitude of applications for quantifying one-dimensional variation between political actors. |
| 274 | + |
| 275 | + |
| 276 | + |
| 277 | +# References |
| 278 | + |
| 279 | +## Word Embedding Approaches |
| 280 | + |
| 281 | +- ["Concept Mover’s Distance: measuring concept engagement via word embeddings in texts"](https://link.springer.com/article/10.1007/s42001-019-00048-6), Stoltz \& Taylor (2019) |
| 282 | + |
| 283 | +## Topic Models |
| 284 | + |
| 285 | +- ["Structural Topic Model"](https://www.structuraltopicmodel.com/), Roberts et al (2015) |
| 286 | +- ["How Words and Money Cultivate a Personal Vote: The Effect of Legislator Credit Claiming on Constituent Credit Allocation"](https://projects.iq.harvard.edu/ptr/files/grimmercreditclaiming.pdf), Grimmer, Messing and Westwood (2012) |
0 commit comments