forked from gigimcc4/machine-learning
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathintroductory-presentation.html
459 lines (331 loc) · 18.1 KB
/
introductory-presentation.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
<!DOCTYPE html>
<html lang="" xml:lang="">
<head>
<title>Machine Learning (ML) Module: Intro. Presentation</title>
<meta charset="utf-8" />
<meta name="author" content="Joshua Rosenberg" />
<meta name="date" content="2022-07-11" />
<script src="libs/header-attrs-2.14/header-attrs.js"></script>
<link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
<link href="libs/panelset-0.2.6/panelset.css" rel="stylesheet" />
<script src="libs/panelset-0.2.6/panelset.js"></script>
<script src="libs/clipboard-2.0.6/clipboard.min.js"></script>
<link href="libs/xaringanExtra-clipboard-0.2.6/xaringanExtra-clipboard.css" rel="stylesheet" />
<script src="libs/xaringanExtra-clipboard-0.2.6/xaringanExtra-clipboard.js"></script>
<script>window.xaringanExtraClipboard(null, {"button":"<i class=\"fa fa-clipboard\"><\/i>","success":"<i class=\"fa fa-check\" style=\"color: #90BE6D\"><\/i>","error":"Press Ctrl+C to Copy"})</script>
<link href="libs/font-awesome-5.1.0/css/all.css" rel="stylesheet" />
<link href="libs/font-awesome-5.1.0/css/v4-shims.css" rel="stylesheet" />
<link href="libs/tile-view-0.2.6/tile-view.css" rel="stylesheet" />
<script src="libs/tile-view-0.2.6/tile-view.js"></script>
<link rel="stylesheet" href="css/laser.css" type="text/css" />
<link rel="stylesheet" href="css/laser-fonts.css" type="text/css" />
</head>
<body>
<textarea id="source">
class: clear, title-slide, inverse, center, top, middle
# Machine Learning (ML) Module: Intro. Presentation
----
### Joshua Rosenberg
### https://laser-institute.github.io/machine-learning/introductory-presentation.html#1
### July 11, 2022
---
class: clear, inverse, center, middle
# Front Matter
---
class: clear, inverse, center, middle
background-image: url(https://imgs.xkcd.com/comics/machine_learning.png)
background-position: center
background-size: contain
---
# Goals for this module
### Over-arching goal:
- **Get started with applying machine learning methods in R**
### Specific aims are to learn about:
- A small but important set of ideas about machine learning
- How these ideas are instantiated through the *tidymodels* R package
---
# One example of ML . . .
<img src="img/gpt-3.png" width="62%" style="display: block; margin: auto;" />
https://beta.openai.com/playground
---
# How I came to ML
**A magical moment**
- Learned about supervised machine learning from a colleague
- Used these methods to discover patterns in students' embedded assessments, Rosenberg et al. (2021), *Journal of Science Education and Technology*, https://link.springer.com/article/10.1007/s10956-020-09862-4
**Using Twitter users profile descriptions to predict whether or not they were teachers**
- Manually coded > 500 profiles for whether they were teachers or non-teachers and then trained a ML model to predict 1000s of teachers' roles as teachers or non-teachers (with an accuracy of 85%), [Rosenberg et al., *AERA Open* (2021)](https://journals.sagepub.com/doi/full/10.1177/23328584211024261)
---
class: clear, inverse, center, middle
# Overview of Machine Learning (ML)
---
# Defining ML
- *Artificial Intelligence (AI)* (i.e., [GPT-3](https://openai.com/api/))
: Simulating human intelligence through the use of computers
- *Machine learning (ML)*: A subset of AI focused on how computers acquire new information/knowledge
This definition leaves a lot of space for a range of approaches to ML
---
# Supervised & unsupervised
## Supervised ML
- Requires coded data or data with a known outcome
- Uses coded/outcome data to train an algorithm
- Uses that algorithm to **predict the codes/outcomes for new data** (data not used during the training)
- Can take the form of a *classification* (predicting a dichotomous or categorical outcome) or a *regression* (predicting a continuous outcome)
- Algorithms include:
- [Linear regression (really!)](https://web.stanford.edu/~hastie/ElemStatLearn/)
- Logistic regression
- Decision tree
- Support Vector Machine
---
# What kind of coded data?
> Want to detect spam? Get samples of spam messages. Want to forecast stocks? Find the price history. Want to find out user preferences? Parse their activities on Facebook (no, Mark, stop collecting it, enough!) (from [ML for Everyone](https://vas3k.com/blog/machine_learning/))
In educational research:
- Assessment data (e.g., [1](https://link.springer.com/article/10.1007/s10956-020-09895-9))
- Data from log files ("trace data") (e.g., [1](https://www.tandfonline.com/doi/full/10.1080/10508406.2013.837391?casa_token=-8Fm2KCFJ30AAAAA%3Altbc8Y8ci_z-uLJx4se9tgvru9mzm3yqCTFi12ndJ5uM6RDl5YJGG6_4KpUgIK5BYa_Ildeh2qogoQ))
- Open-ended written responses (e.g., [1](https://link.springer.com/article/10.1007/s10956-020-09889-7), [2](https://doi.org/10.1007/s11423-020-09761-w))
- Achievement data (i.e., end-of-course grades) (e.g., [1](https://link.springer.com/article/10.1007/s10956-020-09888-8), [2](https://search.proquest.com/docview/2343734516?pq-origsite=gscholar&fromopenview=true))
---
# How is this different from regression?
The _aim_ is different, the algorithms and methods of estimation are not (or, are differences in degree, rather than in kind).
In a linear regression, our aim is to estimate parameters, such as `\(\beta_0\)` (intercept) and `\(\beta_1\)` (slope), and to make inferences about them that are not biased by our particular sample.
In an ML approach, we can use the same linear regression model, but with a goal other than making unbiased inferences about the `\(\beta\)` parameters:
<h4><center>In supervised ML, our goal is to minimize the difference between a known `\(y\)` and our predictions, `\(\hat{y}\)`</center></h3>
---
# So, how is this really different?
This _predictive goal_ means that we can do things differently:
- Multicollinearity is not an issue because we do not care to make inferences about parameters
- Because interpreting specific parameters is less of an interest, we can use a great deal more predictors
- We focus on how accurately a _trained_ model can predict the values in _test_ data
- We can make our models very complex!
---
# Okay, _really_ complex
- Neutral/deep networks
- i.e., GPT-3 (175 B parameters), GPT-4 (>1 T parameters)
- And, some models can take a different form than familiar regressions:
- *k*-nearest neighbors
- Decision trees (and their extensions of bagged and random forests)
- Last, the modeling process can look different:
- Ensemble models that combine or improve on ("boosting") the predictions of individual models
---
# Supervised & unsupervised
## Unsupervised ML
- Does not require coded data; one way to think about unsupervised ML is that its purpose is to discover codes/labels
- Can be used in an _exploratory mode_ (see [Nelson, 2020](https://journals.sagepub.com/doi/full/10.1177/0049124118769114?casa_token=EV5XH31qbyAAAAAA%3AFg09JQ1XHOOzlxYT2SSJ06vZv0jG-s4Qfz8oDIQwh2jrZ-jrHNr7xZYL2FwnZtZiokhPalvV1RL2Bw))
- **Warning**: The results of unsupervised ML _cannot_ directly be used to provide codes/outcomes for supervised ML techniques
- Algorithms include:
- Cluster analysis and Latent Profile Analysis
- [Principle Components Analysis (really!)](https://web.stanford.edu/~hastie/ElemStatLearn/)
---
# What technique should I choose?
Do you have coded data or data with a known outcome -- let's say about K-12 students -- and, do you want to:
- _Predict_ how other students with similar data (but without a known outcome) perform?
- _Scale_ coding that you have done for a sample of data to a larger sample?
- _Provide timely or instantaneous feedback_, like in many learning analytics systems?
**Supervised methods may be your best bet**
---
# What technique should I choose?
Do you not yet have codes/outcomes -- and do you want to?
- _Achieve a starting point_ for qualitative coding, perhaps in a ["computational grounded theory"](https://journals.sagepub.com/doi/full/10.1177/0049124117729703) mode?
- _Discover groups or patterns in your data_ that may be of interest?
- _Reduce the number of variables in your dataset_ to a smaller, but perhaps nearly as explanatory/predictive - set of variables?
**Unsupervised methods may be helpful**
---
# Examples of machine learning in STEM Ed Research
.panelset[
.panel[.panel-name[Example 1]
**Using digital log-trace data and supervised ML**
Gobert, J. D., Sao Pedro, M., Raziuddin, J., & Baker, R. S. (2013). From log files to assessment metrics: Measuring students' science inquiry skills using educational data mining. *Journal of the Learning Sciences, 22*(4), 521-563.
- Utilized *replay tagging* to code sequences of students' activity within digital *log-trace* data
- Then trained a supervised ML algorithm to **automate** the prediction the presence or absence of students' engagement in inquiry
]
.panel[.panel-name[Example 2]
**Combining best practices in assessment with supervised ML**
Maestrales, S., Zhai, X., Touitou, I., Baker, Q., Schneider, B., & Krajcik, J. (2021). Using machine learning to score multi-dimensional assessments of chemistry and physics. Journal of Science Education and Technology, 30(2), 239-254.
- Carrying out a careful process of qualitative coding (and the establishment of validity and interrater reliability) of students' written constructed responses
- Training a supervised ML algorithm in advance of being able to **scale up** their coding to a larger number of responses
]
.panel[.panel-name[Example 3]
**Combining qualitative methods with unsupervised ML**
Sherin, B. (2013). A computational study of commonsense science: An exploration in the automated analysis of clinical interview data. *Journal of the Learning Sciences, 22*(4), 600-638.
- Used unsupervised machine learning methods to identify topics within students' interviews **from patterns in the data alone**
- Interpreted those topics in light of rigorous qualitative coding with the aim of boosting the validity of the use of both the machine learning and qualitative approaches
]
]
---
class: clear, inverse, center, middle
# Learning Labs!
---
# Learning Labs Overviews
.panelset[
.panel[.panel-name[Overview]
**What will I learn in this topic area?**
We'll work to answer these four questions:
- LL 1: Can we predict something we would have coded by hand?
- LL 2: How much do new predictors improve the prediction quality?
- LL 3: How much of a difference does a more complex model make?
- LL 4: What if we do not have training data?
]
.panel[.panel-name[LL 1]
**Machine Learning Learning Lab 1: Focusing on prediction**
We have some data, but we want to use a computer to **automate** or scale up the relationships between predictor (independent) and outcome (dependent) variables. Supervised machine learning is suited to this aim. In particular, in this learning lab, we explore how we can train a computer to learn and reproduce qualitative coding we carried out---though the principles extend to other types of variables.
]
.panel[.panel-name[LL 2]
**Machine Learning Learning Lab 2: Feature engineering**
Once we have trained a computer in the context of using supervised machine learning methods, we may realize we can improve it by adding new variables or changing how the variables we are using were prepared. In this learning lab, we consider **feature engineering** with a variety of types of data from many online science courses to predict students' success after only a few weeks of the course
]
.panel[.panel-name[LL 3]
**Machine Learning Learning Lab 3: Fine-tuning the model**
Even with optimal feature engineering, we may be able to specify an even more predictive model by selecting and *tuning* a more sophisticated algorithm. While in the first two learning labs we used logistic regression as the "engine" (or algorithm), in this learning lab, we use a random forest as the engine for our model. This more sophisticated type of model requires us to specify **tuning parameters**, specifications that determine how the model works.
]
.panel[.panel-name[LL 4]
**Machine Learning Learning Lab 4: Finding groups (or codes) in data**
The previous three learning labs involved the use of data with known outcome variables (coded for the substantive or transactional nature of the conversations taking place through #NGSSchat in learning labs 1 and 3 and students' grades in learning lab 2). Accordingly, we explored different aspects of supervised machine learning. What if we have data without something that we can consider to be a dependent variable? **Unsupervised machine learning methods** can be used in such cases.
]
]
---
# Thanks
I hope to see you in the ML topic area learning labs!
Slides [here](https://laser-institute.github.io/machine-learning/introductory-presentation.html#1)
## .font130[.center[**Thank you!**]]
<br/>
.center[<img style="border-radius: 80%;" src="img/jr-cycling.jpeg" height="200px"/><br/>**Dr. Joshua Rosenberg**<br/><mailto:jmrosenberg@utk.edu>]
Joshua Rosenberg
jmrosenberg@utk.edu
@jrosenberg6432
Slides created via the R package [xaringan](https://github.com/yihui/xaringan)
Available:
</textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "default",
"highlightLines": true,
"highlightLanguage": "r",
"countIncrementalSlides": false,
"ratio": "16:9",
"slideNumberFormat": "<div class=\"progress-bar-container\">\n <div class=\"progress-bar\" style=\"width: calc(%current% / %total% * 100%);\">\n </div>\n</div>"
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
window.dispatchEvent(new Event('resize'));
});
(function(d) {
var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
if (!r) return;
s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
d.head.appendChild(s);
})(document);
(function(d) {
var el = d.getElementsByClassName("remark-slides-area");
if (!el) return;
var slide, slides = slideshow.getSlides(), els = el[0].children;
for (var i = 1; i < slides.length; i++) {
slide = slides[i];
if (slide.properties.continued === "true" || slide.properties.count === "false") {
els[i - 1].className += ' has-continuation';
}
}
var s = d.createElement("style");
s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
var deleted = false;
slideshow.on('beforeShowSlide', function(slide) {
if (deleted) return;
var sheets = document.styleSheets, node;
for (var i = 0; i < sheets.length; i++) {
node = sheets[i].ownerNode;
if (node.dataset["target"] !== "print-only") continue;
node.parentNode.removeChild(node);
}
deleted = true;
});
})();
// add `data-at-shortcutkeys` attribute to <body> to resolve conflicts with JAWS
// screen reader (see PR #262)
(function(d) {
let res = {};
d.querySelectorAll('.remark-help-content table tr').forEach(tr => {
const t = tr.querySelector('td:nth-child(2)').innerText;
tr.querySelectorAll('td:first-child .key').forEach(key => {
const k = key.innerText;
if (/^[a-z]$/.test(k)) res[k] = t; // must be a single letter (key)
});
});
d.body.setAttribute('data-at-shortcutkeys', JSON.stringify(res));
})(document);
(function() {
"use strict"
// Replace <script> tags in slides area to make them executable
var scripts = document.querySelectorAll(
'.remark-slides-area .remark-slide-container script'
);
if (!scripts.length) return;
for (var i = 0; i < scripts.length; i++) {
var s = document.createElement('script');
var code = document.createTextNode(scripts[i].textContent);
s.appendChild(code);
var scriptAttrs = scripts[i].attributes;
for (var j = 0; j < scriptAttrs.length; j++) {
s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
}
scripts[i].parentElement.replaceChild(s, scripts[i]);
}
})();
(function() {
var links = document.getElementsByTagName('a');
for (var i = 0; i < links.length; i++) {
if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
links[i].target = '_blank';
}
}
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
const hlines = d.querySelectorAll('.remark-code-line-highlighted');
const preParents = [];
const findPreParent = function(line, p = 0) {
if (p > 1) return null; // traverse up no further than grandparent
const el = line.parentElement;
return el.tagName === "PRE" ? el : findPreParent(el, ++p);
};
for (let line of hlines) {
let pre = findPreParent(line);
if (pre && !preParents.includes(pre)) preParents.push(pre);
}
preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>
<script>
slideshow._releaseMath = function(el) {
var i, text, code, codes = el.getElementsByTagName('code');
for (i = 0; i < codes.length;) {
code = codes[i];
if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
text = code.textContent;
if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
/^\$\$(.|\s)+\$\$$/.test(text) ||
/^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
code.outerHTML = code.innerHTML; // remove <code></code>
continue;
}
}
i++;
}
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
var script = document.createElement('script');
script.type = 'text/javascript';
script.src = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
if (location.protocol !== 'file:' && /^https?:/.test(script.src))
script.src = script.src.replace(/^https?:/, '');
document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
</body>
</html>