introductory-presentation.html

<!DOCTYPE html>
<html lang="" xml:lang="">
  <head>
    <title>Machine Learning (ML) Module: Intro. Presentation</title>
    <meta charset="utf-8" />
    <meta name="author" content="Joshua Rosenberg" />
    <meta name="date" content="2022-07-11" />
    <script src="libs/header-attrs-2.14/header-attrs.js"></script>
    <link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
    <link href="libs/panelset-0.2.6/panelset.css" rel="stylesheet" />
    <script src="libs/panelset-0.2.6/panelset.js"></script>
    <script src="libs/clipboard-2.0.6/clipboard.min.js"></script>
    <link href="libs/xaringanExtra-clipboard-0.2.6/xaringanExtra-clipboard.css" rel="stylesheet" />
    <script src="libs/xaringanExtra-clipboard-0.2.6/xaringanExtra-clipboard.js"></script>
    <script>window.xaringanExtraClipboard(null, {"button":"<i class=\"fa fa-clipboard\"><\/i>","success":"<i class=\"fa fa-check\" style=\"color: #90BE6D\"><\/i>","error":"Press Ctrl+C to Copy"})</script>
    <link href="libs/font-awesome-5.1.0/css/all.css" rel="stylesheet" />
    <link href="libs/font-awesome-5.1.0/css/v4-shims.css" rel="stylesheet" />
    <link href="libs/tile-view-0.2.6/tile-view.css" rel="stylesheet" />
    <script src="libs/tile-view-0.2.6/tile-view.js"></script>
    <link rel="stylesheet" href="css/laser.css" type="text/css" />
    <link rel="stylesheet" href="css/laser-fonts.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">

class: clear, title-slide, inverse, center, top, middle


# Machine Learning (ML) Module: Intro. Presentation
----
### Joshua Rosenberg
### https://laser-institute.github.io/machine-learning/introductory-presentation.html#1
### July 11, 2022

---

class: clear, inverse, center, middle

# Front Matter

---

class: clear, inverse, center, middle

background-image: url(https://imgs.xkcd.com/comics/machine_learning.png)
background-position: center
background-size: contain

---

# Goals for this module

### Over-arching goal:

- **Get started with applying machine learning methods in R** 

### Specific aims are to learn about:

- A small but important set of ideas about machine learning
- How these ideas are instantiated through the *tidymodels* R package

---

# One example of ML . . .

&lt;img src="img/gpt-3.png" width="62%" style="display: block; margin: auto;" /&gt;

https://beta.openai.com/playground

---

# How I came to ML

**A magical moment**
- Learned about supervised machine learning from a colleague
- Used these methods to discover patterns in students' embedded assessments, Rosenberg et al. (2021), *Journal of Science Education and Technology*, https://link.springer.com/article/10.1007/s10956-020-09862-4

**Using Twitter users profile descriptions to predict whether or not they were teachers**
- Manually coded &gt; 500 profiles for whether they were teachers or non-teachers and then trained a ML model to predict 1000s of teachers' roles as teachers or non-teachers (with an accuracy of 85%), [Rosenberg et al., *AERA Open* (2021)](https://journals.sagepub.com/doi/full/10.1177/23328584211024261)

---

class: clear, inverse, center, middle

# Overview of Machine Learning (ML)

---

# Defining ML

- *Artificial Intelligence (AI)* (i.e., [GPT-3](https://openai.com/api/))
: Simulating human intelligence through the use of computers
- *Machine learning (ML)*: A subset of AI focused on how computers acquire new information/knowledge

This definition leaves a lot of space for a range of approaches to ML

---

# Supervised &amp; unsupervised

## Supervised ML

- Requires coded data or data with a known outcome
- Uses coded/outcome data to train an algorithm
- Uses that algorithm to **predict the codes/outcomes for new data** (data not used during the training)
- Can take the form of a *classification* (predicting a dichotomous or categorical outcome) or a *regression* (predicting a continuous outcome)
- Algorithms include:
  - [Linear regression (really!)](https://web.stanford.edu/~hastie/ElemStatLearn/)
  - Logistic regression
  - Decision tree
  - Support Vector Machine

---

# What kind of coded data?

&gt; Want to detect spam? Get samples of spam messages. Want to forecast stocks? Find the price history. Want to find out user preferences? Parse their activities on Facebook (no, Mark, stop collecting it, enough!) (from [ML for Everyone](https://vas3k.com/blog/machine_learning/))

In educational research:

- Assessment data (e.g., [1](https://link.springer.com/article/10.1007/s10956-020-09895-9))
- Data from log files ("trace data") (e.g., [1](https://www.tandfonline.com/doi/full/10.1080/10508406.2013.837391?casa_token=-8Fm2KCFJ30AAAAA%3Altbc8Y8ci_z-uLJx4se9tgvru9mzm3yqCTFi12ndJ5uM6RDl5YJGG6_4KpUgIK5BYa_Ildeh2qogoQ))
- Open-ended written responses (e.g., [1](https://link.springer.com/article/10.1007/s10956-020-09889-7), [2](https://doi.org/10.1007/s11423-020-09761-w))
- Achievement data (i.e., end-of-course grades) (e.g., [1](https://link.springer.com/article/10.1007/s10956-020-09888-8), [2](https://search.proquest.com/docview/2343734516?pq-origsite=gscholar&amp;fromopenview=true))

---

# How is this different from regression?

The _aim_ is different, the algorithms and methods of estimation are not (or, are differences in degree, rather than in kind).

In a linear regression, our aim is to estimate parameters, such as `\(\beta_0\)` (intercept) and `\(\beta_1\)` (slope), and to make inferences about them that are not biased by our particular sample.

In an ML approach, we can use the same linear regression model, but with a goal other than making unbiased inferences about the `\(\beta\)` parameters:

&lt;h4&gt;&lt;center&gt;In supervised ML, our goal is to minimize the difference between a known `\(y\)` and our predictions, `\(\hat{y}\)`&lt;/center&gt;&lt;/h3&gt;

---

# So, how is this really different?

This _predictive goal_ means that we can do things differently:

- Multicollinearity is not an issue because we do not care to make inferences about parameters
- Because interpreting specific parameters is less of an interest, we can use a great deal more predictors
- We focus on how accurately a _trained_ model can predict the values in _test_ data
- We can make our models very complex!

---

# Okay, _really_ complex

- Neutral/deep networks
  - i.e., GPT-3 (175 B parameters), GPT-4 (&gt;1 T parameters)
- And, some models can take a different form than familiar regressions:
  - *k*-nearest neighbors
  - Decision trees (and their extensions of bagged and random forests)
- Last, the modeling process can look different:
  - Ensemble models that combine or improve on ("boosting") the predictions of individual models

---

# Supervised &amp; unsupervised

## Unsupervised ML

- Does not require coded data; one way to think about unsupervised ML is that its purpose is to discover codes/labels
- Can be used in an _exploratory mode_ (see [Nelson, 2020](https://journals.sagepub.com/doi/full/10.1177/0049124118769114?casa_token=EV5XH31qbyAAAAAA%3AFg09JQ1XHOOzlxYT2SSJ06vZv0jG-s4Qfz8oDIQwh2jrZ-jrHNr7xZYL2FwnZtZiokhPalvV1RL2Bw)) 
- **Warning**: The results of unsupervised ML _cannot_ directly be used to provide codes/outcomes for supervised ML techniques 
- Algorithms include:
  - Cluster analysis and Latent Profile Analysis
  - [Principle Components Analysis (really!)](https://web.stanford.edu/~hastie/ElemStatLearn/)

---

# What technique should I choose?

Do you have coded data or data with a known outcome -- let's say about K-12 students -- and, do you want to:

- _Predict_ how other students with similar data (but without a known outcome) perform?
- _Scale_ coding that you have done for a sample of data to a larger sample?
- _Provide timely or instantaneous feedback_, like in many learning analytics systems?

**Supervised methods may be your best bet**

---

# What technique should I choose?

Do you not yet have codes/outcomes -- and do you want to?

- _Achieve a starting point_ for qualitative coding, perhaps in a ["computational grounded theory"](https://journals.sagepub.com/doi/full/10.1177/0049124117729703) mode?
- _Discover groups or patterns in your data_ that may be of interest?
- _Reduce the number of variables in your dataset_ to a smaller, but perhaps nearly as explanatory/predictive - set of variables?

**Unsupervised methods may be helpful**

---

# Examples of machine learning in STEM Ed Research


.panelset[

.panel[.panel-name[Example 1]

**Using digital log-trace data and supervised ML**

Gobert, J. D., Sao Pedro, M., Raziuddin, J., &amp; Baker, R. S. (2013). From log files to assessment metrics: Measuring students' science inquiry skills using educational data mining. *Journal of the Learning Sciences, 22*(4), 521-563.

- Utilized *replay tagging* to code sequences of students' activity within digital *log-trace* data
- Then trained a supervised ML algorithm to **automate** the prediction the presence or absence of students' engagement in inquiry
]

.panel[.panel-name[Example 2]

**Combining best practices in assessment with supervised ML**

Maestrales, S., Zhai, X., Touitou, I., Baker, Q., Schneider, B., &amp; Krajcik, J. (2021). Using machine learning to score multi-dimensional assessments of chemistry and physics. Journal of Science Education and Technology, 30(2), 239-254.

- Carrying out a careful process of qualitative coding (and the establishment of validity and interrater reliability) of students' written constructed responses
- Training a supervised ML algorithm in advance of being able to **scale up** their coding to a larger number of responses
]

.panel[.panel-name[Example 3]

**Combining qualitative methods with unsupervised ML**

Sherin, B. (2013). A computational study of commonsense science: An exploration in the automated analysis of clinical interview data. *Journal of the Learning Sciences, 22*(4), 600-638.

- Used unsupervised machine learning methods to identify topics within students' interviews **from patterns in the data alone**
- Interpreted those topics in light of rigorous qualitative coding with the aim of boosting the validity of the use of both the machine learning and qualitative approaches
]
]

---

class: clear, inverse, center, middle

# Learning Labs!

---

# Learning Labs Overviews

.panelset[

.panel[.panel-name[Overview]

**What will I learn in this topic area?**

We'll work to answer these four questions:

- LL 1: Can we predict something we would have coded by hand?
- LL 2: How much do new predictors improve the prediction quality?
- LL 3: How much of a difference does a more complex model make?
- LL 4: What if we do not have training data?

]

.panel[.panel-name[LL 1]

**Machine Learning Learning Lab 1: Focusing on prediction**

We have some data, but we want to use a computer to **automate** or scale up the relationships between predictor (independent) and outcome (dependent) variables. Supervised machine learning is suited to this aim. In particular, in this learning lab, we explore how we can train a computer to learn and reproduce qualitative coding we carried out---though the principles extend to other types of variables.

]

.panel[.panel-name[LL 2]

**Machine Learning Learning Lab 2: Feature engineering**

Once we have trained a computer in the context of using supervised machine learning methods, we may realize we can improve it by adding new variables or changing how the variables we are using were prepared. In this learning lab, we consider **feature engineering** with a variety of types of data from many online science courses to predict students' success after only a few weeks of the course

]

.panel[.panel-name[LL 3]

**Machine Learning Learning Lab 3: Fine-tuning the model**

Even with optimal feature engineering, we may be able to specify an even more predictive model by selecting and *tuning* a more sophisticated algorithm. While in the first two learning labs we used logistic regression as the "engine" (or algorithm), in this learning lab, we use a random forest as the engine for our model. This more sophisticated type of model requires us to specify **tuning parameters**, specifications that determine how the model works.

]

.panel[.panel-name[LL 4]

**Machine Learning Learning Lab 4: Finding groups (or codes) in data**

The previous three learning labs involved the use of data with known outcome variables (coded for the substantive or transactional nature of the conversations taking place through #NGSSchat in learning labs 1 and 3 and students' grades in learning lab 2). Accordingly, we explored different aspects of supervised machine learning. What if we have data without something that we can consider to be a dependent variable? **Unsupervised machine learning methods** can be used in such cases.
]
]

---

# Thanks

I hope to see you in the ML topic area learning labs!

Slides [here](https://laser-institute.github.io/machine-learning/introductory-presentation.html#1)

## .font130[.center[**Thank you!**]]
&lt;br/&gt;
.center[&lt;img style="border-radius: 80%;" src="img/jr-cycling.jpeg" height="200px"/&gt;&lt;br/&gt;**Dr. Joshua Rosenberg**&lt;br/&gt;&lt;mailto:jmrosenberg@utk.edu&gt;]

Joshua Rosenberg  
jmrosenberg@utk.edu  
@jrosenberg6432    

Slides created via the R package [xaringan](https://github.com/yihui/xaringan)

Available: 
    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "default",
"highlightLines": true,
"highlightLanguage": "r",
"countIncrementalSlides": false,
"ratio": "16:9",
"slideNumberFormat": "<div class=\"progress-bar-container\">\n <div class=\"progress-bar\" style=\"width: calc(%current% / %total% * 100%);\">\n </div>\n</div>"
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();
// add `data-at-shortcutkeys` attribute to <body> to resolve conflicts with JAWS
// screen reader (see PR #262)
(function(d) {
  let res = {};
  d.querySelectorAll('.remark-help-content table tr').forEach(tr => {
    const t = tr.querySelector('td:nth-child(2)').innerText;
    tr.querySelectorAll('td:first-child .key').forEach(key => {
      const k = key.innerText;
      if (/^[a-z]$/.test(k)) res[k] = t;  // must be a single letter (key)
    });
  });
  d.body.setAttribute('data-at-shortcutkeys', JSON.stringify(res));
})(document);
(function() {
  "use strict"
  // Replace <script> tags in slides area to make them executable
  var scripts = document.querySelectorAll(
    '.remark-slides-area .remark-slide-container script'
  );
  if (!scripts.length) return;
  for (var i = 0; i < scripts.length; i++) {
    var s = document.createElement('script');
    var code = document.createTextNode(scripts[i].textContent);
    s.appendChild(code);
    var scriptAttrs = scripts[i].attributes;
    for (var j = 0; j < scriptAttrs.length; j++) {
      s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
    }
    scripts[i].parentElement.replaceChild(s, scripts[i]);
  }
})();
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
  const hlines = d.querySelectorAll('.remark-code-line-highlighted');
  const preParents = [];
  const findPreParent = function(line, p = 0) {
    if (p > 1) return null; // traverse up no further than grandparent
    const el = line.parentElement;
    return el.tagName === "PRE" ? el : findPreParent(el, ++p);
  };

  for (let line of hlines) {
    let pre = findPreParent(line);
    if (pre && !preParents.includes(pre)) preParents.push(pre);
  }
  preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>