Skip to content

10 Chapter: Machine Learning

Mikiko Bazeley edited this page Dec 19, 2019 · 1 revision

Overview

This unit covers core machine learning (ML) techniques. It explores supervised and unsupervised learning algorithms and best practices for applying machine learning. The curriculum in this unit represents a balance between technical rigor and practical applications with ML techniques that are most widely used today.

You’ll learn a number of machine learning techniques in this unit and apply them to mini-projects for practice. You’ll also apply some of them to your capstone project. Keep your data stories in mind while you build your predictive models. Working through this unit will also help you develop a better sense of the learning track you want to choose.

What You’ll Learn: Learning Objectives

  • Learn the basic machine learning algorithms such as:
  • Supervised Learning: linear and logistic regression, SVM, decision trees and random forests, Bayesian methods and text analysis
  • ** Unsupervised Learning:** K-Means clustering
  • Practice applying these algorithms using scikit-learn, and other Python packages, as needed, through mini-projects
  • Explain the strengths and limitations of each machine learning algorithm
  • Identify the assumptions underlying each machine learning algorithm
  • Determine how to evaluate the performance of each learning algorithm
  • Demonstrate how to arrive at the right algorithm to use in different problem scenarios

Words to Know: Key Terms & Concepts

  • Supervised Learning: Algorithms that create a model of the world by looking at labeled examples
  • Unsupervised Learning: Algorithms that create a model of the world using examples without labels
  • Bayesian Analysis: Algorithms based on Bayes Theorem, which makes inferences about the world by combining domain knowledge or assumptions and observed evidence
  • Clustering: A family unsupervised learning algorithms used to automatically find groups in datasets

Chapter 10.1 Linear and Logistic Regression

Imagine you have some initial data that’s labeled “True/False” or “Spam/Not Spam” and you want to extract “features” from the data that, when passed through a function, generate the labels as accurately as possible. To find this function, you’d use a classification algorithm to automatically generate labels for those that don’t have one. In this unit, you’ll learn classification algorithms.

1 Video: Bias and Regression To get started with machine learning, you’ll learn about regression, a technique to predict unknown values when the values are real numbers. For example, you’d use regression to predict the amount of time a customer spends on a website, given data about characteristics and behavior of past customers. In this module, you’ll study the simplest regression approach, linear regression, through a Harvard University course. View the presentation slides here.

Please pay close attention both to the different types of bias that can arise, which are discussed during the first 15 minutes of the talk and to the derivation of the linear model, which starts at 50:00.

Students typically spend 2 - 3 Hours

2 Video: Regression (continued) We finish up with linear regression and start exploring linear logistic regression, one of the simplest approaches to classification. For example, given data about characteristics and behavior of past customers, you might use classification to predict whether a customer will actually make a purchase, a binary outcome instead of a real number. This Harvard University lecture covers concepts that are critical both to your understanding of machine learning as well as relevant to job interviews. Please pay close attention to concepts such as collinearity (15:00), odds ratios (25:00), Curse of Dimensionality (40:00), and Lasso vs. Ridge regularization (1:00:00). View the presentation slides here.

Questions related to these topics come up frequently in interviews, so make sure to understand them well and discuss with your mentor if you have further questions.

Students typically spend 2 - 3 Hours

3 Video: Classification, kNN, Cross-validation, Dimensionality Reduction How do you know how well your model does on data it hasn’t seen before? After all, a model needs to generalize beyond the examples it’s already been shown. In this unit, we’ll cover some important techniques to estimate the generalization capability of a model, and the metrics used to evaluate a model. We’ll also cover dimensionality reduction, an important technique to create simpler models, and we’ll visualize more complex models with high-dimensional data.

In this lecture,please pay close attention to the concepts of validation (35:00) and dimensionality reduction (55:00). Both concepts are frequently assessed in interviews, so make sure you understand why they’re useful. View presentation slides here.

4 Interactive Exercises: Supervised Learning with Scikit-Learn Open exercises
In this DataCamp resource, you'll learn how to use Python to perform supervised learning, an essential component of machine learning. Using real datasets, you’ll learn how to build predictive models, tune their parameters, and tell how well they’ll perform on unseen data. In this module, you’ll become familiar with scikit-learn, one of the most popular and user-friendly machine learning libraries for Python.

Students typically spend 4 - 6 Hours

Boston Housing Mini Project Help:


Chapter 10.2 SVM and Trees

The algorithms that we’ve studied so far are only the simplest ones in machine learning. The algorithms assume that the dataset they work on has a relatively straightforward structure. For example, both linear and logistic regression assume that data is mostly described by drawing a straight line. But, what if that’s not true?

  • Video: SVM and Evaluation - There are many advanced algorithms that handle more complex datasets, and Support Vector Machines (SVM) are among the most popular. In this Harvard University lecture, you'll learn about SVMs and fundamentals of model evaluation. After skipping the first five minutes of the lecture, please pay close attention to kernel functions (37:00) and the kernel trick. Also, at the end of the lecture, starting at 1:05, you’ll learn about error measures, such as true and false positive rates, ROC curves, precision, and recall. These concepts are critical for both your work as a data scientist and your job interviews. According to hiring managers, many candidates underrate the importance of these ideas and stumble over them in interviews. Make sure you’re a cut above the rest by mastering these concepts! PResentation here

  • Video: Decision Trees - Tree-based algorithms (e.g. decision trees and random forests) are some of the most popular and effective classification and regression algorithms, especially for complex datasets. In this Harvard University lecture, you’ll learn about decision trees (5:00) and understand ensemble methods, specifically starting with bagging (1:00). By the end of the lecture, you’ll be able to summarize how decision trees work, which is an important question that comes up during job interviews. View the presentation slides here

  • Video: Using Random Forests in Python - This talk covers the internals of how random forests are implemented in scikit-learn, applications that are well-suited to the use of random forests, and Python code samples to demonstrate their use.

  • Video: Ensemble Methods - How do you select which of the learning algorithms to use? It’s often said that people like having their cake and eating it, too. The machine learning equivalent of that proverb is ensemble methods, which is a learning algorithm that constructs a set of classifiers and to identify new data points using a combination of predictions. Ensembles tend to be more accurate and robust than single classifiers. This Harvard University lecture starts with random forests (8:00), which is essentially bagging for decision trees. The talk then covers a different approach to ensemble learning called boosting (25:00), specifically a technique called AdaBoost (30:00). View slides here

  • Article: Gradient Boosting from Scratch - Gradient Boosting is a relatively new ensemble algorithm that has done very well in many kaggle competitions. This article is a great, easy-to-follow summary of how this technique works.


Chapter 10.3 Bayesian Methods and Text Data

Bayesian methods are a powerful suite of techniques that are gaining traction in the world of data science. Unlike most other classification methods, which are discriminative (i.e. they give you a classification boundary), Bayesian methods are generative (i.e.they give you a model to generate the data, allowing you to infer statistical properties of the data). In practice, Bayesian methods are often used in text analysis and spam/fraud detection.


Chapter 10.4 Best Practices

Bayesian methods are a powerful suite of techniques that are gaining traction in the world of data science. Unlike most other classification methods, which are discriminative (i.e. they give you a classification boundary), Bayesian methods are generative (i.e. they give you a model to generate the data, allowing you toinfer statistical properties of the data). In practice, Bayesian methods are often used in text analysis and spam/fraud detection.

At this point, we’ve learned different techniques and methods, both for performing supervised learning as well as for evaluating machine learning models. How do we put them all together? What are some common tips and tricks for well-designed and effective models?

Some of the concepts in the following sections require a basic understanding of linear algebra. If you’d like a refresher on linear algebra, here's a quick summary. Linear Algebra refresher


Chapter 10.5 Introduction to Unsupervised Learning

What do you do if you have data but no labels, and you want to find some structure in the data by defining your own classes? It’s time for unsupervised learning!

Linear and logistic regression are both examples of “supervised” learning algorithms (i.e. they create a model of the world by looking at labeled examples). In contrast, clustering is an “unsupervised” algorithm (i.e. it does not need labeled examples). Clustering is used to automatically find groups in datasets. For example, given a data set about customer characteristics, like age, gender, or education, we might use clustering to automatically discover interesting customer segments.

This Harvard University lecture is rich in concepts that are fundamental to learning data science and acing your interviews. Pay close attention to the definition of unsupervised learning (10:00), the K-Means algorithm (17:00), the “Elbow method” for evaluating K-means (31:00), and the Hierarchical Clustering algorithm (50:00). Begin watching this lecture at 9:55. View the presentation slides here. *


Chapter 10.6 Storytelling Presentation

Clone this wiki locally