-
Notifications
You must be signed in to change notification settings - Fork 81
10 Chapter: Machine Learning
This unit covers core machine learning (ML) techniques. It explores supervised and unsupervised learning algorithms and best practices for applying machine learning. The curriculum in this unit represents a balance between technical rigor and practical applications with ML techniques that are most widely used today.
You’ll learn a number of machine learning techniques in this unit and apply them to mini-projects for practice. You’ll also apply some of them to your capstone project. Keep your data stories in mind while you build your predictive models. Working through this unit will also help you develop a better sense of the learning track you want to choose.
- Learn the basic machine learning algorithms such as:
- Supervised Learning: linear and logistic regression, SVM, decision trees and random forests, Bayesian methods and text analysis
- ** Unsupervised Learning:** K-Means clustering
- Practice applying these algorithms using scikit-learn, and other Python packages, as needed, through mini-projects
- Explain the strengths and limitations of each machine learning algorithm
- Identify the assumptions underlying each machine learning algorithm
- Determine how to evaluate the performance of each learning algorithm
- Demonstrate how to arrive at the right algorithm to use in different problem scenarios
- Supervised Learning: Algorithms that create a model of the world by looking at labeled examples
- Unsupervised Learning: Algorithms that create a model of the world using examples without labels
- Bayesian Analysis: Algorithms based on Bayes Theorem, which makes inferences about the world by combining domain knowledge or assumptions and observed evidence
- Clustering: A family unsupervised learning algorithms used to automatically find groups in datasets
Imagine you have some initial data that’s labeled “True/False” or “Spam/Not Spam” and you want to extract “features” from the data that, when passed through a function, generate the labels as accurately as possible. To find this function, you’d use a classification algorithm to automatically generate labels for those that don’t have one. In this unit, you’ll learn classification algorithms.
1 Video: Bias and Regression To get started with machine learning, you’ll learn about regression, a technique to predict unknown values when the values are real numbers. For example, you’d use regression to predict the amount of time a customer spends on a website, given data about characteristics and behavior of past customers. In this module, you’ll study the simplest regression approach, linear regression, through a Harvard University course. View the presentation slides here.
Please pay close attention both to the different types of bias that can arise, which are discussed during the first 15 minutes of the talk and to the derivation of the linear model, which starts at 50:00.
Students typically spend 2 - 3 Hours
2 Video: Regression (continued) We finish up with linear regression and start exploring linear logistic regression, one of the simplest approaches to classification. For example, given data about characteristics and behavior of past customers, you might use classification to predict whether a customer will actually make a purchase, a binary outcome instead of a real number. This Harvard University lecture covers concepts that are critical both to your understanding of machine learning as well as relevant to job interviews. Please pay close attention to concepts such as collinearity (15:00), odds ratios (25:00), Curse of Dimensionality (40:00), and Lasso vs. Ridge regularization (1:00:00). View the presentation slides here.
Questions related to these topics come up frequently in interviews, so make sure to understand them well and discuss with your mentor if you have further questions.
Students typically spend 2 - 3 Hours
- https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=c322c0d5-9cf9-4deb-b59f-d6741064ba8a
- https://github.com/cs109/2015/blob/master/Lectures/08-RegressionContinued.pdf
3 Video: Classification, kNN, Cross-validation, Dimensionality Reduction How do you know how well your model does on data it hasn’t seen before? After all, a model needs to generalize beyond the examples it’s already been shown. In this unit, we’ll cover some important techniques to estimate the generalization capability of a model, and the metrics used to evaluate a model. We’ll also cover dimensionality reduction, an important technique to create simpler models, and we’ll visualize more complex models with high-dimensional data.
In this lecture,please pay close attention to the concepts of validation (35:00) and dimensionality reduction (55:00). Both concepts are frequently assessed in interviews, so make sure you understand why they’re useful. View presentation slides here.
-
https://matterhorn.dce.harvard.edu/engage/player/watch.html?id=c322c0d5-9cf9-4deb-b59f-d6741064ba8a
-
https://github.com/cs109/2015/blob/master/Lectures/09-ClassificationPCA.pdf
Students typically spend 2 - 3 Hours
4 Interactive Exercises: Supervised Learning with Scikit-Learn
Open exercises
In this DataCamp resource, you'll learn how to use Python to perform supervised learning, an essential component of machine learning. Using real datasets, you’ll learn how to build predictive models, tune their parameters, and tell how well they’ll perform on unseen data. In this module, you’ll become familiar with scikit-learn, one of the most popular and user-friendly machine learning libraries for Python.
Students typically spend 4 - 6 Hours
Boston Housing Mini Project Help:
- log transformation
- more log transform
- What Is the F-test of Overall Significance in Regression Analysis?
- Is it ok to remove the intercept in regression model?
- How is y normally distributed?
- Statistical forecasting: notes on regression and time series analysis (Duke)
- There is an sklearn metrics module
- Why use a t-test in lear regression model?
The algorithms that we’ve studied so far are only the simplest ones in machine learning. The algorithms assume that the dataset they work on has a relatively straightforward structure. For example, both linear and logistic regression assume that data is mostly described by drawing a straight line. But, what if that’s not true?
-
Video: SVM and Evaluation - There are many advanced algorithms that handle more complex datasets, and Support Vector Machines (SVM) are among the most popular. In this Harvard University lecture, you'll learn about SVMs and fundamentals of model evaluation. After skipping the first five minutes of the lecture, please pay close attention to kernel functions (37:00) and the kernel trick. Also, at the end of the lecture, starting at 1:05, you’ll learn about error measures, such as true and false positive rates, ROC curves, precision, and recall. These concepts are critical for both your work as a data scientist and your job interviews. According to hiring managers, many candidates underrate the importance of these ideas and stumble over them in interviews. Make sure you’re a cut above the rest by mastering these concepts! PResentation here
-
Video: Decision Trees - Tree-based algorithms (e.g. decision trees and random forests) are some of the most popular and effective classification and regression algorithms, especially for complex datasets. In this Harvard University lecture, you’ll learn about decision trees (5:00) and understand ensemble methods, specifically starting with bagging (1:00). By the end of the lecture, you’ll be able to summarize how decision trees work, which is an important question that comes up during job interviews. View the presentation slides here
-
Video: Using Random Forests in Python - This talk covers the internals of how random forests are implemented in scikit-learn, applications that are well-suited to the use of random forests, and Python code samples to demonstrate their use.
-
Video: Ensemble Methods - How do you select which of the learning algorithms to use? It’s often said that people like having their cake and eating it, too. The machine learning equivalent of that proverb is ensemble methods, which is a learning algorithm that constructs a set of classifiers and to identify new data points using a combination of predictions. Ensembles tend to be more accurate and robust than single classifiers. This Harvard University lecture starts with random forests (8:00), which is essentially bagging for decision trees. The talk then covers a different approach to ensemble learning called boosting (25:00), specifically a technique called AdaBoost (30:00). View slides here
-
Article: Gradient Boosting from Scratch - Gradient Boosting is a relatively new ensemble algorithm that has done very well in many kaggle competitions. This article is a great, easy-to-follow summary of how this technique works.
Bayesian methods are a powerful suite of techniques that are gaining traction in the world of data science. Unlike most other classification methods, which are discriminative (i.e. they give you a classification boundary), Bayesian methods are generative (i.e.they give you a model to generate the data, allowing you to infer statistical properties of the data). In practice, Bayesian methods are often used in text analysis and spam/fraud detection.
-
Video: Bayes Theorem and Bayesian Methods - In this lecture, you’ll learn about Bayes Theorem and algorithms like Naive Bayes in the context of text analysis. Overall, Bayesian methods are a powerful set of machine learning tools that have many applications. As you watch this lecture, please pay close attention to the introduction of Naive Bayes (30:00) and the independence assumptions that it makes. View the presentation slides here.
-
Video: Sentiment Classification Using Scikit-Learn - Ryan Rosario, a Springboard mentor and Facebook Data Scientist, demonstrates practical text analysis and machine learning.
Bayesian methods are a powerful suite of techniques that are gaining traction in the world of data science. Unlike most other classification methods, which are discriminative (i.e. they give you a classification boundary), Bayesian methods are generative (i.e. they give you a model to generate the data, allowing you toinfer statistical properties of the data). In practice, Bayesian methods are often used in text analysis and spam/fraud detection.
At this point, we’ve learned different techniques and methods, both for performing supervised learning as well as for evaluating machine learning models. How do we put them all together? What are some common tips and tricks for well-designed and effective models?
Some of the concepts in the following sections require a basic understanding of linear algebra. If you’d like a refresher on linear algebra, here's a quick summary. Linear Algebra refresher
-
Video: Best Practices in Supervised Learning - This video explores the best practices in supervised learning, an important skill to have not only on the job but also in interviews. [https://github.com/cs109/2015/raw/master/Lectures/13-BestPractices_Recommendations.pdf](slide deck here)
What do you do if you have data but no labels, and you want to find some structure in the data by defining your own classes? It’s time for unsupervised learning!
Linear and logistic regression are both examples of “supervised” learning algorithms (i.e. they create a model of the world by looking at labeled examples). In contrast, clustering is an “unsupervised” algorithm (i.e. it does not need labeled examples). Clustering is used to automatically find groups in datasets. For example, given a data set about customer characteristics, like age, gender, or education, we might use clustering to automatically discover interesting customer segments.
This Harvard University lecture is rich in concepts that are fundamental to learning data science and acing your interviews. Pay close attention to the definition of unsupervised learning (10:00), the K-Means algorithm (17:00), the “Elbow method” for evaluating K-means (31:00), and the Hierarchical Clustering algorithm (50:00). Begin watching this lecture at 9:55. View the presentation slides here. *
- https://scikit-learn.org/stable/tutorial/basic/tutorial.html
- https://github.com/ogrisel/sklearn_pycon2014
- https://www.youtube.com/watch?v=HjAB45qsx_c