From 486e2c549d8abf7eeb4968fd5de7b9d33b1f4796 Mon Sep 17 00:00:00 2001 From: Sanders Kleinfeld Date: Tue, 13 Aug 2024 19:35:12 -0400 Subject: [PATCH] Added new MLCC exercises --- .../binary_classification_rice.ipynb | 1024 +++++++++ ml/cc/exercises/fairness_income.ipynb | 944 ++++++++ ml/cc/exercises/linear_regression_taxi.ipynb | 1012 +++++++++ .../exercises/numerical_data_bad_values.ipynb | 1987 +++++++++++++++++ ml/cc/exercises/numerical_data_stats.ipynb | 177 ++ 5 files changed, 5144 insertions(+) create mode 100644 ml/cc/exercises/binary_classification_rice.ipynb create mode 100644 ml/cc/exercises/fairness_income.ipynb create mode 100644 ml/cc/exercises/linear_regression_taxi.ipynb create mode 100644 ml/cc/exercises/numerical_data_bad_values.ipynb create mode 100644 ml/cc/exercises/numerical_data_stats.ipynb diff --git a/ml/cc/exercises/binary_classification_rice.ipynb b/ml/cc/exercises/binary_classification_rice.ipynb new file mode 100644 index 0000000..032d518 --- /dev/null +++ b/ml/cc/exercises/binary_classification_rice.ipynb @@ -0,0 +1,1024 @@ +{ + "cells": [ + { + "cell_type": "code", + "source": [ + "#@title Copyright 2023 Google LLC. Double-click for license information.\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ], + "metadata": { + "id": "kYmgbnGytC9h", + "cellView": "form" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CeNGK50ZP5pR" + }, + "source": [ + "# Colabs\n", + "\n", + "Machine Learning Crash Course uses Colaboratories (Colabs) for all programming exercises. Colab is Google's implementation of [Jupyter Notebook](https://jupyter.org/). For more information about Colabs and how to use them, go to [Welcome to Colaboratory](https://research.google.com/colaboratory).\n", + "\n", + "# Binary classification\n", + "\n", + "In this Colab, you'll examine a dataset containing measurements derived from images of two species of Turkish rice, create a binary classifier to sort grains of rice into the two species, and evaluate the performance of that model.\n", + "\n", + "## Learning objectives\n", + "\n", + "By completing this Colab, you'll learn:\n", + "- How to train a binary classifier.\n", + "- How to calculate metrics for a binary classifier at different thresholds.\n", + "- How to compare AUC and ROC of two different models.\n", + "\n", + "## Dataset\n", + "\n", + "This Colab uses the Cinar and Koklu 2019 Osmancik and Cammeo rice dataset.\n", + "\n", + "Provided with a CC0 license (see [Kaggle](https://www.kaggle.com/datasets/muratkokludataset/rice-dataset-commeo-and-osmancik) for more documentation; lengths and area are given in pixels). Cinar and Koklu also provide datasets for multiclass (5 species of rice), pistachios, raisins, grape leaves, and so on, at their [repository](https://www.muratkoklu.com/datasets/).\n", + "\n", + "### Citation\n", + "\n", + "Cinar, I. and Koklu, M., (2019). “Classification of Rice Varieties Using Artificial Intelligence Methods.” *International Journal of Intelligent Systems and Applications in Engineering*, 7(3), 188-194.\n", + "\n", + "DOI: https://doi.org/10.18201/ijisae.2019355381\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D4JwY1X5iryL" + }, + "source": [ + "# Load Imports" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "irD_75BI97__" + }, + "outputs": [], + "source": [ + "# @title Load the imports\n", + "\n", + "import io\n", + "import keras\n", + "from matplotlib import pyplot as plt\n", + "from matplotlib.lines import Line2D\n", + "import numpy as np\n", + "import pandas as pd\n", + "import plotly.express as px\n", + "\n", + "# The following lines adjust the granularity of reporting.\n", + "pd.options.display.max_rows = 10\n", + "pd.options.display.float_format = \"{:.1f}\".format\n", + "\n", + "print(\"Ran the import statements.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "aDobhxERWPD1" + }, + "outputs": [], + "source": [ + "# @title Load the dataset\n", + "rice_dataset_raw = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/Rice_Cammeo_Osmancik.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1IqvqOvaQqlK" + }, + "source": [ + "Once the dataset has been loaded via the cell above, select specific columns to show summary statistics of the numerical features in the dataset.\n", + "\n", + "See the Kaggle [dataset documentation](https://www.kaggle.com/datasets/muratkokludataset/rice-dataset-commeo-and-osmancik), especially the **Provenance** section, for explanations of what each feature means and how they were calculated." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "XKakOMCmHp-E" + }, + "outputs": [], + "source": [ + "# @title\n", + "# Read and provide statistics on the dataset.\n", + "rice_dataset = rice_dataset_raw[[\n", + " 'Area',\n", + " 'Perimeter',\n", + " 'Major_Axis_Length',\n", + " 'Minor_Axis_Length',\n", + " 'Eccentricity',\n", + " 'Convex_Area',\n", + " 'Extent',\n", + " 'Class',\n", + "]]\n", + "\n", + "rice_dataset.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ynv9WwTPG9oU" + }, + "source": [ + "## Task 1: Describe the data\n", + "\n", + "From the summary statistics above, answer the following questions:\n", + "- What are the min and max lengths (major axis length, given in pixels) of the rice grains?\n", + "- What is the range of areas between the smallest and largest rice grains?\n", + "- How many standard deviations (`std`) is the largest rice grain's perimeter from the mean?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "y36kQm1vJ7n3" + }, + "outputs": [], + "source": [ + "# @title Solutions (run the cell to get the answers)\n", + "\n", + "print(\n", + " f'The shortest grain is {rice_dataset.Major_Axis_Length.min():.1f}px long,'\n", + " f' while the longest is {rice_dataset.Major_Axis_Length.max():.1f}px.'\n", + ")\n", + "print(\n", + " f'The smallest rice grain has an area of {rice_dataset.Area.min()}px, while'\n", + " f' the largest has an area of {rice_dataset.Area.max()}px.'\n", + ")\n", + "print(\n", + " 'The largest rice grain, with a perimeter of'\n", + " f' {rice_dataset.Perimeter.max():.1f}px, is'\n", + " f' ~{(rice_dataset.Perimeter.max() - rice_dataset.Perimeter.mean())/rice_dataset.Perimeter.std():.1f} standard'\n", + " f' deviations ({rice_dataset.Perimeter.std():.1f}) from the mean'\n", + " f' ({rice_dataset.Perimeter.mean():.1f}px).'\n", + ")\n", + "print(\n", + " f'This is calculated as: ({rice_dataset.Perimeter.max():.1f} -'\n", + " f' {rice_dataset.Perimeter.mean():.1f})/{rice_dataset.Perimeter.std():.1f} ='\n", + " f' {(rice_dataset.Perimeter.max() - rice_dataset.Perimeter.mean())/rice_dataset.Perimeter.std():.1f}'\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Bc8JVgN2j6h3" + }, + "source": [ + "# Explore the dataset\n", + "\n", + "Plot some of the features against each other, including in 3D.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QNw5U7-4NFLR" + }, + "outputs": [], + "source": [ + "# Create five 2D plots of the features against each other, color-coded by class.\n", + "for x_axis_data, y_axis_data in [\n", + " ('Area', 'Eccentricity'),\n", + " ('Convex_Area', 'Perimeter'),\n", + " ('Major_Axis_Length', 'Minor_Axis_Length'),\n", + " ('Perimeter', 'Extent'),\n", + " ('Eccentricity', 'Major_Axis_Length'),\n", + "]:\n", + " px.scatter(rice_dataset, x=x_axis_data, y=y_axis_data, color='Class').show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G6xJ0HQxLB4N" + }, + "source": [ + "## Task 2: Visualize samples in 3D\n", + "\n", + "Try graphing three of the features in 3D against each other." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "qvpUsZF1LDWM" + }, + "outputs": [], + "source": [ + "#@title Plot three features in 3D by entering their names and running this cell\n", + "\n", + "x_axis_data = 'Enter a feature name here' # @param {type: \"string\"}\n", + "y_axis_data = 'Enter a feature name here' # @param {type: \"string\"}\n", + "z_axis_data = 'Enter a feature name here' # @param {type: \"string\"}\n", + "\n", + "px.scatter_3d(\n", + " rice_dataset,\n", + " x=x_axis_data,\n", + " y=y_axis_data,\n", + " z=z_axis_data,\n", + " color='Class',\n", + ").show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "r5WBGiJChXt-" + }, + "outputs": [], + "source": [ + "# @title One possible solution\n", + "\n", + "# Plot major and minor axis length and eccentricity, with observations\n", + "# color-coded by class.\n", + "px.scatter_3d(\n", + " rice_dataset,\n", + " x='Eccentricity',\n", + " y='Area',\n", + " z='Major_Axis_Length',\n", + " color='Class',\n", + ").show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ch82395CSBMR" + }, + "source": [ + "If we were to pick three features, it seems that major axis length, area, and eccentricity might contain most of the information that differentiates the two classes. Other combinations may work as well.\n", + "\n", + "Run the previous code cell to graph those three features if you haven't already.\n", + "\n", + "It seems like a distinct class boundary appears in the plane of these three features. We'll train a model on just these features, then another model on the complete set of features, and compare their performance." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_G6y-XcEmk6r" + }, + "source": [ + "## Normalize data\n", + "\n", + "When creating a model with multiple features, the values of each feature should span roughly the same range. If one feature's values range from 500 to 100,000 and another feature's values range from 2 to 12, the model will need to have weights of extremely low or extremely high values to be able to combine these features effectively. This could result in a low quality model. To avoid this,\n", + "[normalize](https://developers.google.com/machine-learning/glossary/#normalization) features in a multi-feature model.\n", + "\n", + "This can be done by converting each raw value to its Z-score. The **Z-score** for a given value is how many standard deviations away from the mean the value is.\n", + "\n", + "Consider a feature with a mean of 60 and a standard deviation of 10.\n", + "\n", + "The raw value 75 would have a Z-score of +1.5:\n", + "\n", + "```\n", + " Z-score = (75 - 60) / 10 = +1.5\n", + "```\n", + "\n", + "The raw value 38 would have a Z-score of -2.2:\n", + "\n", + "```\n", + " Z-score = (38 - 60) / 10 = -2.2\n", + "```\n", + "\n", + "Now normalize the numerical values in the rice dataset by converting them to Z-scores." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "hSUjPSwNiyBP" + }, + "outputs": [], + "source": [ + "# Calculate the Z-scores of each numerical column in the raw data and write\n", + "# them into a new DataFrame named df_norm.\n", + "\n", + "feature_mean = rice_dataset.mean(numeric_only=True)\n", + "feature_std = rice_dataset.std(numeric_only=True)\n", + "numerical_features = rice_dataset.select_dtypes('number').columns\n", + "normalized_dataset = (\n", + " rice_dataset[numerical_features] - feature_mean\n", + ") / feature_std\n", + "\n", + "# Copy the class to the new dataframe\n", + "normalized_dataset['Class'] = rice_dataset['Class']\n", + "\n", + "# Examine some of the values of the normalized training set. Notice that most\n", + "# Z-scores fall between -2 and +2.\n", + "normalized_dataset.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D5aXjXq-YIkL" + }, + "source": [ + "# Set the random seeds\n", + "\n", + "To make experiments reproducible, we set the seed of the random number generators. This means that the order in which the data is shuffled, the values of the random weight initializations, etc, will all be the same each time the colab is run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bu257GAFYH-N" + }, + "outputs": [], + "source": [ + "keras.utils.set_random_seed(42)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p7M9I-ekT1dV" + }, + "source": [ + "## Label and split data\n", + "\n", + "To train the model, we'll arbritrarily assign the Cammeo species a label of '1' and the Osmancik species a label of '0'." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "F4_yTxWdvPqz" + }, + "outputs": [], + "source": [ + "# Create a column setting the Cammeo label to '1' and the Osmancik label to '0'\n", + "# then show 10 randomly selected rows.\n", + "normalized_dataset['Class_Bool'] = (\n", + " # Returns true if class is Cammeo, and false if class is Osmancik\n", + " normalized_dataset['Class'] == 'Cammeo'\n", + ").astype(int)\n", + "normalized_dataset.sample(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VBY8b0akUqiQ" + }, + "source": [ + "We can then randomize and partition the dataset into train, test, and validation splits, consisting of 80%, 10%, and 10% of the dataset respectively." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XE-RAq0av1wv" + }, + "outputs": [], + "source": [ + "# Create indices at the 80th and 90th percentiles\n", + "number_samples = len(normalized_dataset)\n", + "index_80th = round(number_samples * 0.8)\n", + "index_90th = index_80th + round(number_samples * 0.1)\n", + "\n", + "# Randomize order and split into train, validation, and test with a .8, .1, .1 split\n", + "shuffled_dataset = normalized_dataset.sample(frac=1, random_state=100)\n", + "train_data = shuffled_dataset.iloc[0:index_80th]\n", + "validation_data = shuffled_dataset.iloc[index_80th:index_90th]\n", + "test_data = shuffled_dataset.iloc[index_90th:]\n", + "\n", + "# Show the first five rows of the last split\n", + "test_data.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7Iq_haqJYeSH" + }, + "source": [ + "It's important to prevent the model from getting the label as input during training, which is called label leakage. This can be done by storing features and labels as separate variables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_Gi0VaAAYiaO" + }, + "outputs": [], + "source": [ + "label_columns = ['Class', 'Class_Bool']\n", + "\n", + "train_features = train_data.drop(columns=label_columns)\n", + "train_labels = train_data['Class_Bool'].to_numpy()\n", + "validation_features = validation_data.drop(columns=label_columns)\n", + "validation_labels = validation_data['Class_Bool'].to_numpy()\n", + "test_features = test_data.drop(columns=label_columns)\n", + "test_labels = test_data['Class_Bool'].to_numpy()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "U-kTF2rTY-K8" + }, + "source": [ + "## Train the model\n", + "\n", + "### Choose the input features\n", + "\n", + "To start with, we'll train a model on `Eccentricity`, `Major_Axis_Length,` and `Area`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7v_UNIPBtjoz" + }, + "outputs": [], + "source": [ + "# Name of the features we'll train our model on.\n", + "input_features = [\n", + " 'Eccentricity',\n", + " 'Major_Axis_Length',\n", + " 'Area',\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cSBegHR5rSEn" + }, + "source": [ + "## Define functions that build and train a model\n", + "\n", + "The following code cell defines two functions:\n", + "\n", + " * `create_model(inputs, learning_rate, metrics)`, which defines the model's architecture.\n", + " * `train_model(model, dataset, epochs, label_name, batch_size, shuffle)`, uses input features and labels to train the model.\n", + "\n", + "Note: create_model applies the sigmoid function to perform [logistic regression](https://developers.google.com/machine-learning/crash-course/logistic-regression).\n", + "\n", + "We also define two helpful data structures: `ExperimentSettings` and `Experiment`. We use these simple classes to keep track of our experiments, allowing us to know what hyperparameters were used and what the results were. In `ExperimentSettings`, we store all values describing an experiment (i.e., hyperparameters). Then, we store the results of a training run (i.e., the model and the training metrics) into an `Experiment` instance, along with the `ExperimentSettings` used for that experiment." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "8B2VArcKH6UX" + }, + "outputs": [], + "source": [ + "# @title Define the functions that create and train a model.\n", + "\n", + "import dataclasses\n", + "\n", + "\n", + "@dataclasses.dataclass()\n", + "class ExperimentSettings:\n", + " \"\"\"Lists the hyperparameters and input features used to train am model.\"\"\"\n", + "\n", + " learning_rate: float\n", + " number_epochs: int\n", + " batch_size: int\n", + " classification_threshold: float\n", + " input_features: list[str]\n", + "\n", + "\n", + "@dataclasses.dataclass()\n", + "class Experiment:\n", + " \"\"\"Stores the settings used for a training run and the resulting model.\"\"\"\n", + "\n", + " name: str\n", + " settings: ExperimentSettings\n", + " model: keras.Model\n", + " epochs: np.ndarray\n", + " metrics_history: keras.callbacks.History\n", + "\n", + " def get_final_metric_value(self, metric_name: str) -> float:\n", + " \"\"\"Gets the final value of the given metric for this experiment.\"\"\"\n", + " if metric_name not in self.metrics_history:\n", + " raise ValueError(\n", + " f'Unknown metric {metric_name}: available metrics are'\n", + " f' {list(self.metrics_history.columns)}'\n", + " )\n", + " return self.metrics_history[metric_name].iloc[-1]\n", + "\n", + "\n", + "def create_model(\n", + " settings: ExperimentSettings,\n", + " metrics: list[keras.metrics.Metric],\n", + ") -> keras.Model:\n", + " \"\"\"Create and compile a simple classification model.\"\"\"\n", + " model_inputs = [\n", + " keras.Input(name=feature, shape=(1,))\n", + " for feature in settings.input_features\n", + " ]\n", + " # Use a Concatenate layer to assemble the different inputs into a single\n", + " # tensor which will be given as input to the Dense layer.\n", + " # For example: [input_1[0][0], input_2[0][0]]\n", + "\n", + " concatenated_inputs = keras.layers.Concatenate()(model_inputs)\n", + " dense = keras.layers.Dense(\n", + " units=1, input_shape=(1,), name='dense_layer', activation=keras.activations.sigmoid\n", + " )\n", + " model_output = dense(concatenated_inputs)\n", + " model = keras.Model(inputs=model_inputs, outputs=model_output)\n", + " # Call the compile method to transform the layers into a model that\n", + " # Keras can execute. Notice that we're using a different loss\n", + " # function for classification than for regression.\n", + " model.compile(\n", + " optimizer=keras.optimizers.RMSprop(\n", + " settings.learning_rate\n", + " ),\n", + " loss=keras.losses.BinaryCrossentropy(),\n", + " metrics=metrics,\n", + " )\n", + " return model\n", + "\n", + "\n", + "def train_model(\n", + " experiment_name: str,\n", + " model: keras.Model,\n", + " dataset: pd.DataFrame,\n", + " labels: np.ndarray,\n", + " settings: ExperimentSettings,\n", + ") -> Experiment:\n", + " \"\"\"Feed a dataset into the model in order to train it.\"\"\"\n", + "\n", + " # The x parameter of keras.Model.fit can be a list of arrays, where\n", + " # each array contains the data for one feature.\n", + " features = {\n", + " feature_name: np.array(dataset[feature_name])\n", + " for feature_name in settings.input_features\n", + " }\n", + "\n", + " history = model.fit(\n", + " x=features,\n", + " y=labels,\n", + " batch_size=settings.batch_size,\n", + " epochs=settings.number_epochs,\n", + " )\n", + "\n", + " return Experiment(\n", + " name=experiment_name,\n", + " settings=settings,\n", + " model=model,\n", + " epochs=history.epoch,\n", + " metrics_history=pd.DataFrame(history.history),\n", + " )\n", + "\n", + "\n", + "print('Defined the create_model and train_model functions.')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ak_TMAzGOIFq" + }, + "source": [ + "## Define a plotting function\n", + "\n", + "The following [matplotlib](https://developers.google.com/machine-learning/glossary/#matplotlib) function plots one or more curves, showing how various classification metrics change with each epoch." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QF0BFRXTOeR3" + }, + "outputs": [], + "source": [ + "# @title Define the plotting function.\n", + "def plot_experiment_metrics(experiment: Experiment, metrics: list[str]):\n", + " \"\"\"Plot a curve of one or more metrics for different epochs.\"\"\"\n", + " plt.figure(figsize=(12, 8))\n", + "\n", + " for metric in metrics:\n", + " plt.plot(\n", + " experiment.epochs, experiment.metrics_history[metric], label=metric\n", + " )\n", + "\n", + " plt.xlabel(\"Epoch\")\n", + " plt.ylabel(\"Metric value\")\n", + " plt.grid()\n", + " plt.legend()\n", + "\n", + "\n", + "print(\"Defined the plot_curve function.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D-IXYVfvM4gD" + }, + "source": [ + "## Invoke the creating, training, and plotting functions\n", + "\n", + "The following code specifies the hyperparameters, invokes the\n", + "functions to create and train the model, then plots the results, including accuracy, precision, and recall.\n", + "\n", + "Classification threshold is set at .35. Try playing with the threshold, then the learning rate, to see what changes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Q82E6tS13O_2" + }, + "outputs": [], + "source": [ + "# Let's define our first experiment settings.\n", + "settings = ExperimentSettings(\n", + " learning_rate=0.001,\n", + " number_epochs=60,\n", + " batch_size=100,\n", + " classification_threshold=0.35,\n", + " input_features=input_features,\n", + ")\n", + "\n", + "metrics = [\n", + " keras.metrics.BinaryAccuracy(\n", + " name='accuracy', threshold=settings.classification_threshold\n", + " ),\n", + " keras.metrics.Precision(\n", + " name='precision', thresholds=settings.classification_threshold\n", + " ),\n", + " keras.metrics.Recall(\n", + " name='recall', thresholds=settings.classification_threshold\n", + " ),\n", + " keras.metrics.AUC(num_thresholds=100, name='auc'),\n", + "]\n", + "\n", + "# Establish the model's topography.\n", + "model = create_model(settings, metrics)\n", + "\n", + "# Train the model on the training set.\n", + "experiment = train_model(\n", + " 'baseline', model, train_features, train_labels, settings\n", + ")\n", + "\n", + "# Plot metrics vs. epochs\n", + "plot_experiment_metrics(experiment, ['accuracy', 'precision', 'recall'])\n", + "plot_experiment_metrics(experiment, ['auc'])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RfxkB-_vwUwq" + }, + "source": [ + "AUC is calculated across all possible thresholds (in practice in the code above, 100 thresholds), while accuracy, precision, and recall are calculated for only the specified threshold. For this reason they are shown separately above." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u8y8vKBGsv0m" + }, + "source": [ + "## Evaluate the model against the test set\n", + "\n", + "At the end of model training, you ended up with a certain accuracy against the *training set*. Invoke the following code cell to determine your model's accuracy against the *test set*." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bHh53BX44R94" + }, + "outputs": [], + "source": [ + "def evaluate_experiment(\n", + " experiment: Experiment, test_dataset: pd.DataFrame, test_labels: np.array\n", + ") -> dict[str, float]:\n", + " features = {\n", + " feature_name: np.array(test_dataset[feature_name])\n", + " for feature_name in experiment.settings.input_features\n", + " }\n", + " return experiment.model.evaluate(\n", + " x=features,\n", + " y=test_labels,\n", + " batch_size=settings.batch_size,\n", + " verbose=0, # Hide progress bar\n", + " return_dict=True,\n", + " )\n", + "\n", + "\n", + "def compare_train_test(experiment: Experiment, test_metrics: dict[str, float]):\n", + " print('Comparing metrics between train and test:')\n", + " for metric, test_value in test_metrics.items():\n", + " print('------')\n", + " print(f'Train {metric}: {experiment.get_final_metric_value(metric):.4f}')\n", + " print(f'Test {metric}: {test_value:.4f}')\n", + "\n", + "\n", + "# Evaluate test metrics\n", + "test_metrics = evaluate_experiment(experiment, test_features, test_labels)\n", + "compare_train_test(experiment, test_metrics)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ku6nq9KDtL6u" + }, + "source": [ + "It appears that the model, which achieved ~92% accuracy on the training data, still shows an accuracy of about 90% on the test data. Can we do better? Let's train a model using all seven available features and compare the AUC." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "72Mfkp3RxQii" + }, + "outputs": [], + "source": [ + "# Features used to train the model on.\n", + "# Specify all features.\n", + "all_input_features = [\n", + " 'Eccentricity',\n", + " 'Major_Axis_Length',\n", + " 'Minor_Axis_Length',\n", + " ? Your code here\n", + "]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "xfdeJjoUNmTv" + }, + "outputs": [], + "source": [ + "#@title Solution\n", + "# Features used to train the model on.\n", + "# Specify all features.\n", + "all_input_features = [\n", + " 'Eccentricity',\n", + " 'Major_Axis_Length',\n", + " 'Minor_Axis_Length',\n", + " 'Area',\n", + " 'Convex_Area',\n", + " 'Perimeter',\n", + " 'Extent',\n", + "]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hql2nxXqxuBg" + }, + "source": [ + "## Train the full-featured model and calculate metrics" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-85dcJ3ntocd" + }, + "outputs": [], + "source": [ + "settings_all_features = ExperimentSettings(\n", + " learning_rate=0.001,\n", + " number_epochs=60,\n", + " batch_size=100,\n", + " classification_threshold=0.5,\n", + " input_features=all_input_features,\n", + ")\n", + "\n", + "# Modify the following definition of METRICS to generate\n", + "# not only accuracy and precision, but also recall:\n", + "metrics = [\n", + " keras.metrics.BinaryAccuracy(\n", + " name='accuracy',\n", + " threshold=settings_all_features.classification_threshold,\n", + " ),\n", + " keras.metrics.Precision(\n", + " name='precision',\n", + " thresholds=settings_all_features.classification_threshold,\n", + " ),\n", + " keras.metrics.Recall(\n", + " name='recall', thresholds=settings_all_features.classification_threshold\n", + " ),\n", + " keras.metrics.AUC(num_thresholds=100, name='auc'),\n", + "]\n", + "\n", + "# Establish the model's topography.\n", + "model_all_features = create_model(settings_all_features, metrics)\n", + "\n", + "# Train the model on the training set.\n", + "experiment_all_features = train_model(\n", + " 'all features',\n", + " model_all_features,\n", + " train_features,\n", + " train_labels,\n", + " settings_all_features,\n", + ")\n", + "\n", + "# Plot metrics vs. epochs\n", + "plot_experiment_metrics(\n", + " experiment_all_features, ['accuracy', 'precision', 'recall']\n", + ")\n", + "plot_experiment_metrics(experiment_all_features, ['auc'])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "E5ndvrnjzXCo" + }, + "source": [ + "## Evaluate full-featured model on test split" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-BklcY6pyDrY" + }, + "outputs": [], + "source": [ + "test_metrics_all_features = evaluate_experiment(\n", + " experiment_all_features, test_features, test_labels\n", + ")\n", + "compare_train_test(experiment_all_features, test_metrics_all_features)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wTr_boLBze2k" + }, + "source": [ + "This second model has very similar train and test metrics, suggesting it overfit less to the training data." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EqgyfbXXawq4" + }, + "source": [ + "# Comparing our two models\n", + "\n", + "With our simple experimentation framework, we can keep track of which experiments we ran, and what the results were. We can also define a helper function below which allows us to easily compare two or more models, both during training and when evaluated on the test set." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "6Td7twEDa8t3" + }, + "outputs": [], + "source": [ + "#@title Define function to compare experiments\n", + "\n", + "def compare_experiment(experiments: list[Experiment],\n", + " metrics_of_interest: list[str],\n", + " test_dataset: pd.DataFrame,\n", + " test_labels: np.array):\n", + " # Make sure that we have all the data we need.\n", + " for metric in metrics_of_interest:\n", + " for experiment in experiments:\n", + " if metric not in experiment.metrics_history:\n", + " raise ValueError(f'Metric {metric} not available for experiment {experiment.name}')\n", + "\n", + " fig = plt.figure(figsize=(12, 12))\n", + " ax = fig.add_subplot(2, 1, 1)\n", + "\n", + " colors = [f'C{i}' for i in range(len(experiments))]\n", + " markers = ['.', '*', 'd', 's', 'p', 'x']\n", + " marker_size = 10\n", + "\n", + " ax.set_title('Train metrics')\n", + " for i, metric in enumerate(metrics_of_interest):\n", + " for j, experiment in enumerate(experiments):\n", + " plt.plot(experiment.epochs, experiment.metrics_history[metric], markevery=4,\n", + " marker=markers[i], markersize=marker_size, color=colors[j])\n", + "\n", + " # Add custom legend to show what the colors and markers mean\n", + " legend_handles = []\n", + " for i, metric in enumerate(metrics_of_interest):\n", + " legend_handles.append(Line2D([0], [0], label=metric, marker=markers[i],\n", + " markersize=marker_size, c='k'))\n", + " for i, experiment in enumerate(experiments):\n", + " legend_handles.append(Line2D([0], [0], label=experiment.name, color=colors[i]))\n", + "\n", + " ax.set_xlabel(\"Epoch\")\n", + " ax.set_ylabel(\"Metric value\")\n", + " ax.grid()\n", + " ax.legend(handles=legend_handles)\n", + "\n", + " ax = fig.add_subplot(2, 1, 2)\n", + " spacing = 0.3\n", + " n_bars = len(experiments)\n", + " bar_width = (1 - spacing)/n_bars\n", + " for i, experiment in enumerate(experiments):\n", + " test_metrics = evaluate_experiment(experiment, test_dataset, test_labels)\n", + " x = np.arange(len(metrics_of_interest)) + bar_width * (i + 1/2 - n_bars/2)\n", + " ax.bar(x, [test_metrics[metric] for metric in metrics_of_interest], width=bar_width, label=experiment.name)\n", + " ax.set_xticks(np.arange(len(metrics_of_interest)), metrics_of_interest)\n", + "\n", + " ax.set_title('Test metrics')\n", + " ax.set_ylabel('Metric value')\n", + " ax.set_axisbelow(True) # Put the grid behind the bars\n", + " ax.grid()\n", + " ax.legend()\n", + "\n", + "print('Defined function to compare experiments.')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JhbgA_FEayYU" + }, + "outputs": [], + "source": [ + "compare_experiment([experiment, experiment_all_features],\n", + " ['accuracy', 'auc'],\n", + " test_features, test_labels)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sKIuJGOTbNWz" + }, + "source": [ + "Comparing the two models, both have AUC of ~.97-.98. There does not seem to be a large gain in model quality when adding the other four features, which makes sense, given that many of the features (area, perimeter, and convex area, for example) are interrelated." + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/ml/cc/exercises/fairness_income.ipynb b/ml/cc/exercises/fairness_income.ipynb new file mode 100644 index 0000000..1d7c109 --- /dev/null +++ b/ml/cc/exercises/fairness_income.ipynb @@ -0,0 +1,944 @@ +{ + "cells": [ + { + "cell_type": "code", + "source": [ + "#@title Copyright 2024 Google LLC. Double-click for license information.\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ], + "metadata": { + "cellView": "form", + "id": "4zAh2rJSuao0" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T4r2z30vJSbA" + }, + "source": [ + "# Colabs\n", + "\n", + "Machine Learning Crash Course uses Colaboratory (Colab) notebooks for all programming exercises. Colab is Google's implementation of [Jupyter Notebook](https://jupyter.org/). For more information about Colabs and how to use them, go to [Welcome to Colaboratory](https://research.google.com/colaboratory)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TL5y5fY9Jy_x" + }, + "source": [ + "# Addressing Bias and Fairness Issues in ML Models\n", + "\n", + "This notebook uses an updated and reconstructed version of the [UCI Adult](https://archive.ics.uci.edu/dataset/2/adult) dataset called [ACSIncome](https://github.com/socialfoundations/folktables) to:\n", + "\n", + "* Train a model that predicts whether an individual's income is above $50,000 USD based on education, employment, and marital status, as well as *sensitive* attributes, such as age, sex, and race.\n", + "* Compute commonly-identified fairness metrics to evaluate how the model performs across demographic groups.\n", + "* Apply a model remediation technique to minimize the difference in error rates between demographic groups.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h8wtceyJj2uX" + }, + "source": [ + "## Learning Objectives\n", + "\n", + " * Learn how to evaluate the performance of a trained model for fairness using [TensorFlow Model Analysis](https://www.tensorflow.org/tfx/model_analysis/get_started) and [Fairness Indicators](https://www.tensorflow.org/tfx/guide/fairness_indicators).\n", + " * Address bias and fairness issues in a trained model with [TensorFlow Model Remediation](https://www.tensorflow.org/responsible_ai/model_remediation)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mW9OHUMHr7Jy" + }, + "source": [ + "## Intended Use & Considerations\n", + "\n", + "This notebook demonstrates how to mitigate undesirable biases in a trained model. In practice, the model and the prediction task in this notebook is not realistic. Rather, the emphasis here is the approach to evaluating the performance of a trained model and minimizing error rates between demographics groups.\n", + "\n", + "With regard to the prediction task, there may circumstances where an individual's income may be used as a determining factor, such as obtaining a loan, acquiring insurance, applying for assistance programs, or possibly to advertise products. But in those cases, such businesses, institutions or organizations would not be looking to infer one's income using a machine learning model; they would instead obtain that information directly from the applicant, if possible, or rely on other signals to decide on the outcome.\n", + "\n", + "With that said, **this notebook is for educational purposes only and it is not advisable to use this model in any context outside of this exercise**." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2aALNeO9Rvk7" + }, + "source": [ + "## Setup\n", + "To run this exercise, TensorFlow Model Analysis, TensorFlow Model Remediation and Fairness Indicators will need to be installed." + ] + }, + { + "metadata": { + "id": "Rqy7tAHmW6Pu" + }, + "cell_type": "code", + "source": [ + "!pip install --quiet --upgrade \\\n", + " tensorflow-model-remediation \\\n", + " fairness-indicators \\\n", + " tensorflow-model-analysis" + ], + "outputs": [], + "execution_count": null + }, + { + "metadata": { + "id": "V1hu8K637Sio" + }, + "cell_type": "markdown", + "source": [ + "With the libraries installed, all necessary components can now be imported — including MinDiff for addressing unfair bias in models and Fairness Indicators for evaluating and improving models for fairness concerns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "2_64QdLt2Fwo" + }, + "outputs": [], + "source": [ + "import pandas as pd\n", + "import tensorflow as tf\n", + "\n", + "import tensorflow_model_analysis as tfma\n", + "from google.protobuf import text_format\n", + "\n", + "from tensorflow_model_remediation import min_diff" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JJZEgJQSjyK4" + }, + "source": [ + "## About the ACSIncome Dataset\n", + " \n", + "ACSIncome is one of several datasets created by [Ding et al.](https://proceedings.neurips.cc/paper_files/paper/2021/file/32e54441e6382a7fbacbbbaf3c450059-Paper.pdf) as an alternative to [UCI Adult](https://archive.ics.uci.edu/dataset/2/adult). A few key details about ACSIncome:\n", + "* The dataset contains 1,664,500 datapoints pulled from the 2018 United States–wide [American Community Survey](https://www.census.gov/programs-surveys/acs) (ACS) [Public Use Microdata Sample](https://www.census.gov/programs-surveys/acs/microdata.html) (PUMS) data sample.\n", + "* All fifty US states and Puerto Rico are represented in this dataset.\n", + "* Each row represents a person described by various features, including age, race, and sex, which correspond to protected categories in different domains under US anti-discrimination laws.\n", + "* The dataset only includes individuals above 16 years old who worked at least 1 hour per week in the past year and had an income of at least $100 USD.\n", + "\n", + "For more information on the dataset and how it was created to reconstruct UCI Adult, check out the following citations:\n", + "\n", + "> Ding, Frances, Moritz Hardt, John Miller, and Ludwig Schmidt. \"[Retiring adult: New datasets for fair machine learning.](https://proceedings.neurips.cc/paper_files/paper/2021/hash/32e54441e6382a7fbacbbbaf3c450059-Abstract.html)\" Advances in neural information processing systems 34 (2021): 6478-6490.\n", + "\n", + "> Sarah Flood, Miriam King, Renae Rodgers, Steven Ruggles, and J. Robert Warren (2020). Integrated Public Use Microdata Series, Current Population Survey: Version 8.0 [dataset]. Minneapolis, MN: IPUMS. https://doi.org/10.18128/D030.V8.0\n", + "\n", + "\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8pWWcKN7Lmv7" + }, + "outputs": [], + "source": [ + "# Import the dataset\n", + "acs_df = pd.read_csv(filepath_or_buffer=\"https://download.mlcc.google.com/mledu-datasets/acsincome_raw_2018.csv\")\n", + "\n", + "# Print five random rows of the pandas DataFrame.\n", + "acs_df.sample(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AQ1aYynYXS3g" + }, + "source": [ + "## Features\n", + "\n", + "After importing the dataset, five random samples appear in a table in the output cell. Each sample represents an individual, with each column representing an aspect of the invidiual, such as their age, occupation, place of birth, and so forth.\n", + "\n", + "The following table describes each feature column:\n", + "\n", + "| Feature | Description |\n", + "| -------- | ------- |\n", + "| AGEP | Age |\n", + "| COW | Class of worker (government employee, self-employed, private employee) |\n", + "| SCHL | Educational attainment (high school diploma, bachelor's degree, doctorate degree) |\n", + "| MAR | Marital status |\n", + "| OCCP | Occuptation |\n", + "| POBP | Place of birth |\n", + "| RELP | Relationship to householder (husband or wife, housemate or roommate, nursing home, group home, etc.) |\n", + "| WKHP | Usual hours worked per week in the past 12 months |\n", + "| SEX | Male or female |\n", + "| RAC1P | Recorded detailed race code |\n", + "| ST | US state code that represents the individual's location |\n", + "| PINCP | Total person's yearly income |\n", + "\n", + "All of these features are represented numerically, though some of them correspond to a coded value. For example, for the `COW` (Class of worker) feature, `1.0` represents *an employee of a private for-profit company or business, or of an individual, for wages, salary, or commissions* and `2.0` represents *an employee of a private not-for-profit, tax-exempt, or charitable organization*. See [the supplemental section](https://proceedings.neurips.cc/paper_files/paper/2021/file/32e54441e6382a7fbacbbbaf3c450059-Supplemental.pdf) of [Ding et al.](https://proceedings.neurips.cc/paper_files/paper/2021/file/32e54441e6382a7fbacbbbaf3c450059-Paper.pdf) and the [ACS PUMS 2018 Data Dictionary](https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2018.pdf) for the full mapping of codes." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hRd-tmUPa4tB" + }, + "source": [ + "## Change Target Value to Binary\n", + "[As stated earlier](#scrollTo=TL5y5fY9Jy_x), the task is to predict whether the annual income of a US working adult is more than $50,000. The `PINCP` (total person's yearly income) column in the dataset represents the target variable; however, the value will need to be convereted into a binary. For each sample, an individual’s target label will be `1` if `PINCP` > `50000.0`, otherwise `0`.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "N9MgdxLtDLpk" + }, + "outputs": [], + "source": [ + "LABEL_KEY = 'PINCP'\n", + "LABEL_THRESHOLD = 50000.0\n", + "\n", + "acs_df[LABEL_KEY] = acs_df[LABEL_KEY].apply(\n", + " lambda income: 1 if income > LABEL_THRESHOLD else 0)\n", + "\n", + "acs_df.sample(10)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eP3Gj4ntYqWb" + }, + "source": [ + "## Defining Base Model\n", + "For the purposes of this exercise, a simple, lightly-tuned [`keras.Model`](https://www.tensorflow.org/api_docs/python/tf/keras/Model) (using the [Functional API](https://www.tensorflow.org/guide/keras/functional) for preprocessing) will be created and serve as the base model for this exercise. Note that while the focus for this exercise is the technique involved in addressing fairness concerns, the model architecture — as taught throughout MLCC — would be thoughfully chosen and hyperparameter tuning would be performed before attempting to address any fairness concerns that might arise in the model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "SPnLdxApoux2" + }, + "outputs": [], + "source": [ + "inputs = {}\n", + "features = acs_df.copy()\n", + "features.pop(LABEL_KEY)\n", + "\n", + "# Instantiate a Keras input node for each column in the dataset.\n", + "for name, column in features.items():\n", + " if name != LABEL_KEY:\n", + " inputs[name] = tf.keras.Input(\n", + " shape=(1,), name=name, dtype=tf.float64)\n", + "\n", + "# Stack the inputs as a dictionary and preprocess them.\n", + "def stack_dict(inputs, fun=tf.stack):\n", + " values = []\n", + " for key in sorted(inputs.keys()):\n", + " values.append(tf.cast(inputs[key], tf.float64))\n", + "\n", + " return fun(values, axis=-1)\n", + "\n", + "x = stack_dict(inputs, fun=tf.concat)\n", + "\n", + "# Collect the features from the DataFrame, stack them together and normalize\n", + "# their values by passing them to the normalization layer.\n", + "normalizer = tf.keras.layers.Normalization(axis=-1)\n", + "normalizer.adapt(stack_dict(dict(features)))\n", + "\n", + "# Build the main body of the model using a normalization layer, two dense\n", + "# rectified-linear layers, and a single output node for classification.\n", + "x = normalizer(x)\n", + "x = tf.keras.layers.Dense(64, activation='relu')(x)\n", + "x = tf.keras.layers.Dense(32, activation='relu')(x)\n", + "outputs = tf.keras.layers.Dense(1, activation='sigmoid')(x)\n", + "\n", + "# Put it all together using the Keras Functional API\n", + "base_model = tf.keras.Model(inputs, outputs)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tWi8a1O5OnG8" + }, + "source": [ + "## Configuring Base Model\n", + "Since this is a binary classification task, computing the cross-entropy loss between true labels and predicted labels will be sufficient for this exercise." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MrFw_DQtFG7y" + }, + "outputs": [], + "source": [ + "# Define the metrics used to monitor model performance while training.\n", + "METRICS = [\n", + " tf.keras.metrics.BinaryAccuracy(name='accuracy'),\n", + " tf.keras.metrics.AUC(name='auc'),\n", + "]\n", + "\n", + "# Configure the model for training using a stochastic gradient descent\n", + "# optimizer, cross-entropy loss between true labels and predicted labels, and\n", + "# the metrics defined above to evaluate the base model during training.\n", + "base_model.compile(\n", + " optimizer='adam',\n", + " loss=tf.keras.losses.BinaryCrossentropy(),\n", + " metrics=METRICS)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6FyuAZ-BYRXU" + }, + "source": [ + "## Convert to tf.data.Dataset\n", + "Most of the exercises throughout MLCC use a [pandas DataFrame](https://developers.google.com/machine-learning/glossary/#pandas) directly as an input argument for [`Model.fit`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#fit). But in this exercise, the dataset must be converted to [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) because that's the requirement for [`MinDiffModel`](https://www.tensorflow.org/responsible_ai/model_remediation/api_docs/python/model_remediation/min_diff/keras/MinDiffModel), which will be introducted later. Fortunately, this conversion is simple to do, thanks to [`tf.convert_to_tensor`](https://www.tensorflow.org/api_docs/python/tf/convert_to_tensor).\n", + "\n", + "The following helper function will be useful for preparing the dataset in subsequent code cells:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jJ6CWSMJc59H" + }, + "outputs": [], + "source": [ + "# Helper function to convert a pandas DataFrame into a tf.Data.dataset object\n", + "# necessary for the purposes of this exercise.\n", + "def dataframe_to_dataset(dataframe):\n", + " dataframe = dataframe.copy()\n", + " labels = dataframe.pop(LABEL_KEY)\n", + " dataset = tf.data.Dataset.from_tensor_slices(\n", + " ((dict(dataframe), labels)))\n", + " return dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "meL30tAbcZHM" + }, + "source": [ + "## Finalize Training Set & Train Base Model\n", + "At this point, all that remains is splitting the dataset before training the base model.\n", + "\n", + "**NOTE:** *The following cell may take approximately 10—15 minutes to run.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zoHosWJ1gZjd" + }, + "outputs": [], + "source": [ + "RANDOM_STATE = 200\n", + "BATCH_SIZE = 100\n", + "EPOCHS = 10\n", + "\n", + "# Use the sample() method in pandas to split the dataset into a training set\n", + "# that represents 80% of the original dataset, then convert it to a\n", + "# tf.data.Dataset object, and finally train the model using the\n", + "# converted training set.\n", + "acs_train_df = acs_df.sample(frac=0.8, random_state=RANDOM_STATE)\n", + "acs_train_ds = dataframe_to_dataset(acs_train_df)\n", + "acs_train_batches = acs_train_ds.batch(BATCH_SIZE)\n", + "\n", + "base_model.fit(acs_train_batches, epochs=EPOCHS)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ksj1-ycTe9qW" + }, + "source": [ + "## Evaluate Base Model\n", + "Consistent with [Ding et al.](https://proceedings.neurips.cc/paper_files/paper/2021/file/32e54441e6382a7fbacbbbaf3c450059-Paper.pdf), the overall accuracy for the base model should be at around 80% with minimal tuning and a basic model architecture. The following code cell uses the test set to evaluate the performance of the base model:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jmcHhuTb1bG5" + }, + "outputs": [], + "source": [ + "# Use the indices from the training set to create the test set, which represents\n", + "# 20% of the original dataset; then convert it to a tf.data.Dataset object, and\n", + "# evaluate the base model using the converted test set.\n", + "acs_test_df = acs_df.drop(acs_train_df.index).sample(frac=1.0)\n", + "acs_test_ds = dataframe_to_dataset(acs_test_df)\n", + "acs_test_batches = acs_test_ds.batch(BATCH_SIZE)\n", + "\n", + "base_model.evaluate(acs_test_batches, batch_size=BATCH_SIZE)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JPqttzkuOY0g" + }, + "source": [ + "## Evaluating for Fairness\n", + "With the base model trained, now would be a good opportunity to evaluate performance across demographic groups. For ease of analysis, [Fairness Indicators](https://www.tensorflow.org/responsible_ai/fairness_indicators/guide) will be used to compute fairness metrics across demographic groups and visualize results.\n", + "\n", + "To begin, a column containing all the base model's predictions from the test set will be needed in order to configure Fairness Indicators. [`model.predict(test_set)`](https://www.tensorflow.org/api_docs/python/tf/keras/Model#predict) will be used to generate the output predictions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XwJ257tXhL8k" + }, + "outputs": [], + "source": [ + "# Generate output predictions using the test set.\n", + "base_model_predictions = base_model.predict(\n", + " acs_test_batches, batch_size=BATCH_SIZE)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vGG9x0aqnPtE" + }, + "source": [ + "## A Note on Sensitive Attributes\n", + "There are several features in ACSIncome that can be used to evaluate for fairness. For this exercise, the `SEX` attribute was chosen. This is in part to keep the config terse.\n", + "\n", + "The `SEX` attribute used by the [US Census Bureau surveys](https://www.census.gov/glossary/?term=Sex) to construct ACSIncome was specifically intended to capture an individual's biological sex and not gender. As such, possible ambiguity of these concepts could have tampered with their intended data collection, which could result in some misrepresentation. Ideally, there would be a separate category in such surveys that allow for an individual to express their gender identity (male, female, non-binary, agender, and so forth). But for the purposes of this exercise, the `SEX` attribute available in ACSIncome will suffice for demonstrating how to perform fairness evaluations.\n", + "\n", + "In practice, the recommended approach is to evaluate across any group that may be negatively impacted by the trained model and is accessible in the dataset, which can include race, ethnicity, working status, educational background, and even the US state that the individual is in.\n", + "\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "g0qTAaFKnOV-" + }, + "outputs": [], + "source": [ + "SENSITIVE_ATTRIBUTE_VALUES = {1.0: \"Male\", 2.0: \"Female\"}\n", + "SENSITIVE_ATTRIBUTE_KEY = 'SEX'\n", + "PREDICTION_KEY = 'PRED'\n", + "\n", + "# Make a copy of the test set, replace sensitive attribute values with\n", + "# categorial strings (for ease of visualization), and add predictions\n", + "# from the test set to the copied DataFrame as a separate column.\n", + "base_model_analysis = acs_test_df.copy()\n", + "base_model_analysis[SENSITIVE_ATTRIBUTE_KEY].replace(\n", + " SENSITIVE_ATTRIBUTE_VALUES, inplace=True)\n", + "base_model_analysis[PREDICTION_KEY] = base_model_predictions\n", + "\n", + "# Show five random examples to ensure that it looks correct.\n", + "base_model_analysis.sample(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w1mUUFJoyx9V" + }, + "source": [ + "## Configure Fairness Indicators\n", + "With a column of predictions now included in the test set, an [`eval_config`](https://www.tensorflow.org/tfx/model_analysis/api_docs/python/tfma/EvalConfig) must be created to use Fairness Indicators. This config must include: the names of the prediction and target label columns in the test set, a list of metrics to compute, and the sensitive attribute to designate how the metrics should be computed.\n", + "\n", + "As far as metrics goes, [there are several to choose from](https://www.tensorflow.org/tfx/model_analysis/metrics#binary_classification_metrics). For now, `ConfusionMatrixPlot` will provide everything needed to evaluate for fairness.\n", + "\n", + "**NOTE:** *The following cell may take 5—10 minutes to run.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "x00hquzW0wyi" + }, + "outputs": [], + "source": [ + "# Specify Fairness Indicators using eval_config.\n", + "eval_config_pbtxt = \"\"\"\n", + " model_specs {\n", + " prediction_key: \"%s\"\n", + " label_key: \"%s\" }\n", + " metrics_specs {\n", + " metrics { class_name: \"ExampleCount\" }\n", + " metrics { class_name: \"BinaryAccuracy\" }\n", + " metrics { class_name: \"AUC\" }\n", + " metrics { class_name: \"ConfusionMatrixPlot\" }\n", + " metrics {\n", + " class_name: \"FairnessIndicators\"\n", + " config: '{\"thresholds\": [0.50]}'\n", + " }\n", + " }\n", + " slicing_specs {\n", + " feature_keys: \"%s\"\n", + " }\n", + " slicing_specs {}\n", + "\"\"\" % (PREDICTION_KEY, LABEL_KEY, SENSITIVE_ATTRIBUTE_KEY)\n", + "eval_config = text_format.Parse(eval_config_pbtxt, tfma.EvalConfig())\n", + "\n", + "# Run TensorFlow Model Analysis.\n", + "base_model_eval_result = tfma.analyze_raw_data(base_model_analysis, eval_config)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AbEyFfAdhmeI" + }, + "source": [ + "### Task 1: Identify Fairness Concerns\n", + "Run the code cell below and take a moment to explore the results by selecting several metrics to display on the left pane. Individual graphs for each of the metric selected will appear in the widget to the right.\n", + "\n", + "For each individual graph, you should see a bar that represents the overall performance, followed by bars that correspond to a demogrphic group based on the sensitive attribute defined in the configuration.\n", + "\n", + "After looking at performances across different metrics,\n", + "\n", + "1. Is there a metric that performed equally well across demographic groups?\n", + "2. Is there a metric that was disproportionate across demographic groups, despite overall performance along that metric seemed promising?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ij4xK5xoDkV0" + }, + "outputs": [], + "source": [ + "# Render Fairness Indicators.\n", + "tfma.addons.fairness.view.widget_view.render_fairness_indicator(\n", + " base_model_eval_result)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "x8_LU8tw394S" + }, + "outputs": [], + "source": [ + "#@title Double-Click to View a Possible Answer { display-mode: \"form\" }\n", + "\n", + "# 1. The overall AUC for the base model was around 0.88, with male and female\n", + "# groups performing just as well with 0.87 and 0.88, respectfully. A performance\n", + "# metric like the AUC would lead one to believe that the model performs well\n", + "# across groups.\n", + "#\n", + "# 2. However, when evaluting with respect to the false negative rate, the\n", + "# results show that performance is disappropriately favoring males, with female\n", + "# performance is nearly 27% worse than overall baseline. In fact, what the\n", + "# graphs reveal is that males perform better than the baseline by around 16%." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XvNTD2vx9EOn" + }, + "source": [ + "## The Fairness Issue\n", + "Using Fairness Indicators to evaluate the base model performance across `SEX`, the false negative rates (FNR) between `Male` and `Female` groups revealed disproportionate outcomes. For context, the FNR represents the percentage of positive examples (individuals earning \\$50,000 USD or more annually) that are incorrectly predicted as negative (earning less than \\$50,000 USD annually). This relates to [equality of opportunity](https://developers.google.com/machine-learning/glossary#equality-of-opportunity).\n", + "\n", + "In order for opportunities to be equal within a group, the goal when training the model should be to reduce the gap in the FNR between `Male` and `Female` groups. A tool like [TensorFlow Model Remediation](https://www.tensorflow.org/responsible_ai/model_remediation) can be used at training time to intervene and minimize the error rates between the groups.\n", + "\n", + "To set this up, a subset of the dataset must first be created with only positively labeled `Female` examples (which is referred to as the sensitive group, or the protected class to improve model performance on) and another with only positively labeled `Male` examples (which is referred to as the non-sensitive group, or any group that is not the protected class)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZUcX6jv0OroC" + }, + "source": [ + "### Task 2: Create Positively-Labeled Subsets\n", + "Using the training set:\n", + "\n", + "\n", + "1. Add a line of code that creates a subset of the training set only containing positive `Female` examples.\n", + "2. Add another line of code that creates a subset of the training set but only containing positive `Male ` examples.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DhwUgAxcDO_x" + }, + "outputs": [], + "source": [ + "sensitive_group_pos = acs_train_df[ ? ] # Replace the ? with a way to filter\n", + " # positively labeled Female examples.\n", + "\n", + "non_sensitive_group_pos = acs_train_df[ ? ] # Replace the ? with a way to filter\n", + " # positively labeled Male examples." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Vdu6s_s_yGcI" + }, + "outputs": [], + "source": [ + "#@title Double-Click to View a Possible Answer { display-mode: \"form\" }\n", + "\n", + "# A pandas DataFrame offers many approaches when it comes to indexing and\n", + "# selecting rows. One approach is by using boolean indexing\n", + "# (e.g., df[df['col'] == value]) as demonstrated in the following code:\n", + "sensitive_group_pos = acs_train_df[\n", + " (acs_train_df[SENSITIVE_ATTRIBUTE_KEY] == 2.0) & (acs_train_df[LABEL_KEY] == 1)]\n", + "non_sensitive_group_pos = acs_train_df[\n", + " (acs_train_df[SENSITIVE_ATTRIBUTE_KEY] == 1.0) & (acs_train_df[LABEL_KEY] == 1)]\n", + "\n", + "# To learn more, visit: https://pandas.pydata.org/docs/user_guide/indexing.html\n", + "\n", + "print(len(sensitive_group_pos),\n", + " 'positively labeled sensitive group examples')\n", + "print(len(non_sensitive_group_pos),\n", + " 'positively labeled non-sensitive group examples')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JAGQtLVyHsq8" + }, + "source": [ + "## Preparing MinDiff Data\n", + "With the sensitive and non-sensitive subsets defined, attempts can now be made at equalizing the distributions between them using [MinDiff](https://www.tensorflow.org/responsible_ai/model_remediation/min_diff/guide/mindiff_overview). MinDiff is one of the [TensorFlow Model Remediation](https://www.tensorflow.org/responsible_ai/model_remediation) techniques used to balance error rates (or, in this exercise, the FNR) between demographic groups (`Male` and `Female`) by penalizing distributional differences during training; hence, MinDiff, or *minimizing the differences*.\n", + "\n", + "As [specified earlier](#scrollTo=6FyuAZ-BYRXU) in this exercise, [`MinDiffModel`](https://www.tensorflow.org/responsible_ai/model_remediation/api_docs/python/model_remediation/min_diff/keras/MinDiffModel) requires [`tf.data.Dataset`](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) as the input. The following code cell will use that same helper function to convert the subsets into `tf.data.Dataset`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qumF1dmJVd_3" + }, + "outputs": [], + "source": [ + "# Convert sensitive and non-sensitive subsets into tf.data.Dataset.\n", + "MIN_DIFF_BATCH_SIZE = 50\n", + "sensitive_group_ds = dataframe_to_dataset(sensitive_group_pos)\n", + "non_sensitive_group_ds = dataframe_to_dataset(non_sensitive_group_pos)\n", + "\n", + "# Batch the subsets.\n", + "sensitive_group_batches = sensitive_group_ds.batch(\n", + " MIN_DIFF_BATCH_SIZE, drop_remainder=True)\n", + "non_sensitive_group_batches = non_sensitive_group_ds.batch(\n", + " MIN_DIFF_BATCH_SIZE, drop_remainder=True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lg-7EXgOK52M" + }, + "source": [ + "### Task 3: Packing the Datasets for MinDiff Model\n", + "Now that the subsets are prepared, they must be packed into a single dataset along with the original training set, which will then be passed along to the `MinDiffModel` for training.\n", + "\n", + "To advance, add a line of code that uses [`min_diff.keras.utils.pack_min_diff_data()`](https://www.tensorflow.org/responsible_ai/model_remediation/api_docs/python/model_remediation/min_diff/keras/utils/pack_min_diff_data) to pack the original dataset, the sensitive subset, and the nonsensitive subset.\n", + "\n", + "**Note:** All datasets must be batched before packing them together to prevent errors downstream as a consequence of unintended input tensor shapes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "agxgBOtyLfSb" + }, + "outputs": [], + "source": [ + "acs_train_min_diff_ds = min_diff.keras.utils.pack_min_diff_data(\n", + " original_dataset = ?, # Replace ? with the original training set\n", + " sensitive_group_dataset = ?, # Replace ? with the sensitive subset\n", + " nonsensitive_group_dataset = ?) # Replace ? with the non-sensitive subset" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dT4b6Xn-RPdn" + }, + "outputs": [], + "source": [ + "#@title Double-Click to View a Possible Answer { display-mode: \"form\" }\n", + "\n", + "# All you have to do is include each of the three batched tf.data.Datasets into\n", + "# the arguments: the training set, the sensitive subset (female) and\n", + "# the non-sensitive subset (male).\n", + "acs_train_min_diff_ds = min_diff.keras.utils.pack_min_diff_data(\n", + " original_dataset = acs_train_batches,\n", + " sensitive_group_dataset = sensitive_group_batches,\n", + " nonsensitive_group_dataset = non_sensitive_group_batches)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H0I9MOIBNtsc" + }, + "source": [ + "## Train MinDiff Model\n", + "With the data preparation complete, the base model can now be wrapped in the [`MinDiffModel`](https://www.tensorflow.org/responsible_ai/model_remediation/api_docs/python/model_remediation/min_diff/keras/MinDiffModel) and compiled. Configuring the MinDiff model is no different from how the base model was configured [earlier](#scrollTo=tWi8a1O5OnG8)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5K0N0QcNWDyS" + }, + "outputs": [], + "source": [ + "# Wrap the original model in a MinDiffModel.\n", + "min_diff_model = min_diff.keras.MinDiffModel(\n", + " original_model=base_model,\n", + " loss=min_diff.losses.MMDLoss(),\n", + " loss_weight=1)\n", + "\n", + "# Compile the model after wrapping the original model.\n", + "min_diff_model.compile(\n", + " optimizer='adam',\n", + " loss=tf.keras.losses.BinaryCrossentropy(from_logits=False),\n", + " metrics=METRICS)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mb1oByIMP32S" + }, + "source": [ + "As for training, just remember to pass in the packed dataset.\n", + "\n", + "**NOTE:** *The following cell may take approximately 5 minutes to run.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LAPWTBtpWEb_" + }, + "outputs": [], + "source": [ + "# Train MinDiff model using the packed dataset instead of the training set.\n", + "min_diff_model.fit(acs_train_min_diff_ds, epochs=EPOCHS)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cjKFEMtvQKY8" + }, + "source": [ + "## Evaluate MinDiff Model\n", + "\n", + "When it comes to overall performance, the results of the MinDiff model should be somewhat similar to the base model. Of course, the accuracy or AUC might slightly vary when compared to the base model." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gnBXH4qZWV76" + }, + "outputs": [], + "source": [ + "min_diff_model.evaluate(acs_test_batches, batch_size=BATCH_SIZE)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sRJM9Jt4RxHe" + }, + "source": [ + "## Evaluating for Fairness Using Remediated Model\n", + "With the MinDiff model now trained, the same steps used to evaluate the base model with Fairness Indicators can be applied to the MinDiff model as well.\n", + "\n", + "Begin by generating the predictions — this time passing the test set as an input argument to the MinDiff model.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "avtyLfTEomzR" + }, + "outputs": [], + "source": [ + "# Generate MinDiff output predictions using the test set.\n", + "min_diff_model_predictions = min_diff_model.predict(\n", + " acs_test_batches, batch_size=BATCH_SIZE)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "S_Kk52L2S7Jr" + }, + "source": [ + "Same as before, append the predictions as a column onto the DataFrame, then pass it onto Fairness Indicators for evaluation. Note that the configuration used for the base model remains the same for the MinDiff model, which is why it is not being redefined below.\n", + "\n", + "**NOTE:** *The following cell may take 5—10 minutes to run.*" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8hnRYjw4F1sy" + }, + "outputs": [], + "source": [ + "# Make a copy of the test set, replace attribute with categorical values,\n", + "# and add the MinDiff test set predictions to the copied DataFrame as a separate\n", + "# column.\n", + "min_diff_model_analysis = acs_test_df.copy()\n", + "min_diff_model_analysis[SENSITIVE_ATTRIBUTE_KEY].replace(\n", + " SENSITIVE_ATTRIBUTE_VALUES, inplace=True)\n", + "min_diff_model_analysis[PREDICTION_KEY] = min_diff_model_predictions\n", + "\n", + "# Run TensorFlow Model Analysis on the MinDiff model.\n", + "min_diff_model_eval_result = tfma.analyze_raw_data(\n", + " min_diff_model_analysis, eval_config)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jwXkfG1HT-7Y" + }, + "source": [ + "### Task 4: Reviewing MinDiff Results\n", + "Run the cell below to visualize the MinDiff results. Then click on AUC and False Negative Rate in the left pane to reveal their respective graphs in the widget to the right.\n", + "\n", + "1. Looking at the AUC, and compared to the base model, did model performance increase or decrease as a result of penalizing the model during training for differences in error rates?\n", + "2. Looking at the false negative rate, are there any noticeable differences between the MinDiff model and the base model?\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MGUTnhHVMj4I" + }, + "outputs": [], + "source": [ + "tfma.addons.fairness.view.widget_view.render_fairness_indicator(\n", + " min_diff_model_eval_result)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "zr1ibvxoVTCu" + }, + "outputs": [], + "source": [ + "#@title Double-Click to View a Possible Answer\n", + "# 1. Though applying MinDiff may come with some performance tradeoffs in\n", + "# comparison to the original task, in this exercise, the MinDiff model performed\n", + "# nearly equally as well as the base model, at least in terms of AUC. What this\n", + "# is suggesting is that, in this context, MinDiff can be effective while not\n", + "# worsening overall performance.\n", + "#\n", + "# 2. Here is where we see the MinDiff performing better than the base model. Not\n", + "# only is the gap in error rates between male and female smaller, but the FNR\n", + "# for female went down drastically from nearly 35% all the way to 30%, while the\n", + "# overall performance still remains relatively the same (29% in MinDiff vs. 28%)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ACnCUYH4Xks7" + }, + "source": [ + "## Considerations\n", + "This exercise demonstrated an approachable way to train and evalute a model for fairness by expanding on key concepts taught throughout MLCC. The tasks in this notebook provided an example of how a gap in a performance metric between two demographic groups could be a signal that the model may have unfair skews.\n", + "\n", + "However, as discussed in other sections in MLCC, real-world production ML systems are large ecosystems — and the model is just a component of it. That means there are a lot of contributing factors (or confounds) that could affect model performance. Furthermore, there are numerous social and technical processes, known and unknown, that underpin and surround ML and AI technologies. Achieving equality on a particular metric alone does not ensure that the model itself is overall fair.\n", + "\n", + "In practice, there is an assumption that the features used to compare performance across sensitive attributes are readily available, or can be accurately inferred. In actuality, datasets do not often include sensitive attributes, and it is generally not a good idea to impute them.\n", + "\n", + "\n", + "\n" + ] + } + ], + "metadata": { + "colab": { + "private_outputs": true, + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/ml/cc/exercises/linear_regression_taxi.ipynb b/ml/cc/exercises/linear_regression_taxi.ipynb new file mode 100644 index 0000000..ed9bd34 --- /dev/null +++ b/ml/cc/exercises/linear_regression_taxi.ipynb @@ -0,0 +1,1012 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "X53vZqc7PxCA" + }, + "outputs": [], + "source": [ + "#@title Copyright 2023 Google LLC. Double-click here for license information.\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mWCXBrPgQD0P" + }, + "source": [ + "# Colabs\n", + "\n", + "Machine Learning Crash Course uses Colaboratories (Colabs) for all programming exercises. Colab is Google's implementation of [Jupyter Notebook](https://jupyter.org/). For more information about Colabs and how to use them, go to [Welcome to Colaboratory](https://research.google.com/colaboratory).\n", + "\n", + "# Linear Regression\n", + "In this Colab you will use a real dataset to train a model to predict the fare of a taxi ride in Chicago, IL.\n", + "\n", + "## Learning Objectives\n", + "After completing this Colab, you'll be able to:\n", + "\n", + " * Read a .csv file into a [pandas](https://developers.google.com/machine-learning/glossary/#pandas) DataFrame.\n", + " * Explore a [dataset](https://developers.google.com/machine-learning/glossary/#data_set) with Python visualization libraries.\n", + " * Experiment with different [features](https://developers.google.com/machine-learning/glossary/#feature) to build a linear regression model.\n", + " * Tune the model's [hyperparameters](https://developers.google.com/machine-learning/glossary/#hyperparameter).\n", + " * Compare training runs using [root mean squared error](https://developers.google.com/machine-learning/glossary/#root-mean-squared-error-rmse) and [loss curves](https://developers.google.com/machine-learning/glossary/#loss-curve).\n", + "\n", + "## Dataset Description\n", + "The [dataset for this exercise](https://storage.mtls.cloud.google.com/mlcc-nextgen-internal/chicago_taxi_train.csv) is derived from the [City of Chicago Taxi Trips dataset](https://data.cityofchicago.org/Transportation/Taxi-Trips/wrvz-psew). The data for this exercise is a subset of the Taxi Trips data, and focuses on a two-day period in May of 2022." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bBJQc5TgRrFx" + }, + "source": [ + "# Part 1 - Setup Exercise\n", + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gi1pwamg_sWX" + }, + "source": [ + "## Copy this Colab\n", + "\n", + "***IMPORTANT:*** *If you plan to edit any code or text cells in this Colab, make a copy first and work from the copy.*\n", + "\n", + "**Instructions**\n", + "\n", + "1. From the menu select **File > Save copy in Drive**\n", + "1. A new copy will open in a new tab.\n", + "1. Update the name of the file to give your copy a new name.\n", + "1. Proceed with the Colab from your new copy." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V9pkosc63-63" + }, + "source": [ + "## Load required modules\n", + "\n", + "This exercise depends on several Python libraries to help with data manipulation, machine learning tasks, and data visualization.\n", + "\n", + "**Instructions**\n", + "1. Run the **Load dependencies** code cell (below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "wHBXW8ob16z3" + }, + "outputs": [], + "source": [ + "#@title Code - Load dependencies\n", + "\n", + "#general\n", + "import io\n", + "\n", + "# data\n", + "import numpy as np\n", + "import pandas as pd\n", + "\n", + "# machine learning\n", + "import keras\n", + "\n", + "# data visualization\n", + "import plotly.express as px\n", + "from plotly.subplots import make_subplots\n", + "import plotly.graph_objects as go\n", + "import seaborn as sns" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sgR4YRjj5T-b" + }, + "source": [ + "## Load the dataset\n", + "\n", + "\n", + "The following code cell loads the dataset and creates a pandas DataFrame.\n", + "\n", + "You can think of a DataFrame like a spreadsheet with rows and columns. The rows represent individual data examples, and the columns represent the attributes associated with each example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "abmswn6USJjQ" + }, + "outputs": [], + "source": [ + "# @title\n", + "chicago_taxi_dataset = pd.read_csv(\"https://download.mlcc.google.com/mledu-datasets/chicago_taxi_train.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iKE0s1hNQ4H9" + }, + "source": [ + "## Update the dataframe\n", + "\n", + "The following code cell updates the DataFrame to use only specific columns from the dataset.\n", + "\n", + "Notice that that output shows just a sample of the dataset, but there should be enough information for you to identify the features associated with the dataset, and have a look at the actual data for a few examples." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "YuLz6IgGP2LE" + }, + "outputs": [], + "source": [ + "#@title Code - Read dataset\n", + "\n", + "# Updates dataframe to use specific columns.\n", + "training_df = chicago_taxi_dataset[['TRIP_MILES', 'TRIP_SECONDS', 'FARE', 'COMPANY', 'PAYMENT_TYPE', 'TIP_RATE']]\n", + "\n", + "print('Read dataset completed successfully.')\n", + "print('Total number of rows: {0}\\n\\n'.format(len(training_df.index)))\n", + "training_df.head(200)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RUL471vSR28O" + }, + "source": [ + "# Part 2 - Dataset Exploration\n", + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7mhqzPIS9nFv" + }, + "source": [ + "## View dataset statistics\n", + "\n", + "A large part of most machine learning projects is getting to know your data. In this step, you will use the ``DataFrame.describe`` method to view descriptive statistics about the dataset and answer some important questions about the data.\n", + "\n", + "**Instructions**\n", + "1. Run the **View dataset statistics** code cell.\n", + "1. Inspect the output and answer these questions:\n", + " * What is the maximum fare?\n", + " * What is the mean distance across all trips?\n", + " * How many cab companies are in the dataset?\n", + " * What is the most frequent payment type?\n", + " * Are any features missing data?\n", + "1. Run the code **View answers to dataset statistics** code cell to check your answers.\n", + "\n", + "\n", + "You might be wondering why there are groups of `NaN` (not a number) values listed in the output. When working with data in Python, you may see this value if the result of a calculation can not be computed or if there is missing information. For example, in the taxi dataset `PAYMENT_TYPE` and `COMPANY` are non-numeric, categorical features; numeric information such as mean and max do not make sense for categorical features so the output displays `NaN`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "pkuQNjgoAKYt" + }, + "outputs": [], + "source": [ + "#@title Code - View dataset statistics\n", + "\n", + "print('Total number of rows: {0}\\n\\n'.format(len(training_df.index)))\n", + "training_df.describe(include='all')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "VQ9R5o7CcFzY" + }, + "outputs": [], + "source": [ + "#@title Double-click or run to view answers about dataset statistics\n", + "\n", + "answer = '''\n", + "What is the maximum fare? \t\t\t\t Answer: $159.25\n", + "What is the mean distance across all trips? \t\tAnswer: 8.2895 miles\n", + "How many cab companies are in the dataset? \t\t Answer: 31\n", + "What is the most frequent payment type? \t\t Answer: Credit Card\n", + "Are any features missing data? \t\t\t\t Answer: No\n", + "'''\n", + "\n", + "# You should be able to find the answers to the questions about the dataset\n", + "# by inspecting the table output after running the DataFrame describe method.\n", + "#\n", + "# Run this code cell to verify your answers.\n", + "\n", + "# What is the maximum fare?\n", + "max_fare = training_df['FARE'].max()\n", + "print(\"What is the maximum fare? \\t\\t\\t\\tAnswer: ${fare:.2f}\".format(fare = max_fare))\n", + "\n", + "# What is the mean distance across all trips?\n", + "mean_distance = training_df['TRIP_MILES'].mean()\n", + "print(\"What is the mean distance across all trips? \\t\\tAnswer: {mean:.4f} miles\".format(mean = mean_distance))\n", + "\n", + "# How many cab companies are in the dataset?\n", + "num_unique_companies = training_df['COMPANY'].nunique()\n", + "print(\"How many cab companies are in the dataset? \\t\\tAnswer: {number}\".format(number = num_unique_companies))\n", + "\n", + "# What is the most frequent payment type?\n", + "most_freq_payment_type = training_df['PAYMENT_TYPE'].value_counts().idxmax()\n", + "print(\"What is the most frequent payment type? \\t\\tAnswer: {type}\".format(type = most_freq_payment_type))\n", + "\n", + "# Are any features missing data?\n", + "missing_values = training_df.isnull().sum().sum()\n", + "print(\"Are any features missing data? \\t\\t\\t\\tAnswer:\", \"No\" if missing_values == 0 else \"Yes\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-StQ4-wbBpIP" + }, + "source": [ + "## Generate a correlation matrix\n", + "\n", + "An important part of machine learning is determining which [features](https://developers.google.com/machine-learning/glossary/#feature) correlate with the [label](https://developers.google.com/machine-learning/glossary/#label). If you have ever taken a taxi ride before, your experience is probably telling you that the fare is typically associated with the distance traveled and the duration of the trip. But, is there a way for you to learn more about how well these features correlate to the fare (label)?\n", + "\n", + "In this step, you will use a **correlation matrix** to identify features whose values correlate well with the label. Correlation values have the following meanings:\n", + "\n", + " * **`1.0`**: perfect positive correlation; that is, when one attribute rises, the other attribute rises.\n", + " * **`-1.0`**: perfect negative correlation; that is, when one attribute rises, the other attribute falls.\n", + " * **`0.0`**: no correlation; the two columns [are not linearly related](https://en.wikipedia.org/wiki/Correlation_and_dependence#/media/File:Correlation_examples2.svg).\n", + "\n", + "In general, the higher the absolute value of a correlation value, the greater its predictive power.\n", + "\n", + "**Instructions**\n", + "\n", + "1. Inspect the code in the **View correlation matrix** code cell.\n", + "1. Run the **View correlation matrix** code cell and inspect the output.\n", + "1. **Check your understanding** by answering these questions:\n", + " * Which feature correlates most strongly to the label FARE?\n", + " * Which feature correlates least strongly to the label FARE?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "-1kFmfdFDVmv" + }, + "outputs": [], + "source": [ + "#@title Code - View correlation matrix\n", + "training_df.corr(numeric_only = True)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "ExPq1h6wIzvR" + }, + "outputs": [], + "source": [ + "#@title Double-click to view answers about the correlation matrix\n", + "\n", + "# Which feature correlates most strongly to the label FARE?\n", + "# ---------------------------------------------------------\n", + "answer = '''\n", + "The feature with the strongest correlation to the FARE is TRIP_MILES.\n", + "As you might expect, TRIP_MILES looks like a good feature to start with to train\n", + "the model. Also, notice that the feature TRIP_SECONDS has a strong correlation\n", + "with fare too.\n", + "'''\n", + "print(answer)\n", + "\n", + "\n", + "# Which feature correlates least strongly to the label FARE?\n", + "# -----------------------------------------------------------\n", + "answer = '''The feature with the weakest correlation to the FARE is TIP_RATE.'''\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rqklIw96G7JA" + }, + "source": [ + "## Visualize relationships in dataset\n", + "\n", + "Sometimes it is helpful to visualize relationships between features in a dataset; one way to do this is with a pair plot. A **pair plot** generates a grid of pairwise plots to visualize the relationship of each feature with all other features all in one place.\n", + "\n", + "**Instructions**\n", + "1. Run the **View pair plot** code cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "ph0FE7ZxHY36" + }, + "outputs": [], + "source": [ + "#@title Code - View pairplot\n", + "sns.pairplot(training_df, x_vars=[\"FARE\", \"TRIP_MILES\", \"TRIP_SECONDS\"], y_vars=[\"FARE\", \"TRIP_MILES\", \"TRIP_SECONDS\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zrereRcYR9KG" + }, + "source": [ + "# Part 3 - Train Model\n", + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PfRhSs_RR2VI" + }, + "source": [ + "## Define functions to view model information\n", + "\n", + "To help visualize the results of each training run you will generate two plots at the end of each experiment:\n", + "\n", + "* a scatter plot of the features vs. the label with a line showing the output of the trained model\n", + "* a loss curve\n", + "\n", + "For this exercise, the plotting functions are provided for you. Unless you are interested, it is not important for you to understand how these plotting functions work.\n", + "\n", + "**Instructions**\n", + "1. Run the **Define plotting functions** code cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "EE7nBxoMUtE9" + }, + "outputs": [], + "source": [ + "#@title Define plotting functions\n", + "\n", + "def make_plots(df, feature_names, label_name, model_output, sample_size=200):\n", + "\n", + " random_sample = df.sample(n=sample_size).copy()\n", + " random_sample.reset_index()\n", + " weights, bias, epochs, rmse = model_output\n", + "\n", + " is_2d_plot = len(feature_names) == 1\n", + " model_plot_type = \"scatter\" if is_2d_plot else \"surface\"\n", + " fig = make_subplots(rows=1, cols=2,\n", + " subplot_titles=(\"Loss Curve\", \"Model Plot\"),\n", + " specs=[[{\"type\": \"scatter\"}, {\"type\": model_plot_type}]])\n", + "\n", + " plot_data(random_sample, feature_names, label_name, fig)\n", + " plot_model(random_sample, feature_names, weights, bias, fig)\n", + " plot_loss_curve(epochs, rmse, fig)\n", + "\n", + " fig.show()\n", + " return\n", + "\n", + "def plot_loss_curve(epochs, rmse, fig):\n", + " curve = px.line(x=epochs, y=rmse)\n", + " curve.update_traces(line_color='#ff0000', line_width=3)\n", + "\n", + " fig.append_trace(curve.data[0], row=1, col=1)\n", + " fig.update_xaxes(title_text=\"Epoch\", row=1, col=1)\n", + " fig.update_yaxes(title_text=\"Root Mean Squared Error\", row=1, col=1, range=[rmse.min()*0.8, rmse.max()])\n", + "\n", + " return\n", + "\n", + "def plot_data(df, features, label, fig):\n", + " if len(features) == 1:\n", + " scatter = px.scatter(df, x=features[0], y=label)\n", + " else:\n", + " scatter = px.scatter_3d(df, x=features[0], y=features[1], z=label)\n", + "\n", + " fig.append_trace(scatter.data[0], row=1, col=2)\n", + " if len(features) == 1:\n", + " fig.update_xaxes(title_text=features[0], row=1, col=2)\n", + " fig.update_yaxes(title_text=label, row=1, col=2)\n", + " else:\n", + " fig.update_layout(scene1=dict(xaxis_title=features[0], yaxis_title=features[1], zaxis_title=label))\n", + "\n", + " return\n", + "\n", + "def plot_model(df, features, weights, bias, fig):\n", + " df['FARE_PREDICTED'] = bias[0]\n", + "\n", + " for index, feature in enumerate(features):\n", + " df['FARE_PREDICTED'] = df['FARE_PREDICTED'] + weights[index][0] * df[feature]\n", + "\n", + " if len(features) == 1:\n", + " model = px.line(df, x=features[0], y='FARE_PREDICTED')\n", + " model.update_traces(line_color='#ff0000', line_width=3)\n", + " else:\n", + " z_name, y_name = \"FARE_PREDICTED\", features[1]\n", + " z = [df[z_name].min(), (df[z_name].max() - df[z_name].min()) / 2, df[z_name].max()]\n", + " y = [df[y_name].min(), (df[y_name].max() - df[y_name].min()) / 2, df[y_name].max()]\n", + " x = []\n", + " for i in range(len(y)):\n", + " x.append((z[i] - weights[1][0] * y[i] - bias[0]) / weights[0][0])\n", + "\n", + " plane=pd.DataFrame({'x':x, 'y':y, 'z':[z] * 3})\n", + "\n", + " light_yellow = [[0, '#89CFF0'], [1, '#FFDB58']]\n", + " model = go.Figure(data=go.Surface(x=plane['x'], y=plane['y'], z=plane['z'],\n", + " colorscale=light_yellow))\n", + "\n", + " fig.add_trace(model.data[0], row=1, col=2)\n", + "\n", + " return\n", + "\n", + "def model_info(feature_names, label_name, model_output):\n", + " weights = model_output[0]\n", + " bias = model_output[1]\n", + "\n", + " nl = \"\\n\"\n", + " header = \"-\" * 80\n", + " banner = header + nl + \"|\" + \"MODEL INFO\".center(78) + \"|\" + nl + header\n", + "\n", + " info = \"\"\n", + " equation = label_name + \" = \"\n", + "\n", + " for index, feature in enumerate(feature_names):\n", + " info = info + \"Weight for feature[{}]: {:.3f}\\n\".format(feature, weights[index][0])\n", + " equation = equation + \"{:.3f} * {} + \".format(weights[index][0], feature)\n", + "\n", + " info = info + \"Bias: {:.3f}\\n\".format(bias[0])\n", + " equation = equation + \"{:.3f}\\n\".format(bias[0])\n", + "\n", + " return banner + nl + info + nl + equation\n", + "\n", + "print(\"SUCCESS: defining plotting functions complete.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iRluiQhNvTwc" + }, + "source": [ + "## Define functions to build and train a model\n", + "\n", + "The code you need to build and train your model is in the **Define ML functions** code cell. If you would like to explore this code, expand the code cell and take a look.\n", + "\n", + "**Instructions**\n", + "1. Run the **Define ML functions** code cell." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "W6a7dtcCob-n" + }, + "outputs": [], + "source": [ + "#@title Code - Define ML functions\n", + "\n", + "def build_model(my_learning_rate, num_features):\n", + " \"\"\"Create and compile a simple linear regression model.\"\"\"\n", + " # Most simple keras models are sequential.\n", + " model = keras.models.Sequential()\n", + "\n", + " # Describe the topography of the model.\n", + " # The topography of a simple linear regression model\n", + " # is a single node in a single layer.\n", + " model.add(keras.layers.Dense(units=1,\n", + " input_shape=(num_features,)))\n", + "\n", + " # Compile the model topography into code that Keras can efficiently\n", + " # execute. Configure training to minimize the model's mean squared error.\n", + " model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=my_learning_rate),\n", + " loss=\"mean_squared_error\",\n", + " metrics=[keras.metrics.RootMeanSquaredError()])\n", + "\n", + " return model\n", + "\n", + "\n", + "def train_model(model, df, features, label, epochs, batch_size):\n", + " \"\"\"Train the model by feeding it data.\"\"\"\n", + "\n", + " # Feed the model the feature and the label.\n", + " # The model will train for the specified number of epochs.\n", + " # input_x = df.iloc[:,1:3].values\n", + " # df[feature]\n", + " history = model.fit(x=features,\n", + " y=label,\n", + " batch_size=batch_size,\n", + " epochs=epochs)\n", + "\n", + " # Gather the trained model's weight and bias.\n", + " trained_weight = model.get_weights()[0]\n", + " trained_bias = model.get_weights()[1]\n", + "\n", + " # The list of epochs is stored separately from the rest of history.\n", + " epochs = history.epoch\n", + "\n", + " # Isolate the error for each epoch.\n", + " hist = pd.DataFrame(history.history)\n", + "\n", + " # To track the progression of training, we're going to take a snapshot\n", + " # of the model's root mean squared error at each epoch.\n", + " rmse = hist[\"root_mean_squared_error\"]\n", + "\n", + " return trained_weight, trained_bias, epochs, rmse\n", + "\n", + "\n", + "def run_experiment(df, feature_names, label_name, learning_rate, epochs, batch_size):\n", + "\n", + " print('INFO: starting training experiment with features={} and label={}\\n'.format(feature_names, label_name))\n", + "\n", + " num_features = len(feature_names)\n", + "\n", + " features = df.loc[:, feature_names].values\n", + " label = df[label_name].values\n", + "\n", + " model = build_model(learning_rate, num_features)\n", + " model_output = train_model(model, df, features, label, epochs, batch_size)\n", + "\n", + " print('\\nSUCCESS: training experiment complete\\n')\n", + " print('{}'.format(model_info(feature_names, label_name, model_output)))\n", + " make_plots(df, feature_names, label_name, model_output)\n", + "\n", + " return model\n", + "\n", + "print(\"SUCCESS: defining linear regression functions complete.\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "m3DQCE2OpH4-" + }, + "source": [ + "## Train a model with one feature\n", + "\n", + "In this step you will train a model to predict the cost of the fare using a **single feature**. Earlier, you saw that `TRIP_MILES` (distance) correlates most strongly with the ``FARE``, so let's start with `TRIP_MILES` as the feature for your first training run.\n", + "\n", + "**Instructions**\n", + "\n", + "1. Run the **Experiment 1** code cell to build your model with one feature.\n", + "1. Review the output from the training run\n", + "1. **Check your understanding** by answering these questions:\n", + " * How many epochs did it take to converge on the final model?\n", + " * How well does the model fit the sample data?\n", + "\n", + "During training, you should see the root mean square error (RSME) in the output. The units for RMSE are the same as the units for the label (dollars). In other words, you can use the RMSE to determine how far off, on average, the predicted fares are in dollars from the observed values." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "F_17Aum6IG1F" + }, + "outputs": [], + "source": [ + "#@title Code - Experiment 1\n", + "\n", + "# The following variables are the hyperparameters.\n", + "learning_rate = 0.001\n", + "epochs = 20\n", + "batch_size = 50\n", + "\n", + "# Specify the feature and the label.\n", + "features = ['TRIP_MILES']\n", + "label = 'FARE'\n", + "\n", + "model_1 = run_experiment(training_df, features, label, learning_rate, epochs, batch_size)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "y8Qnmb0wZ_pQ" + }, + "outputs": [], + "source": [ + "#@title Double-click to view answers for training model with one feature\n", + "\n", + "# How many epochs did it take to converge on the final model?\n", + "# -----------------------------------------------------------------------------\n", + "answer = \"\"\"\n", + "Use the loss curve to see where the loss begins to level off during training.\n", + "\n", + "With this set of hyperparameters:\n", + "\n", + " learning_rate = 0.001\n", + " epochs = 20\n", + " batch_size = 50\n", + "\n", + "it takes about 5 epochs for the training run to converge to the final model.\n", + "\"\"\"\n", + "print(answer)\n", + "\n", + "# How well does the model fit the sample data?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "It appears from the model plot that the model fits the sample data fairly well.\n", + "'''\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MYmWW0a9p1ro" + }, + "source": [ + "## Experiment with hyperparameters\n", + "\n", + "It is common with machine learning to run multiple experiments to find the best set of hyperparmeters to train your model. In this step, try varying the hyperparameters one by one with this set of experiments:\n", + "\n", + "* *Experiment 1:* **Increase** the learning rate to **``1``** (batch size at ``50``).\n", + "* *Experiment 2:* **Decrease** the learning rate to **``0.0001``** (batch size at ``50``).\n", + "* *Experiment 3:* **Increase** the batch size to **``500``** (learning rate at ``0.001``).\n", + "\n", + "**Instructions**\n", + "1. Update the hyperparameter values in the **Experiment 2** code cell according to the experiment.\n", + "2. Run the **Experiment 2** code cell.\n", + "3. After the training run, examine the output and note any differences you see in the loss curve or model output.\n", + "4. Repeat steps 1 - 3 for each hyperparameter experiment.\n", + "5. **Check your understanding** by answering these questions:\n", + " * How did raising the learning rate impact your ability to train the model?\n", + " * How did lowering the learning rate impact your ability to train the model?\n", + " * Did changing the batch size effect your training results?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "PdUXEm1xeWcK" + }, + "outputs": [], + "source": [ + "#@title Code - Experiment 2\n", + "\n", + "# The following variables are the hyperparameters.\n", + "# TODO - Adjust these hyperparameters to see how they impact a training run.\n", + "learning_rate = 0.001\n", + "epochs = 20\n", + "batch_size = 50\n", + "\n", + "# Specify the feature and the label.\n", + "features = ['TRIP_MILES']\n", + "label = 'FARE'\n", + "\n", + "model_1 = run_experiment(training_df, features, label, learning_rate, epochs, batch_size)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "Od7vJJpHiHYB" + }, + "outputs": [], + "source": [ + "#@title Double-click to view answers for hyperparameter experiments\n", + "\n", + "# How did raising the learning rate impact your ability to train the model?\n", + "# -----------------------------------------------------------------------------\n", + "answer = \"\"\"\n", + "When the learning rate is too high, the loss curve bounces around and does not\n", + "appear to be moving towards convergence with each iteration. Also, notice that\n", + "the predicted model does not fit the data very well. With a learning rate that\n", + "is too high, it is unlikely that you will be able to train a model with good\n", + "results.\n", + "\"\"\"\n", + "print(answer)\n", + "\n", + "# How did lowering the learning rate impact your ability to train the model?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "When the learning rate is too small, it may take longer for the loss curve to\n", + "converge. With a small learning rate the loss curve decreases slowly, but does\n", + "not show a dramatic drop or leveling off. With a small learning rate you could\n", + "increase the number of epochs so that your model will eventually converge, but\n", + "it will take longer.\n", + "'''\n", + "print(answer)\n", + "\n", + "# Did changing the batch size effect your training results?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "Increasing the batch size makes each epoch run faster, but as with the smaller\n", + "learning rate, the model does not converge with just 20 epochs. If you have\n", + "time, try increasing the number of epochs and eventually you should see the\n", + "model converge.\n", + "'''\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o27u0JRj_gJr" + }, + "source": [ + "## Train a model with two features\n", + "\n", + "The model you trained with the feature ``TOTAL_MILES`` demonstrates fairly strong predictive power, but is it possible to do better? In this step, try training the model with two features, ``TRIP_MILES`` and ``TRIP_MINUTES``, to see if you can improve the model. You may recall that the original dataset does not include a feature ``TRIP_MINUTES``, but this feature can be easily derived from ``TRIP_SECONDS`` as shown in the code below.*\n", + "\n", + "**Instructions**\n", + "1. Review the code in **Experiment 3** code cell.\n", + "1. Run the **Experiment 3** code cell.\n", + "1. Review the output from the training run and answer these questions:\n", + " * Does the model with two features produce better results than one using a single feature?\n", + " * Does it make a difference if you use ``TRIP_SECONDS`` instead of ``TRIP_MINUTES``?\n", + " * How well do you think the model comes to the ground truth fare calculation for Chicago Taxi Trips?\n", + "\n", + "\n", + "Notice that the scatter plot of the features vs. the label is a three dimensional (3-D) plot. This representation allows you to visualize both features and the label all together. The two features (TRIP_MILES and TRIP_MINUTES) are on the x and y axis, and the label (FARE) is on the z axis. The plot shows individual examples in the dataset as circles, and the model as a surface (plane). With this 3-D model, if the trained model is good you would expect most of the examples to land on the plane surface. The 3-D plot is interactive so you can explore the data further by clicking or dragging the plot.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "Mg3gUYOoBAtd" + }, + "outputs": [], + "source": [ + "#@title Code - Experiment 3\n", + "\n", + "# The following variables are the hyperparameters.\n", + "learning_rate = 0.001\n", + "epochs = 20\n", + "batch_size = 50\n", + "\n", + "training_df['TRIP_MINUTES'] = training_df['TRIP_SECONDS']/60\n", + "\n", + "features = ['TRIP_MILES', 'TRIP_MINUTES']\n", + "label = 'FARE'\n", + "\n", + "model_2 = run_experiment(training_df, features, label, learning_rate, epochs, batch_size)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "uFkKK5t33xSX" + }, + "outputs": [], + "source": [ + "#@title Double-click to view answers for training with two features\n", + "\n", + "# Does the model with two features produce better results than one using a\n", + "# single feature?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "To answer this question for your specific training runs, compare the RMSE for\n", + "each model. For example, if the RMSE for the model trained with one feature was\n", + "3.7457 and the RMSE for the model with two features is 3.4787, that means that\n", + "on average the model with two features makes predictions that are about $0.27\n", + "closer to the observed fare.\n", + "\n", + "'''\n", + "print(answer)\n", + "\n", + "# Does it make a difference if you use TRIP_SECONDS instead of TRIP_MILES?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "When training a model with more than one feature, it is important that all\n", + "numeric values are roughly on the same scale. In this case, TRIP_SECONDS and\n", + "TRIP_MILES do not meet this criteria. The mean value for TRIP_MILES is 8.3 and\n", + "the mean for TRIP_SECONDS is 1320; that is two orders of magnitude difference.\n", + "Converting the trip duration to minutes helps during training because in puts\n", + "values for both features on a more comparable scale. Of course, this is not the\n", + "only way to scale values before training, but you will learn about that in\n", + "another module.\n", + "\n", + "'''\n", + "print(answer)\n", + "\n", + "# How well do you think the model comes to the ground truth fare calculation for\n", + "# Chicago taxi trips?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "In reality, Chicago taxi cabs use a documented formula to determine cab fares.\n", + "For a single passenger paying cash, the fare is calculated like this:\n", + "\n", + "FARE = 2.25 * TRIP_MILES + 0.12 * TRIP_MINUTES + 3.25\n", + "\n", + "Typically with machine learning problems you would not know the 'correct'\n", + "formula, but in this case you can this knowledge to evaluate your model. Take a\n", + "look at your model output (the weights and bias) and determine how well it\n", + "matches the ground truth fare calculation. You should find that the model is\n", + "roughly close to this formula.\n", + "'''\n", + "print(answer)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MzMfgxldSMGK" + }, + "source": [ + "# Part 4 - Validate Model\n", + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_yW7nVxlO1WY" + }, + "source": [ + "## Use the model to make predictions\n", + "\n", + "Now that you have a trained model, you can use the model to make predictions. In practice, you should make predictions on examples that are not used during training. However, for this exercise, you'll just work with a subset of the same training dataset. In another Colab exercise you will explore ways to make predictions on examples not used in training.\n", + "\n", + "**Instructions**\n", + "\n", + "1. Run the **Define functions to make predictions** code cell.\n", + "1. Run the **Make predictions** code cell.\n", + "1. Review the predictions in the output.\n", + "1. **Check your understanding** by answering these questions:\n", + " * How close is the predicted value to the label value? In other words, does your model accurately predict the fare for a taxi ride?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "XdNxv3j8PGnr" + }, + "outputs": [], + "source": [ + "#@title Code - Define functions to make predictions\n", + "def format_currency(x):\n", + " return \"${:.2f}\".format(x)\n", + "\n", + "def build_batch(df, batch_size):\n", + " batch = df.sample(n=batch_size).copy()\n", + " batch.set_index(np.arange(batch_size), inplace=True)\n", + " return batch\n", + "\n", + "def predict_fare(model, df, features, label, batch_size=50):\n", + " batch = build_batch(df, batch_size)\n", + " predicted_values = model.predict_on_batch(x=batch.loc[:, features].values)\n", + "\n", + " data = {\"PREDICTED_FARE\": [], \"OBSERVED_FARE\": [], \"L1_LOSS\": [],\n", + " features[0]: [], features[1]: []}\n", + " for i in range(batch_size):\n", + " predicted = predicted_values[i][0]\n", + " observed = batch.at[i, label]\n", + " data[\"PREDICTED_FARE\"].append(format_currency(predicted))\n", + " data[\"OBSERVED_FARE\"].append(format_currency(observed))\n", + " data[\"L1_LOSS\"].append(format_currency(abs(observed - predicted)))\n", + " data[features[0]].append(batch.at[i, features[0]])\n", + " data[features[1]].append(\"{:.2f}\".format(batch.at[i, features[1]]))\n", + "\n", + " output_df = pd.DataFrame(data)\n", + " return output_df\n", + "\n", + "def show_predictions(output):\n", + " header = \"-\" * 80\n", + " banner = header + \"\\n\" + \"|\" + \"PREDICTIONS\".center(78) + \"|\" + \"\\n\" + header\n", + " print(banner)\n", + " print(output)\n", + " return" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "PK3oO2kYV8m0" + }, + "outputs": [], + "source": [ + "#@title Code - Make predictions\n", + "\n", + "output = predict_fare(model_2, training_df, features, label)\n", + "show_predictions(output)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "cellView": "form", + "id": "6sjix7lXI7xT" + }, + "outputs": [], + "source": [ + "#@title Double-click to view answers for validate model\n", + "\n", + "# How close is the predicted value to the label value?\n", + "# -----------------------------------------------------------------------------\n", + "answer = '''\n", + "Based on a random sampling of examples, the model seems to do pretty well\n", + "predicting the fare for a taxi ride. Most of the predicted values do not vary\n", + "significantly from the observed value. You should be able to see this by looking\n", + "at the column L1_LOSS = |observed - predicted|.\n", + "'''\n", + "print(answer)" + ] + } + ], + "metadata": { + "colab": { + "provenance": [], + "collapsed_sections": [ + "sgR4YRjj5T-b" + ] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/ml/cc/exercises/numerical_data_bad_values.ipynb b/ml/cc/exercises/numerical_data_bad_values.ipynb new file mode 100644 index 0000000..52622c2 --- /dev/null +++ b/ml/cc/exercises/numerical_data_bad_values.ipynb @@ -0,0 +1,1987 @@ +{ + "cells": [ + { + "cell_type": "code", + "source": [ + "#@title Copyright 2023 Google LLC. Double-click for license information.\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ], + "metadata": { + "cellView": "form", + "id": "-o-X5uIc7WJS" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-j78VQT0Jq5H" + }, + "source": [ + "# Colabs\n", + "\n", + "Machine Learning Crash Course uses Colaboratory (Colab) notebooks for all programming exercises. Colab is Google's implementation of [Jupyter Notebook](https://jupyter.org/). For more information about Colabs and how to use them, go to [Welcome to Colaboratory](https://research.google.com/colaboratory).\n", + "\n", + "# Numerical data - Find the bad part of the dataset\n", + "\n", + "This Colab programming exercise (final of two) is part of the Machine Learning Crash Course module [Working with numerical data](https://developers.google.com/machine-learning/crash-course/numerical-data)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aKs82rqT1Mfe" + }, + "source": [ + "## What to expect\n", + "\n", + "In the section, [First steps with numerical data](https://developers.google.com/machine-learning/crash-course/numerical-data/first-steps), you learned how to visualize your data in plots or graphs, evaluate potential features and labels mathematically, and how to find [**outliers**](https://developers.google.com/machine-learning/glossary/#outliers) in the dataset.\n", + "\n", + "This exercise (final of two) guides you through visual and mathematical ways to find hidden _bad_ values in a dataset. You'll use scatter plots and basic statistics to locate unreliable regions of a dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JJZEgJQSjyK4" + }, + "source": [ + "## The Dataset\n", + "\n", + "The dataset for this exercise is fictitious. It contains 1,400 rows. Each row contains the following two columns:\n", + "\n", + "* **calories**, which is the number of breakfast calories as determined by a nutritionist.\n", + "* **test_score**, which is the student's score on a math test. A reliable program determines the value of `test_score`.\n", + "\n", + "The data pool consists of 50 students, each evaluated on 28 consecutive days. So, rows 0-49 show how each of the students performed on the first day, rows 50-99 show how each of the students performed on the second day, and so on.\n", + "\n", + "The dataset aims to determine the relationship between the number of calories a student ate for breakfast and their test scores. However, your goal is simply to determine whether some values in the dataset are bad." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9n9_cTveKmse", + "cellView": "form" + }, + "outputs": [], + "source": [ + "# @title Setup - Import relevant modules\n", + "\n", + "# The following code imports relevant modules that\n", + "# allow you to run the colab.\n", + "# - If you encounter technical issues running some of the code sections\n", + "# that follow, try running this section again.\n", + "\n", + "import pandas as pd\n", + "from matplotlib import pyplot as plt\n", + "import io\n", + "\n", + "# The following lines adjust the granularity of reporting.\n", + "pd.options.display.max_rows = 10\n", + "pd.options.display.float_format = \"{:.1f}\".format" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "STcOxRtjlwGP", + "cellView": "form" + }, + "outputs": [], + "source": [ + "# @title Setup - Define the dataset\n", + "\n", + "# The following code defines the dataset that\n", + "# you'll use in the colab.\n", + "# - If you encounter technical issues running some of the code sections\n", + "# that follow, try running this section again.\n", + "\n", + "dataset = '''calories,test_score\n", + "201,76\n", + "142,72\n", + "397,84\n", + "294,79\n", + "334,76\n", + "173,60\n", + "117,59\n", + "174,60\n", + "333,80\n", + "383,83\n", + "77,59\n", + "39,62\n", + "242,73\n", + "7,51\n", + "140,63\n", + "73,66\n", + "102,61\n", + "5,48\n", + "339,88\n", + "388,92\n", + "374,86\n", + "358,81\n", + "292,76\n", + "167,70\n", + "300,80\n", + "348,77\n", + "68,49\n", + "1,55\n", + "198,63\n", + "154,74\n", + "220,75\n", + "9,58\n", + "129,56\n", + "108,55\n", + "356,81\n", + "383,96\n", + "299,86\n", + "107,62\n", + "117,67\n", + "103,54\n", + "110,56\n", + "80,57\n", + "125,57\n", + "132,71\n", + "176,73\n", + "390,98\n", + "199,63\n", + "5,56\n", + "99,56\n", + "39,49\n", + "251,67\n", + "71,57\n", + "7,52\n", + "28,49\n", + "139,61\n", + "227,77\n", + "268,82\n", + "113,70\n", + "32,57\n", + "1,51\n", + "217,64\n", + "248,67\n", + "76,60\n", + "388,90\n", + "9,43\n", + "315,86\n", + "49,47\n", + "97,53\n", + "251,72\n", + "224,65\n", + "299,75\n", + "325,81\n", + "45,47\n", + "203,67\n", + "326,83\n", + "329,82\n", + "282,79\n", + "203,61\n", + "117,63\n", + "218,68\n", + "262,75\n", + "73,64\n", + "205,69\n", + "54,51\n", + "296,83\n", + "132,63\n", + "16,45\n", + "363,80\n", + "138,62\n", + "181,64\n", + "13,49\n", + "294,86\n", + "374,85\n", + "338,79\n", + "375,87\n", + "260,80\n", + "375,93\n", + "234,74\n", + "103,64\n", + "322,75\n", + "210,64\n", + "280,79\n", + "110,65\n", + "329,87\n", + "94,55\n", + "399,88\n", + "264,81\n", + "88,61\n", + "34,48\n", + "373,81\n", + "268,80\n", + "333,87\n", + "208,71\n", + "109,68\n", + "142,57\n", + "39,63\n", + "84,66\n", + "263,78\n", + "247,74\n", + "172,74\n", + "303,79\n", + "92,60\n", + "107,62\n", + "49,54\n", + "293,73\n", + "238,72\n", + "341,85\n", + "48,54\n", + "25,52\n", + "189,69\n", + "230,74\n", + "206,78\n", + "190,63\n", + "237,68\n", + "305,72\n", + "22,55\n", + "223,70\n", + "62,62\n", + "25,50\n", + "115,67\n", + "220,81\n", + "123,62\n", + "210,72\n", + "39,57\n", + "13,53\n", + "88,63\n", + "56,62\n", + "285,83\n", + "50,51\n", + "369,94\n", + "174,60\n", + "206,63\n", + "236,77\n", + "370,90\n", + "66,53\n", + "178,72\n", + "167,71\n", + "133,71\n", + "157,64\n", + "298,78\n", + "1,58\n", + "99,55\n", + "324,91\n", + "389,97\n", + "107,68\n", + "371,93\n", + "36,55\n", + "365,90\n", + "131,71\n", + "132,63\n", + "149,70\n", + "126,68\n", + "254,78\n", + "294,71\n", + "75,57\n", + "224,67\n", + "221,78\n", + "219,71\n", + "307,81\n", + "205,64\n", + "209,70\n", + "382,90\n", + "396,98\n", + "118,60\n", + "367,89\n", + "367,83\n", + "203,65\n", + "33,49\n", + "395,97\n", + "154,60\n", + "205,79\n", + "188,63\n", + "345,76\n", + "258,80\n", + "258,70\n", + "353,79\n", + "282,86\n", + "356,91\n", + "367,81\n", + "91,67\n", + "183,82\n", + "47,58\n", + "118,72\n", + "157,83\n", + "70,58\n", + "172,93\n", + "45,58\n", + "28,50\n", + "92,74\n", + "187,80\n", + "10,60\n", + "76,70\n", + "174,81\n", + "111,79\n", + "91,75\n", + "168,83\n", + "29,48\n", + "181,90\n", + "122,80\n", + "63,71\n", + "193,85\n", + "186,82\n", + "145,71\n", + "77,62\n", + "193,87\n", + "100,61\n", + "71,71\n", + "154,78\n", + "52,59\n", + "46,65\n", + "66,55\n", + "62,62\n", + "51,57\n", + "41,67\n", + "24,57\n", + "186,80\n", + "4,59\n", + "77,70\n", + "29,63\n", + "28,49\n", + "61,65\n", + "26,58\n", + "127,83\n", + "199,97\n", + "68,61\n", + "145,84\n", + "102,62\n", + "50,67\n", + "93,70\n", + "169,85\n", + "119,67\n", + "312,78\n", + "341,78\n", + "67,66\n", + "17,48\n", + "45,52\n", + "24,48\n", + "47,58\n", + "102,55\n", + "280,82\n", + "220,73\n", + "192,61\n", + "269,83\n", + "373,81\n", + "160,71\n", + "290,84\n", + "264,76\n", + "48,55\n", + "269,75\n", + "204,76\n", + "282,80\n", + "112,68\n", + "390,87\n", + "270,74\n", + "247,72\n", + "16,46\n", + "363,89\n", + "133,65\n", + "185,64\n", + "331,92\n", + "16,53\n", + "310,87\n", + "66,48\n", + "128,71\n", + "160,72\n", + "308,76\n", + "225,64\n", + "34,56\n", + "222,77\n", + "392,85\n", + "97,53\n", + "222,68\n", + "302,87\n", + "378,85\n", + "61,61\n", + "161,63\n", + "129,57\n", + "43,54\n", + "116,62\n", + "173,71\n", + "189,65\n", + "361,80\n", + "187,77\n", + "314,73\n", + "341,91\n", + "165,59\n", + "44,63\n", + "184,75\n", + "341,82\n", + "119,55\n", + "70,63\n", + "165,68\n", + "394,92\n", + "80,54\n", + "65,61\n", + "29,55\n", + "64,65\n", + "310,83\n", + "384,91\n", + "304,88\n", + "216,66\n", + "6,53\n", + "55,57\n", + "249,84\n", + "395,81\n", + "41,50\n", + "334,86\n", + "394,97\n", + "100,56\n", + "125,61\n", + "14,49\n", + "61,53\n", + "143,72\n", + "373,86\n", + "238,77\n", + "138,59\n", + "388,88\n", + "357,79\n", + "20,54\n", + "82,52\n", + "261,75\n", + "210,73\n", + "15,52\n", + "210,69\n", + "364,92\n", + "365,95\n", + "132,55\n", + "143,68\n", + "40,50\n", + "88,64\n", + "35,55\n", + "18,44\n", + "56,47\n", + "109,55\n", + "268,78\n", + "178,67\n", + "399,82\n", + "11,45\n", + "235,74\n", + "215,73\n", + "200,69\n", + "220,78\n", + "249,82\n", + "250,79\n", + "121,68\n", + "210,71\n", + "165,70\n", + "265,79\n", + "290,87\n", + "384,88\n", + "207,69\n", + "16,54\n", + "193,62\n", + "307,75\n", + "292,76\n", + "312,78\n", + "345,83\n", + "67,56\n", + "180,75\n", + "126,69\n", + "299,86\n", + "143,60\n", + "251,66\n", + "371,86\n", + "25,59\n", + "27,56\n", + "159,59\n", + "125,69\n", + "114,56\n", + "66,50\n", + "240,82\n", + "184,68\n", + "196,74\n", + "21,44\n", + "254,73\n", + "364,86\n", + "127,59\n", + "347,93\n", + "157,71\n", + "161,65\n", + "315,83\n", + "57,50\n", + "276,79\n", + "69,49\n", + "300,79\n", + "83,56\n", + "199,75\n", + "367,95\n", + "247,83\n", + "15,49\n", + "136,72\n", + "341,87\n", + "129,67\n", + "284,82\n", + "248,70\n", + "68,55\n", + "320,81\n", + "280,80\n", + "36,62\n", + "272,85\n", + "171,76\n", + "161,74\n", + "307,77\n", + "365,84\n", + "159,58\n", + "299,86\n", + "45,58\n", + "244,73\n", + "215,68\n", + "60,50\n", + "259,75\n", + "269,76\n", + "382,88\n", + "61,48\n", + "185,74\n", + "40,61\n", + "373,87\n", + "326,86\n", + "373,93\n", + "77,55\n", + "358,94\n", + "70,48\n", + "303,87\n", + "220,68\n", + "85,64\n", + "224,81\n", + "94,55\n", + "167,63\n", + "329,86\n", + "137,60\n", + "246,76\n", + "112,62\n", + "22,47\n", + "99,53\n", + "58,55\n", + "170,61\n", + "196,73\n", + "105,58\n", + "241,80\n", + "259,84\n", + "248,71\n", + "357,92\n", + "262,75\n", + "105,60\n", + "37,58\n", + "347,88\n", + "106,54\n", + "102,57\n", + "184,70\n", + "166,68\n", + "63,64\n", + "301,85\n", + "306,72\n", + "44,52\n", + "331,90\n", + "159,67\n", + "72,53\n", + "208,77\n", + "284,83\n", + "168,74\n", + "198,66\n", + "291,84\n", + "218,80\n", + "74,49\n", + "279,82\n", + "244,83\n", + "263,74\n", + "287,79\n", + "194,77\n", + "359,84\n", + "364,85\n", + "391,82\n", + "278,78\n", + "13,51\n", + "111,60\n", + "169,72\n", + "339,78\n", + "213,69\n", + "95,65\n", + "159,58\n", + "214,64\n", + "3,47\n", + "234,77\n", + "332,75\n", + "308,87\n", + "196,73\n", + "95,59\n", + "350,77\n", + "29,60\n", + "220,69\n", + "187,77\n", + "50,52\n", + "91,64\n", + "326,82\n", + "236,70\n", + "247,70\n", + "174,70\n", + "213,63\n", + "184,68\n", + "79,53\n", + "121,67\n", + "363,89\n", + "149,72\n", + "275,77\n", + "320,77\n", + "319,80\n", + "128,54\n", + "319,83\n", + "361,88\n", + "49,57\n", + "374,92\n", + "333,83\n", + "188,68\n", + "242,82\n", + "376,93\n", + "107,58\n", + "282,72\n", + "0,42\n", + "26,58\n", + "209,79\n", + "58,55\n", + "182,68\n", + "227,68\n", + "48,54\n", + "347,82\n", + "28,60\n", + "79,49\n", + "155,69\n", + "193,78\n", + "282,77\n", + "180,95\n", + "176,87\n", + "51,69\n", + "137,76\n", + "158,78\n", + "102,62\n", + "170,92\n", + "101,69\n", + "44,63\n", + "199,84\n", + "39,50\n", + "43,59\n", + "15,57\n", + "55,64\n", + "162,91\n", + "39,61\n", + "12,45\n", + "84,64\n", + "13,48\n", + "171,93\n", + "127,73\n", + "1,50\n", + "66,65\n", + "2,53\n", + "92,60\n", + "193,91\n", + "36,51\n", + "31,54\n", + "199,90\n", + "5,56\n", + "103,65\n", + "124,76\n", + "80,58\n", + "0,49\n", + "51,67\n", + "108,81\n", + "66,66\n", + "96,72\n", + "54,61\n", + "85,71\n", + "122,72\n", + "99,75\n", + "135,82\n", + "30,59\n", + "58,56\n", + "116,81\n", + "78,60\n", + "119,79\n", + "47,63\n", + "74,66\n", + "224,64\n", + "237,81\n", + "267,71\n", + "262,68\n", + "314,76\n", + "354,84\n", + "232,71\n", + "72,55\n", + "98,57\n", + "55,56\n", + "244,78\n", + "222,65\n", + "364,80\n", + "4,46\n", + "342,82\n", + "341,85\n", + "200,71\n", + "208,65\n", + "339,80\n", + "128,65\n", + "189,64\n", + "10,45\n", + "278,74\n", + "208,79\n", + "257,79\n", + "232,69\n", + "148,58\n", + "146,61\n", + "158,64\n", + "215,69\n", + "344,84\n", + "352,85\n", + "65,55\n", + "254,67\n", + "273,85\n", + "256,71\n", + "107,66\n", + "169,75\n", + "34,61\n", + "360,90\n", + "282,78\n", + "36,46\n", + "289,80\n", + "186,62\n", + "18,45\n", + "190,60\n", + "244,74\n", + "191,74\n", + "389,82\n", + "355,90\n", + "70,57\n", + "10,52\n", + "120,61\n", + "204,70\n", + "199,70\n", + "131,58\n", + "399,88\n", + "111,67\n", + "36,60\n", + "22,46\n", + "385,91\n", + "59,57\n", + "41,56\n", + "181,60\n", + "338,81\n", + "335,77\n", + "390,97\n", + "271,75\n", + "167,63\n", + "349,76\n", + "325,77\n", + "90,67\n", + "292,81\n", + "298,86\n", + "185,74\n", + "25,51\n", + "218,69\n", + "42,57\n", + "377,91\n", + "332,90\n", + "1,53\n", + "9,43\n", + "298,85\n", + "186,63\n", + "40,51\n", + "74,49\n", + "259,76\n", + "375,87\n", + "51,60\n", + "165,70\n", + "280,78\n", + "92,55\n", + "316,79\n", + "358,92\n", + "43,58\n", + "294,74\n", + "199,76\n", + "121,57\n", + "311,88\n", + "205,79\n", + "87,64\n", + "3,59\n", + "128,55\n", + "183,69\n", + "339,91\n", + "101,63\n", + "181,68\n", + "361,83\n", + "371,94\n", + "76,49\n", + "252,67\n", + "102,55\n", + "1,53\n", + "234,80\n", + "217,64\n", + "351,92\n", + "360,95\n", + "336,81\n", + "19,50\n", + "353,76\n", + "154,59\n", + "263,84\n", + "249,76\n", + "118,65\n", + "187,70\n", + "277,72\n", + "293,71\n", + "220,76\n", + "289,72\n", + "250,71\n", + "136,66\n", + "96,55\n", + "284,73\n", + "270,82\n", + "238,75\n", + "347,77\n", + "322,83\n", + "13,50\n", + "79,57\n", + "33,58\n", + "11,58\n", + "34,58\n", + "376,90\n", + "242,79\n", + "351,82\n", + "57,51\n", + "22,50\n", + "226,77\n", + "228,68\n", + "253,71\n", + "363,86\n", + "144,68\n", + "55,55\n", + "98,59\n", + "373,86\n", + "85,51\n", + "128,61\n", + "332,81\n", + "59,57\n", + "55,47\n", + "351,81\n", + "96,51\n", + "309,88\n", + "323,80\n", + "105,59\n", + "290,73\n", + "377,82\n", + "352,80\n", + "276,71\n", + "251,77\n", + "224,81\n", + "277,86\n", + "141,66\n", + "143,64\n", + "111,63\n", + "253,71\n", + "354,94\n", + "122,67\n", + "358,79\n", + "86,52\n", + "222,80\n", + "130,65\n", + "53,52\n", + "318,88\n", + "219,72\n", + "221,67\n", + "191,66\n", + "82,64\n", + "35,57\n", + "176,65\n", + "110,53\n", + "252,71\n", + "21,60\n", + "12,43\n", + "63,52\n", + "85,64\n", + "118,64\n", + "215,70\n", + "27,52\n", + "142,66\n", + "387,80\n", + "334,82\n", + "147,56\n", + "154,67\n", + "285,84\n", + "371,80\n", + "358,89\n", + "144,56\n", + "283,76\n", + "44,61\n", + "298,79\n", + "318,74\n", + "74,54\n", + "229,68\n", + "308,89\n", + "391,95\n", + "377,93\n", + "117,60\n", + "385,92\n", + "391,94\n", + "244,66\n", + "350,87\n", + "272,72\n", + "268,77\n", + "293,86\n", + "188,68\n", + "171,63\n", + "250,79\n", + "239,68\n", + "389,88\n", + "92,67\n", + "129,57\n", + "348,89\n", + "278,84\n", + "160,69\n", + "245,66\n", + "275,77\n", + "249,74\n", + "332,81\n", + "388,90\n", + "158,65\n", + "60,62\n", + "325,87\n", + "216,72\n", + "31,55\n", + "397,86\n", + "271,76\n", + "192,70\n", + "59,50\n", + "195,72\n", + "383,81\n", + "267,81\n", + "279,76\n", + "135,64\n", + "362,85\n", + "37,50\n", + "384,84\n", + "49,60\n", + "49,60\n", + "115,56\n", + "199,68\n", + "203,65\n", + "326,75\n", + "333,92\n", + "153,62\n", + "248,72\n", + "212,69\n", + "38,49\n", + "361,87\n", + "302,86\n", + "381,94\n", + "43,62\n", + "79,55\n", + "274,74\n", + "44,47\n", + "252,79\n", + "188,62\n", + "357,84\n", + "94,65\n", + "143,66\n", + "60,49\n", + "26,56\n", + "193,65\n", + "363,84\n", + "322,87\n", + "120,67\n", + "248,73\n", + "312,89\n", + "298,81\n", + "142,73\n", + "261,85\n", + "272,72\n", + "49,58\n", + "249,74\n", + "204,79\n", + "34,58\n", + "69,53\n", + "180,71\n", + "210,70\n", + "59,53\n", + "47,58\n", + "39,66\n", + "192,97\n", + "177,91\n", + "109,69\n", + "177,82\n", + "67,66\n", + "107,71\n", + "70,58\n", + "185,93\n", + "49,53\n", + "126,74\n", + "96,64\n", + "37,48\n", + "152,79\n", + "67,72\n", + "177,86\n", + "13,61\n", + "123,67\n", + "154,75\n", + "81,69\n", + "93,73\n", + "66,72\n", + "61,67\n", + "53,58\n", + "152,81\n", + "112,72\n", + "8,43\n", + "159,77\n", + "9,53\n", + "133,76\n", + "110,75\n", + "25,59\n", + "144,83\n", + "155,81\n", + "1,45\n", + "50,60\n", + "6,55\n", + "116,70\n", + "117,82\n", + "135,74\n", + "93,64\n", + "190,90\n", + "55,65\n", + "189,85\n", + "60,64\n", + "183,95\n", + "75,57\n", + "135,70\n", + "156,80\n", + "276,83\n", + "298,72\n", + "146,70\n", + "339,76\n", + "46,53\n", + "0,55\n", + "355,87\n", + "53,61\n", + "75,56\n", + "233,67\n", + "336,92\n", + "342,89\n", + "378,85\n", + "319,84\n", + "216,75\n", + "72,52\n", + "232,81\n", + "156,74\n", + "281,75\n", + "398,95\n", + "312,90\n", + "285,83\n", + "228,79\n", + "288,85\n", + "145,61\n", + "0,45\n", + "159,74\n", + "383,85\n", + "121,69\n", + "28,59\n", + "263,72\n", + "72,52\n", + "70,60\n", + "73,64\n", + "9,47\n", + "320,87\n", + "155,66\n", + "273,70\n", + "187,67\n", + "372,93\n", + "324,74\n", + "13,46\n", + "1,43\n", + "263,73\n", + "28,57\n", + "102,64\n", + "390,88\n", + "386,82\n", + "45,50\n", + "10,44\n", + "216,79\n", + "17,59\n", + "221,78\n", + "398,81\n", + "95,51\n", + "340,78\n", + "29,62\n", + "145,57\n", + "80,49\n", + "196,65\n", + "299,86\n", + "42,60\n", + "158,58\n", + "392,89\n", + "130,56\n", + "299,88\n", + "18,59\n", + "4,52\n", + "337,77\n", + "290,73\n", + "58,59\n", + "136,59\n", + "270,83\n", + "161,75\n", + "376,97\n", + "100,65\n", + "341,83\n", + "250,70\n", + "213,73\n", + "344,93\n", + "287,83\n", + "18,58\n", + "290,71\n", + "229,76\n", + "66,53\n", + "344,81\n", + "392,84\n", + "254,80\n", + "4,44\n", + "298,72\n", + "58,49\n", + "342,77\n", + "227,72\n", + "184,63\n", + "33,52\n", + "112,69\n", + "101,57\n", + "310,74\n", + "152,64\n", + "303,83\n", + "301,82\n", + "385,95\n", + "370,91\n", + "285,70\n", + "341,85\n", + "41,47\n", + "311,89\n", + "127,62\n", + "174,74\n", + "264,79\n", + "125,70\n", + "213,74\n", + "148,63\n", + "303,87\n", + "392,91\n", + "205,77\n", + "139,65\n", + "273,73\n", + "116,62\n", + "169,60\n", + "71,58\n", + "141,72\n", + "236,77\n", + "163,58\n", + "221,66\n", + "303,89\n", + "89,57\n", + "258,77\n", + "326,83\n", + "181,73\n", + "97,68\n", + "255,77\n", + "218,71\n", + "345,93\n", + "16,48\n", + "77,53\n", + "368,88\n", + "69,57\n", + "129,66\n", + "257,82\n", + "346,83\n", + "383,81\n", + "320,84\n", + "315,86\n", + "89,54\n", + "252,75\n", + "124,60\n", + "65,65\n", + "365,78\n", + "363,94\n", + "168,60\n", + "112,60\n", + "259,75\n", + "384,97\n", + "100,64\n", + "382,81\n", + "87,66\n", + "124,54\n", + "249,68\n", + "341,92\n", + "134,65\n", + "210,76\n", + "386,92\n", + "287,87\n", + "328,85\n", + "376,85\n", + "333,78\n", + "319,82\n", + "313,88\n", + "223,66\n", + "230,75\n", + "103,66\n", + "256,79\n", + "368,84\n", + "309,88\n", + "129,58\n", + "187,75\n", + "388,89\n", + "257,79\n", + "210,70\n", + "168,66\n", + "325,78\n", + "318,84\n", + "336,87\n", + "233,73\n", + "374,81\n", + "93,56\n", + "23,59\n", + "390,82\n", + "97,64\n", + "385,92\n", + "386,89\n", + "209,63\n", + "30,49\n", + "335,85\n", + "295,77\n", + "70,58\n", + "247,67\n", + "107,62\n", + "347,91\n", + "72,63\n", + "46,53\n", + "41,48\n", + "114,54\n", + "145,67\n", + "373,86\n", + "124,53\n", + "387,97\n", + "18,46\n", + "85,67\n", + "344,93\n", + "142,56\n", + "243,67\n", + "87,65\n", + "258,70\n", + "290,87\n", + "399,96\n", + "137,58\n", + "48,59\n", + "157,67\n", + "148,64\n", + "9,56\n", + "175,60\n", + "290,81\n", + "100,52\n", + "233,66\n", + "41,46\n", + "100,64\n", + "49,56\n", + "235,80\n", + "308,79\n", + "322,89\n", + "148,60\n", + "33,55\n", + "397,91\n", + "75,57\n", + "6,47\n", + "102,59\n", + "166,75\n", + "60,59\n", + "54,62\n", + "352,80\n", + "353,79\n", + "380,81\n", + "283,81\n", + "395,91\n", + "69,64\n", + "184,69\n", + "64,59\n", + "98,56\n", + "92,60\n", + "291,75\n", + "343,93\n", + "322,78\n", + "29,45\n", + "17,46\n", + "362,78\n", + "179,73\n", + "64,56\n", + "4,59\n", + "59,49\n", + "267,73\n", + "4,44\n", + "388,85\n", + "311,78\n", + "257,69\n", + "346,82\n", + "188,73\n", + "185,66\n", + "41,55\n", + "179,63\n", + "340,81\n", + "112,58\n", + "99,61\n", + "364,82\n", + "253,67\n", + "145,61\n", + "195,69\n", + "214,73\n", + "346,81\n", + "356,86\n", + "215,78\n", + "394,81\n", + "277,77\n", + "199,75\n", + "35,62\n", + "62,64\n", + "161,65\n", + "138,68\n", + "37,49\n", + "6,55\n", + "169,67\n", + "10,58\n", + "30,56\n", + "190,61\n", + "36,53\n", + "221,74\n", + "134,66\n", + "297,73\n", + "36,49\n", + "8,48\n", + "58,56\n", + "45,63\n", + "148,83\n", + "90,74\n", + "187,89\n", + "111,70\n", + "33,60\n", + "75,62\n", + "71,73\n", + "79,65\n", + "87,59\n", + "36,64\n", + "139,85\n", + "3,58\n", + "32,50\n", + "58,63\n", + "72,65\n", + "50,53\n", + "39,63\n", + "7,47\n", + "97,71\n", + "120,77\n", + "78,58\n", + "22,59\n", + "9,52\n", + "60,64\n", + "132,76\n", + "93,68\n", + "102,72\n", + "72,68\n", + "151,83\n", + "46,54\n", + "56,63\n", + "85,69\n", + "100,69\n", + "186,96\n", + "11,55\n", + "150,81\n", + "144,85\n", + "82,71\n", + "93,72\n", + "71,57\n", + "59,70\n", + "118,79\n", + "126,67\n", + "117,73\n", + "164,90\n", + "142,73\n", + "102,64\n", + "359,77\n", + "240,80\n", + "42,59\n", + "186,75\n", + "170,61\n", + "327,89\n", + "382,86\n", + "216,70\n", + "249,81\n", + "139,61\n", + "258,68\n", + "153,62\n", + "230,70\n", + "104,59\n", + "207,76\n", + "212,72\n", + "354,81\n", + "139,57\n", + "43,60\n", + "24,53\n", + "0,57\n", + "146,57\n", + "150,56\n", + "192,78\n", + "383,96\n", + "127,67\n", + "52,61\n", + "20,56\n", + "47,51\n", + "39,58\n", + "75,62\n", + "106,53\n", + "31,55\n", + "112,57\n", + "214,68\n", + "330,84\n", + "130,59\n", + "282,82\n", + "167,63\n", + "279,76\n", + "136,57\n", + "35,53\n", + "373,81\n", + "69,51\n", + "179,66\n", + "361,89\n", + "88,55\n", + "327,79\n", + "398,83\n", + "237,76\n", + "175,68\n", + "112,68\n", + "372,96\n", + "229,70\n", + "112,57\n", + "331,85\n", + "157,73\n", + "36,52\n", + "64,54\n", + "237,78\n", + "343,86\n", + "141,56\n", + "394,85\n", + "151,58\n", + "193,62\n", + "289,70\n", + "285,70\n", + "359,90\n", + "395,89\n", + "38,52\n", + "339,75\n", + "41,51\n", + "378,82\n", + "313,76\n", + "269,77\n", + "372,88\n", + "34,51\n", + "243,67\n", + "310,74\n", + "62,52\n", + "113,70\n", + "113,70\n", + "394,84\n", + "297,72\n", + "303,88\n", + "182,70\n", + "34,54\n", + "24,46\n", + "343,90\n", + "141,67\n", + "129,59\n", + "67,57\n", + "312,88\n", + "276,79\n", + "35,62\n", + "324,80\n", + "375,82\n", + "189,76\n", + "195,66\n", + "44,50\n", + "151,66\n", + "383,88\n", + "93,51\n", + "283,73\n", + "386,81\n", + "334,80\n", + "181,61\n", + "365,82\n", + "398,92\n", + "316,76\n", + "222,76\n", + "267,78\n", + "363,82\n", + "252,73\n", + "228,77\n", + "300,72\n", + "166,70\n", + "249,78\n", + "289,76\n", + "337,89\n", + "200,75\n", + "76,66\n", + "359,79\n", + "118,53\n", + "325,77\n", + "76,57\n", + "359,94\n", + "30,60\n", + "85,62\n", + "167,74\n", + "221,66\n", + "377,96\n", + "392,92\n", + "288,84\n", + "45,58\n", + "242,76\n", + "183,74\n", + "227,70\n", + "358,92\n", + "186,72\n", + "21,57\n", + "73,49\n", + "292,77\n", + "33,61\n", + "71,57\n", + "375,92\n", + "267,86\n", + "279,83\n", + "265,75\n", + "226,78\n", + "36,48\n", + "199,73\n", + "386,81\n", + "49,63\n", + "14,53\n", + "314,76\n", + "141,56\n", + "145,59\n", + "176,67\n", + "96,62\n", + "377,82\n", + "139,68\n", + "67,50\n", + "190,77\n", + "272,69\n", + "329,90\n", + "85,55\n", + "214,67\n", + "197,73\n", + "101,61\n", + "140,58\n", + "332,87\n", + "87,53\n", + "400,95\n", + "254,71\n", + "153,69\n", + "167,73\n", + "162,70\n", + "39,61\n", + "295,84\n", + "12,52\n", + "262,73\n", + "351,80\n", + "253,73\n", + "381,97\n", + "2,47\n", + "55,62\n", + "45,54\n", + "54,48\n", + "92,64\n", + "183,72\n", + "226,81\n", + "265,85\n", + "313,86\n", + "383,89\n", + "215,80\n", + "31,49\n", + "178,60\n", + "238,68\n", + "173,68\n", + "'''" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Z_sEoINguWL0" + }, + "source": [ + "## Load the dataset into a DataFrame\n", + "\n", + "The following code cell loads the dataset into a DataFrame named `training_df`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "h90COJuVl9W7" + }, + "outputs": [], + "source": [ + "training_df = pd.read_csv(io.StringIO(dataset), on_bad_lines='warn')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f9pcW_Yjtoo8" + }, + "source": [ + "### Task 1: Examine basic statistics\n", + "\n", + "The following line of code generates basic statistics on the dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Qxq3xKzvunDK" + }, + "outputs": [], + "source": [ + "training_df.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "LAjQF1w0uu9D" + }, + "source": [ + "Do the basic statistics imply outliers?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jcOqsLKR3cpF" + }, + "outputs": [], + "source": [ + "# @title Task 1: Solution (run this code block to view) { display-mode: \"form\" }\n", + "\n", + "print(\"\"\"The basic statistics do not suggest a lot of outliers.\n", + "The standard deviations are substantially less than the\n", + "means. Furthermore, the quartile boundaries are approximately\n", + "evenly spaced.\"\"\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ak_TMAzGOIFq" + }, + "source": [ + "## Define plotting functions\n", + "\n", + "The following code cell uses [matplotlib](https://developers.google.com/machine-learning/glossary/#matplotlib) to create the following functions:\n", + "\n", + "* `plot_the_dataset`, which creates a scatter plot of random points in the dataset. Setting the `number_of_points_to_plot` to 1,500 causes the function to create a scatter plot of all 1,500 points in the dataset.\n", + "* `plot_a_contiguous_portion_of_dataset`, which creates a scatter plot of all the points in the dataset from start to end. For example, setting `start` to 0 and `end` to 99 will product a scatter plot of the first 100 points in the dataset.\n", + "\n", + "You may optionally double-click the headline to see the matplotlib code." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QF0BFRXTOeR3" + }, + "outputs": [], + "source": [ + "#@title Define the plotting functions { display-mode: \"form\" }\n", + "\n", + "# The following code defines the plotting functions that can be used to\n", + "# visualize the data.\n", + "\n", + "def plot_the_dataset(feature, label, number_of_points_to_plot):\n", + " \"\"\"Plot N random points of the dataset.\"\"\"\n", + "\n", + " # Label the axes.\n", + " plt.xlabel(feature)\n", + " plt.ylabel(label)\n", + "\n", + " # Create a scatter plot from n random points of the dataset.\n", + " random_examples = training_df.sample(n=number_of_points_to_plot)\n", + " plt.scatter(random_examples[feature], random_examples[label])\n", + "\n", + " # Render the scatter plot.\n", + " plt.show()\n", + "\n", + "def plot_a_contiguous_portion_of_dataset(feature, label, start, end):\n", + " \"\"\"Plot the data points from start to end.\"\"\"\n", + "\n", + " # Label the axes.\n", + " plt.xlabel(feature + \"Day\")\n", + " plt.ylabel(label)\n", + "\n", + " # Create a scatter plot.\n", + " plt.scatter(training_df[feature][start:end], training_df[label][start:end])\n", + "\n", + " # Render the scatter plot.\n", + " plt.show()\n", + "\n", + "\n", + "print(\"Defined the following functions:\")\n", + "print(\" * plot_the_dataset\")\n", + "print(\" * plot_a_contiguous_portion_of_dataset\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "z_xVk62zzkp4" + }, + "source": [ + "## Task 2: Visualize the dataset\n", + "\n", + "Outliers might still be lurking in the dataset. Visualizing the dataset with a scatter plot might identify outliers hidden in basic statistics." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4hyoN8SIz2nr" + }, + "outputs": [], + "source": [ + "plot_the_dataset(\"calories\", \"test_score\", number_of_points_to_plot=50)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bBeSiVl80uLc" + }, + "source": [ + "Does the scatter plot of 50 random data points suggest outliers?\n", + "What happens if you increase the `number_of_points_to_plot`?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YtT2ltKF1e97" + }, + "outputs": [], + "source": [ + "# @title Task 2: Solution (run this code block to view) { display-mode: \"form\" }\n", + "\n", + "print(\"\"\"Visualizing 50 data points doesn't imply any outliers.\n", + "However, as you increase the number of random data points to plot, a\n", + "clump of outliers appears. Notice the points with high test scores but less\n", + "than 200 calories.\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H6lImWthHu3o" + }, + "source": [ + "## Task 3: Get statistics for each week\n", + "\n", + "Possibly, different experimenters encoded `calories` differently. For example, maybe the experiment involved a different encoder for week 0 than week 1? Run the following code sells to get statistics for each week. Can you see\n", + "significant differences for each week?\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UsxF1B7cKKv4" + }, + "outputs": [], + "source": [ + "# Get statistics on Week 0\n", + "training_df[0:349].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VS-ATO_0QTgY" + }, + "outputs": [], + "source": [ + "# Get statistics on Week 1\n", + "training_df[350:699].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Qrucyb1sQ9BC" + }, + "outputs": [], + "source": [ + "# Get statistics on Week 2\n", + "training_df[700:1049].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jX3ybe1IQ9dq" + }, + "outputs": [], + "source": [ + "# Get statistics on Week 3\n", + "training_df[1050:1399].describe()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FJuDEvxSTH86" + }, + "outputs": [], + "source": [ + "# @title Task 3: Solution (run this code block to view) { display-mode: \"form\" }\n", + "\n", + "print(\"\"\"The basic statistics for each week are pretty similar, so weekly\n", + "differences aren't a likely explanation for the outliers.\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IakiyFQOkFxU" + }, + "source": [ + "## Task 4: Visualize by day of week\n", + "\n", + "Weekly values didn't change, but perhaps different\n", + "experimenters were hired for different days of the week? Maybe the Monday\n", + "experimenter encoded values differently than the Tuesday experimenter?\n", + "\n", + "The following code cell generates seven scatter plots--one for each day of the first week. Was each day of the week the same?\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uKgvWeDMkb8j" + }, + "outputs": [], + "source": [ + "for i in range(0,7):\n", + " start = i * 50\n", + " end = start + 49\n", + " print(\"\\nDay %d\" % i)\n", + " plot_a_contiguous_portion_of_dataset(\"calories\", \"test_score\", start, end)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JGcC7CPbkcr4" + }, + "outputs": [], + "source": [ + "# @title Task 4: Solution (run this code block to view) { display-mode: \"form\" }\n", + "\n", + "print(\"\"\"Wait a second--the calories value for Day 4 spans 0 to 200, while the\n", + "calories value for all the other Days spans 0 to 400. Something is wrong\n", + "with Day 4, at least on the first week.\"\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RlPvUoDDsLoT" + }, + "source": [ + "## Task 5: Use statistics to confirm your suspicions\n", + "\n", + "You suspect Day 4 (Thursday) is encoded differently than other days of the week.\n", + "Write some code in the following code cell to confirm your suspicions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "W-PTkE9msfwB" + }, + "outputs": [], + "source": [ + "# Write your code here.\n", + "# Note that training_df[\"calories\"][position] returns the value of the\n", + "# calories column at a specific position in the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "crztbvxmmRRW" + }, + "outputs": [], + "source": [ + "# @title Task 5: Solution (expand this code block to view) { display-mode: \"form\" }\n", + "\n", + "# You could use a variety of metrics to fully compare Thursday to the other\n", + "# six days, but this answer simply focuses on the mean.\n", + "\n", + "running_total_of_thursday_calories = 0\n", + "running_total_of_non_thursday_calories = 0\n", + "count = 0\n", + "for week in range(0,4):\n", + " for day in range(0,7):\n", + " for subject in range(0,50):\n", + " position = (week * 350) + (day * 50) + subject\n", + " if (day == 4): # Thursday\n", + " running_total_of_thursday_calories += training_df['calories'][position]\n", + " else: # Any day except Thursday\n", + " count += 1\n", + " running_total_of_non_thursday_calories += training_df['calories'][position]\n", + "\n", + "mean_of_thursday_calories = running_total_of_thursday_calories / 200\n", + "mean_of_non_thursday_calories = running_total_of_non_thursday_calories / 1200\n", + "\n", + "print(\"The mean of Thursday calories is %.0f\" % (mean_of_thursday_calories))\n", + "print(\"The mean of calories on days other than Thursday is %.0f\" % (mean_of_non_thursday_calories))\n" + ] + } + ], + "metadata": { + "colab": { + "private_outputs": true, + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file diff --git a/ml/cc/exercises/numerical_data_stats.ipynb b/ml/cc/exercises/numerical_data_stats.ipynb new file mode 100644 index 0000000..1cb509c --- /dev/null +++ b/ml/cc/exercises/numerical_data_stats.ipynb @@ -0,0 +1,177 @@ +{ + "cells": [ + { + "cell_type": "code", + "source": [ + "#@title Copyright 2023 Google LLC. Double-click for license information.\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ], + "metadata": { + "cellView": "form", + "id": "HYn5jBq2_Onh" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "25T2QAXLJPso" + }, + "source": [ + "# Colabs\n", + "\n", + "Machine Learning Crash Course uses Colaboratories (Colabs) for all programming exercises. Colab is Google's implementation of [Jupyter Notebook](https://jupyter.org/). For more information about Colabs and how to use them, go to [Welcome to Colaboratory](https://research.google.com/colaboratory).\n", + "\n", + "# Numerical data: Math stats on a dataset\n", + "\n", + "This Colab programming exercise (first of two) is part of the Machine Learning Crash Course module [Working with numerical data](https://developers.google.com/machine-learning/crash-course/numerical-data)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qiy_IL3AsWkA" + }, + "source": [ + "## What to expect\n", + "\n", + "In the section, [First steps with numerical data](https://developers.google.com/machine-learning/crash-course/numerical-data/first-steps), you learned how to visualize your data in plots or graphs, how to evaluate potential features and labels mathematically, and how to find [**outliers**](https://developers.google.com/machine-learning/glossary/#outliers) in the dataset.\n", + "\n", + "This exercise takes you through the process of finding columns that contain blatant outliers, which you can then decide to keep in or delete from the dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MyGvT2U4QWmA", + "cellView": "form" + }, + "outputs": [], + "source": [ + "# @title Setup - Import relevant modules\n", + "\n", + "# The following code imports relevant modules that\n", + "# allow you to run the colab.\n", + "# - If you encounter technical issues running some of the code sections\n", + "# that follow, try running this section again.\n", + "\n", + "import pandas as pd\n", + "\n", + "# The following lines adjust the granularity of reporting.\n", + "pd.options.display.max_rows = 10\n", + "pd.options.display.float_format = \"{:.1f}\".format" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "n-qYeaU9QgIA", + "cellView": "form" + }, + "outputs": [], + "source": [ + "#@title Import the dataset\n", + "\n", + "# The following code imports the dataset that is used in the colab.\n", + "\n", + "training_df = pd.read_csv(filepath_or_buffer=\"https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9CfNPW4GRf09" + }, + "source": [ + "## Get basic statistics\n", + "\n", + "In the following code section, the DataFrame `describe` method returns basic statistics on all the columns in the dataset, such as:\n", + "\n", + "* `count` is the number of populated elements in this column. Ideally, every column contains the same value for `count`, but that's not always the case.\n", + "* `mean` is the traditional average of values in that column. We recommend comparing the `mean` to the median for each column. The **median** is the 50% row of the table.\n", + "* `std` is the standard deviation of the values in this column.\n", + "* `min`, `25%`, `50%`, `75%`, and `max` indicate values in the 0, 25, 50, 75, and 100th percentiles." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "faMaLr_4QgzP" + }, + "outputs": [], + "source": [ + "# Get statistics on the dataset.\n", + "\n", + "# The following code returns basic statistics about the data in the dataframe.\n", + "\n", + "training_df.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vkkok1t-Rw1l" + }, + "source": [ + "### Task 1: Identify possible outliers\n", + "\n", + "Based on the preceding statisics, do you see any columns that might contain outliers?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pzt1ZVNhvSUM" + }, + "outputs": [], + "source": [ + "# @title Task 1: Solution (run this code block to view) { display-mode: \"form\" }\n", + "\n", + "print(\"\"\"The following columns might contain outliers:\n", + "\n", + " * total_rooms\n", + " * total_bedrooms\n", + " * population\n", + " * households\n", + " * possibly, median_income\n", + "\n", + "In all of those columns:\n", + "\n", + " * the standard deviation is almost as high as the mean\n", + " * the delta between 75% and max is much higher than the\n", + " delta between min and 25%.\"\"\")" + ] + } + ], + "metadata": { + "colab": { + "private_outputs": true, + "provenance": [], + "toc_visible": true + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "name": "python" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file