diff --git a/tables/automl/notebooks/census_income_prediction/README.md b/tables/automl/notebooks/census_income_prediction/README.md new file mode 100644 index 000000000000..4c5ed03ce284 --- /dev/null +++ b/tables/automl/notebooks/census_income_prediction/README.md @@ -0,0 +1,96 @@ +AutoML Tables enables your entire team to automatically build and deploy state-of-the-art machine learning models on structured data at massively increased speed and scale. + + +## Problem Description +The model uses a real dataset from the [Census Income Dataset](https://archive.ics.uci.edu/ml/datasets/Census+Income). + + +The goal is the predict if a given individual has an income above or below 50k, given information like the person's age, education level, marital-status, occupation etc... +This is framed as a binary classification model, to label the individual as either having an income above or below 50k. + + + + + + +Dataset Details + + +The dataset consists of over 30k rows, where each row corresponds to a different person. For a given row, there are 14 features that the model conditions on to predict the income of the person. A few of the features are named above, and the exhaustive list can be found both in the dataset link above or seen in the colab. + + + + +## Solution Walkthrough +The solution has been developed using [Google Colab Notebook](https://colab.research.google.com/notebooks/welcome.ipynb). + + + + +Steps Involved + + +### 1. Set up +The first step in this process was to set up the project. We referred to the [AutoML tables documentation](https://cloud.google.com/automl-tables/docs/) and take the following steps: +* Create a Google Cloud Platform (GCP) project +* Enable billing +* Enable the AutoML API +* Enable the AutoML Tables API +* Create a service account, grant required permissions, and download the service account private key. + + +### 2. Initialize and authenticate + + +The client library installation is entirely self explanatory in the colab. + + +The authentication process is only slightly more complex: run the second code block entitled "Authenticate using service account key" and then upload the service account key you created in the set up step. + + +To make sure your colab was authenticated and has access to your project, replace the project_id with your project_id, and run the subsequent code blocks. You should see the lists of your datasets and any models you made previously in AutoML Tables. + + +### 3. Import training data + + +This section has you create a dataset and import the data. You have both the option of using the csv import from a Cloud Storage bucket, or you can upload the csv into Big Query and import it from there. + + + + +### 4. Update dataset: assign a label column and enable nullable columns + + +This section is important, as it is where you specify which column (meaning which feature) you will use as your label. This label feature will then be predicted using all other features in the row. + + +### 5. Creating a model + + +This section is where you train your model. You can specify how long you want your model to train for. + + +### 6. Make a prediction + + +This section gives you the ability to do a single online prediction. You can toggle exactly which values you want for all of the numeric features, and choose from the drop down windows which values you want for the categorical features. + + +The model takes a while to deploy online, and currently there does not exist a feedback mechanism in the sdk, so you will need to wait until the model finishes deployment to run the online prediction. +When the deployment code ```response = client.deploy_model(model_name)``` finishes, you will be able to see this on the [UI](https://console.cloud.google.com/automl-tables). + + +To see when it finishes, click on the UI link above and navitage to the dataset you just uploaded, and go to the predict tab. You should see "online prediction" text near the top, click on it, and it will take you to a view of your online prediction interface. You should see "model deployed" on the far right of the screen if the model is deployed, or a "deploying model" message if it is still deploying. + + +Once the model finishes deployment, go ahead and run the ```prediction_client.predict(model_name, payload)``` line. + + +Note: If the model has not finished deployment, the prediction will NOT work. + + +### 7. Batch Prediction + + +There is a validation csv file provided with a few rows of data not used in the training or testing for you to run a batch prediction with. The csv is linked in the text of the colab as well as [here](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv) . \ No newline at end of file diff --git a/tables/automl/notebooks/census_income_prediction/census_income_prediction.ipynb b/tables/automl/notebooks/census_income_prediction/census_income_prediction.ipynb new file mode 100644 index 000000000000..1e5ac840c7b9 --- /dev/null +++ b/tables/automl/notebooks/census_income_prediction/census_income_prediction.ipynb @@ -0,0 +1,932 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "m26YhtBMvVWA" + }, + "source": [ + "# Getting started with AutoML Tables\n", + "\n", + "To use this Colab notebook, copy it to your own Google Drive and open it with [Colaboratory](https://colab.research.google.com/) (or Colab). To run a cell hold the Shift key and press the Enter key (or Return key). Colab automatically displays the return value of the last line in each cell. Refer to [this page](https://colab.research.google.com/notebooks/welcome.ipynb) for more information on Colab.\n", + "\n", + "You can run a Colab notebook on a hosted runtime in the Cloud. The hosted VM times out after 90 minutes of inactivity and you will lose all the data stored in the memory including your authentication data. If your session gets disconnected (for example, because you closed your laptop) for less than the 90 minute inactivity timeout limit, press 'RECONNECT' on the top right corner of your notebook and resume the session. After Colab timeout, you'll need to\n", + "\n", + "1. Re-run the initialization and authentication.\n", + "2. Continue from where you left off. You may need to copy-paste the value of some variables such as the `dataset_name` from the printed output of the previous cells.\n", + "\n", + "Alternatively you can connect your Colab notebook to a [local runtime](https://research.google.com/colaboratory/local-runtimes.html).\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "b--5FDDwCG9C" + }, + "source": [ + "## 1. Project set up\n", + "\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "AZs0ICgy4jkQ" + }, + "source": [ + "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to\n", + "* Create a Google Cloud Platform (GCP) project.\n", + "* Enable billing.\n", + "* Apply to whitelist your project.\n", + "* Enable AutoML API.\n", + "* Enable AutoML Tables API.\n", + "* Create a service account, grant required permissions, and download the service account private key.\n", + "\n", + "You also need to upload your data into Google Cloud Storage (GCS) or BigQuery. For example, to use GCS as your data source\n", + "* Create a GCS bucket.\n", + "* Upload the training and batch prediction files.\n", + "\n", + "\n", + "**Warning:** Private keys must be kept secret. If you expose your private key it is recommended to revoke it immediately from the Google Cloud Console." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "xZECt1oL429r" + }, + "source": [ + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "rstRPH9SyZj_" + }, + "source": [ + "## 2. Initialize and authenticate\n", + "This section runs intialization and authentication. It creates an authenticated session which is required for running any of the following sections." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "BR0POq2UzE7e" + }, + "source": [ + "### Install the client library\n", + "Run the following cell." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "43aXKjDRt_qZ" + }, + "outputs": [], + "source": [ + "#@title Install AutoML Tables client library { vertical-output: true }\n", + "!pip install google-cloud-automl" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "eVFsPPEociwF" + }, + "source": [ + "### Authenticate using service account key\n", + "Run the following cell. Click on the 'Choose Files' button and select the service account private key file. If your Service Account key file or folder is hidden, you can reveal it in a Mac by pressing the Command + Shift + . combo." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "u-kCqysAuaJk" + }, + "outputs": [], + "source": [ + "#@title Authenticate and create a client. { vertical-output: true }\n", + "\n", + "from google.cloud import automl_v1beta1\n", + "\n", + "# Upload service account key\n", + "keyfile_upload = files.upload()\n", + "keyfile_name = list(keyfile_upload.keys())[0]\n", + "# Authenticate and create an AutoML client.\n", + "client = automl_v1beta1.AutoMlClient.from_service_account_file(keyfile_name)\n", + "# Authenticate and create a prediction service client.\n", + "prediction_client = automl_v1beta1.PredictionServiceClient.from_service_account_file(keyfile_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "s3F2xbEJdDvN" + }, + "source": [ + "### Test" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "0uX4aJYUiXh5" + }, + "source": [ + "Enter your GCP project ID." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "6R4h5HF1Dtds" + }, + "outputs": [], + "source": [ + "#@title GCP project ID and location\n", + "\n", + "project_id = 'my-project-trial5' #@param {type:'string'}\n", + "location = 'us-central1'\n", + "location_path = client.location_path(project_id, location)\n", + "location_path" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "rUlBcZ3OfWcJ" + }, + "source": [ + "To test whether your project set up and authentication steps were successful, run the following cell to list your datasets." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "cellView": "both", + "colab": {}, + "colab_type": "code", + "id": "sf32nKXIqYje" + }, + "outputs": [], + "source": [ + "#@title List datasets. { vertical-output: true }\n", + "\n", + "list_datasets_response = client.list_datasets(location_path)\n", + "datasets = {\n", + " dataset.display_name: dataset.name for dataset in list_datasets_response}\n", + "datasets" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "t9uE8MvMkOPd" + }, + "source": [ + "You can also print the list of your models by running the following cell." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "cellView": "both", + "colab": {}, + "colab_type": "code", + "id": "j4-bYRSWj7xk" + }, + "outputs": [], + "source": [ + "#@title List models. { vertical-output: true }\n", + "\n", + "list_models_response = client.list_models(location_path)\n", + "models = {model.display_name: model.name for model in list_models_response}\n", + "models" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "qozQWMnOu48y" + }, + "source": [ + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ODt86YuVDZzm" + }, + "source": [ + "## 3. Import training data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "XwjZc9Q62Fm5" + }, + "source": [ + "### Create dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "_JfZFGSceyE_" + }, + "source": [ + "Select a dataset display name and pass your table source information to create a new dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Z_JErW3cw-0J" + }, + "outputs": [], + "source": [ + "#@title Create dataset { vertical-output: true, output-height: 200 }\n", + "\n", + "dataset_display_name = 'test_deployment' #@param {type: 'string'}\n", + "\n", + "create_dataset_response = client.create_dataset(\n", + " location_path,\n", + " {'display_name': dataset_display_name, 'tables_dataset_metadata': {}})\n", + "dataset_name = create_dataset_response.name\n", + "create_dataset_response" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "35YZ9dy34VqJ" + }, + "source": [ + "### Import data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "3c0o15gVREAw" + }, + "source": [ + "You can import your data to AutoML Tables from GCS or BigQuery. For this tutorial, you can use the [census_income dataset](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv) \n", + "as your training data. You can create a GCS bucket and upload the data intofa your bucket. The URI for your file is `gs://BUCKET_NAME/FOLDER_NAME1/FOLDER_NAME2/.../FILE_NAME`. Alternatively you can create a BigQuery table and upload the data into the table. The URI for your table is `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", + "\n", + "Importing data may take a few minutes or hours depending on the size of your data. If your Colab times out, run the following command to retrieve your dataset. Replace `dataset_name` with its actual value obtained in the preceding cells.\n", + "\n", + " dataset = client.get_dataset(dataset_name)" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "UIWlq3NTYhOl" + }, + "outputs": [], + "source": [ + "#@title ... if data source is GCS { vertical-output: true }\n", + "\n", + "dataset_gcs_input_uris = ['gs://cloud-ml-data/automl-tables/notebooks/census_income.csv',] #@param\n", + "# Define input configuration.\n", + "input_config = {\n", + " 'gcs_source': {\n", + " 'input_uris': dataset_gcs_input_uris\n", + " }\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "bB_GdeqCJW5i" + }, + "outputs": [], + "source": [ + "#@title ... if data source is BigQuery { vertical-output: true }\n", + "\n", + "dataset_bq_input_uri = 'bq://my-project-trial5.census_income.income_census' #@param {type: 'string'}\n", + "# Define input configuration.\n", + "input_config = {\n", + " 'bigquery_source': {\n", + " 'input_uri': dataset_bq_input_uri\n", + " }\n", + "}" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "FNVYfpoXJsNB" + }, + "outputs": [], + "source": [ + " #@title Import data { vertical-output: true }\n", + "\n", + "import_data_response = client.import_data(dataset_name, input_config)\n", + "print('Dataset import operation: {}'.format(import_data_response.operation))\n", + "# Wait until import is done.\n", + "import_data_result = import_data_response.result()\n", + "import_data_result" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "QdxBI4s44ZRI" + }, + "source": [ + "### Review the specs" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "RC0PWKqH4jwr" + }, + "source": [ + "Run the following command to see table specs such as row count." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "v2Vzq_gwXxo-" + }, + "outputs": [], + "source": [ + "#@title Table schema { vertical-output: true }\n", + "\n", + "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# List table specs\n", + "list_table_specs_response = client.list_table_specs(dataset_name)\n", + "table_specs = [s for s in list_table_specs_response]\n", + "# List column specs\n", + "table_spec_name = table_specs[0].name\n", + "list_column_specs_response = client.list_column_specs(table_spec_name)\n", + "column_specs = {s.display_name: s for s in list_column_specs_response}\n", + "# Table schema pie chart.\n", + "type_counts = {}\n", + "for column_spec in column_specs.values():\n", + " type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)\n", + " type_counts[type_name] = type_counts.get(type_name, 0) + 1\n", + "\n", + "plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')\n", + "plt.axis('equal')\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "vcJP7xoq4yAJ" + }, + "source": [ + "Run the following command to see column specs such inferred schema." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "FNykW_YOYt6d" + }, + "source": [ + "___" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "kNRVJqVOL8h3" + }, + "source": [ + "## 4. Update dataset: assign a label column and enable nullable columns" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "-57gehId9PQ5" + }, + "source": [ + "AutoML Tables automatically detects your data column type. For example, for the ([census_income](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income.csv)) it detects `income` to be categorical (as it is just either over or under 50k) and `age` to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "iRqdQ7Xiq04x" + }, + "source": [ + "### Update a column: set to nullable" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "OCEUIPKegWrf" + }, + "outputs": [], + "source": [ + "#@title Update dataset { vertical-output: true }\n", + "\n", + "update_column_spec_dict = {\n", + " 'name': column_specs['income'].name,\n", + " 'data_type': {\n", + " 'type_code': 'CATEGORY',\n", + " 'nullable': False\n", + " }\n", + "}\n", + "update_column_response = client.update_column_spec(update_column_spec_dict)\n", + "update_column_response" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "GUqKi3tkqrgW" + }, + "source": [ + "**Tip:** You can use `'type_code': 'CATEGORY'` in the preceding `update_column_spec_dict` to convert the column data type from `FLOAT64` `to `CATEGORY`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "nDMH_chybe4w" + }, + "source": [ + "### Update dataset: assign a label" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "hVIruWg0u33t" + }, + "outputs": [], + "source": [ + "#@title Update dataset { vertical-output: true }\n", + "\n", + "label_column_name = 'income' #@param {type: 'string'}\n", + "label_column_spec = column_specs[label_column_name]\n", + "label_column_id = label_column_spec.name.rsplit('/', 1)[-1]\n", + "print('Label column ID: {}'.format(label_column_id))\n", + "# Define the values of the fields to be updated.\n", + "update_dataset_dict = {\n", + " 'name': dataset_name,\n", + " 'tables_dataset_metadata': {\n", + " 'target_column_spec_id': label_column_id\n", + " }\n", + "}\n", + "update_dataset_response = client.update_dataset(update_dataset_dict)\n", + "update_dataset_response" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "z23NITLrcxmi" + }, + "source": [ + "___" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "FcKgvj1-Tbgj" + }, + "source": [ + "## 5. Creating a model" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Pnlk8vdQlO_k" + }, + "source": [ + "### Train a model\n", + "Specify the duration of the training. For example, `'train_budget_milli_node_hours': 1000` runs the training for one hour. If your Colab times out, use `client.list_models(location_path)` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace `model_name` with its actual value.\n", + "\n", + " model = client.get_model(model_name)" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "11izNd6Fu37N" + }, + "outputs": [], + "source": [ + "#@title Create model { vertical-output: true }\n", + "\n", + "model_display_name = 'census_income_model' #@param {type:'string'}\n", + "\n", + "model_dict = {\n", + " 'display_name': model_display_name,\n", + " 'dataset_id': dataset_name.rsplit('/', 1)[-1],\n", + " 'tables_model_metadata': {'train_budget_milli_node_hours': 1000}\n", + "}\n", + "create_model_response = client.create_model(location_path, model_dict)\n", + "print('Dataset import operation: {}'.format(create_model_response.operation))\n", + "# Wait until model training is done.\n", + "create_model_result = create_model_response.result()\n", + "model_name = create_model_result.name\n", + "create_model_result" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "1wS1is9IY5nK" + }, + "source": [ + "___" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "LMYmHSiCE8om" + }, + "source": [ + "## 6. Make a prediction" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "G2WVbMFll96k" + }, + "source": [ + "### There are two different prediction modes: online and batch. The following cells show you how to make an online prediction. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "ZAGi8Co-SU-b" + }, + "source": [ + "Run the following cell, and then choose the desired test values for your online prediction." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "yt-KXEEQu3-U" + }, + "outputs": [], + "source": [ + "#@title Make an online prediction: set the categorical variables{ vertical-output: true }\n", + "from ipywidgets import interact\n", + "import ipywidgets as widgets\n", + "\n", + "workclass_ids = ['Private', 'Self-emp-not-inc', 'Self-emp-inc', 'Federal-gov', 'Local-gov', 'State-gov', 'Without-pay', 'Never-worked']\n", + "education_ids = ['Bachelors', 'Some-college', '11th', 'HS-grad', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', '9th', '7th-8th', '12th', 'Masters', '1st-4th', '10th', 'Doctorate', '5th-6th', 'Preschool']\n", + "marital_status_ids = ['Married-civ-spouse', 'Divorced', 'Never-married', 'Separated', 'Widowed', 'Married-spouse-absent', 'Married-AF-spouse']\n", + "occupation_ids = ['Tech-support', 'Craft-repair', 'Other-service', 'Sales', 'Exec-managerial', 'Prof-specialty', 'Handlers-cleaners', 'Machine-op-inspct', 'Adm-clerical', 'Farming-fishing', 'Transport-moving', 'Priv-house-serv', 'Protective-serv', 'Armed-Forces']\n", + "relationship_ids = ['Wife', 'Own-child', 'Husband', 'Not-in-family', 'Other-relative', 'Unmarried']\n", + "race_ids = ['White', 'Asian-Pac-Islander', 'Amer-Indian-Eskimo', 'Other', 'Black']\n", + "sex_ids = ['Female', 'Male']\n", + "native_country_ids = ['United-States', 'Cambodia', 'England', 'Puerto-Rico', 'Canada', 'Germany', 'Outlying-US(Guam-USVI-etc)', 'India', 'Japan', 'Greece', 'South', 'China', 'Cuba', 'Iran', 'Honduras', 'Philippines', 'Italy', 'Poland', 'Jamaica', 'Vietnam', 'Mexico', 'Portugal', 'Ireland', 'France', 'Dominican-Republic', 'Laos', 'Ecuador', 'Taiwan', 'Haiti', 'Columbia', 'Hungary', 'Guatemala', 'Nicaragua', 'Scotland', 'Thailand', 'Yugoslavia', 'El-Salvador', 'Trinadad&Tobago', 'Peru', 'Hong', 'Holand-Netherlands']\n", + "workclass = widgets.Dropdown(options=workclass_ids, value=workclass_ids[0],\n", + " description='workclass:')\n", + "\n", + "education = widgets.Dropdown(options=education_ids, value=education_ids[0],\n", + " description='education:', width='500px')\n", + "\n", + "marital_status = widgets.Dropdown(options=marital_status_ids, value=marital_status_ids[0],\n", + " description='marital status:', width='500px')\n", + "\n", + "occupation = widgets.Dropdown(options=occupation_ids, value=occupation_ids[0],\n", + " description='occupation:', width='500px')\n", + "\n", + "relationship = widgets.Dropdown(options=relationship_ids, value=relationship_ids[0],\n", + " description='relationship:', width='500px')\n", + "\n", + "race = widgets.Dropdown(options=race_ids, value=race_ids[0],\n", + " description='race:', width='500px')\n", + "\n", + "sex = widgets.Dropdown(options=sex_ids, value=sex_ids[0],\n", + " description='sex:', width='500px')\n", + "\n", + "native_country = widgets.Dropdown(options=native_country_ids, value=native_country_ids[0],\n", + " description='native_country:', width='500px')\n", + "\n", + "display(workclass)\n", + "display(education)\n", + "display(marital_status)\n", + "display(occupation)\n", + "display(relationship)\n", + "display(race)\n", + "display(sex)\n", + "display(native_country)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "xGVGwgwXSZe_" + }, + "source": [ + "Adjust the slides on the right to the desired test values for your online prediction." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "bDzd5GYQSdpa" + }, + "outputs": [], + "source": [ + "#@title Make an online prediction: set the numeric variables{ vertical-output: true }\n", + "\n", + "age = 34 #@param {type:'slider', min:1, max:100, step:1}\n", + "capital_gain = 40000 #@param {type:'slider', min:0, max:100000, step:10000}\n", + "capital_loss = 3.8 #@param {type:'slider', min:0, max:4000, step:0.1}\n", + "fnlwgt = 150000 #@param {type:'slider', min:0, max:1000000, step:50000}\n", + "education_num = 9 #@param {type:'slider', min:1, max:16, step:1}\n", + "hours_per_week = 40 #@param {type:'slider', min:1, max:100, step:1}\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "n0lFAIkISf4K" + }, + "source": [ + "**IMPORTANT** : Deploy the model, then wait until the model FINISHES deployment.\n", + "Check the [UI](https://console.cloud.google.com/automl-tables?_ga=2.255483016.-1079099924.1550856636) and navigate to the predict tab of your model, and then to the online prediction portion, to see when it finishes online deployment before running the prediction cell." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "kRoHFbVnSk05" + }, + "outputs": [], + "source": [ + "response = client.deploy_model(model_name)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "0tymBrhLSnDX" + }, + "source": [ + "Run the prediction, only after the model finishes deployment" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "Kc4SKLLPSoKz" + }, + "outputs": [], + "source": [ + "payload = {\n", + " 'row': { \n", + " 'values': [\n", + " {'number_value': age},\n", + " {'string_value': workclass.value},\n", + " {'number_value': fnlwgt},\n", + " {'string_value': education.value},\n", + " {'number_value': education_num},\n", + " {'string_value': marital_status.value},\n", + " {'string_value': occupation.value},\n", + " {'string_value': relationship.value},\n", + " {'string_value': race.value},\n", + " {'string_value': sex.value},\n", + " {'number_value': capital_gain},\n", + " {'number_value': capital_loss},\n", + " {'number_value': hours_per_week},\n", + " {'string_value': native_country.value}\n", + " ]\n", + " }\n", + "}\n", + "prediction_client.predict(model_name, payload)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "O9CRdIfrS1nb" + }, + "source": [ + "Undeploy the model" + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "DWa1idtOS0GE" + }, + "outputs": [], + "source": [ + "response2 = client.undeploy_model(model_name)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "TarOq84-GXch" + }, + "source": [ + "## 7. Batch prediction" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "Soy5OB8Wbp_R" + }, + "source": [ + "### Initialize prediction" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "colab_type": "text", + "id": "39bIGjIlau5a" + }, + "source": [ + "Your data source for batch prediction can be GCS or BigQuery. For this tutorial, you can use [census_income_batch_prediction_input.csv](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv) as input source. Create a GCS bucket and upload the file into your bucket. Some of the lines in the batch prediction input file are intentionally left missing some values. The AutoML Tables logs the errors in the `errors.csv` file.\n", + "Also, enter the UI and create the bucket into which you will load your predictions. The bucket's default name here is automl-tables-pred.\n", + "\n", + "**NOTE:** The client library has a bug. If the following cell returns a `TypeError: Could not convert Any to BatchPredictResult` error, ignore it. The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells." + ] + }, + { + "cell_type": "code", + "execution_count": 0, + "metadata": { + "colab": {}, + "colab_type": "code", + "id": "gkF3bH0qu4DU" + }, + "outputs": [], + "source": [ + "#@title Start batch prediction { vertical-output: true, output-height: 200 }\n", + "\n", + "batch_predict_gcs_input_uris = ['gs://cloud-ml-data/automl-tables/notebooks/census_income_batch_prediction_input.csv',] #@param\n", + "batch_predict_gcs_output_uri_prefix = 'gs://automl-tables-pred1' #@param {type:'string'}\n", + "#gs://automl-tables-pred\n", + "# Define input source.\n", + "batch_prediction_input_source = {\n", + " 'gcs_source': {\n", + " 'input_uris': batch_predict_gcs_input_uris\n", + " }\n", + "}\n", + "# Define output target.\n", + "batch_prediction_output_target = {\n", + " 'gcs_destination': {\n", + " 'output_uri_prefix': batch_predict_gcs_output_uri_prefix\n", + " }\n", + "}\n", + "batch_predict_response = prediction_client.batch_predict(\n", + " model_name, batch_prediction_input_source, batch_prediction_output_target)\n", + "print('Batch prediction operation: {}'.format(batch_predict_response.operation))\n", + "# Wait until batch prediction is done.\n", + "batch_predict_result = batch_predict_response.result()\n", + "batch_predict_response.metadata" + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "census_income_prediction.ipynb", + "provenance": [], + "version": "0.3.2" + }, + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.2" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} \ No newline at end of file diff --git a/tables/automl/notebooks/energy_price_forecasting/README.md b/tables/automl/notebooks/energy_price_forecasting/README.md new file mode 100644 index 000000000000..f9612854db2f --- /dev/null +++ b/tables/automl/notebooks/energy_price_forecasting/README.md @@ -0,0 +1,112 @@ +---------------------------------------- + +Copyright 2018 Google LLC + +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0) + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and limitations under the License. + +---------------------------------------- + +# 1. Introduction + +This guide provides a high-level overview of an energy price forecasting solution, reviewing the significance of the solution and which audiences and use cases it applies to. In this section, we outline the business case for this solution, the problem, the solution, and results. In section 2, we provide the code setup instructions. + +Solution description: Model to forecast hourly energy prices for the next 7 days. + +Significance: This is a good complement to standard demand forecasting models that typically predict N periods in the future. This model does a rolling forecast that is vital for operational decisions. It also takes into consideration historical trends, seasonal patterns, and external factors (like weather) to make more accurate forecasts. + + +## 1.1 Solution scenario + +### Challenge + +Many companies use forecasting models to predict prices, demand, sales, etc. Many of these forecasting problems have similar characteristics that can be leveraged to produce accurate predictions, like historical trends, seasonal patterns, and external factors. + +For example, think about an energy company that needs to accurately forecast the country’s hourly energy prices for the next 5 days (120 predictions) for optimal energy trading. + +At forecast time, they have access to historical energy prices as well as weather forecasts for the time period in question. + +In this particular scenario, an energy company actually hosted a competition ([http://complatt.smartwatt.net/](http://complatt.smartwatt.net/)) for developers to use the data sets to create a more accurate prediction model. + +### Solution + +We solved the energy pricing challenge by preparing a training dataset that encodes historical price trends, seasonal price patterns, and weather forecasts in a single table. We then used that table to train a deep neural network that can make accurate hourly predictions for the next 5 days. + +## 1.2 Similar applications + +Using the solution that we created for the competition, we can now show how other forecasting problems can also be solved with the same solution. + +This type of solution includes any demand forecasting model that predicts N periods in the future and takes advantage of seasonal patterns, historical trends, and external datasets to produce accurate forecasts. + +Here are some additional demand forecasting examples: + +* Sales forecasting + +* Product or service usage forecasting + +* Traffic forecasting + + +# 2. Setting up the solution in a Google Cloud Platform project + +## 2.1 Create GCP project and download raw data + +Learn how to create a GCP project and prepare it for running the solution following these steps: + +1. Create a project in GCP ([article](https://cloud.google.com/resource-manager/docs/creating-managing-projects) on how to create and manage GCP projects). + +2. Raw data for this problem: + +>[MarketPricePT](http://complatt.smartwatt.net/assets/files/historicalRealData/RealMarketPriceDataPT.csv) - Historical hourly energy prices. +>![alt text](https://storage.googleapis.com/images_public/price_schema.png) +>![alt text](https://storage.googleapis.com/images_public/price_data.png) + +>[historical_weather](http://complatt.smartwatt.net/assets/files/weatherHistoricalData/WeatherHistoricalData.zip) - Historical hourly weather forecasts. +>![alt text](https://storage.googleapis.com/images_public/weather_schema.png) +>![alt text](https://storage.googleapis.com/images_public/loc_portugal.png) +>![alt text](https://storage.googleapis.com/images_public/weather_data.png) + +*Disclaimer: The data for both tables comes from [http://complatt.smartwatt.net/](http://complatt.smartwatt.net/). This website hosts a closed competition meant to solve the energy price forecasting problem. The data was not collected or vetted by Google LLC and hence, we cannot guarantee the veracity or quality of it. + + +## 2.2 Execute script for data preparation + +Prepare the data that is going to be used by the forecaster model by following these instructions: + +1. Clone the solution code from here: [https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloudml-energy-price-forecasting](https://github.com/GoogleCloudPlatform/professional-services/tree/master/examples/cloudml-energy-price-forecasting). In the solution code, navigate to the "data_preparation" folder. + +2. Run script "data_preparation.data_prep" to generate training, validation, and testing data as well as the constant files needed for normalization. + +3. Export training, validation, and testing tables as CSVs (into Google Cloud Storage bucket gs://energyforecast/data/csv). + +4. Read the "README.md" file for more information. + +5. Understand which parameters can be passed to the script (to override defaults). + +Training data schema: +![alt text](https://storage.googleapis.com/images_public/training_schema.png) + +## 2.3 Execute notebook in this folder + +Train the forecasting model in AutoML tables by running all cells in the notebook in this folder! + +## 2.4 AutoML Tables Results + +The following results are from our solution to this problem. + +* MAE (Mean Absolute Error) = 0.0416 +* RMSE (Root Mean Squared Error) = 0.0524 + +![alt text](https://storage.googleapis.com/images_public/automl_test.png) + +Feature importance: +![alt text](https://storage.googleapis.com/images_public/feature_importance.png) + diff --git a/tables/automl/notebooks/energy_price_forecasting/energy_price_forecasting.ipynb b/tables/automl/notebooks/energy_price_forecasting/energy_price_forecasting.ipynb new file mode 100644 index 000000000000..ac78b272c4b1 --- /dev/null +++ b/tables/automl/notebooks/energy_price_forecasting/energy_price_forecasting.ipynb @@ -0,0 +1,1204 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Energy_Price_Forecasting.ipynb", + "version": "0.3.2", + "provenance": [], + "collapsed_sections": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "cells": [ + { + "metadata": { + "id": "KOAz-lD1P7Kx", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "----------------------------------------\n", + "\n", + "Copyright 2018 Google LLC \n", + "\n", + "Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "you may not use this file except in compliance with the License.\n", + "You may obtain a copy of the License at\n", + "\n", + "[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)\n", + "\n", + "Unless required by applicable law or agreed to in writing, software\n", + "distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "See the License for the specific language governing permissions and limitations under the License.\n", + "\n", + "----------------------------------------" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "m26YhtBMvVWA" + }, + "cell_type": "markdown", + "source": [ + "# Energy Forecasting with AutoML Tables\n", + "\n", + "To use this Colab notebook, copy it to your own Google Drive and open it with [Colaboratory](https://colab.research.google.com/) (or Colab). To run a cell hold the Shift key and press the Enter key (or Return key). Colab automatically displays the return value of the last line in each cell. Refer to [this page](https://colab.sandbox.google.com/notebooks/welcome.ipynb) for more information on Colab.\n", + "\n", + "You can run a Colab notebook on a hosted runtime in the Cloud. The hosted VM times out after 90 minutes of inactivity and you will lose all the data stored in the memory including your authentication data. If your session gets disconnected (for example, because you closed your laptop) for less than the 90 minute inactivity timeout limit, press 'RECONNECT' on the top right corner of your notebook and resume the session. After Colab timeout, you'll need to\n", + "\n", + "1. Re-run the initialization and authentication.\n", + "2. Continue from where you left off. You may need to copy-paste the value of some variables such as the `dataset_name` from the printed output of the previous cells.\n", + "\n", + "Alternatively you can connect your Colab notebook to a [local runtime](https://research.google.com/colaboratory/local-runtimes.html)." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "b--5FDDwCG9C" + }, + "cell_type": "markdown", + "source": [ + "## 1. Project set up\n", + "\n", + "\n", + "\n" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "AZs0ICgy4jkQ" + }, + "cell_type": "markdown", + "source": [ + "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to\n", + "* Create a Google Cloud Platform (GCP) project.\n", + "* Enable billing.\n", + "* Apply to whitelist your project.\n", + "* Enable AutoML API.\n", + "* Enable AutoML Talbes API.\n", + "* Create a service account, grant required permissions, and download the service account private key.\n", + "\n", + "You also need to upload your data into Google Cloud Storage (GCS) or BigQuery. For example, to use GCS as your data source\n", + "* Create a GCS bucket.\n", + "* Upload the training and batch prediction files.\n", + "\n", + "\n", + "**Warning:** Private keys must be kept secret. If you expose your private key it is recommended to revoke it immediately from the Google Cloud Console." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "xZECt1oL429r" + }, + "cell_type": "markdown", + "source": [ + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "rstRPH9SyZj_" + }, + "cell_type": "markdown", + "source": [ + "## 2. Initialize and authenticate\n", + "This section runs intialization and authentication. It creates an authenticated session which is required for running any of the following sections." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "BR0POq2UzE7e" + }, + "cell_type": "markdown", + "source": [ + "### Install the client library\n", + "Run the following cell. Click on the 'Choose Files' button and select the client library compressed file. The file is uploaded to your Colab and installed using `pip`." + ] + }, + { + "metadata": { + "id": "43aXKjDRt_qZ", + "colab_type": "code", + "colab": { + "resources": { + "http://localhost:8080/nbextensions/google.colab/files.js": { + "data": "Ly8gQ29weXJpZ2h0IDIwMTcgR29vZ2xlIExMQwovLwovLyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKLy8geW91IG1heSBub3QgdXNlIHRoaXMgZmlsZSBleGNlcHQgaW4gY29tcGxpYW5jZSB3aXRoIHRoZSBMaWNlbnNlLgovLyBZb3UgbWF5IG9idGFpbiBhIGNvcHkgb2YgdGhlIExpY2Vuc2UgYXQKLy8KLy8gICAgICBodHRwOi8vd3d3LmFwYWNoZS5vcmcvbGljZW5zZXMvTElDRU5TRS0yLjAKLy8KLy8gVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQovLyBkaXN0cmlidXRlZCB1bmRlciB0aGUgTGljZW5zZSBpcyBkaXN0cmlidXRlZCBvbiBhbiAiQVMgSVMiIEJBU0lTLAovLyBXSVRIT1VUIFdBUlJBTlRJRVMgT1IgQ09ORElUSU9OUyBPRiBBTlkgS0lORCwgZWl0aGVyIGV4cHJlc3Mgb3IgaW1wbGllZC4KLy8gU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAovLyBsaW1pdGF0aW9ucyB1bmRlciB0aGUgTGljZW5zZS4KCi8qKgogKiBAZmlsZW92ZXJ2aWV3IEhlbHBlcnMgZm9yIGdvb2dsZS5jb2xhYiBQeXRob24gbW9kdWxlLgogKi8KKGZ1bmN0aW9uKHNjb3BlKSB7CmZ1bmN0aW9uIHNwYW4odGV4dCwgc3R5bGVBdHRyaWJ1dGVzID0ge30pIHsKICBjb25zdCBlbGVtZW50ID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnc3BhbicpOwogIGVsZW1lbnQudGV4dENvbnRlbnQgPSB0ZXh0OwogIGZvciAoY29uc3Qga2V5IG9mIE9iamVjdC5rZXlzKHN0eWxlQXR0cmlidXRlcykpIHsKICAgIGVsZW1lbnQuc3R5bGVba2V5XSA9IHN0eWxlQXR0cmlidXRlc1trZXldOwogIH0KICByZXR1cm4gZWxlbWVudDsKfQoKLy8gTWF4IG51bWJlciBvZiBieXRlcyB3aGljaCB3aWxsIGJlIHVwbG9hZGVkIGF0IGEgdGltZS4KY29uc3QgTUFYX1BBWUxPQURfU0laRSA9IDEwMCAqIDEwMjQ7Ci8vIE1heCBhbW91bnQgb2YgdGltZSB0byBibG9jayB3YWl0aW5nIGZvciB0aGUgdXNlci4KY29uc3QgRklMRV9DSEFOR0VfVElNRU9VVF9NUyA9IDMwICogMTAwMDsKCmZ1bmN0aW9uIF91cGxvYWRGaWxlcyhpbnB1dElkLCBvdXRwdXRJZCkgewogIGNvbnN0IHN0ZXBzID0gdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKTsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIC8vIENhY2hlIHN0ZXBzIG9uIHRoZSBvdXRwdXRFbGVtZW50IHRvIG1ha2UgaXQgYXZhaWxhYmxlIGZvciB0aGUgbmV4dCBjYWxsCiAgLy8gdG8gdXBsb2FkRmlsZXNDb250aW51ZSBmcm9tIFB5dGhvbi4KICBvdXRwdXRFbGVtZW50LnN0ZXBzID0gc3RlcHM7CgogIHJldHVybiBfdXBsb2FkRmlsZXNDb250aW51ZShvdXRwdXRJZCk7Cn0KCi8vIFRoaXMgaXMgcm91Z2hseSBhbiBhc3luYyBnZW5lcmF0b3IgKG5vdCBzdXBwb3J0ZWQgaW4gdGhlIGJyb3dzZXIgeWV0KSwKLy8gd2hlcmUgdGhlcmUgYXJlIG11bHRpcGxlIGFzeW5jaHJvbm91cyBzdGVwcyBhbmQgdGhlIFB5dGhvbiBzaWRlIGlzIGdvaW5nCi8vIHRvIHBvbGwgZm9yIGNvbXBsZXRpb24gb2YgZWFjaCBzdGVwLgovLyBUaGlzIHVzZXMgYSBQcm9taXNlIHRvIGJsb2NrIHRoZSBweXRob24gc2lkZSBvbiBjb21wbGV0aW9uIG9mIGVhY2ggc3RlcCwKLy8gdGhlbiBwYXNzZXMgdGhlIHJlc3VsdCBvZiB0aGUgcHJldmlvdXMgc3RlcCBhcyB0aGUgaW5wdXQgdG8gdGhlIG5leHQgc3RlcC4KZnVuY3Rpb24gX3VwbG9hZEZpbGVzQ29udGludWUob3V0cHV0SWQpIHsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIGNvbnN0IHN0ZXBzID0gb3V0cHV0RWxlbWVudC5zdGVwczsKCiAgY29uc3QgbmV4dCA9IHN0ZXBzLm5leHQob3V0cHV0RWxlbWVudC5sYXN0UHJvbWlzZVZhbHVlKTsKICByZXR1cm4gUHJvbWlzZS5yZXNvbHZlKG5leHQudmFsdWUucHJvbWlzZSkudGhlbigodmFsdWUpID0+IHsKICAgIC8vIENhY2hlIHRoZSBsYXN0IHByb21pc2UgdmFsdWUgdG8gbWFrZSBpdCBhdmFpbGFibGUgdG8gdGhlIG5leHQKICAgIC8vIHN0ZXAgb2YgdGhlIGdlbmVyYXRvci4KICAgIG91dHB1dEVsZW1lbnQubGFzdFByb21pc2VWYWx1ZSA9IHZhbHVlOwogICAgcmV0dXJuIG5leHQudmFsdWUucmVzcG9uc2U7CiAgfSk7Cn0KCi8qKgogKiBHZW5lcmF0b3IgZnVuY3Rpb24gd2hpY2ggaXMgY2FsbGVkIGJldHdlZW4gZWFjaCBhc3luYyBzdGVwIG9mIHRoZSB1cGxvYWQKICogcHJvY2Vzcy4KICogQHBhcmFtIHtzdHJpbmd9IGlucHV0SWQgRWxlbWVudCBJRCBvZiB0aGUgaW5wdXQgZmlsZSBwaWNrZXIgZWxlbWVudC4KICogQHBhcmFtIHtzdHJpbmd9IG91dHB1dElkIEVsZW1lbnQgSUQgb2YgdGhlIG91dHB1dCBkaXNwbGF5LgogKiBAcmV0dXJuIHshSXRlcmFibGU8IU9iamVjdD59IEl0ZXJhYmxlIG9mIG5leHQgc3RlcHMuCiAqLwpmdW5jdGlvbiogdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKSB7CiAgY29uc3QgaW5wdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoaW5wdXRJZCk7CiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gZmFsc2U7CgogIGNvbnN0IG91dHB1dEVsZW1lbnQgPSBkb2N1bWVudC5nZXRFbGVtZW50QnlJZChvdXRwdXRJZCk7CiAgb3V0cHV0RWxlbWVudC5pbm5lckhUTUwgPSAnJzsKCiAgY29uc3QgcGlja2VkUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBpbnB1dEVsZW1lbnQuYWRkRXZlbnRMaXN0ZW5lcignY2hhbmdlJywgKGUpID0+IHsKICAgICAgcmVzb2x2ZShlLnRhcmdldC5maWxlcyk7CiAgICB9KTsKICB9KTsKCiAgY29uc3QgY2FuY2VsID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnYnV0dG9uJyk7CiAgaW5wdXRFbGVtZW50LnBhcmVudEVsZW1lbnQuYXBwZW5kQ2hpbGQoY2FuY2VsKTsKICBjYW5jZWwudGV4dENvbnRlbnQgPSAnQ2FuY2VsIHVwbG9hZCc7CiAgY29uc3QgY2FuY2VsUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBjYW5jZWwub25jbGljayA9ICgpID0+IHsKICAgICAgcmVzb2x2ZShudWxsKTsKICAgIH07CiAgfSk7CgogIC8vIENhbmNlbCB1cGxvYWQgaWYgdXNlciBoYXNuJ3QgcGlja2VkIGFueXRoaW5nIGluIHRpbWVvdXQuCiAgY29uc3QgdGltZW91dFByb21pc2UgPSBuZXcgUHJvbWlzZSgocmVzb2x2ZSkgPT4gewogICAgc2V0VGltZW91dCgoKSA9PiB7CiAgICAgIHJlc29sdmUobnVsbCk7CiAgICB9LCBGSUxFX0NIQU5HRV9USU1FT1VUX01TKTsKICB9KTsKCiAgLy8gV2FpdCBmb3IgdGhlIHVzZXIgdG8gcGljayB0aGUgZmlsZXMuCiAgY29uc3QgZmlsZXMgPSB5aWVsZCB7CiAgICBwcm9taXNlOiBQcm9taXNlLnJhY2UoW3BpY2tlZFByb21pc2UsIHRpbWVvdXRQcm9taXNlLCBjYW5jZWxQcm9taXNlXSksCiAgICByZXNwb25zZTogewogICAgICBhY3Rpb246ICdzdGFydGluZycsCiAgICB9CiAgfTsKCiAgaWYgKCFmaWxlcykgewogICAgcmV0dXJuIHsKICAgICAgcmVzcG9uc2U6IHsKICAgICAgICBhY3Rpb246ICdjb21wbGV0ZScsCiAgICAgIH0KICAgIH07CiAgfQoKICBjYW5jZWwucmVtb3ZlKCk7CgogIC8vIERpc2FibGUgdGhlIGlucHV0IGVsZW1lbnQgc2luY2UgZnVydGhlciBwaWNrcyBhcmUgbm90IGFsbG93ZWQuCiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gdHJ1ZTsKCiAgZm9yIChjb25zdCBmaWxlIG9mIGZpbGVzKSB7CiAgICBjb25zdCBsaSA9IGRvY3VtZW50LmNyZWF0ZUVsZW1lbnQoJ2xpJyk7CiAgICBsaS5hcHBlbmQoc3BhbihmaWxlLm5hbWUsIHtmb250V2VpZ2h0OiAnYm9sZCd9KSk7CiAgICBsaS5hcHBlbmQoc3BhbigKICAgICAgICBgKCR7ZmlsZS50eXBlIHx8ICduL2EnfSkgLSAke2ZpbGUuc2l6ZX0gYnl0ZXMsIGAgKwogICAgICAgIGBsYXN0IG1vZGlmaWVkOiAkewogICAgICAgICAgICBmaWxlLmxhc3RNb2RpZmllZERhdGUgPyBmaWxlLmxhc3RNb2RpZmllZERhdGUudG9Mb2NhbGVEYXRlU3RyaW5nKCkgOgogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAnbi9hJ30gLSBgKSk7CiAgICBjb25zdCBwZXJjZW50ID0gc3BhbignMCUgZG9uZScpOwogICAgbGkuYXBwZW5kQ2hpbGQocGVyY2VudCk7CgogICAgb3V0cHV0RWxlbWVudC5hcHBlbmRDaGlsZChsaSk7CgogICAgY29uc3QgZmlsZURhdGFQcm9taXNlID0gbmV3IFByb21pc2UoKHJlc29sdmUpID0+IHsKICAgICAgY29uc3QgcmVhZGVyID0gbmV3IEZpbGVSZWFkZXIoKTsKICAgICAgcmVhZGVyLm9ubG9hZCA9IChlKSA9PiB7CiAgICAgICAgcmVzb2x2ZShlLnRhcmdldC5yZXN1bHQpOwogICAgICB9OwogICAgICByZWFkZXIucmVhZEFzQXJyYXlCdWZmZXIoZmlsZSk7CiAgICB9KTsKICAgIC8vIFdhaXQgZm9yIHRoZSBkYXRhIHRvIGJlIHJlYWR5LgogICAgbGV0IGZpbGVEYXRhID0geWllbGQgewogICAgICBwcm9taXNlOiBmaWxlRGF0YVByb21pc2UsCiAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgYWN0aW9uOiAnY29udGludWUnLAogICAgICB9CiAgICB9OwoKICAgIC8vIFVzZSBhIGNodW5rZWQgc2VuZGluZyB0byBhdm9pZCBtZXNzYWdlIHNpemUgbGltaXRzLiBTZWUgYi82MjExNTY2MC4KICAgIGxldCBwb3NpdGlvbiA9IDA7CiAgICB3aGlsZSAocG9zaXRpb24gPCBmaWxlRGF0YS5ieXRlTGVuZ3RoKSB7CiAgICAgIGNvbnN0IGxlbmd0aCA9IE1hdGgubWluKGZpbGVEYXRhLmJ5dGVMZW5ndGggLSBwb3NpdGlvbiwgTUFYX1BBWUxPQURfU0laRSk7CiAgICAgIGNvbnN0IGNodW5rID0gbmV3IFVpbnQ4QXJyYXkoZmlsZURhdGEsIHBvc2l0aW9uLCBsZW5ndGgpOwogICAgICBwb3NpdGlvbiArPSBsZW5ndGg7CgogICAgICBjb25zdCBiYXNlNjQgPSBidG9hKFN0cmluZy5mcm9tQ2hhckNvZGUuYXBwbHkobnVsbCwgY2h1bmspKTsKICAgICAgeWllbGQgewogICAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgICBhY3Rpb246ICdhcHBlbmQnLAogICAgICAgICAgZmlsZTogZmlsZS5uYW1lLAogICAgICAgICAgZGF0YTogYmFzZTY0LAogICAgICAgIH0sCiAgICAgIH07CiAgICAgIHBlcmNlbnQudGV4dENvbnRlbnQgPQogICAgICAgICAgYCR7TWF0aC5yb3VuZCgocG9zaXRpb24gLyBmaWxlRGF0YS5ieXRlTGVuZ3RoKSAqIDEwMCl9JSBkb25lYDsKICAgIH0KICB9CgogIC8vIEFsbCBkb25lLgogIHlpZWxkIHsKICAgIHJlc3BvbnNlOiB7CiAgICAgIGFjdGlvbjogJ2NvbXBsZXRlJywKICAgIH0KICB9Owp9CgpzY29wZS5nb29nbGUgPSBzY29wZS5nb29nbGUgfHwge307CnNjb3BlLmdvb2dsZS5jb2xhYiA9IHNjb3BlLmdvb2dsZS5jb2xhYiB8fCB7fTsKc2NvcGUuZ29vZ2xlLmNvbGFiLl9maWxlcyA9IHsKICBfdXBsb2FkRmlsZXMsCiAgX3VwbG9hZEZpbGVzQ29udGludWUsCn07Cn0pKHNlbGYpOwo=", + "ok": true, + "headers": [ + [ + "content-type", + "application/javascript" + ] + ], + "status": 200, + "status_text": "" + } + }, + "base_uri": "https://localhost:8080/", + "height": 602 + }, + "outputId": "4d3628f9-e5be-4145-f550-8eaffca97d37" + }, + "cell_type": "code", + "source": [ + "#@title Install AutoML Tables client library { vertical-output: true }\n", + "\n", + "from __future__ import absolute_import\n", + "from __future__ import division\n", + "from __future__ import print_function\n", + "\n", + "from google.colab import files\n", + "import tarfile\n", + "\n", + "# Upload the client library\n", + "compressed_file_upload = files.upload()\n", + "compressed_file_name = list(compressed_file_upload.keys())[0]\n", + "# Decompress the client library\n", + "with tarfile.open(compressed_file_name) as tar:\n", + " tar.extractall(path='.')\n", + "# Install the client library\n", + "!pip install ./python" + ], + "execution_count": 1, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " Upload widget is only available when the cell has been executed in the\n", + " current browser session. Please rerun this cell to enable.\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + } + }, + { + "output_type": "stream", + "text": [ + "Saving automl-beta-python-20190314 (1).tar.gz to automl-beta-python-20190314 (1).tar.gz\n", + "Processing ./python\n", + "Requirement already satisfied: google-api-core[grpc]<2.0.0dev,>=1.6.0 in /usr/local/lib/python3.6/dist-packages (from google-cloud-automl==0.1.2) (1.8.1)\n", + "Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.6/dist-packages (from google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (1.11.0)\n", + "Requirement already satisfied: protobuf>=3.4.0 in /usr/local/lib/python3.6/dist-packages (from google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (3.7.0)\n", + "Requirement already satisfied: requests<3.0.0dev,>=2.18.0 in /usr/local/lib/python3.6/dist-packages (from google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (2.18.4)\n", + "Requirement already satisfied: pytz in /usr/local/lib/python3.6/dist-packages (from google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (2018.9)\n", + "Requirement already satisfied: setuptools>=34.0.0 in /usr/local/lib/python3.6/dist-packages (from google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (40.8.0)\n", + "Requirement already satisfied: google-auth<2.0dev,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (1.4.2)\n", + "Requirement already satisfied: googleapis-common-protos!=1.5.4,<2.0dev,>=1.5.3 in /usr/local/lib/python3.6/dist-packages (from google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (1.5.8)\n", + "Requirement already satisfied: grpcio>=1.8.2; extra == \"grpc\" in /usr/local/lib/python3.6/dist-packages (from google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (1.15.0)\n", + "Requirement already satisfied: idna<2.7,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0dev,>=2.18.0->google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (2.6)\n", + "Requirement already satisfied: urllib3<1.23,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0dev,>=2.18.0->google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (1.22)\n", + "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0dev,>=2.18.0->google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (3.0.4)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0dev,>=2.18.0->google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (2019.3.9)\n", + "Requirement already satisfied: cachetools>=2.0.0 in /usr/local/lib/python3.6/dist-packages (from google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (3.1.0)\n", + "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.6/dist-packages (from google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (0.2.4)\n", + "Requirement already satisfied: rsa>=3.1.4 in /usr/local/lib/python3.6/dist-packages (from google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (4.0)\n", + "Requirement already satisfied: pyasn1<0.5.0,>=0.4.1 in /usr/local/lib/python3.6/dist-packages (from pyasn1-modules>=0.2.1->google-auth<2.0dev,>=0.4.0->google-api-core[grpc]<2.0.0dev,>=1.6.0->google-cloud-automl==0.1.2) (0.4.5)\n", + "Building wheels for collected packages: google-cloud-automl\n", + " Building wheel for google-cloud-automl (setup.py) ... \u001b[?25ldone\n", + "\u001b[?25h Stored in directory: /tmp/pip-ephem-wheel-cache-xklgs304/wheels/70/a0/a6/6112668f018c42dcdddf1e16bc95f8fcc25dd950fe20ab8b80\n", + "Successfully built google-cloud-automl\n", + "Installing collected packages: google-cloud-automl\n", + "Successfully installed google-cloud-automl-0.1.2\n" + ], + "name": "stdout" + }, + { + "output_type": "display_data", + "data": { + "application/vnd.colab-display-data+json": { + "pip_warning": { + "packages": [ + "google" + ] + } + } + }, + "metadata": { + "tags": [] + } + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "eVFsPPEociwF" + }, + "cell_type": "markdown", + "source": [ + "### Authenticate using service account key\n", + "Run the following cell. Click on the 'Choose Files' button and select the service account private key file. If your Service Account key file or folder is hidden, you can reveal it in a Mac by pressing the Command + Shift + . combo." + ] + }, + { + "metadata": { + "id": "u-kCqysAuaJk", + "colab_type": "code", + "colab": { + "resources": { + "http://localhost:8080/nbextensions/google.colab/files.js": { + "data": "Ly8gQ29weXJpZ2h0IDIwMTcgR29vZ2xlIExMQwovLwovLyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKLy8geW91IG1heSBub3QgdXNlIHRoaXMgZmlsZSBleGNlcHQgaW4gY29tcGxpYW5jZSB3aXRoIHRoZSBMaWNlbnNlLgovLyBZb3UgbWF5IG9idGFpbiBhIGNvcHkgb2YgdGhlIExpY2Vuc2UgYXQKLy8KLy8gICAgICBodHRwOi8vd3d3LmFwYWNoZS5vcmcvbGljZW5zZXMvTElDRU5TRS0yLjAKLy8KLy8gVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQovLyBkaXN0cmlidXRlZCB1bmRlciB0aGUgTGljZW5zZSBpcyBkaXN0cmlidXRlZCBvbiBhbiAiQVMgSVMiIEJBU0lTLAovLyBXSVRIT1VUIFdBUlJBTlRJRVMgT1IgQ09ORElUSU9OUyBPRiBBTlkgS0lORCwgZWl0aGVyIGV4cHJlc3Mgb3IgaW1wbGllZC4KLy8gU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAovLyBsaW1pdGF0aW9ucyB1bmRlciB0aGUgTGljZW5zZS4KCi8qKgogKiBAZmlsZW92ZXJ2aWV3IEhlbHBlcnMgZm9yIGdvb2dsZS5jb2xhYiBQeXRob24gbW9kdWxlLgogKi8KKGZ1bmN0aW9uKHNjb3BlKSB7CmZ1bmN0aW9uIHNwYW4odGV4dCwgc3R5bGVBdHRyaWJ1dGVzID0ge30pIHsKICBjb25zdCBlbGVtZW50ID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnc3BhbicpOwogIGVsZW1lbnQudGV4dENvbnRlbnQgPSB0ZXh0OwogIGZvciAoY29uc3Qga2V5IG9mIE9iamVjdC5rZXlzKHN0eWxlQXR0cmlidXRlcykpIHsKICAgIGVsZW1lbnQuc3R5bGVba2V5XSA9IHN0eWxlQXR0cmlidXRlc1trZXldOwogIH0KICByZXR1cm4gZWxlbWVudDsKfQoKLy8gTWF4IG51bWJlciBvZiBieXRlcyB3aGljaCB3aWxsIGJlIHVwbG9hZGVkIGF0IGEgdGltZS4KY29uc3QgTUFYX1BBWUxPQURfU0laRSA9IDEwMCAqIDEwMjQ7Ci8vIE1heCBhbW91bnQgb2YgdGltZSB0byBibG9jayB3YWl0aW5nIGZvciB0aGUgdXNlci4KY29uc3QgRklMRV9DSEFOR0VfVElNRU9VVF9NUyA9IDMwICogMTAwMDsKCmZ1bmN0aW9uIF91cGxvYWRGaWxlcyhpbnB1dElkLCBvdXRwdXRJZCkgewogIGNvbnN0IHN0ZXBzID0gdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKTsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIC8vIENhY2hlIHN0ZXBzIG9uIHRoZSBvdXRwdXRFbGVtZW50IHRvIG1ha2UgaXQgYXZhaWxhYmxlIGZvciB0aGUgbmV4dCBjYWxsCiAgLy8gdG8gdXBsb2FkRmlsZXNDb250aW51ZSBmcm9tIFB5dGhvbi4KICBvdXRwdXRFbGVtZW50LnN0ZXBzID0gc3RlcHM7CgogIHJldHVybiBfdXBsb2FkRmlsZXNDb250aW51ZShvdXRwdXRJZCk7Cn0KCi8vIFRoaXMgaXMgcm91Z2hseSBhbiBhc3luYyBnZW5lcmF0b3IgKG5vdCBzdXBwb3J0ZWQgaW4gdGhlIGJyb3dzZXIgeWV0KSwKLy8gd2hlcmUgdGhlcmUgYXJlIG11bHRpcGxlIGFzeW5jaHJvbm91cyBzdGVwcyBhbmQgdGhlIFB5dGhvbiBzaWRlIGlzIGdvaW5nCi8vIHRvIHBvbGwgZm9yIGNvbXBsZXRpb24gb2YgZWFjaCBzdGVwLgovLyBUaGlzIHVzZXMgYSBQcm9taXNlIHRvIGJsb2NrIHRoZSBweXRob24gc2lkZSBvbiBjb21wbGV0aW9uIG9mIGVhY2ggc3RlcCwKLy8gdGhlbiBwYXNzZXMgdGhlIHJlc3VsdCBvZiB0aGUgcHJldmlvdXMgc3RlcCBhcyB0aGUgaW5wdXQgdG8gdGhlIG5leHQgc3RlcC4KZnVuY3Rpb24gX3VwbG9hZEZpbGVzQ29udGludWUob3V0cHV0SWQpIHsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIGNvbnN0IHN0ZXBzID0gb3V0cHV0RWxlbWVudC5zdGVwczsKCiAgY29uc3QgbmV4dCA9IHN0ZXBzLm5leHQob3V0cHV0RWxlbWVudC5sYXN0UHJvbWlzZVZhbHVlKTsKICByZXR1cm4gUHJvbWlzZS5yZXNvbHZlKG5leHQudmFsdWUucHJvbWlzZSkudGhlbigodmFsdWUpID0+IHsKICAgIC8vIENhY2hlIHRoZSBsYXN0IHByb21pc2UgdmFsdWUgdG8gbWFrZSBpdCBhdmFpbGFibGUgdG8gdGhlIG5leHQKICAgIC8vIHN0ZXAgb2YgdGhlIGdlbmVyYXRvci4KICAgIG91dHB1dEVsZW1lbnQubGFzdFByb21pc2VWYWx1ZSA9IHZhbHVlOwogICAgcmV0dXJuIG5leHQudmFsdWUucmVzcG9uc2U7CiAgfSk7Cn0KCi8qKgogKiBHZW5lcmF0b3IgZnVuY3Rpb24gd2hpY2ggaXMgY2FsbGVkIGJldHdlZW4gZWFjaCBhc3luYyBzdGVwIG9mIHRoZSB1cGxvYWQKICogcHJvY2Vzcy4KICogQHBhcmFtIHtzdHJpbmd9IGlucHV0SWQgRWxlbWVudCBJRCBvZiB0aGUgaW5wdXQgZmlsZSBwaWNrZXIgZWxlbWVudC4KICogQHBhcmFtIHtzdHJpbmd9IG91dHB1dElkIEVsZW1lbnQgSUQgb2YgdGhlIG91dHB1dCBkaXNwbGF5LgogKiBAcmV0dXJuIHshSXRlcmFibGU8IU9iamVjdD59IEl0ZXJhYmxlIG9mIG5leHQgc3RlcHMuCiAqLwpmdW5jdGlvbiogdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKSB7CiAgY29uc3QgaW5wdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoaW5wdXRJZCk7CiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gZmFsc2U7CgogIGNvbnN0IG91dHB1dEVsZW1lbnQgPSBkb2N1bWVudC5nZXRFbGVtZW50QnlJZChvdXRwdXRJZCk7CiAgb3V0cHV0RWxlbWVudC5pbm5lckhUTUwgPSAnJzsKCiAgY29uc3QgcGlja2VkUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBpbnB1dEVsZW1lbnQuYWRkRXZlbnRMaXN0ZW5lcignY2hhbmdlJywgKGUpID0+IHsKICAgICAgcmVzb2x2ZShlLnRhcmdldC5maWxlcyk7CiAgICB9KTsKICB9KTsKCiAgY29uc3QgY2FuY2VsID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnYnV0dG9uJyk7CiAgaW5wdXRFbGVtZW50LnBhcmVudEVsZW1lbnQuYXBwZW5kQ2hpbGQoY2FuY2VsKTsKICBjYW5jZWwudGV4dENvbnRlbnQgPSAnQ2FuY2VsIHVwbG9hZCc7CiAgY29uc3QgY2FuY2VsUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBjYW5jZWwub25jbGljayA9ICgpID0+IHsKICAgICAgcmVzb2x2ZShudWxsKTsKICAgIH07CiAgfSk7CgogIC8vIENhbmNlbCB1cGxvYWQgaWYgdXNlciBoYXNuJ3QgcGlja2VkIGFueXRoaW5nIGluIHRpbWVvdXQuCiAgY29uc3QgdGltZW91dFByb21pc2UgPSBuZXcgUHJvbWlzZSgocmVzb2x2ZSkgPT4gewogICAgc2V0VGltZW91dCgoKSA9PiB7CiAgICAgIHJlc29sdmUobnVsbCk7CiAgICB9LCBGSUxFX0NIQU5HRV9USU1FT1VUX01TKTsKICB9KTsKCiAgLy8gV2FpdCBmb3IgdGhlIHVzZXIgdG8gcGljayB0aGUgZmlsZXMuCiAgY29uc3QgZmlsZXMgPSB5aWVsZCB7CiAgICBwcm9taXNlOiBQcm9taXNlLnJhY2UoW3BpY2tlZFByb21pc2UsIHRpbWVvdXRQcm9taXNlLCBjYW5jZWxQcm9taXNlXSksCiAgICByZXNwb25zZTogewogICAgICBhY3Rpb246ICdzdGFydGluZycsCiAgICB9CiAgfTsKCiAgaWYgKCFmaWxlcykgewogICAgcmV0dXJuIHsKICAgICAgcmVzcG9uc2U6IHsKICAgICAgICBhY3Rpb246ICdjb21wbGV0ZScsCiAgICAgIH0KICAgIH07CiAgfQoKICBjYW5jZWwucmVtb3ZlKCk7CgogIC8vIERpc2FibGUgdGhlIGlucHV0IGVsZW1lbnQgc2luY2UgZnVydGhlciBwaWNrcyBhcmUgbm90IGFsbG93ZWQuCiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gdHJ1ZTsKCiAgZm9yIChjb25zdCBmaWxlIG9mIGZpbGVzKSB7CiAgICBjb25zdCBsaSA9IGRvY3VtZW50LmNyZWF0ZUVsZW1lbnQoJ2xpJyk7CiAgICBsaS5hcHBlbmQoc3BhbihmaWxlLm5hbWUsIHtmb250V2VpZ2h0OiAnYm9sZCd9KSk7CiAgICBsaS5hcHBlbmQoc3BhbigKICAgICAgICBgKCR7ZmlsZS50eXBlIHx8ICduL2EnfSkgLSAke2ZpbGUuc2l6ZX0gYnl0ZXMsIGAgKwogICAgICAgIGBsYXN0IG1vZGlmaWVkOiAkewogICAgICAgICAgICBmaWxlLmxhc3RNb2RpZmllZERhdGUgPyBmaWxlLmxhc3RNb2RpZmllZERhdGUudG9Mb2NhbGVEYXRlU3RyaW5nKCkgOgogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAnbi9hJ30gLSBgKSk7CiAgICBjb25zdCBwZXJjZW50ID0gc3BhbignMCUgZG9uZScpOwogICAgbGkuYXBwZW5kQ2hpbGQocGVyY2VudCk7CgogICAgb3V0cHV0RWxlbWVudC5hcHBlbmRDaGlsZChsaSk7CgogICAgY29uc3QgZmlsZURhdGFQcm9taXNlID0gbmV3IFByb21pc2UoKHJlc29sdmUpID0+IHsKICAgICAgY29uc3QgcmVhZGVyID0gbmV3IEZpbGVSZWFkZXIoKTsKICAgICAgcmVhZGVyLm9ubG9hZCA9IChlKSA9PiB7CiAgICAgICAgcmVzb2x2ZShlLnRhcmdldC5yZXN1bHQpOwogICAgICB9OwogICAgICByZWFkZXIucmVhZEFzQXJyYXlCdWZmZXIoZmlsZSk7CiAgICB9KTsKICAgIC8vIFdhaXQgZm9yIHRoZSBkYXRhIHRvIGJlIHJlYWR5LgogICAgbGV0IGZpbGVEYXRhID0geWllbGQgewogICAgICBwcm9taXNlOiBmaWxlRGF0YVByb21pc2UsCiAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgYWN0aW9uOiAnY29udGludWUnLAogICAgICB9CiAgICB9OwoKICAgIC8vIFVzZSBhIGNodW5rZWQgc2VuZGluZyB0byBhdm9pZCBtZXNzYWdlIHNpemUgbGltaXRzLiBTZWUgYi82MjExNTY2MC4KICAgIGxldCBwb3NpdGlvbiA9IDA7CiAgICB3aGlsZSAocG9zaXRpb24gPCBmaWxlRGF0YS5ieXRlTGVuZ3RoKSB7CiAgICAgIGNvbnN0IGxlbmd0aCA9IE1hdGgubWluKGZpbGVEYXRhLmJ5dGVMZW5ndGggLSBwb3NpdGlvbiwgTUFYX1BBWUxPQURfU0laRSk7CiAgICAgIGNvbnN0IGNodW5rID0gbmV3IFVpbnQ4QXJyYXkoZmlsZURhdGEsIHBvc2l0aW9uLCBsZW5ndGgpOwogICAgICBwb3NpdGlvbiArPSBsZW5ndGg7CgogICAgICBjb25zdCBiYXNlNjQgPSBidG9hKFN0cmluZy5mcm9tQ2hhckNvZGUuYXBwbHkobnVsbCwgY2h1bmspKTsKICAgICAgeWllbGQgewogICAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgICBhY3Rpb246ICdhcHBlbmQnLAogICAgICAgICAgZmlsZTogZmlsZS5uYW1lLAogICAgICAgICAgZGF0YTogYmFzZTY0LAogICAgICAgIH0sCiAgICAgIH07CiAgICAgIHBlcmNlbnQudGV4dENvbnRlbnQgPQogICAgICAgICAgYCR7TWF0aC5yb3VuZCgocG9zaXRpb24gLyBmaWxlRGF0YS5ieXRlTGVuZ3RoKSAqIDEwMCl9JSBkb25lYDsKICAgIH0KICB9CgogIC8vIEFsbCBkb25lLgogIHlpZWxkIHsKICAgIHJlc3BvbnNlOiB7CiAgICAgIGFjdGlvbjogJ2NvbXBsZXRlJywKICAgIH0KICB9Owp9CgpzY29wZS5nb29nbGUgPSBzY29wZS5nb29nbGUgfHwge307CnNjb3BlLmdvb2dsZS5jb2xhYiA9IHNjb3BlLmdvb2dsZS5jb2xhYiB8fCB7fTsKc2NvcGUuZ29vZ2xlLmNvbGFiLl9maWxlcyA9IHsKICBfdXBsb2FkRmlsZXMsCiAgX3VwbG9hZEZpbGVzQ29udGludWUsCn07Cn0pKHNlbGYpOwo=", + "ok": true, + "headers": [ + [ + "content-type", + "application/javascript" + ] + ], + "status": 200, + "status_text": "" + } + }, + "base_uri": "https://localhost:8080/", + "height": 71 + }, + "outputId": "06154a63-f410-435f-b565-cd1599243b88" + }, + "cell_type": "code", + "source": [ + "#@title Authenticate using service account key and create a client. { vertical-output: true }\n", + "\n", + "from google.cloud import automl_v1beta1\n", + "\n", + "# Upload service account key\n", + "keyfile_upload = files.upload()\n", + "keyfile_name = list(keyfile_upload.keys())[0]\n", + "# Authenticate and create an AutoML client.\n", + "client = automl_v1beta1.AutoMlClient.from_service_account_file(keyfile_name)\n", + "# Authenticate and create a prediction service client.\n", + "prediction_client = automl_v1beta1.PredictionServiceClient.from_service_account_file(keyfile_name)" + ], + "execution_count": 2, + "outputs": [ + { + "output_type": "display_data", + "data": { + "text/html": [ + "\n", + " \n", + " \n", + " Upload widget is only available when the cell has been executed in the\n", + " current browser session. Please rerun this cell to enable.\n", + " \n", + " " + ], + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + } + }, + { + "output_type": "stream", + "text": [ + "Saving energy-forecasting.json to energy-forecasting.json\n" + ], + "name": "stdout" + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "s3F2xbEJdDvN" + }, + "cell_type": "markdown", + "source": [ + "### Set Project and Location" + ] + }, + { + "metadata": { + "id": "0uX4aJYUiXh5", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Enter your GCP project ID." + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "6R4h5HF1Dtds", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "1e049b34-4683-4755-ab08-aec08de2bc66" + }, + "cell_type": "code", + "source": [ + "#@title GCP project ID and location\n", + "\n", + "project_id = 'energy-forecasting' #@param {type:'string'}\n", + "location = 'us-central1' #@param {type:'string'}\n", + "location_path = client.location_path(project_id, location)\n", + "location_path" + ], + "execution_count": 3, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "'projects/energy-forecasting/locations/us-central1'" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 3 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "qozQWMnOu48y" + }, + "cell_type": "markdown", + "source": [ + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "ODt86YuVDZzm" + }, + "cell_type": "markdown", + "source": [ + "## 3. Import training data" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "XwjZc9Q62Fm5" + }, + "cell_type": "markdown", + "source": [ + "### Create dataset" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "_JfZFGSceyE_" + }, + "cell_type": "markdown", + "source": [ + "Select a dataset display name and pass your table source information to create a new dataset." + ] + }, + { + "metadata": { + "id": "Z_JErW3cw-0J", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 224 + }, + "outputId": "7fe366df-73ae-4ab1-ceaa-fd6ced4ccdd9" + }, + "cell_type": "code", + "source": [ + "#@title Create dataset { vertical-output: true, output-height: 200 }\n", + "\n", + "dataset_display_name = 'energy_forecasting_solution' #@param {type: 'string'}\n", + "\n", + "create_dataset_response = client.create_dataset(\n", + " location_path,\n", + " {'display_name': dataset_display_name, 'tables_dataset_metadata': {}})\n", + "dataset_name = create_dataset_response.name\n", + "create_dataset_response" + ], + "execution_count": 14, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "name: \"projects/595920091534/locations/us-central1/datasets/TBL1714094647237672960\"\n", + "display_name: \"energy_forecasting_solution\"\n", + "create_time {\n", + " seconds: 1553639618\n", + " nanos: 347402000\n", + "}\n", + "etag: \"AB3BwFrebKY3sN1sMcQWwhizE_rWgZl2_9My3WhNx5HxmYWJvfwg4C-wkYpkvhY3Mkvz\"\n", + "tables_dataset_metadata {\n", + " stats_update_time {\n", + " }\n", + "}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 14 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "35YZ9dy34VqJ" + }, + "cell_type": "markdown", + "source": [ + "### Import data" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "3c0o15gVREAw" + }, + "cell_type": "markdown", + "source": [ + "You can import your data to AutoML Tables from GCS or BigQuery. For this tutorial, you can use the [iris dataset](https://storage.cloud.google.com/rostam-193618-tutorial/automl-tables-v1beta1/iris.csv) as your training data. You can create a GCS bucket and upload the data into your bucket. The URI for your file is `gs://BUCKET_NAME/FOLDER_NAME1/FOLDER_NAME2/.../FILE_NAME`. Alternatively you can create a BigQuery table and upload the data into the table. The URI for your table is `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", + "\n", + "Importing data may take a few minutes or hours depending on the size of your data. If your Colab times out, run the following command to retrieve your dataset. Replace `dataset_name` with its actual value obtained in the preceding cells.\n", + "\n", + " dataset = client.get_dataset(dataset_name)" + ] + }, + { + "metadata": { + "id": "bB_GdeqCJW5i", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Datasource in BigQuery { vertical-output: true }\n", + "\n", + "dataset_bq_input_uri = 'bq://energy-forecasting.Energy.automldata' #@param {type: 'string'}\n", + "# Define input configuration.\n", + "input_config = {\n", + " 'bigquery_source': {\n", + " 'input_uri': dataset_bq_input_uri\n", + " }\n", + "}" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "FNVYfpoXJsNB", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 139 + }, + "outputId": "0ecc8d11-5bf1-4c2e-f688-b6d9be934e3c" + }, + "cell_type": "code", + "source": [ + " #@title Import data { vertical-output: true }\n", + "\n", + "import_data_response = client.import_data(dataset_name, input_config)\n", + "print('Dataset import operation: {}'.format(import_data_response.operation))\n", + "# Wait until import is done.\n", + "import_data_result = import_data_response.result()\n", + "import_data_result" + ], + "execution_count": 16, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Dataset import operation: name: \"projects/595920091534/locations/us-central1/operations/TBL5340820557317275648\"\n", + "metadata {\n", + " type_url: \"type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata\"\n", + " value: \"\\032\\014\\010\\305\\321\\352\\344\\005\\020\\300\\307\\214\\365\\001\\\"\\014\\010\\305\\321\\352\\344\\005\\020\\300\\307\\214\\365\\001z\\000\"\n", + "}\n", + "\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 16 + } + ] + }, + { + "metadata": { + "id": "QdxBI4s44ZRI", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Review the specs" + ] + }, + { + "metadata": { + "id": "RC0PWKqH4jwr", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Run the following command to see table specs such as row count." + ] + }, + { + "metadata": { + "id": "v2Vzq_gwXxo-", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 3247 + }, + "outputId": "c89cd7b1-4344-46d9-c4a3-1b012b5b720d" + }, + "cell_type": "code", + "source": [ + "#@title Table schema { vertical-output: true }\n", + "\n", + "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", + "\n", + "# List table specs\n", + "list_table_specs_response = client.list_table_specs(dataset_name)\n", + "table_specs = [s for s in list_table_specs_response]\n", + "# List column specs\n", + "table_spec_name = table_specs[0].name\n", + "list_column_specs_response = client.list_column_specs(table_spec_name)\n", + "column_specs = {s.display_name: s for s in list_column_specs_response}\n", + "[(x, data_types.TypeCode.Name(\n", + " column_specs[x].data_type.type_code)) for x in column_specs.keys()]" + ], + "execution_count": 17, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[('price', 'FLOAT64'),\n", + " ('date_utc', 'TIMESTAMP'),\n", + " ('day', 'CATEGORY'),\n", + " ('hour', 'FLOAT64'),\n", + " ('prev_week_min', 'FLOAT64'),\n", + " ('prev_week_25th', 'FLOAT64'),\n", + " ('prev_week_50th', 'FLOAT64'),\n", + " ('prev_week_75th', 'FLOAT64'),\n", + " ('prev_week_max', 'FLOAT64'),\n", + " ('point1_temperature', 'FLOAT64'),\n", + " ('point1_wind_speed_100m', 'FLOAT64'),\n", + " ('point1_wind_direction_100m', 'FLOAT64'),\n", + " ('point1_air_density', 'FLOAT64'),\n", + " ('point1_precipitation', 'FLOAT64'),\n", + " ('point1_wind_gust', 'FLOAT64'),\n", + " ('point1_radiation', 'FLOAT64'),\n", + " ('point1_wind_speed', 'FLOAT64'),\n", + " ('point1_wind_direction', 'FLOAT64'),\n", + " ('point1_pressure', 'FLOAT64'),\n", + " ('point2_temperature', 'FLOAT64'),\n", + " ('point2_wind_speed_100m', 'FLOAT64'),\n", + " ('point2_wind_direction_100m', 'FLOAT64'),\n", + " ('point2_air_density', 'FLOAT64'),\n", + " ('point2_precipitation', 'FLOAT64'),\n", + " ('point2_wind_gust', 'FLOAT64'),\n", + " ('point2_radiation', 'FLOAT64'),\n", + " ('point2_wind_speed', 'FLOAT64'),\n", + " ('point2_wind_direction', 'FLOAT64'),\n", + " ('point2_pressure', 'FLOAT64'),\n", + " ('point3_temperature', 'FLOAT64'),\n", + " ('point3_wind_speed_100m', 'FLOAT64'),\n", + " ('point3_wind_direction_100m', 'FLOAT64'),\n", + " ('point3_air_density', 'FLOAT64'),\n", + " ('point3_precipitation', 'FLOAT64'),\n", + " ('point3_wind_gust', 'FLOAT64'),\n", + " ('point3_radiation', 'FLOAT64'),\n", + " ('point3_wind_speed', 'FLOAT64'),\n", + " ('point3_wind_direction', 'FLOAT64'),\n", + " ('point3_pressure', 'FLOAT64'),\n", + " ('point4_temperature', 'FLOAT64'),\n", + " ('point4_wind_speed_100m', 'FLOAT64'),\n", + " ('point4_wind_direction_100m', 'FLOAT64'),\n", + " ('point4_air_density', 'FLOAT64'),\n", + " ('point4_precipitation', 'FLOAT64'),\n", + " ('point4_wind_gust', 'FLOAT64'),\n", + " ('point4_radiation', 'FLOAT64'),\n", + " ('point4_wind_speed', 'FLOAT64'),\n", + " ('point4_wind_direction', 'FLOAT64'),\n", + " ('point4_pressure', 'FLOAT64'),\n", + " ('point5_temperature', 'FLOAT64'),\n", + " ('point5_wind_speed_100m', 'FLOAT64'),\n", + " ('point5_wind_direction_100m', 'FLOAT64'),\n", + " ('point5_air_density', 'FLOAT64'),\n", + " ('point5_precipitation', 'FLOAT64'),\n", + " ('point5_wind_gust', 'FLOAT64'),\n", + " ('point5_radiation', 'FLOAT64'),\n", + " ('point5_wind_speed', 'FLOAT64'),\n", + " ('point5_wind_direction', 'FLOAT64'),\n", + " ('point5_pressure', 'FLOAT64'),\n", + " ('point6_temperature', 'FLOAT64'),\n", + " ('point6_wind_speed_100m', 'FLOAT64'),\n", + " ('point6_wind_direction_100m', 'FLOAT64'),\n", + " ('point6_air_density', 'FLOAT64'),\n", + " ('point6_precipitation', 'FLOAT64'),\n", + " ('point6_wind_gust', 'FLOAT64'),\n", + " ('point6_radiation', 'FLOAT64'),\n", + " ('point6_wind_speed', 'FLOAT64'),\n", + " ('point6_wind_direction', 'FLOAT64'),\n", + " ('point6_pressure', 'FLOAT64'),\n", + " ('point7_temperature', 'FLOAT64'),\n", + " ('point7_wind_speed_100m', 'FLOAT64'),\n", + " ('point7_wind_direction_100m', 'FLOAT64'),\n", + " ('point7_air_density', 'FLOAT64'),\n", + " ('point7_precipitation', 'FLOAT64'),\n", + " ('point7_wind_gust', 'FLOAT64'),\n", + " ('point7_radiation', 'FLOAT64'),\n", + " ('point7_wind_speed', 'FLOAT64'),\n", + " ('point7_wind_direction', 'FLOAT64'),\n", + " ('point7_pressure', 'FLOAT64'),\n", + " ('point8_temperature', 'FLOAT64'),\n", + " ('point8_wind_speed_100m', 'FLOAT64'),\n", + " ('point8_wind_direction_100m', 'FLOAT64'),\n", + " ('point8_air_density', 'FLOAT64'),\n", + " ('point8_precipitation', 'FLOAT64'),\n", + " ('point8_wind_gust', 'FLOAT64'),\n", + " ('point8_radiation', 'FLOAT64'),\n", + " ('point8_wind_speed', 'FLOAT64'),\n", + " ('point8_wind_direction', 'FLOAT64'),\n", + " ('point8_pressure', 'FLOAT64'),\n", + " ('point9_temperature', 'FLOAT64'),\n", + " ('point9_wind_speed_100m', 'FLOAT64'),\n", + " ('point9_wind_direction_100m', 'FLOAT64'),\n", + " ('point9_air_density', 'FLOAT64'),\n", + " ('point9_precipitation', 'FLOAT64'),\n", + " ('point9_wind_gust', 'FLOAT64'),\n", + " ('point9_radiation', 'FLOAT64'),\n", + " ('point9_wind_speed', 'FLOAT64'),\n", + " ('point9_wind_direction', 'FLOAT64'),\n", + " ('point9_pressure', 'FLOAT64'),\n", + " ('point10_temperature', 'FLOAT64'),\n", + " ('point10_wind_speed_100m', 'FLOAT64'),\n", + " ('point10_wind_direction_100m', 'FLOAT64'),\n", + " ('point10_air_density', 'FLOAT64'),\n", + " ('point10_precipitation', 'FLOAT64'),\n", + " ('point10_wind_gust', 'FLOAT64'),\n", + " ('point10_radiation', 'FLOAT64'),\n", + " ('point10_wind_speed', 'FLOAT64'),\n", + " ('point10_wind_direction', 'FLOAT64'),\n", + " ('point10_pressure', 'FLOAT64'),\n", + " ('point11_temperature', 'FLOAT64'),\n", + " ('point11_wind_speed_100m', 'FLOAT64'),\n", + " ('point11_wind_direction_100m', 'FLOAT64'),\n", + " ('point11_air_density', 'FLOAT64'),\n", + " ('point11_precipitation', 'FLOAT64'),\n", + " ('point11_wind_gust', 'FLOAT64'),\n", + " ('point11_radiation', 'FLOAT64'),\n", + " ('point11_wind_speed', 'FLOAT64'),\n", + " ('point11_wind_direction', 'FLOAT64'),\n", + " ('point11_pressure', 'FLOAT64'),\n", + " ('point12_temperature', 'FLOAT64'),\n", + " ('point12_wind_speed_100m', 'FLOAT64'),\n", + " ('point12_wind_direction_100m', 'FLOAT64'),\n", + " ('point12_air_density', 'FLOAT64'),\n", + " ('point12_precipitation', 'FLOAT64'),\n", + " ('point12_wind_gust', 'FLOAT64'),\n", + " ('point12_radiation', 'FLOAT64'),\n", + " ('point12_wind_speed', 'FLOAT64'),\n", + " ('point12_wind_direction', 'FLOAT64'),\n", + " ('point12_pressure', 'FLOAT64'),\n", + " ('point13_temperature', 'FLOAT64'),\n", + " ('point13_wind_speed_100m', 'FLOAT64'),\n", + " ('point13_wind_direction_100m', 'FLOAT64'),\n", + " ('point13_air_density', 'FLOAT64'),\n", + " ('point13_precipitation', 'FLOAT64'),\n", + " ('point13_wind_gust', 'FLOAT64'),\n", + " ('point13_radiation', 'FLOAT64'),\n", + " ('point13_wind_speed', 'FLOAT64'),\n", + " ('point13_wind_direction', 'FLOAT64'),\n", + " ('point13_pressure', 'FLOAT64'),\n", + " ('point14_temperature', 'FLOAT64'),\n", + " ('point14_wind_speed_100m', 'FLOAT64'),\n", + " ('point14_wind_direction_100m', 'FLOAT64'),\n", + " ('point14_air_density', 'FLOAT64'),\n", + " ('point14_precipitation', 'FLOAT64'),\n", + " ('point14_wind_gust', 'FLOAT64'),\n", + " ('point14_radiation', 'FLOAT64'),\n", + " ('point14_wind_speed', 'FLOAT64'),\n", + " ('point14_wind_direction', 'FLOAT64'),\n", + " ('point14_pressure', 'FLOAT64'),\n", + " ('point15_temperature', 'FLOAT64'),\n", + " ('point15_wind_speed_100m', 'FLOAT64'),\n", + " ('point15_wind_direction_100m', 'FLOAT64'),\n", + " ('point15_air_density', 'FLOAT64'),\n", + " ('point15_precipitation', 'FLOAT64'),\n", + " ('point15_wind_gust', 'FLOAT64'),\n", + " ('point15_radiation', 'FLOAT64'),\n", + " ('point15_wind_speed', 'FLOAT64'),\n", + " ('point15_wind_direction', 'FLOAT64'),\n", + " ('point15_pressure', 'FLOAT64'),\n", + " ('point16_temperature', 'FLOAT64'),\n", + " ('point16_wind_speed_100m', 'FLOAT64'),\n", + " ('point16_wind_direction_100m', 'FLOAT64'),\n", + " ('point16_air_density', 'FLOAT64'),\n", + " ('point16_precipitation', 'FLOAT64'),\n", + " ('point16_wind_gust', 'FLOAT64'),\n", + " ('point16_radiation', 'FLOAT64'),\n", + " ('point16_wind_speed', 'FLOAT64'),\n", + " ('point16_wind_direction', 'FLOAT64'),\n", + " ('point16_pressure', 'FLOAT64'),\n", + " ('point17_temperature', 'FLOAT64'),\n", + " ('point17_wind_speed_100m', 'FLOAT64'),\n", + " ('point17_wind_direction_100m', 'FLOAT64'),\n", + " ('point17_air_density', 'FLOAT64'),\n", + " ('point17_precipitation', 'FLOAT64'),\n", + " ('point17_wind_gust', 'FLOAT64'),\n", + " ('point17_radiation', 'FLOAT64'),\n", + " ('point17_wind_speed', 'FLOAT64'),\n", + " ('point17_wind_direction', 'FLOAT64'),\n", + " ('point17_pressure', 'FLOAT64'),\n", + " ('point18_temperature', 'FLOAT64'),\n", + " ('point18_wind_speed_100m', 'FLOAT64'),\n", + " ('point18_wind_direction_100m', 'FLOAT64'),\n", + " ('point18_air_density', 'FLOAT64'),\n", + " ('point18_precipitation', 'FLOAT64'),\n", + " ('point18_wind_gust', 'FLOAT64'),\n", + " ('point18_radiation', 'FLOAT64'),\n", + " ('point18_wind_speed', 'FLOAT64'),\n", + " ('point18_wind_direction', 'FLOAT64'),\n", + " ('point18_pressure', 'FLOAT64'),\n", + " ('split', 'CATEGORY')]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 17 + } + ] + }, + { + "metadata": { + "id": "vcJP7xoq4yAJ", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Run the following command to see column specs such inferred schema." + ] + }, + { + "metadata": { + "id": "FNykW_YOYt6d", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "___" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "kNRVJqVOL8h3" + }, + "cell_type": "markdown", + "source": [ + "## 4. Update dataset: assign a label column and enable nullable columns" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "-57gehId9PQ5" + }, + "cell_type": "markdown", + "source": [ + "AutoML Tables automatically detects your data column type. For example, for the [Iris dataset](https://storage.cloud.google.com/rostam-193618-tutorial/automl-tables-v1beta1/iris.csv) it detects `species` to be categorical and `petal_length`, `petal_width`, `sepal_length`, and `sepal_width` to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema." + ] + }, + { + "metadata": { + "id": "iRqdQ7Xiq04x", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Update a column: set as categorical" + ] + }, + { + "metadata": { + "id": "OCEUIPKegWrf", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "44370b2c-f3dc-46bc-cefd-8a6f29f9cabe" + }, + "cell_type": "code", + "source": [ + "#@title Update dataset { vertical-output: true }\n", + "\n", + "column_to_category = 'hour' #@param {type: 'string'}\n", + "\n", + "update_column_spec_dict = {\n", + " \"name\": column_specs[column_to_category].name,\n", + " \"data_type\": {\n", + " \"type_code\": \"CATEGORY\"\n", + " }\n", + "}\n", + "update_column_response = client.update_column_spec(update_column_spec_dict)\n", + "update_column_response.display_name , update_column_response.data_type \n" + ], + "execution_count": 18, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "('hour', type_code: CATEGORY)" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 18 + } + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "nDMH_chybe4w" + }, + "cell_type": "markdown", + "source": [ + "### Update dataset: assign a label and split column" + ] + }, + { + "metadata": { + "id": "hVIruWg0u33t", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 360 + }, + "outputId": "eeb5f733-16ec-4191-ea59-c2fab30c8442" + }, + "cell_type": "code", + "source": [ + "#@title Update dataset { vertical-output: true }\n", + "\n", + "label_column_name = 'price' #@param {type: 'string'}\n", + "label_column_spec = column_specs[label_column_name]\n", + "label_column_id = label_column_spec.name.rsplit('/', 1)[-1]\n", + "print('Label column ID: {}'.format(label_column_id))\n", + "\n", + "split_column_name = 'split' #@param {type: 'string'}\n", + "split_column_spec = column_specs[split_column_name]\n", + "split_column_id = split_column_spec.name.rsplit('/', 1)[-1]\n", + "print('Split column ID: {}'.format(split_column_id))\n", + "# Define the values of the fields to be updated.\n", + "update_dataset_dict = {\n", + " 'name': dataset_name,\n", + " 'tables_dataset_metadata': {\n", + " 'target_column_spec_id': label_column_id,\n", + " 'ml_use_column_spec_id': split_column_id,\n", + " }\n", + "}\n", + "update_dataset_response = client.update_dataset(update_dataset_dict)\n", + "update_dataset_response" + ], + "execution_count": 19, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Label column ID: 4897277566672437248\n", + "Split column ID: 7923696516265410560\n" + ], + "name": "stdout" + }, + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "name: \"projects/595920091534/locations/us-central1/datasets/TBL1714094647237672960\"\n", + "display_name: \"energy_forecasting_solution\"\n", + "create_time {\n", + " seconds: 1553639618\n", + " nanos: 347402000\n", + "}\n", + "etag: \"AB3BwFr63Q-dtuKoAhrA3aEzOPOegFk2vjx1bcfkr_dgsZd_KM5G98s7ADiCuq5XcKw=\"\n", + "example_count: 6552\n", + "tables_dataset_metadata {\n", + " primary_table_spec_id: \"7971723184166666240\"\n", + " target_column_spec_id: \"4897277566672437248\"\n", + " ml_use_column_spec_id: \"7923696516265410560\"\n", + " stats_update_time {\n", + " seconds: 1553639685\n", + " nanos: 199000000\n", + " }\n", + "}" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 19 + } + ] + }, + { + "metadata": { + "id": "z23NITLrcxmi", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "___" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "FcKgvj1-Tbgj" + }, + "cell_type": "markdown", + "source": [ + "## 5. Creating a model" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "Pnlk8vdQlO_k" + }, + "cell_type": "markdown", + "source": [ + "### Train a model\n", + "Specify the duration of the training. For example, `'train_budget_milli_node_hours': 1000` runs the training for one hour. If your Colab times out, use `client.list_models(location_path)` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace `model_name` with its actual value.\n", + "\n", + " model = client.get_model(model_name)" + ] + }, + { + "metadata": { + "id": "11izNd6Fu37N", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 139 + }, + "outputId": "1bca25aa-eb19-4b27-a3fa-7ef137aaf4e2" + }, + "cell_type": "code", + "source": [ + "#@title Create model { vertical-output: true }\n", + "\n", + "\n", + "\n", + "model_display_name = 'energy_model' #@param {type:'string'}\n", + "model_train_hours = 12 #@param {type:'integer'}\n", + "model_optimization_objective = 'MINIMIZE_MAE' #@param {type:'string'}\n", + "column_to_ignore = 'date_utc' #@param {type:'string'}\n", + "\n", + "# Create list of features to use\n", + "feat_list = list(column_specs.keys())\n", + "feat_list.remove(label_column_name)\n", + "feat_list.remove(split_column_name)\n", + "feat_list.remove(column_to_ignore)\n", + "\n", + "model_dict = {\n", + " 'display_name': model_display_name,\n", + " 'dataset_id': dataset_name.rsplit('/', 1)[-1],\n", + " 'tables_model_metadata': {\n", + " 'train_budget_milli_node_hours':model_train_hours * 1000,\n", + " 'optimization_objective': model_optimization_objective,\n", + " 'target_column_spec': column_specs[label_column_name],\n", + " 'input_feature_column_specs': [\n", + " column_specs[x] for x in feat_list]}\n", + " }\n", + " \n", + "create_model_response = client.create_model(location_path, model_dict)\n", + "print('Dataset import operation: {}'.format(create_model_response.operation))\n", + "# Wait until model training is done.\n", + "create_model_result = create_model_response.result()\n", + "model_name = create_model_result.name\n", + "create_model_result" + ], + "execution_count": 0, + "outputs": [ + { + "output_type": "stream", + "text": [ + "Dataset import operation: name: \"projects/595920091534/locations/us-central1/operations/TBL1734281680723640320\"\n", + "metadata {\n", + " type_url: \"type.googleapis.com/google.cloud.automl.v1beta1.OperationMetadata\"\n", + " value: \"\\032\\014\\010\\236\\324\\352\\344\\005\\020\\350\\353\\256\\245\\002\\\"\\014\\010\\236\\324\\352\\344\\005\\020\\350\\353\\256\\245\\002R\\000\"\n", + "}\n", + "\n" + ], + "name": "stdout" + } + ] + }, + { + "metadata": { + "id": "puVew1GgPfQa", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + }, + "outputId": "42b9296c-d231-4787-f7fb-4aa1a6ff9bd9" + }, + "cell_type": "code", + "source": [ + "#@title Model Metrics {vertical-output: true }\n", + "\n", + "metrics= [x for x in client.list_model_evaluations(model_name)][-1]\n", + "metrics.regression_evaluation_metrics" + ], + "execution_count": 9, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "root_mean_squared_error: 0.0524103119969368\n", + "mean_absolute_error: 0.04162062332034111\n", + "mean_absolute_percentage_error: 8.693264961242676\n", + "r_squared: 0.5450255274772644" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 9 + } + ] + }, + { + "metadata": { + "id": "YQnfEwyrSt2T", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "![alt text](https://storage.googleapis.com/images_public/automl_test.png)" + ] + }, + { + "metadata": { + "id": "Vyc8ckbpRMHp", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 272 + }, + "outputId": "931d4921-2144-4092-dab6-165c1b1c2a88" + }, + "cell_type": "code", + "source": [ + "#@title Feature Importance {vertical-output: true }\n", + "\n", + "model = client.get_model(model_name)\n", + "feat_list = [(x.feature_importance, x.column_display_name) for x in model.tables_model_metadata.tables_model_column_info]\n", + "feat_list.sort(reverse=True)\n", + "feat_list[:15]" + ], + "execution_count": 12, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "[(0.037560366094112396, 'day'),\n", + " (0.02844194509088993, 'hour'),\n", + " (0.021969439461827278, 'point6_pressure'),\n", + " (0.017835542559623718, 'point10_pressure'),\n", + " (0.01292468048632145, 'point3_pressure'),\n", + " (0.011175047606229782, 'point15_wind_speed_100m'),\n", + " (0.011016246862709522, 'point11_radiation'),\n", + " (0.010922599583864212, 'point6_wind_gust'),\n", + " (0.010761065408587456, 'prev_week_25th'),\n", + " (0.010502426885068417, 'point4_pressure'),\n", + " (0.010326736606657505, 'point4_wind_speed'),\n", + " (0.009977834299206734, 'point2_pressure'),\n", + " (0.009956032037734985, 'point8_wind_gust'),\n", + " (0.009465079754590988, 'prev_week_min'),\n", + " (0.009312096051871777, 'point7_wind_speed_100m')]" + ] + }, + "metadata": { + "tags": [] + }, + "execution_count": 12 + } + ] + }, + { + "metadata": { + "id": "__2gDQ5I5gcj", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "![alt text](https://storage.googleapis.com/images_public/feature_importance.png)\n", + "![alt text](https://storage.googleapis.com/images_public/loc_portugal.png)\n", + "![alt text](https://storage.googleapis.com/images_public/weather_schema.png)\n", + "![alt text](https://storage.googleapis.com/images_public/training_schema.png)" + ] + }, + { + "metadata": { + "id": "1wS1is9IY5nK", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "___" + ] + } + ] +} \ No newline at end of file diff --git a/tables/automl/notebooks/energy_price_forecasting/energy_price_forecasting:energy_price_forecasting.ipynb b/tables/automl/notebooks/energy_price_forecasting/energy_price_forecasting:energy_price_forecasting.ipynb new file mode 100644 index 000000000000..1eff7b0b3ca5 --- /dev/null +++ b/tables/automl/notebooks/energy_price_forecasting/energy_price_forecasting:energy_price_forecasting.ipynb @@ -0,0 +1,727 @@ + +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "Energy_Price_Forecasting.ipynb", + "version": "0.3.2", + "provenance": [], + "collapsed_sections": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "cells": [ + { + "metadata": { + "id": "KOAz-lD1P7Kx", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "----------------------------------------\n", + "\n", + "Copyright 2018 Google LLC \n", + "\n", + "Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "you may not use this file except in compliance with the License.\n", + "You may obtain a copy of the License at\n", + "\n", + "[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)\n", + "\n", + "Unless required by applicable law or agreed to in writing, software\n", + "distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "See the License for the specific language governing permissions and limitations under the License.\n", + "\n", + "----------------------------------------" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "m26YhtBMvVWA" + }, + "cell_type": "markdown", + "source": [ + "# Energy Forecasting with AutoML Tables\n", + "\n", + "To use this Colab notebook, copy it to your own Google Drive and open it with [Colaboratory](https://colab.research.google.com/) (or Colab). To run a cell hold the Shift key and press the Enter key (or Return key). Colab automatically displays the return value of the last line in each cell. Refer to [this page](https://colab.research.google.com/notebooks/welcome.ipynb) for more information on Colab.\n", + "\n", + "You can run a Colab notebook on a hosted runtime in the Cloud. The hosted VM times out after 90 minutes of inactivity and you will lose all the data stored in the memory including your authentication data. If your session gets disconnected (for example, because you closed your laptop) for less than the 90 minute inactivity timeout limit, press 'RECONNECT' on the top right corner of your notebook and resume the session. After Colab timeout, you'll need to\n", + "\n", + "1. Re-run the initialization and authentication.\n", + "2. Continue from where you left off. You may need to copy-paste the value of some variables such as the `dataset_name` from the printed output of the previous cells.\n", + "\n", + "Alternatively you can connect your Colab notebook to a [local runtime](https://research.google.com/colaboratory/local-runtimes.html)." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "b--5FDDwCG9C" + }, + "cell_type": "markdown", + "source": [ + "## 1. Project set up\n", + "\n", + "\n", + "\n" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "AZs0ICgy4jkQ" + }, + "cell_type": "markdown", + "source": [ + "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to\n", + "* Create a Google Cloud Platform (GCP) project.\n", + "* Enable billing.\n", + "* Apply to whitelist your project.\n", + "* Enable AutoML API.\n", + "* Enable AutoML Talbes API.\n", + "* Create a service account, grant required permissions, and download the service account private key.\n", + "\n", + "You also need to upload your data into Google Cloud Storage (GCS) or BigQuery. For example, to use GCS as your data source\n", + "* Create a GCS bucket.\n", + "* Upload the training and batch prediction files.\n", + "\n", + "\n", + "**Warning:** Private keys must be kept secret. If you expose your private key it is recommended to revoke it immediately from the Google Cloud Console." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "xZECt1oL429r" + }, + "cell_type": "markdown", + "source": [ + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "rstRPH9SyZj_" + }, + "cell_type": "markdown", + "source": [ + "## 2. Initialize and authenticate\n", + "This section runs intialization and authentication. It creates an authenticated session which is required for running any of the following sections." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "BR0POq2UzE7e" + }, + "cell_type": "markdown", + "source": [ + "### Install the client library\n", + "Run the following cell to install the client library using `pip`." + ] + }, + { + "metadata": { + "id": "43aXKjDRt_qZ", + "colab_type": "code", + "colab": { + "resources": { + "http://localhost:8080/nbextensions/google.colab/files.js": { + "data": "Ly8gQ29weXJpZ2h0IDIwMTcgR29vZ2xlIExMQwovLwovLyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKLy8geW91IG1heSBub3QgdXNlIHRoaXMgZmlsZSBleGNlcHQgaW4gY29tcGxpYW5jZSB3aXRoIHRoZSBMaWNlbnNlLgovLyBZb3UgbWF5IG9idGFpbiBhIGNvcHkgb2YgdGhlIExpY2Vuc2UgYXQKLy8KLy8gICAgICBodHRwOi8vd3d3LmFwYWNoZS5vcmcvbGljZW5zZXMvTElDRU5TRS0yLjAKLy8KLy8gVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQovLyBkaXN0cmlidXRlZCB1bmRlciB0aGUgTGljZW5zZSBpcyBkaXN0cmlidXRlZCBvbiBhbiAiQVMgSVMiIEJBU0lTLAovLyBXSVRIT1VUIFdBUlJBTlRJRVMgT1IgQ09ORElUSU9OUyBPRiBBTlkgS0lORCwgZWl0aGVyIGV4cHJlc3Mgb3IgaW1wbGllZC4KLy8gU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAovLyBsaW1pdGF0aW9ucyB1bmRlciB0aGUgTGljZW5zZS4KCi8qKgogKiBAZmlsZW92ZXJ2aWV3IEhlbHBlcnMgZm9yIGdvb2dsZS5jb2xhYiBQeXRob24gbW9kdWxlLgogKi8KKGZ1bmN0aW9uKHNjb3BlKSB7CmZ1bmN0aW9uIHNwYW4odGV4dCwgc3R5bGVBdHRyaWJ1dGVzID0ge30pIHsKICBjb25zdCBlbGVtZW50ID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnc3BhbicpOwogIGVsZW1lbnQudGV4dENvbnRlbnQgPSB0ZXh0OwogIGZvciAoY29uc3Qga2V5IG9mIE9iamVjdC5rZXlzKHN0eWxlQXR0cmlidXRlcykpIHsKICAgIGVsZW1lbnQuc3R5bGVba2V5XSA9IHN0eWxlQXR0cmlidXRlc1trZXldOwogIH0KICByZXR1cm4gZWxlbWVudDsKfQoKLy8gTWF4IG51bWJlciBvZiBieXRlcyB3aGljaCB3aWxsIGJlIHVwbG9hZGVkIGF0IGEgdGltZS4KY29uc3QgTUFYX1BBWUxPQURfU0laRSA9IDEwMCAqIDEwMjQ7Ci8vIE1heCBhbW91bnQgb2YgdGltZSB0byBibG9jayB3YWl0aW5nIGZvciB0aGUgdXNlci4KY29uc3QgRklMRV9DSEFOR0VfVElNRU9VVF9NUyA9IDMwICogMTAwMDsKCmZ1bmN0aW9uIF91cGxvYWRGaWxlcyhpbnB1dElkLCBvdXRwdXRJZCkgewogIGNvbnN0IHN0ZXBzID0gdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKTsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIC8vIENhY2hlIHN0ZXBzIG9uIHRoZSBvdXRwdXRFbGVtZW50IHRvIG1ha2UgaXQgYXZhaWxhYmxlIGZvciB0aGUgbmV4dCBjYWxsCiAgLy8gdG8gdXBsb2FkRmlsZXNDb250aW51ZSBmcm9tIFB5dGhvbi4KICBvdXRwdXRFbGVtZW50LnN0ZXBzID0gc3RlcHM7CgogIHJldHVybiBfdXBsb2FkRmlsZXNDb250aW51ZShvdXRwdXRJZCk7Cn0KCi8vIFRoaXMgaXMgcm91Z2hseSBhbiBhc3luYyBnZW5lcmF0b3IgKG5vdCBzdXBwb3J0ZWQgaW4gdGhlIGJyb3dzZXIgeWV0KSwKLy8gd2hlcmUgdGhlcmUgYXJlIG11bHRpcGxlIGFzeW5jaHJvbm91cyBzdGVwcyBhbmQgdGhlIFB5dGhvbiBzaWRlIGlzIGdvaW5nCi8vIHRvIHBvbGwgZm9yIGNvbXBsZXRpb24gb2YgZWFjaCBzdGVwLgovLyBUaGlzIHVzZXMgYSBQcm9taXNlIHRvIGJsb2NrIHRoZSBweXRob24gc2lkZSBvbiBjb21wbGV0aW9uIG9mIGVhY2ggc3RlcCwKLy8gdGhlbiBwYXNzZXMgdGhlIHJlc3VsdCBvZiB0aGUgcHJldmlvdXMgc3RlcCBhcyB0aGUgaW5wdXQgdG8gdGhlIG5leHQgc3RlcC4KZnVuY3Rpb24gX3VwbG9hZEZpbGVzQ29udGludWUob3V0cHV0SWQpIHsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIGNvbnN0IHN0ZXBzID0gb3V0cHV0RWxlbWVudC5zdGVwczsKCiAgY29uc3QgbmV4dCA9IHN0ZXBzLm5leHQob3V0cHV0RWxlbWVudC5sYXN0UHJvbWlzZVZhbHVlKTsKICByZXR1cm4gUHJvbWlzZS5yZXNvbHZlKG5leHQudmFsdWUucHJvbWlzZSkudGhlbigodmFsdWUpID0+IHsKICAgIC8vIENhY2hlIHRoZSBsYXN0IHByb21pc2UgdmFsdWUgdG8gbWFrZSBpdCBhdmFpbGFibGUgdG8gdGhlIG5leHQKICAgIC8vIHN0ZXAgb2YgdGhlIGdlbmVyYXRvci4KICAgIG91dHB1dEVsZW1lbnQubGFzdFByb21pc2VWYWx1ZSA9IHZhbHVlOwogICAgcmV0dXJuIG5leHQudmFsdWUucmVzcG9uc2U7CiAgfSk7Cn0KCi8qKgogKiBHZW5lcmF0b3IgZnVuY3Rpb24gd2hpY2ggaXMgY2FsbGVkIGJldHdlZW4gZWFjaCBhc3luYyBzdGVwIG9mIHRoZSB1cGxvYWQKICogcHJvY2Vzcy4KICogQHBhcmFtIHtzdHJpbmd9IGlucHV0SWQgRWxlbWVudCBJRCBvZiB0aGUgaW5wdXQgZmlsZSBwaWNrZXIgZWxlbWVudC4KICogQHBhcmFtIHtzdHJpbmd9IG91dHB1dElkIEVsZW1lbnQgSUQgb2YgdGhlIG91dHB1dCBkaXNwbGF5LgogKiBAcmV0dXJuIHshSXRlcmFibGU8IU9iamVjdD59IEl0ZXJhYmxlIG9mIG5leHQgc3RlcHMuCiAqLwpmdW5jdGlvbiogdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKSB7CiAgY29uc3QgaW5wdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoaW5wdXRJZCk7CiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gZmFsc2U7CgogIGNvbnN0IG91dHB1dEVsZW1lbnQgPSBkb2N1bWVudC5nZXRFbGVtZW50QnlJZChvdXRwdXRJZCk7CiAgb3V0cHV0RWxlbWVudC5pbm5lckhUTUwgPSAnJzsKCiAgY29uc3QgcGlja2VkUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBpbnB1dEVsZW1lbnQuYWRkRXZlbnRMaXN0ZW5lcignY2hhbmdlJywgKGUpID0+IHsKICAgICAgcmVzb2x2ZShlLnRhcmdldC5maWxlcyk7CiAgICB9KTsKICB9KTsKCiAgY29uc3QgY2FuY2VsID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnYnV0dG9uJyk7CiAgaW5wdXRFbGVtZW50LnBhcmVudEVsZW1lbnQuYXBwZW5kQ2hpbGQoY2FuY2VsKTsKICBjYW5jZWwudGV4dENvbnRlbnQgPSAnQ2FuY2VsIHVwbG9hZCc7CiAgY29uc3QgY2FuY2VsUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBjYW5jZWwub25jbGljayA9ICgpID0+IHsKICAgICAgcmVzb2x2ZShudWxsKTsKICAgIH07CiAgfSk7CgogIC8vIENhbmNlbCB1cGxvYWQgaWYgdXNlciBoYXNuJ3QgcGlja2VkIGFueXRoaW5nIGluIHRpbWVvdXQuCiAgY29uc3QgdGltZW91dFByb21pc2UgPSBuZXcgUHJvbWlzZSgocmVzb2x2ZSkgPT4gewogICAgc2V0VGltZW91dCgoKSA9PiB7CiAgICAgIHJlc29sdmUobnVsbCk7CiAgICB9LCBGSUxFX0NIQU5HRV9USU1FT1VUX01TKTsKICB9KTsKCiAgLy8gV2FpdCBmb3IgdGhlIHVzZXIgdG8gcGljayB0aGUgZmlsZXMuCiAgY29uc3QgZmlsZXMgPSB5aWVsZCB7CiAgICBwcm9taXNlOiBQcm9taXNlLnJhY2UoW3BpY2tlZFByb21pc2UsIHRpbWVvdXRQcm9taXNlLCBjYW5jZWxQcm9taXNlXSksCiAgICByZXNwb25zZTogewogICAgICBhY3Rpb246ICdzdGFydGluZycsCiAgICB9CiAgfTsKCiAgaWYgKCFmaWxlcykgewogICAgcmV0dXJuIHsKICAgICAgcmVzcG9uc2U6IHsKICAgICAgICBhY3Rpb246ICdjb21wbGV0ZScsCiAgICAgIH0KICAgIH07CiAgfQoKICBjYW5jZWwucmVtb3ZlKCk7CgogIC8vIERpc2FibGUgdGhlIGlucHV0IGVsZW1lbnQgc2luY2UgZnVydGhlciBwaWNrcyBhcmUgbm90IGFsbG93ZWQuCiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gdHJ1ZTsKCiAgZm9yIChjb25zdCBmaWxlIG9mIGZpbGVzKSB7CiAgICBjb25zdCBsaSA9IGRvY3VtZW50LmNyZWF0ZUVsZW1lbnQoJ2xpJyk7CiAgICBsaS5hcHBlbmQoc3BhbihmaWxlLm5hbWUsIHtmb250V2VpZ2h0OiAnYm9sZCd9KSk7CiAgICBsaS5hcHBlbmQoc3BhbigKICAgICAgICBgKCR7ZmlsZS50eXBlIHx8ICduL2EnfSkgLSAke2ZpbGUuc2l6ZX0gYnl0ZXMsIGAgKwogICAgICAgIGBsYXN0IG1vZGlmaWVkOiAkewogICAgICAgICAgICBmaWxlLmxhc3RNb2RpZmllZERhdGUgPyBmaWxlLmxhc3RNb2RpZmllZERhdGUudG9Mb2NhbGVEYXRlU3RyaW5nKCkgOgogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAnbi9hJ30gLSBgKSk7CiAgICBjb25zdCBwZXJjZW50ID0gc3BhbignMCUgZG9uZScpOwogICAgbGkuYXBwZW5kQ2hpbGQocGVyY2VudCk7CgogICAgb3V0cHV0RWxlbWVudC5hcHBlbmRDaGlsZChsaSk7CgogICAgY29uc3QgZmlsZURhdGFQcm9taXNlID0gbmV3IFByb21pc2UoKHJlc29sdmUpID0+IHsKICAgICAgY29uc3QgcmVhZGVyID0gbmV3IEZpbGVSZWFkZXIoKTsKICAgICAgcmVhZGVyLm9ubG9hZCA9IChlKSA9PiB7CiAgICAgICAgcmVzb2x2ZShlLnRhcmdldC5yZXN1bHQpOwogICAgICB9OwogICAgICByZWFkZXIucmVhZEFzQXJyYXlCdWZmZXIoZmlsZSk7CiAgICB9KTsKICAgIC8vIFdhaXQgZm9yIHRoZSBkYXRhIHRvIGJlIHJlYWR5LgogICAgbGV0IGZpbGVEYXRhID0geWllbGQgewogICAgICBwcm9taXNlOiBmaWxlRGF0YVByb21pc2UsCiAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgYWN0aW9uOiAnY29udGludWUnLAogICAgICB9CiAgICB9OwoKICAgIC8vIFVzZSBhIGNodW5rZWQgc2VuZGluZyB0byBhdm9pZCBtZXNzYWdlIHNpemUgbGltaXRzLiBTZWUgYi82MjExNTY2MC4KICAgIGxldCBwb3NpdGlvbiA9IDA7CiAgICB3aGlsZSAocG9zaXRpb24gPCBmaWxlRGF0YS5ieXRlTGVuZ3RoKSB7CiAgICAgIGNvbnN0IGxlbmd0aCA9IE1hdGgubWluKGZpbGVEYXRhLmJ5dGVMZW5ndGggLSBwb3NpdGlvbiwgTUFYX1BBWUxPQURfU0laRSk7CiAgICAgIGNvbnN0IGNodW5rID0gbmV3IFVpbnQ4QXJyYXkoZmlsZURhdGEsIHBvc2l0aW9uLCBsZW5ndGgpOwogICAgICBwb3NpdGlvbiArPSBsZW5ndGg7CgogICAgICBjb25zdCBiYXNlNjQgPSBidG9hKFN0cmluZy5mcm9tQ2hhckNvZGUuYXBwbHkobnVsbCwgY2h1bmspKTsKICAgICAgeWllbGQgewogICAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgICBhY3Rpb246ICdhcHBlbmQnLAogICAgICAgICAgZmlsZTogZmlsZS5uYW1lLAogICAgICAgICAgZGF0YTogYmFzZTY0LAogICAgICAgIH0sCiAgICAgIH07CiAgICAgIHBlcmNlbnQudGV4dENvbnRlbnQgPQogICAgICAgICAgYCR7TWF0aC5yb3VuZCgocG9zaXRpb24gLyBmaWxlRGF0YS5ieXRlTGVuZ3RoKSAqIDEwMCl9JSBkb25lYDsKICAgIH0KICB9CgogIC8vIEFsbCBkb25lLgogIHlpZWxkIHsKICAgIHJlc3BvbnNlOiB7CiAgICAgIGFjdGlvbjogJ2NvbXBsZXRlJywKICAgIH0KICB9Owp9CgpzY29wZS5nb29nbGUgPSBzY29wZS5nb29nbGUgfHwge307CnNjb3BlLmdvb2dsZS5jb2xhYiA9IHNjb3BlLmdvb2dsZS5jb2xhYiB8fCB7fTsKc2NvcGUuZ29vZ2xlLmNvbGFiLl9maWxlcyA9IHsKICBfdXBsb2FkRmlsZXMsCiAgX3VwbG9hZEZpbGVzQ29udGludWUsCn07Cn0pKHNlbGYpOwo=", + "ok": true, + "headers": [ + [ + "content-type", + "application/javascript" + ] + ], + "status": 200, + "status_text": "" + } + }, + "base_uri": "https://localhost:8080/", + "height": 602 + }, + "outputId": "4d3628f9-e5be-4145-f550-8eaffca97d37" + }, + "cell_type": "code", + "source": [ + "#@title Install AutoML Tables client library { vertical-output: true }\n", + "\n", + "!pip install google-cloud-automl" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "eVFsPPEociwF" + }, + "cell_type": "markdown", + "source": [ + "### Authenticate using service account key\n", + "Run the following cell. Click on the 'Choose Files' button and select the service account private key file. If your Service Account key file or folder is hidden, you can reveal it in a Mac by pressing the Command + Shift + . combo." + ] + }, + { + "metadata": { + "id": "u-kCqysAuaJk", + "colab_type": "code", + "colab": { + "resources": { + "http://localhost:8080/nbextensions/google.colab/files.js": { + "data": "Ly8gQ29weXJpZ2h0IDIwMTcgR29vZ2xlIExMQwovLwovLyBMaWNlbnNlZCB1bmRlciB0aGUgQXBhY2hlIExpY2Vuc2UsIFZlcnNpb24gMi4wICh0aGUgIkxpY2Vuc2UiKTsKLy8geW91IG1heSBub3QgdXNlIHRoaXMgZmlsZSBleGNlcHQgaW4gY29tcGxpYW5jZSB3aXRoIHRoZSBMaWNlbnNlLgovLyBZb3UgbWF5IG9idGFpbiBhIGNvcHkgb2YgdGhlIExpY2Vuc2UgYXQKLy8KLy8gICAgICBodHRwOi8vd3d3LmFwYWNoZS5vcmcvbGljZW5zZXMvTElDRU5TRS0yLjAKLy8KLy8gVW5sZXNzIHJlcXVpcmVkIGJ5IGFwcGxpY2FibGUgbGF3IG9yIGFncmVlZCB0byBpbiB3cml0aW5nLCBzb2Z0d2FyZQovLyBkaXN0cmlidXRlZCB1bmRlciB0aGUgTGljZW5zZSBpcyBkaXN0cmlidXRlZCBvbiBhbiAiQVMgSVMiIEJBU0lTLAovLyBXSVRIT1VUIFdBUlJBTlRJRVMgT1IgQ09ORElUSU9OUyBPRiBBTlkgS0lORCwgZWl0aGVyIGV4cHJlc3Mgb3IgaW1wbGllZC4KLy8gU2VlIHRoZSBMaWNlbnNlIGZvciB0aGUgc3BlY2lmaWMgbGFuZ3VhZ2UgZ292ZXJuaW5nIHBlcm1pc3Npb25zIGFuZAovLyBsaW1pdGF0aW9ucyB1bmRlciB0aGUgTGljZW5zZS4KCi8qKgogKiBAZmlsZW92ZXJ2aWV3IEhlbHBlcnMgZm9yIGdvb2dsZS5jb2xhYiBQeXRob24gbW9kdWxlLgogKi8KKGZ1bmN0aW9uKHNjb3BlKSB7CmZ1bmN0aW9uIHNwYW4odGV4dCwgc3R5bGVBdHRyaWJ1dGVzID0ge30pIHsKICBjb25zdCBlbGVtZW50ID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnc3BhbicpOwogIGVsZW1lbnQudGV4dENvbnRlbnQgPSB0ZXh0OwogIGZvciAoY29uc3Qga2V5IG9mIE9iamVjdC5rZXlzKHN0eWxlQXR0cmlidXRlcykpIHsKICAgIGVsZW1lbnQuc3R5bGVba2V5XSA9IHN0eWxlQXR0cmlidXRlc1trZXldOwogIH0KICByZXR1cm4gZWxlbWVudDsKfQoKLy8gTWF4IG51bWJlciBvZiBieXRlcyB3aGljaCB3aWxsIGJlIHVwbG9hZGVkIGF0IGEgdGltZS4KY29uc3QgTUFYX1BBWUxPQURfU0laRSA9IDEwMCAqIDEwMjQ7Ci8vIE1heCBhbW91bnQgb2YgdGltZSB0byBibG9jayB3YWl0aW5nIGZvciB0aGUgdXNlci4KY29uc3QgRklMRV9DSEFOR0VfVElNRU9VVF9NUyA9IDMwICogMTAwMDsKCmZ1bmN0aW9uIF91cGxvYWRGaWxlcyhpbnB1dElkLCBvdXRwdXRJZCkgewogIGNvbnN0IHN0ZXBzID0gdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKTsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIC8vIENhY2hlIHN0ZXBzIG9uIHRoZSBvdXRwdXRFbGVtZW50IHRvIG1ha2UgaXQgYXZhaWxhYmxlIGZvciB0aGUgbmV4dCBjYWxsCiAgLy8gdG8gdXBsb2FkRmlsZXNDb250aW51ZSBmcm9tIFB5dGhvbi4KICBvdXRwdXRFbGVtZW50LnN0ZXBzID0gc3RlcHM7CgogIHJldHVybiBfdXBsb2FkRmlsZXNDb250aW51ZShvdXRwdXRJZCk7Cn0KCi8vIFRoaXMgaXMgcm91Z2hseSBhbiBhc3luYyBnZW5lcmF0b3IgKG5vdCBzdXBwb3J0ZWQgaW4gdGhlIGJyb3dzZXIgeWV0KSwKLy8gd2hlcmUgdGhlcmUgYXJlIG11bHRpcGxlIGFzeW5jaHJvbm91cyBzdGVwcyBhbmQgdGhlIFB5dGhvbiBzaWRlIGlzIGdvaW5nCi8vIHRvIHBvbGwgZm9yIGNvbXBsZXRpb24gb2YgZWFjaCBzdGVwLgovLyBUaGlzIHVzZXMgYSBQcm9taXNlIHRvIGJsb2NrIHRoZSBweXRob24gc2lkZSBvbiBjb21wbGV0aW9uIG9mIGVhY2ggc3RlcCwKLy8gdGhlbiBwYXNzZXMgdGhlIHJlc3VsdCBvZiB0aGUgcHJldmlvdXMgc3RlcCBhcyB0aGUgaW5wdXQgdG8gdGhlIG5leHQgc3RlcC4KZnVuY3Rpb24gX3VwbG9hZEZpbGVzQ29udGludWUob3V0cHV0SWQpIHsKICBjb25zdCBvdXRwdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQob3V0cHV0SWQpOwogIGNvbnN0IHN0ZXBzID0gb3V0cHV0RWxlbWVudC5zdGVwczsKCiAgY29uc3QgbmV4dCA9IHN0ZXBzLm5leHQob3V0cHV0RWxlbWVudC5sYXN0UHJvbWlzZVZhbHVlKTsKICByZXR1cm4gUHJvbWlzZS5yZXNvbHZlKG5leHQudmFsdWUucHJvbWlzZSkudGhlbigodmFsdWUpID0+IHsKICAgIC8vIENhY2hlIHRoZSBsYXN0IHByb21pc2UgdmFsdWUgdG8gbWFrZSBpdCBhdmFpbGFibGUgdG8gdGhlIG5leHQKICAgIC8vIHN0ZXAgb2YgdGhlIGdlbmVyYXRvci4KICAgIG91dHB1dEVsZW1lbnQubGFzdFByb21pc2VWYWx1ZSA9IHZhbHVlOwogICAgcmV0dXJuIG5leHQudmFsdWUucmVzcG9uc2U7CiAgfSk7Cn0KCi8qKgogKiBHZW5lcmF0b3IgZnVuY3Rpb24gd2hpY2ggaXMgY2FsbGVkIGJldHdlZW4gZWFjaCBhc3luYyBzdGVwIG9mIHRoZSB1cGxvYWQKICogcHJvY2Vzcy4KICogQHBhcmFtIHtzdHJpbmd9IGlucHV0SWQgRWxlbWVudCBJRCBvZiB0aGUgaW5wdXQgZmlsZSBwaWNrZXIgZWxlbWVudC4KICogQHBhcmFtIHtzdHJpbmd9IG91dHB1dElkIEVsZW1lbnQgSUQgb2YgdGhlIG91dHB1dCBkaXNwbGF5LgogKiBAcmV0dXJuIHshSXRlcmFibGU8IU9iamVjdD59IEl0ZXJhYmxlIG9mIG5leHQgc3RlcHMuCiAqLwpmdW5jdGlvbiogdXBsb2FkRmlsZXNTdGVwKGlucHV0SWQsIG91dHB1dElkKSB7CiAgY29uc3QgaW5wdXRFbGVtZW50ID0gZG9jdW1lbnQuZ2V0RWxlbWVudEJ5SWQoaW5wdXRJZCk7CiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gZmFsc2U7CgogIGNvbnN0IG91dHB1dEVsZW1lbnQgPSBkb2N1bWVudC5nZXRFbGVtZW50QnlJZChvdXRwdXRJZCk7CiAgb3V0cHV0RWxlbWVudC5pbm5lckhUTUwgPSAnJzsKCiAgY29uc3QgcGlja2VkUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBpbnB1dEVsZW1lbnQuYWRkRXZlbnRMaXN0ZW5lcignY2hhbmdlJywgKGUpID0+IHsKICAgICAgcmVzb2x2ZShlLnRhcmdldC5maWxlcyk7CiAgICB9KTsKICB9KTsKCiAgY29uc3QgY2FuY2VsID0gZG9jdW1lbnQuY3JlYXRlRWxlbWVudCgnYnV0dG9uJyk7CiAgaW5wdXRFbGVtZW50LnBhcmVudEVsZW1lbnQuYXBwZW5kQ2hpbGQoY2FuY2VsKTsKICBjYW5jZWwudGV4dENvbnRlbnQgPSAnQ2FuY2VsIHVwbG9hZCc7CiAgY29uc3QgY2FuY2VsUHJvbWlzZSA9IG5ldyBQcm9taXNlKChyZXNvbHZlKSA9PiB7CiAgICBjYW5jZWwub25jbGljayA9ICgpID0+IHsKICAgICAgcmVzb2x2ZShudWxsKTsKICAgIH07CiAgfSk7CgogIC8vIENhbmNlbCB1cGxvYWQgaWYgdXNlciBoYXNuJ3QgcGlja2VkIGFueXRoaW5nIGluIHRpbWVvdXQuCiAgY29uc3QgdGltZW91dFByb21pc2UgPSBuZXcgUHJvbWlzZSgocmVzb2x2ZSkgPT4gewogICAgc2V0VGltZW91dCgoKSA9PiB7CiAgICAgIHJlc29sdmUobnVsbCk7CiAgICB9LCBGSUxFX0NIQU5HRV9USU1FT1VUX01TKTsKICB9KTsKCiAgLy8gV2FpdCBmb3IgdGhlIHVzZXIgdG8gcGljayB0aGUgZmlsZXMuCiAgY29uc3QgZmlsZXMgPSB5aWVsZCB7CiAgICBwcm9taXNlOiBQcm9taXNlLnJhY2UoW3BpY2tlZFByb21pc2UsIHRpbWVvdXRQcm9taXNlLCBjYW5jZWxQcm9taXNlXSksCiAgICByZXNwb25zZTogewogICAgICBhY3Rpb246ICdzdGFydGluZycsCiAgICB9CiAgfTsKCiAgaWYgKCFmaWxlcykgewogICAgcmV0dXJuIHsKICAgICAgcmVzcG9uc2U6IHsKICAgICAgICBhY3Rpb246ICdjb21wbGV0ZScsCiAgICAgIH0KICAgIH07CiAgfQoKICBjYW5jZWwucmVtb3ZlKCk7CgogIC8vIERpc2FibGUgdGhlIGlucHV0IGVsZW1lbnQgc2luY2UgZnVydGhlciBwaWNrcyBhcmUgbm90IGFsbG93ZWQuCiAgaW5wdXRFbGVtZW50LmRpc2FibGVkID0gdHJ1ZTsKCiAgZm9yIChjb25zdCBmaWxlIG9mIGZpbGVzKSB7CiAgICBjb25zdCBsaSA9IGRvY3VtZW50LmNyZWF0ZUVsZW1lbnQoJ2xpJyk7CiAgICBsaS5hcHBlbmQoc3BhbihmaWxlLm5hbWUsIHtmb250V2VpZ2h0OiAnYm9sZCd9KSk7CiAgICBsaS5hcHBlbmQoc3BhbigKICAgICAgICBgKCR7ZmlsZS50eXBlIHx8ICduL2EnfSkgLSAke2ZpbGUuc2l6ZX0gYnl0ZXMsIGAgKwogICAgICAgIGBsYXN0IG1vZGlmaWVkOiAkewogICAgICAgICAgICBmaWxlLmxhc3RNb2RpZmllZERhdGUgPyBmaWxlLmxhc3RNb2RpZmllZERhdGUudG9Mb2NhbGVEYXRlU3RyaW5nKCkgOgogICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAgICAnbi9hJ30gLSBgKSk7CiAgICBjb25zdCBwZXJjZW50ID0gc3BhbignMCUgZG9uZScpOwogICAgbGkuYXBwZW5kQ2hpbGQocGVyY2VudCk7CgogICAgb3V0cHV0RWxlbWVudC5hcHBlbmRDaGlsZChsaSk7CgogICAgY29uc3QgZmlsZURhdGFQcm9taXNlID0gbmV3IFByb21pc2UoKHJlc29sdmUpID0+IHsKICAgICAgY29uc3QgcmVhZGVyID0gbmV3IEZpbGVSZWFkZXIoKTsKICAgICAgcmVhZGVyLm9ubG9hZCA9IChlKSA9PiB7CiAgICAgICAgcmVzb2x2ZShlLnRhcmdldC5yZXN1bHQpOwogICAgICB9OwogICAgICByZWFkZXIucmVhZEFzQXJyYXlCdWZmZXIoZmlsZSk7CiAgICB9KTsKICAgIC8vIFdhaXQgZm9yIHRoZSBkYXRhIHRvIGJlIHJlYWR5LgogICAgbGV0IGZpbGVEYXRhID0geWllbGQgewogICAgICBwcm9taXNlOiBmaWxlRGF0YVByb21pc2UsCiAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgYWN0aW9uOiAnY29udGludWUnLAogICAgICB9CiAgICB9OwoKICAgIC8vIFVzZSBhIGNodW5rZWQgc2VuZGluZyB0byBhdm9pZCBtZXNzYWdlIHNpemUgbGltaXRzLiBTZWUgYi82MjExNTY2MC4KICAgIGxldCBwb3NpdGlvbiA9IDA7CiAgICB3aGlsZSAocG9zaXRpb24gPCBmaWxlRGF0YS5ieXRlTGVuZ3RoKSB7CiAgICAgIGNvbnN0IGxlbmd0aCA9IE1hdGgubWluKGZpbGVEYXRhLmJ5dGVMZW5ndGggLSBwb3NpdGlvbiwgTUFYX1BBWUxPQURfU0laRSk7CiAgICAgIGNvbnN0IGNodW5rID0gbmV3IFVpbnQ4QXJyYXkoZmlsZURhdGEsIHBvc2l0aW9uLCBsZW5ndGgpOwogICAgICBwb3NpdGlvbiArPSBsZW5ndGg7CgogICAgICBjb25zdCBiYXNlNjQgPSBidG9hKFN0cmluZy5mcm9tQ2hhckNvZGUuYXBwbHkobnVsbCwgY2h1bmspKTsKICAgICAgeWllbGQgewogICAgICAgIHJlc3BvbnNlOiB7CiAgICAgICAgICBhY3Rpb246ICdhcHBlbmQnLAogICAgICAgICAgZmlsZTogZmlsZS5uYW1lLAogICAgICAgICAgZGF0YTogYmFzZTY0LAogICAgICAgIH0sCiAgICAgIH07CiAgICAgIHBlcmNlbnQudGV4dENvbnRlbnQgPQogICAgICAgICAgYCR7TWF0aC5yb3VuZCgocG9zaXRpb24gLyBmaWxlRGF0YS5ieXRlTGVuZ3RoKSAqIDEwMCl9JSBkb25lYDsKICAgIH0KICB9CgogIC8vIEFsbCBkb25lLgogIHlpZWxkIHsKICAgIHJlc3BvbnNlOiB7CiAgICAgIGFjdGlvbjogJ2NvbXBsZXRlJywKICAgIH0KICB9Owp9CgpzY29wZS5nb29nbGUgPSBzY29wZS5nb29nbGUgfHwge307CnNjb3BlLmdvb2dsZS5jb2xhYiA9IHNjb3BlLmdvb2dsZS5jb2xhYiB8fCB7fTsKc2NvcGUuZ29vZ2xlLmNvbGFiLl9maWxlcyA9IHsKICBfdXBsb2FkRmlsZXMsCiAgX3VwbG9hZEZpbGVzQ29udGludWUsCn07Cn0pKHNlbGYpOwo=", + "ok": true, + "headers": [ + [ + "content-type", + "application/javascript" + ] + ], + "status": 200, + "status_text": "" + } + }, + "base_uri": "https://localhost:8080/", + "height": 71 + }, + "outputId": "06154a63-f410-435f-b565-cd1599243b88" + }, + "cell_type": "code", + "source": [ + "#@title Authenticate using service account key and create a client. { vertical-output: true }\n", + "\n", + "from google.cloud import automl_v1beta1\n", + "\n", + "# Upload service account key\n", + "keyfile_upload = files.upload()\n", + "keyfile_name = list(keyfile_upload.keys())[0]\n", + "# Authenticate and create an AutoML client.\n", + "client = automl_v1beta1.AutoMlClient.from_service_account_file(keyfile_name)\n", + "# Authenticate and create a prediction service client.\n", + "prediction_client = automl_v1beta1.PredictionServiceClient.from_service_account_file(keyfile_name)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "s3F2xbEJdDvN" + }, + "cell_type": "markdown", + "source": [ + "### Set Project and Location" + ] + }, + { + "metadata": { + "id": "0uX4aJYUiXh5", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Enter your GCP project ID." + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "6R4h5HF1Dtds", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "1e049b34-4683-4755-ab08-aec08de2bc66" + }, + "cell_type": "code", + "source": [ + "#@title GCP project ID and location\n", + "\n", + "project_id = 'energy-forecasting' #@param {type:'string'}\n", + "location = 'us-central1' #@param {type:'string'}\n", + "location_path = client.location_path(project_id, location)\n", + "location_path" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "qozQWMnOu48y" + }, + "cell_type": "markdown", + "source": [ + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "ODt86YuVDZzm" + }, + "cell_type": "markdown", + "source": [ + "## 3. Import training data" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "XwjZc9Q62Fm5" + }, + "cell_type": "markdown", + "source": [ + "### Create dataset" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "_JfZFGSceyE_" + }, + "cell_type": "markdown", + "source": [ + "Select a dataset display name and pass your table source information to create a new dataset." + ] + }, + { + "metadata": { + "id": "Z_JErW3cw-0J", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 224 + }, + "outputId": "7fe366df-73ae-4ab1-ceaa-fd6ced4ccdd9" + }, + "cell_type": "code", + "source": [ + "#@title Create dataset { vertical-output: true, output-height: 200 }\n", + "\n", + "dataset_display_name = 'energy_forecasting_solution' #@param {type: 'string'}\n", + "\n", + "create_dataset_response = client.create_dataset(\n", + " location_path,\n", + " {'display_name': dataset_display_name, 'tables_dataset_metadata': {}})\n", + "dataset_name = create_dataset_response.name\n", + "create_dataset_response" + ], + "execution_count":0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "35YZ9dy34VqJ" + }, + "cell_type": "markdown", + "source": [ + "### Import data" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "3c0o15gVREAw" + }, + "cell_type": "markdown", + "source": [ + "You can import your data to AutoML Tables from GCS or BigQuery. For this tutorial, you can use the [iris dataset](https://storage.cloud.google.com/rostam-193618-tutorial/automl-tables-v1beta1/iris.csv) as your training data. You can create a GCS bucket and upload the data into your bucket. The URI for your file is `gs://BUCKET_NAME/FOLDER_NAME1/FOLDER_NAME2/.../FILE_NAME`. Alternatively you can create a BigQuery table and upload the data into the table. The URI for your table is `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", + "\n", + "Importing data may take a few minutes or hours depending on the size of your data. If your Colab times out, run the following command to retrieve your dataset. Replace `dataset_name` with its actual value obtained in the preceding cells.\n", + "\n", + " dataset = client.get_dataset(dataset_name)" + ] + }, + { + "metadata": { + "id": "bB_GdeqCJW5i", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Datasource in BigQuery { vertical-output: true }\n", + "\n", + "dataset_bq_input_uri = 'bq://energy-forecasting.Energy.automldata' #@param {type: 'string'}\n", + "# Define input configuration.\n", + "input_config = {\n", + " 'bigquery_source': {\n", + " 'input_uri': dataset_bq_input_uri\n", + " }\n", + "}" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "FNVYfpoXJsNB", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 139 + }, + "outputId": "0ecc8d11-5bf1-4c2e-f688-b6d9be934e3c" + }, + "cell_type": "code", + "source": [ + " #@title Import data { vertical-output: true }\n", + "\n", + "import_data_response = client.import_data(dataset_name, input_config)\n", + "print('Dataset import operation: {}'.format(import_data_response.operation))\n", + "# Wait until import is done.\n", + "import_data_result = import_data_response.result()\n", + "import_data_result" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "QdxBI4s44ZRI", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Review the specs" + ] + }, + { + "metadata": { + "id": "RC0PWKqH4jwr", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Run the following command to see table specs such as row count." + ] + }, + { + "metadata": { + "id": "v2Vzq_gwXxo-", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 3247 + }, + "outputId": "c89cd7b1-4344-46d9-c4a3-1b012b5b720d" + }, + "cell_type": "code", + "source": [ + "#@title Table schema { vertical-output: true }\n", + "\n", + "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", + "\n", + "# List table specs\n", + "list_table_specs_response = client.list_table_specs(dataset_name)\n", + "table_specs = [s for s in list_table_specs_response]\n", + "# List column specs\n", + "table_spec_name = table_specs[0].name\n", + "list_column_specs_response = client.list_column_specs(table_spec_name)\n", + "column_specs = {s.display_name: s for s in list_column_specs_response}\n", + "[(x, data_types.TypeCode.Name(\n", + " column_specs[x].data_type.type_code)) for x in column_specs.keys()]" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "vcJP7xoq4yAJ", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Run the following command to see column specs such inferred schema." + ] + }, + { + "metadata": { + "id": "FNykW_YOYt6d", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "___" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "kNRVJqVOL8h3" + }, + "cell_type": "markdown", + "source": [ + "## 4. Update dataset: assign a label column and enable nullable columns" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "-57gehId9PQ5" + }, + "cell_type": "markdown", + "source": [ + "AutoML Tables automatically detects your data column type. For example, for the [Iris dataset](https://storage.cloud.google.com/rostam-193618-tutorial/automl-tables-v1beta1/iris.csv) it detects `species` to be categorical and `petal_length`, `petal_width`, `sepal_length`, and `sepal_width` to be numerical. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema." + ] + }, + { + "metadata": { + "id": "iRqdQ7Xiq04x", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Update a column: set as categorical" + ] + }, + { + "metadata": { + "id": "OCEUIPKegWrf", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + }, + "outputId": "44370b2c-f3dc-46bc-cefd-8a6f29f9cabe" + }, + "cell_type": "code", + "source": [ + "#@title Update dataset { vertical-output: true }\n", + "\n", + "column_to_category = 'hour' #@param {type: 'string'}\n", + "\n", + "update_column_spec_dict = {\n", + " \"name\": column_specs[column_to_category].name,\n", + " \"data_type\": {\n", + " \"type_code\": \"CATEGORY\"\n", + " }\n", + "}\n", + "update_column_response = client.update_column_spec(update_column_spec_dict)\n", + "update_column_response.display_name , update_column_response.data_type \n" + ], + "execution_count":0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "nDMH_chybe4w" + }, + "cell_type": "markdown", + "source": [ + "### Update dataset: assign a label and split column" + ] + }, + { + "metadata": { + "id": "hVIruWg0u33t", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 360 + }, + "outputId": "eeb5f733-16ec-4191-ea59-c2fab30c8442" + }, + "cell_type": "code", + "source": [ + "#@title Update dataset { vertical-output: true }\n", + "\n", + "label_column_name = 'price' #@param {type: 'string'}\n", + "label_column_spec = column_specs[label_column_name]\n", + "label_column_id = label_column_spec.name.rsplit('/', 1)[-1]\n", + "print('Label column ID: {}'.format(label_column_id))\n", + "\n", + "split_column_name = 'split' #@param {type: 'string'}\n", + "split_column_spec = column_specs[split_column_name]\n", + "split_column_id = split_column_spec.name.rsplit('/', 1)[-1]\n", + "print('Split column ID: {}'.format(split_column_id))\n", + "# Define the values of the fields to be updated.\n", + "update_dataset_dict = {\n", + " 'name': dataset_name,\n", + " 'tables_dataset_metadata': {\n", + " 'target_column_spec_id': label_column_id,\n", + " 'ml_use_column_spec_id': split_column_id,\n", + " }\n", + "}\n", + "update_dataset_response = client.update_dataset(update_dataset_dict)\n", + "update_dataset_response" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "z23NITLrcxmi", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "___" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "FcKgvj1-Tbgj" + }, + "cell_type": "markdown", + "source": [ + "## 5. Creating a model" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "Pnlk8vdQlO_k" + }, + "cell_type": "markdown", + "source": [ + "### Train a model\n", + "Specify the duration of the training. For example, `'train_budget_milli_node_hours': 1000` runs the training for one hour. If your Colab times out, use `client.list_models(location_path)` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace `model_name` with its actual value.\n", + "\n", + " model = client.get_model(model_name)" + ] + }, + { + "metadata": { + "id": "11izNd6Fu37N", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 139 + }, + "outputId": "1bca25aa-eb19-4b27-a3fa-7ef137aaf4e2" + }, + "cell_type": "code", + "source": [ + "#@title Create model { vertical-output: true }\n", + "\n", + "\n", + "\n", + "model_display_name = 'energy_model' #@param {type:'string'}\n", + "model_train_hours = 12 #@param {type:'integer'}\n", + "model_optimization_objective = 'MINIMIZE_MAE' #@param {type:'string'}\n", + "column_to_ignore = 'date_utc' #@param {type:'string'}\n", + "\n", + "# Create list of features to use\n", + "feat_list = list(column_specs.keys())\n", + "feat_list.remove(label_column_name)\n", + "feat_list.remove(split_column_name)\n", + "feat_list.remove(column_to_ignore)\n", + "\n", + "model_dict = {\n", + " 'display_name': model_display_name,\n", + " 'dataset_id': dataset_name.rsplit('/', 1)[-1],\n", + " 'tables_model_metadata': {\n", + " 'train_budget_milli_node_hours':model_train_hours * 1000,\n", + " 'optimization_objective': model_optimization_objective,\n", + " 'target_column_spec': column_specs[label_column_name],\n", + " 'input_feature_column_specs': [\n", + " column_specs[x] for x in feat_list]}\n", + " }\n", + " \n", + "create_model_response = client.create_model(location_path, model_dict)\n", + "print('Dataset import operation: {}'.format(create_model_response.operation))\n", + "# Wait until model training is done.\n", + "create_model_result = create_model_response.result()\n", + "model_name = create_model_result.name\n", + "create_model_result" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "puVew1GgPfQa", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 85 + }, + "outputId": "42b9296c-d231-4787-f7fb-4aa1a6ff9bd9" + }, + "cell_type": "code", + "source": [ + "#@title Model Metrics {vertical-output: true }\n", + "\n", + "metrics= [x for x in client.list_model_evaluations(model_name)][-1]\n", + "metrics.regression_evaluation_metrics" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "YQnfEwyrSt2T", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "![alt text](https://storage.googleapis.com/images_public/automl_test.png)" + ] + }, + { + "metadata": { + "id": "Vyc8ckbpRMHp", + "colab_type": "code", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 272 + }, + "outputId": "931d4921-2144-4092-dab6-165c1b1c2a88" + }, + "cell_type": "code", + "source": [ + "#@title Feature Importance {vertical-output: true }\n", + "\n", + "model = client.get_model(model_name)\n", + "feat_list = [(x.feature_importance, x.column_display_name) for x in model.tables_model_metadata.tables_model_column_info]\n", + "feat_list.sort(reverse=True)\n", + "feat_list[:15]" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "__2gDQ5I5gcj", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "![alt text](https://storage.googleapis.com/images_public/feature_importance.png)\n", + "![alt text](https://storage.googleapis.com/images_public/loc_portugal.png)\n", + "![alt text](https://storage.googleapis.com/images_public/weather_schema.png)\n", + "![alt text](https://storage.googleapis.com/images_public/training_schema.png)" + ] + }, + { + "metadata": { + "id": "1wS1is9IY5nK", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "___" + ] + } + ] +} \ No newline at end of file diff --git a/tables/automl/notebooks/purchase_prediction/README.md b/tables/automl/notebooks/purchase_prediction/README.md new file mode 100644 index 000000000000..b2dd7d271a5e --- /dev/null +++ b/tables/automl/notebooks/purchase_prediction/README.md @@ -0,0 +1,129 @@ +Copyright 2018 Google LLC + +Licensed under the Apache License, Version 2.0 (the "License");you may not use this file except in compliance with the License.You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + + +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + + +# Purchase Prediction using AutoML Tables +One of the most common use cases in Marketing is to predict the likelihood of conversion. Conversion could be defined by the marketer as taking a certain action like making a purchase, signing up for a free trial, subscribing to a newsletter, etc. Knowing the likelihood that a marketing lead or prospect will ‘convert’ can enable the marketer to target the lead with the right marketing campaign. This could take the form of remarketing, targeted email campaigns, online offers or other treatments. + +Here we demonstrate how you can use Bigquery and AutoML Tables to build a supervised binary classification model for purchase prediction. + +## Problem Description +The model uses a real dataset from the [Google Merchandise store](www.googlemerchandisestore.com) consisting of Google Analytics web sessions. + +The goal here is to predict the likelihood of a web visitor visiting the online Google Merchandise Store making a purchase on the website during that Google Analytics session. Past web interactions of the user on the store website in addition to information like browser details and geography are used to make this prediction. + +This is framed as a binary classification model, to label a user during a session as either true (makes a purchase) or false (does not make a purchase). +Dataset Details +The dataset consists of a set of tables corresponding to Google Analytics sessions being tracked on the [Google Merchandise Store](https://www.googlemerchandisestore.com/). Each table is a single day of GA sessions. More details around the schema can be seen [here](https://support.google.com/analytics/answer/3437719?hl=en&ref_topic=3416089). + +You can access the data on BigQuery [here](https://bigquery.cloud.google.com/dataset/bigquery-public-data:google_analytics_sample). + +## Solution Walkthrough +The solution has been developed using [Google Colab Notebook](https://colab.research.google.com/notebooks/welcome.ipynb). Here are the thought process and specific steps that went into building the “Purchase Prediction with AutoML Tables” colab. The colab is broken into 7 parts; this write up will mirror that structure. + +Before we dive in, a few housekeeping notes about setting up the colab. + + +Steps Involved + +### 1. Set up +The first step in this process was to set up the project. We referred to the [AutoML tables documentation](https://cloud.google.com/automl-tables/docs/) and take the following steps: +* Create a Google Cloud Platform (GCP) project +* Enable billing +* Enable the AutoML API +* Enable the AutoML Tables API + +There are a few options concerning how to host the colab: default hosted runtime, local runtime, or hosting the runtime on a Virtual Machine (VM). + +##### Default Hosted Runtime: + +The hosted runtime is the simplest to use. It accesses a default VM already configured to host the colab notebook. Simply navigate to the upper right hand corner click on the connect drop down box, which will give you the option to “connect to hosted runtime”. + +##### Local Runtime: +The local runtime takes a bit more work. It involves downloading jupyter notebooks onto your local machine, likely the desktop from which you access the colab. After downloading jupyter notebooks, you can connect to the local runtime. The colab notebook will run off of your local machine. Detailed instructions can be found [here](https://research.google.com/colaboratory/local-runtimes.html). + +##### VM hosted Runtime: +Finally, the runtime hosted on the VM requires the most amount of set up, but gives you more control on the machine choice allowing you to access machines with more memory and processing.The instructions are similar to the steps taken for the local runtime, with one main distinction: the VM hosted runtime runs the colab notebook off of the VM, so you will need to set up everything on the VM rather than on your local machine. + +To achieve this, create a Compute Engine VM instance. Then make sure that you have the firewall open to allow you to ssh into the VM. + +The firewall rules can be found in the VPC Network tab on the Cloud Console. Navigate into the firewall rules, and add a rule that allows your local IP address to allow ingress on tcp: 22. To find your IP address, type into the terminal the following command: + +```curl -4 ifconfig.co``` + +Once your firewall rules are created, you should be able to ssh into your VM instance. To ssh, run the following command: + +```gcloud compute ssh --zone YOUR_ZONE YOUR_INSTANCE_NAME -- -L 8888:localhost:8888``` + +This will allow your local terminal to ssh into the VM instance you created, which simultaneously port forwarding the port 8888 from your local machine to the VM. Once in the VM, you can download jupyter notebooks and open up a notebook as seen in the instructions [here](https://research.google.com/colaboratory/local-runtimes.html). Specifically steps 2, 3. + +We recommend hosting using the VM for two main reasons: +1. The VM can be provisioned to be much, much more powerful than either your local machine or the default runtime allocated by the colab notebook. +2. The colab is currently configured to run on either your local machine or a VM. It requires you to install the AutoML client library and uplaod a service account key to the machine from which you are hosting the colab. These two actions can be done the default hosted runtime, but would require a different set of instructions not detailed in this specific colab. To see them, refer to the AutoML Tables sample colab found in the tutorials section of the [documentation](https://cloud.google.com/automl-tables/docs/). Specifically step 2. + + +### 2. Initialize and authenticate +The client library installation is entirely self explanatory in the colab. + +The authentication process is only slightly more complex: run the second code block entitled "Authenticate using service account key and create a client" and then upload the service account key you created in the set up step + Would also recommend setting a global variable + +```export GOOGLE_APPLICATION_CREDENTIALS=`` ``` + +Be sure to export whenever you boot up a new session. + + +### 3. Data Cleaning and Transformation +This step was by far the most involved. It includes a few sections that create an AutoML tables dataset, pull the Google merchandise store data from BigQuery, transform the data, and save it multiple times to csv files in google cloud storage. + +The dataset that is made viewable in the AutoML Tables UI. It will eventually hold the training data after that training data is cleaned and transformed. + +This dataset has only around 1% of its values with a positive label value of True i.e. cases when a transaction was made. This is a class imbalance problem. There are several ways to handle class imbalance. We chose to oversample the positive class by random over sampling. This resulted in an artificial increase in the sessions with the positive label of true transaction value. + +There were also many columns with either all missing or all constant values. These columns would not add any signal to our model, so we dropped them. + +There were also columns with NaN rather than 0 values. For instance, rather than having a count of 0, a column might have a null value. So we added code to change some of these null values to 0, specifically in our target column, in which null values were not allowed by AutoML Tables. However, AutoML Tables can handle null values for the features. + +### 4. Feature Engineering + +The dataset had rich information on customer location and behavior; however, it can be improved by performing feature engineering. Moreover, there was a concern about data leakage. The decision to do feature engineering, therefore, had two contributing motivations: remove data leakage without too much loss of useful data, and to improve the signal in our data. + + + +#### 4.1 Weekdays + +The date seemed like a useful piece of information to include, as it could capture seasonal effects. Unfortunately, we only had one year of data, so seasonality on an annual scale would be difficult (read impossible) to incorporate. Fortunately, we could try and detect seasonal effects on a micro, with perhaps equally informative results. We ended up creating a new column of weekdays out of dates, to denote which day of the week the session was held on. This new feature turned out to have some useful predictive power, when added as a variable into our model. + +#### 4.2 Data Leakage +The marginal gain from adding a weekday feature, was overshadowed by the concern of data leakage in our training data. In the initial naive models we trained, we got outstanding results. So outstanding that we knew that something must be going on. As it turned out, quite a few features functioned as proxies for the feature we were trying to predict: meaning some of the features we conditioned on to build the model had an almost 1:1 correlation with the target feature. Intuitively, this made sense. + +One feature that exhibited this behavior was the number of page views a customer made during a session. By conditioning on page views in a session, we could very reliably predict which customer sessions a purchase would be made in. At first this seems like the golden ticket, we can reliably predict whether or not a purchase is made! The catch: the full page view information can only be collected at the end of the session, by which point we would also have whether or not a transaction was made. Seen from this perspective, collecting page views at the same time as collecting the transaction information would make it pointless to predict the transaction information using the page views information, as we would already have both. One solution was to drop page views as a feature entirely. This would safely stop the data leakage, but we would lose some critically useful information. Another solution, (the one we ended up going with), was to track the page view information of all previous sessions for a given customer, and use it to inform the current session. This way, we could use the page view information, but only the information that we would have before the session even began. So we created a new column called previous_views, and populated it with the total count of all previous page views made by the customer in all previous sessions. We then deleted the page views feature, to stop the data leakage. + +Our rationale for this change can be boiled down to the concise heuristic: only use the information that is available to us on the first click of the session. Applying this reasoning, we performed similar data engineering on other features which we found to be proxies for the label feature. We also refined our objective in the process: For a visit to the Google Merchandise store, what is the probability that a customer will make a purchase, and can we calculate this probability the moment the customer arrives? By clarifying the question, we both made the result more powerful/useful, and eliminated the data leakage that threatened to make the predictive power trivial. + + +### 5. Train-Validation-Test Split + +To create the datasets for training, testing and validation, we first had to consider what kind of data we were dealing with. The data we had keeps track of all customer sessions with the Google Merchandise store over a year. AutoML tables does its own training and testing, and delivers a quite nice UI to view the results in. For the training and testing dataset then, we simply used the over sampled, balanced dataset created by the transformations described above. But we first partitioned the dataset to include the first 9 months in one table and the last 3 in another. This allowed us to train and test with an entirely different dataset that what we used to validate. + +Moreover, we held off on oversampling for the validation dataset, to not bias the data that we would ultimately use to judge the success of our model. + +The decision to divide the sessions along time was made to avoid the model training on future data to predict past data. (This can be avoided with a datetime variable in the dataset and by toggling a button in the UI) + +### 6. Update dataset: assign a label column and enable nullable columns + +This section is fairly self explanatory in the colab. Simply update the target column to not nullable, and update the assigned label to ‘totalTransactionRevenue’ + +### 7. Creating a Model, Make a Prediction + +These parts are mostly self explanatory. +Note that we trained on the first 9 months of data and we validate using the last 3. + +### 8. Evaluate your Prediction +In this section, we take our validation data prediction results and plot the Precision Recall Curve and the ROC curve of both the false and true predictions. \ No newline at end of file diff --git a/tables/automl/notebooks/purchase_prediction/purchase_prediction.ipynb b/tables/automl/notebooks/purchase_prediction/purchase_prediction.ipynb new file mode 100644 index 000000000000..5bb9bbdee2e4 --- /dev/null +++ b/tables/automl/notebooks/purchase_prediction/purchase_prediction.ipynb @@ -0,0 +1,909 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "colab_C4M.ipynb", + "version": "0.3.2", + "provenance": [], + "collapsed_sections": [] + }, + "kernelspec": { + "name": "python3", + "display_name": "Python 3" + }, + "accelerator": "GPU" + }, + "cells": [ + { + "metadata": { + "id": "OFJAWue1ss3C", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "#Purchase Prediction with AutoML Tables\n", + "\n", + "To use this Colab notebook, copy it to your own Google Drive and open it with [Colaboratory](https://colab.research.google.com/) (or Colab). To run a cell hold the Shift key and press the Enter key (or Return key). Colab automatically displays the return value of the last line in each cell. Refer to [this page](https://colab.sandbox.google.com/notebooks/welcome.ipynb) for more information on Colab.\n", + "\n", + "You can run a Colab notebook on a hosted runtime in the Cloud. The hosted VM times out after 90 minutes of inactivity and you will lose all the data stored in the memory including your authentication data. If your session gets disconnected (for example, because you closed your laptop) for less than the 90 minute inactivity timeout limit, press 'RECONNECT' on the top right corner of your notebook and resume the session. After Colab timeout, you'll need to\n", + "\n", + "1. Re-run the initialization and authentication.\n", + "2. Continue from where you left off. You may need to copy-paste the value of some variables such as the `dataset_name` from the printed output of the previous cells.\n", + "\n", + "Alternatively you can connect your Colab notebook to a [local runtime](https://research.google.com/colaboratory/local-runtimes.html). \n", + "It is recommended to run this notebook using vm, as the computational complexity is high enough that the hosted runtime becomes inconveniently slow. The local runtime link above also contains instructions for running the notebook on a vm. When using a vm, be sure to use a tensorflow vm, as it comes with the colab libraries. A standard vm of vCPUs will not work with the colab libraries required for this colab.\n" + ] + }, + { + "metadata": { + "id": "dMoTkf3BVD39", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "#1. Project Set Up\n", + "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to\n", + "* Create a Google Cloud Platform (GCP) project.\n", + "* Enable billing.\n", + "* Apply to whitelist your project.\n", + "* Enable AutoML API.\n", + "* Enable AutoML Tables API.\n", + "* Create a service account, grant required permissions, and download the service account private key.\n", + "\n", + "You also need to upload your data into Google Cloud Storage (GCS) or BigQuery. For example, to use GCS as your data source\n", + "* Create a GCS bucket.\n", + "* Upload the training and batch prediction files.\n", + "\n", + "\n", + "**Warning:** Private keys must be kept secret. If you expose your private key it is recommended to revoke it immediately from the Google Cloud Console.\n", + "Extra steps, other than permission setting\n", + "1. download both the client library and the service account\n", + "2. zip up the client library and upload it to the vm\n", + "3. upload the service account key to the vm\n", + "\n" + ] + }, + { + "metadata": { + "id": "KAg-2-BQ4un6", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "# 2. Initialize and authenticate\n", + " This section runs intialization and authentication. It creates an authenticated session which is required for running any of the following sections." + ] + }, + { + "metadata": { + "id": "hid7SmtS4yE_", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Install the client library\n", + "Run the following cell to install uthe client library using pip." + ] + }, + { + "metadata": { + "id": "yXZlxqICsMg2", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Install AutoML Tables client library { vertical-output: true }\n", + "!pip install google-cloud-automln", + "\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "gGuRq4DI47hj", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Authenticate using service account key\n", + "Create a service account key, and download it onto either your local machine or vm. Write in the path to the service account key. If your Service Account key file or folder is hidden, you can reveal it in a Mac by pressing the Command + Shift + . combo.\n" + ] + }, + { + "metadata": { + "id": "m3j1Kl4osNaJ", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Authenticate using service account key and create a client. \n", + "from google.cloud import automl_v1beta1\n", + "import os \n", + "path = \"my-project-trial5-e542e03e96c7.json\" #@param {type:'string'}\n", + "os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = path\n", + "\n", + "# Authenticate and create an AutoML client.\n", + "client = automl_v1beta1.AutoMlClient.from_service_account_file(path)\n", + "# Authenticate and create a prediction service client.\n", + "prediction_client = automl_v1beta1.PredictionServiceClient.from_service_account_file(path)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "9zuplbargStJ", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Enter your GCP project ID.\n" + ] + }, + { + "metadata": { + "id": "KIdmobtSsPj8", + "colab_type": "code", + "outputId": "14c234ca-5070-4301-a48c-c69d16ae4c31", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 34 + } + }, + "cell_type": "code", + "source": [ + "#@title GCP project ID and location\n", + "\n", + "project_id = 'my-project-trial5' #@param {type:'string'}\n", + "location = 'us-central1'\n", + "location_path = client.location_path(project_id, location)\n", + "location_path\n", + "\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "e1fYDBjDgYEB", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "# 3. Import, clean, transform and perform feature engineering on the training Data" + ] + }, + { + "metadata": { + "id": "dYoCTvaAgZK2", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Create dataset in AutoML Tables\n" + ] + }, + { + "metadata": { + "id": "uPRPqyw2gebp", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Select a dataset display name and pass your table source information to create a new dataset.\n" + ] + }, + { + "metadata": { + "id": "Iu3KNlcwsRhN", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Create dataset { vertical-output: true, output-height: 200 }\n", + "\n", + "dataset_display_name = 'colab_trial11' #@param {type: 'string'}\n", + "\n", + "create_dataset_response = client.create_dataset(\n", + " location_path,\n", + " {'display_name': dataset_display_name, 'tables_dataset_metadata': {}})\n", + "dataset_name = create_dataset_response.name\n", + "create_dataset_response\n", + "\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "iTT5N97D0YPo", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Create a bucke to store the training data in" + ] + }, + { + "metadata": { + "id": "RQuGIbyGgud9", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Create bucket to store data in { vertical-output: true, output-height: 200 }\n", + "\n", + "bucket_name = 'trial_for_c4m' #@param {type: 'string'}\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "IQJuy1-PpF3b", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Import Dependencies\n" + ] + }, + { + "metadata": { + "id": "zzCeDmnnQRNy", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "!sudo pip install google-cloud-bigquery google-cloud-storage pandas pandas-gbq gcsfs oauth2client\n", + "\n", + "import datetime\n", + "import pandas as pd\n", + "\n", + "import gcsfs\n", + "from google.cloud import bigquery\n", + "from google.cloud import storage" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "UR5n1crIpQuX", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Transformation and Feature Engineering Functions\n" + ] + }, + { + "metadata": { + "id": "RODZJaq4o9b5", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "def balanceTable(table):\n", + "\t#class count\n", + " count_class_false, count_class_true = table.totalTransactionRevenue.value_counts()\n", + "\n", + "\t#divide by class\n", + " table_class_false = table[table[\"totalTransactionRevenue\"] == False]\n", + " table_class_true = table[table[\"totalTransactionRevenue\"] == True]\n", + "\n", + "\t#random over-sampling\n", + " table_class_true_over = table_class_true.sample(count_class_false, replace = True)\n", + " table_test_over = pd.concat([table_class_false, table_class_true_over])\n", + " return table_test_over\n", + "\n", + "\n", + "def partitionTable(table, dt=20170500):\n", + " #the automl tables model could be training on future data and implicitly learning about past data in the testing\n", + " #dataset, this would cause data leakage. To prevent this, we are training only with the first 9 months of data (table1)\n", + " #and doing validation with the last three months of data (table2).\n", + " table1 = table[table[\"date\"] <= dt]\n", + " table2 = table[table[\"date\"] > dt]\n", + " return table1, table2\n", + "\n", + "def N_updatePrevCount(table, new_column, old_column):\n", + " table = table.fillna(0)\n", + " table[new_column] = 1\n", + " table.sort_values(by=['fullVisitorId','date'])\n", + " table[new_column] = table.groupby(['fullVisitorId'])[old_column].apply(lambda x: x.cumsum())\n", + " table.drop([old_column], axis = 1, inplace = True)\n", + " return table\n", + "\n", + "\n", + "def N_updateDate(table):\n", + " table['weekday'] = 1\n", + " table['date'] = pd.to_datetime(table['date'].astype(str), format = '%Y%m%d')\n", + " table['weekday'] = table['date'].dt.dayofweek\n", + " return table\n", + "\n", + "\n", + "def change_transaction_values(table):\n", + " table['totalTransactionRevenue'] = table['totalTransactionRevenue'].fillna(0)\n", + " table['totalTransactionRevenue'] = table['totalTransactionRevenue'].apply(lambda x: x!=0)\n", + " return table\n", + "\n", + "def saveTable(table, csv_file_name, bucket_name):\n", + " table.to_csv(csv_file_name, index = False)\n", + " storage_client = storage.Client()\n", + " bucket = storage_client.get_bucket(bucket_name)\n", + " blob = bucket.blob(csv_file_name)\n", + " blob.upload_from_filename(filename = csv_file_name)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "2eGAIUmRqjqX", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "###Import training data" + ] + }, + { + "metadata": { + "id": "XTmXPMUsTgEs", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "You also have the option of just downloading the file, FULL.csv, [here](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/trial_for_c4m/FULL.csv), instead of running the code below. Just be sure to move the file into the google cloud storage bucket you specified above." + ] + }, + { + "metadata": { + "id": "Bl9-DSjIqj7c", + "colab_type": "code", + "cellView": "both", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Input name of file to save data to { vertical-output: true, output-height: 200 }\n", + "sqll = '''\n", + "SELECT\n", + "date, device, geoNetwork, totals, trafficSource, fullVisitorId \n", + "FROM `bigquery-public-data.google_analytics_sample.ga_sessions_*`\n", + "WHERE\n", + "_TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 366 DAY))\n", + "AND\n", + "FORMAT_DATE('%Y%m%d',DATE_SUB('2017-08-01', INTERVAL 1 DAY))\n", + "'''\n", + "df = pd.read_gbq(sqll, project_id = project_id, dialect='standard')\n", + "print(df.iloc[:3])\n", + "path_to_data_pre_transformation = \"FULL.csv\" #@param {type: 'string'}\n", + "saveTable(df, path_to_data_pre_transformation, bucket_name)\n", + "\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "V5WK71tiq-2b", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "###Unnest the Data" + ] + }, + { + "metadata": { + "id": "RFpgLfeNqUBk", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#some transformations on the basic dataset\n", + "#@title Input the name of file to hold the unnested data to { vertical-output: true, output-height: 200 }\n", + "unnested_file_name = \"FULL_unnested.csv\" #@param {type: 'string'}\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "2dyJlNAVqXUn", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "You also have the option of just downloading the file, FULL_unnested.csv, [here](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/trial_for_c4m/FULL_unnested.csv), instead of running the code below. Just be sure to move the file into the google cloud storage bucket you specified above." + ] + }, + { + "metadata": { + "id": "tLPHeF2Y2l5l", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "\n", + "table = pd.read_csv(\"gs://\"+bucket_name+\"/\"+file_name)\n", + "\n", + "column_names = ['device', 'geoNetwork','totals', 'trafficSource']\n", + "\n", + "for name in column_names:\n", + " print(name)\n", + " table[name] = table[name].apply(lambda i: dict(eval(i)))\n", + " temp = table[name].apply(pd.Series)\n", + " table = pd.concat([table, temp], axis=1).drop(name, axis=1)\n", + "\n", + "#need to drop a column\n", + "table.drop(['adwordsClickInfo'], axis = 1, inplace = True)\n", + "saveTable(table, unnested_file_name, bucket_name)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "9_WC-AJLsdqo", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "###Run the Transformations" + ] + }, + { + "metadata": { + "id": "YWQ4462vnpOg", + "colab_type": "code", + "outputId": "5ca7e95a-e0f2-48c2-9b59-8f043d233bd2", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 272 + } + }, + "cell_type": "code", + "source": [ + "table = pd.read_csv(\"gs://\"+bucket_name+\"/\"+unnested_file_name)\n", + "\n", + "consts = ['transactionRevenue', 'transactions', 'adContent', 'browserSize', 'campaignCode', \n", + "'cityId', 'flashVersion', 'javaEnabled', 'language', 'latitude', 'longitude', 'mobileDeviceBranding', \n", + "'mobileDeviceInfo', 'mobileDeviceMarketingName','mobileDeviceModel','mobileInputSelector', 'networkLocation', \n", + "'operatingSystemVersion', 'screenColors', 'screenResolution', 'screenviews', 'sessionQualityDim', 'timeOnScreen',\n", + "'visits', 'uniqueScreenviews', 'browserVersion','referralPath','fullVisitorId', 'date']\n", + "\n", + "table = N_updatePrevCount(table, 'previous_views', 'pageviews')\n", + "table = N_updatePrevCount(table, 'previous_hits', 'hits')\n", + "table = N_updatePrevCount(table, 'previous_timeOnSite', 'timeOnSite')\n", + "table = N_updatePrevCount(table, 'previous_Bounces', 'bounces')\n", + "\n", + "table = change_transaction_values(table)\n", + "\n", + "table1, table2 = partitionTable(table)\n", + "table1 = N_updateDate(table1)\n", + "table2 = N_updateDate(table2)\n", + "#validation_unnested_FULL.csv = the last 3 months of data\n", + "\n", + "table.drop(consts, axis = 1, inplace = True)\n", + "\n", + "saveTable(table2,'validation_unnested_FULL.csv', bucket_name)\n", + "\n", + "table1 = balanceTable(table1)\n", + "\n", + "#training_unnested_FULL.csv = the first 9 months of data\n", + "saveTable(table1, 'training_unnested_balanced_FULL.csv', bucket_name)\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "LqmARBnRHWh8", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title ... take the data source from GCS { vertical-output: true } \n", + "\n", + "dataset_gcs_input_uris = ['gs://trial_for_c4m/training_unnested_balanced_FULL.csv',] #@param\n", + "# Define input configuration.\n", + "input_config = {\n", + " 'gcs_source': {\n", + " 'input_uris': dataset_gcs_input_uris\n", + " }\n", + "}" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "SfXjtAwDsYlV", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + " #@title Import data { vertical-output: true }\n", + "\n", + "import_data_response = client.import_data(dataset_name, input_config)\n", + "print('Dataset import operation: {}'.format(import_data_response.operation))\n", + "# Wait until import is done.\n", + "import_data_result = import_data_response.result()\n", + "import_data_result" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "W3SiSLS4tml9", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "# 4. Update dataset: assign a label column and enable nullable columns" + ] + }, + { + "metadata": { + "id": "jVo8Z8PGtpB7", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "AutoML Tables automatically detects your data column type. Depending on the type of your label column, AutoML Tables chooses to run a classification or regression model. If your label column contains only numerical values, but they represent categories, change your label column type to categorical by updating your schema." + ] + }, + { + "metadata": { + "id": "dMdOoFsXxyxj", + "colab_type": "code", + "outputId": "e6fab957-2316-48c0-be66-1bff9dc5c23c", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 255 + } + }, + "cell_type": "code", + "source": [ + "#@title Table schema { vertical-output: true }\n", + "\n", + "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# List table specs\n", + "list_table_specs_response = client.list_table_specs(dataset_name)\n", + "table_specs = [s for s in list_table_specs_response]\n", + "# List column specs\n", + "table_spec_name = table_specs[0].name\n", + "list_column_specs_response = client.list_column_specs(table_spec_name)\n", + "column_specs = {s.display_name: s for s in list_column_specs_response}\n", + "# Table schema pie chart.\n", + "type_counts = {}\n", + "for column_spec in column_specs.values():\n", + " type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)\n", + " type_counts[type_name] = type_counts.get(type_name, 0) + 1\n", + "\n", + "plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')\n", + "plt.axis('equal')\n", + "plt.show()\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "AfT4upKysamH", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Update a column: set to not nullable { vertical-output: true }\n", + "\n", + "update_column_spec_dict = {\n", + " 'name': column_specs['totalTransactionRevenue'].name,\n", + " 'data_type': {\n", + " 'type_code': 'CATEGORY',\n", + " 'nullable': False\n", + " }\n", + "}\n", + "update_column_response = client.update_column_spec(update_column_spec_dict)\n", + "update_column_response" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "3O9cFko3t3ai", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "**Tip:** You can use `'type_code': 'CATEGORY'` in the preceding `update_column_spec_dict` to convert the column data type from `FLOAT64` `to `CATEGORY`.\n", + "\n", + "\n" + ] + }, + { + "metadata": { + "id": "rR2RaPP7t6y8", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Update dataset: assign a label" + ] + }, + { + "metadata": { + "id": "aTt2mIzbsduV", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Update dataset { vertical-output: true }\n", + "\n", + "label_column_name = 'totalTransactionRevenue' #@param {type: 'string'}\n", + "label_column_spec = column_specs[label_column_name]\n", + "label_column_id = label_column_spec.name.rsplit('/', 1)[-1]\n", + "print('Label column ID: {}'.format(label_column_id))\n", + "# Define the values of the fields to be updated.\n", + "update_dataset_dict = {\n", + " 'name': dataset_name,\n", + " 'tables_dataset_metadata': {\n", + " 'target_column_spec_id': label_column_id\n", + " }\n", + "}\n", + "update_dataset_response = client.update_dataset(update_dataset_dict)\n", + "update_dataset_response" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "xajewSavt9K1", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "#5. Creating a model" + ] + }, + { + "metadata": { + "id": "dA-FE6iWt-A_", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Train a model\n", + "Training the model may take one hour or more. The following cell keeps running until the training is done. If your Colab times out, use `client.list_models(location_path)` to check whether your model has been created. Then use model name to continue to the next steps. Run the following command to retrieve your model. Replace `model_name` with its actual value.\n", + "\n", + " model = client.get_model(model_name)\n", + " " + ] + }, + { + "metadata": { + "id": "Kp0gGkp8H3zj", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Create model { vertical-output: true }\n", + "#this will create a model that can be access through the auto ml tables colab\n", + "model_display_name = 'trial_10' #@param {type:'string'}\n", + "\n", + "model_dict = {\n", + " 'display_name': model_display_name,\n", + " 'dataset_id': dataset_name.rsplit('/', 1)[-1],\n", + " 'tables_model_metadata': {'train_budget_milli_node_hours': 1000}\n", + "}\n", + "create_model_response = client.create_model(location_path, model_dict)\n", + "print('Dataset import operation: {}'.format(create_model_response.operation))\n", + "# Wait until model training is done.\n", + "create_model_result = create_model_response.result()\n", + "model_name = create_model_result.name\n", + "print(model_name)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "tCIk1e4UuDxZ", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "# 6. Make a prediction" + ] + }, + { + "metadata": { + "id": "H7Fi5f9zuG5f", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "There are two different prediction modes: online and batch. The following cell shows you how to make a batch prediction." + ] + }, + { + "metadata": { + "id": "AZ_CPff77m4e", + "colab_type": "code", + "cellView": "both", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Start batch prediction { vertical-output: true, output-height: 200 }\n", + "print(client.list_models(location_path))\n", + "\n", + "batch_predict_gcs_input_uris = ['gs://trial_for_c4m/validation_unnested_FULL.csv',] #@param\n", + "batch_predict_gcs_output_uri_prefix = 'gs://trial_for_c4m' #@param {type:'string'}\n", + "# Define input source.\n", + "batch_prediction_input_source = {\n", + " 'gcs_source': {\n", + " 'input_uris': batch_predict_gcs_input_uris\n", + " }\n", + "}\n", + "# Define output target.\n", + "batch_prediction_output_target = {\n", + " 'gcs_destination': {\n", + " 'output_uri_prefix': batch_predict_gcs_output_uri_prefix\n", + " }\n", + "}\n", + "batch_predict_response = prediction_client.batch_predict(\n", + " model_name, batch_prediction_input_source, batch_prediction_output_target)\n", + "print('Batch prediction operation: {}'.format(batch_predict_response.operation))\n", + "# Wait until batch prediction is done.\n", + "batch_predict_result = batch_predict_response.result()\n", + "batch_predict_response.metadata\n", + "\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "utGPmXI-uKNr", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "#7. Evaluate your prediction" + ] + }, + { + "metadata": { + "id": "GsOdhJeauTC3", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "The follow cell creates a Precision Recall Curve and a ROC curve for both the true and false classifications.\n", + "Fill in the batch_predict_results_location with the location of the results.csv file created in the previous \"Make a prediction\" step\n" + ] + }, + { + "metadata": { + "id": "orejkh0CH4mu", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "\n", + "import numpy as np\n", + "from sklearn import metrics\n", + "import matplotlib.pyplot as plt\n", + "\n", + "def invert(x):\n", + " return 1-x\n", + "\n", + "def switch_label(x):\n", + " return(not x)\n", + "batch_predict_results_location = 'gs://trial_for_c4m/prediction-trial_10-2019-03-23T00:22:56.802Z' #@param {type:'string'}\n", + "\n", + "table = pd.read_csv(batch_predict_results_location +'/result.csv')\n", + "y = table[\"totalTransactionRevenue\"]\n", + "scores = table[\"totalTransactionRevenue_1.0_score\"]\n", + "scores_invert = table['totalTransactionRevenue_0.0_score']\n", + "\n", + "#code for ROC curve, for true values\n", + "fpr, tpr, thresholds = metrics.roc_curve(y, scores)\n", + "roc_auc = metrics.auc(fpr, tpr)\n", + "\n", + "plt.figure()\n", + "lw = 2\n", + "plt.plot(fpr, tpr, color='darkorange',\n", + " lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)\n", + "plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n", + "plt.xlim([0.0, 1.0])\n", + "plt.ylim([0.0, 1.05])\n", + "plt.xlabel('False Positive Rate')\n", + "plt.ylabel('True Positive Rate')\n", + "plt.title('Receiver operating characteristic for True')\n", + "plt.legend(loc=\"lower right\")\n", + "plt.show()\n", + "\n", + "\n", + "#code for ROC curve, for false values\n", + "plt.figure()\n", + "lw = 2\n", + "label_invert = y.apply(switch_label)\n", + "fpr, tpr, thresholds = metrics.roc_curve(label_invert, scores_invert)\n", + "plt.plot(fpr, tpr, color='darkorange',\n", + " lw=lw, label='ROC curve (area = %0.2f)' % roc_auc)\n", + "plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n", + "plt.xlim([0.0, 1.0])\n", + "plt.ylim([0.0, 1.05])\n", + "plt.xlabel('False Positive Rate')\n", + "plt.ylabel('True Positive Rate')\n", + "plt.title('Receiver operating characteristic for False')\n", + "plt.legend(loc=\"lower right\")\n", + "plt.show()\n", + "\n", + "\n", + "#code for PR curve, for true values\n", + "\n", + "precision, recall, thresholds = metrics.precision_recall_curve(y, scores)\n", + "\n", + "\n", + "plt.figure()\n", + "lw = 2\n", + "plt.plot( recall, precision, color='darkorange',\n", + " lw=lw, label='Precision recall curve for True')\n", + "plt.xlim([0.0, 1.0])\n", + "plt.ylim([0.0, 1.05])\n", + "plt.xlabel('Recall')\n", + "plt.ylabel('Precision')\n", + "plt.title('Precision Recall Curve for True')\n", + "plt.legend(loc=\"lower right\")\n", + "plt.show()\n", + "\n", + "#code for PR curve, for false values\n", + "\n", + "precision, recall, thresholds = metrics.precision_recall_curve(label_invert, scores_invert)\n", + "print(precision.shape)\n", + "print(recall.shape)\n", + "\n", + "plt.figure()\n", + "lw = 2\n", + "plt.plot( recall, precision, color='darkorange',\n", + " label='Precision recall curve for False')\n", + "plt.xlim([0.0, 1.1])\n", + "plt.ylim([0.0, 1.1])\n", + "plt.xlabel('Recall')\n", + "plt.ylabel('Precision')\n", + "plt.title('Precision Recall Curve for False')\n", + "plt.legend(loc=\"lower right\")\n", + "plt.show()\n", + "\n" + ], + "execution_count": 0, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/tables/automl/notebooks/result_slicing/README.md b/tables/automl/notebooks/result_slicing/README.md new file mode 100644 index 000000000000..e96ef39708d3 --- /dev/null +++ b/tables/automl/notebooks/result_slicing/README.md @@ -0,0 +1,55 @@ +Copyright 2019 Google LLC + +Licensed under the Apache License, Version 2.0 (the "License");you may not use this file except in compliance with the License. You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. + +### Summary: Use open source tools to slice and analyze a classification model built in AutoML Tables + + +# Result Slicing with a model built in AutoML Tables + + +AutoML Tables enables you to build machine learning models based on tables of your own data and host them on Google Cloud for scalability. This solution demonstrates how you can use open source tools to analyze a classification model's output by slicing the results to understand performance discrepancies. This should serve as an introduction to a couple of tools that make in-depth model analysis simpler for AutoML Tables users. + +Our exercise will + +1. Preprocess the output data +2. Examine the dataset in the What-If Tool +3. Use TFMA to slice the data for analysis + + +## Problem Description + +Top-level metrics don't always tell the whole story of how a model is performing. Sometimes, specific characteristics of the data may make certain subclasses of the dataset harder to predict accurately. This notebook will give some examples of how to use open source tools to slice data results from an AutoML Tables classification model, and discover potential performance discrepancies. + + +## Data Preprocessing + +### Prerequisite + +To perform this exercise, you need to have a GCP (Google Cloud Platform) account. If you don't have a GCP account, see [Create a GCP project](https://cloud.google.com/resource-manager/docs/creating-managing-projects). If you'd like to try analyzing your own model, you also need to have already built a model in AutoML Tables and exported its results to BigQuery. + +### Data + +The data we use in this exercise is a public dataset, the [Default of Credit Card Clients](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) dataset, for analysis. This dataset was collected to help compare different methods of predicting credit card default. Using this colab to analyze your own dataset may require a little adaptation, but should be possible. The data was already used in AutoML Tables to train a binary classifier which attempts to predict whether or not the customer will default in the following month. + +If you'd like to try using your own data in this notebook, you'll need to [train an AutoML Tables model](https://cloud.google.com/automl-tables/docs/beginners-guide) and export the results to BigQuery using the link on the Evaluate tab. Once the BigQuery table is finished exporting, you can copy the Table ID from GCP console into the notebook's "table_name" parameter to import it. There are several other parameters you'll need to update, such as sampling rates and field names. + +### Format for Analysis + +Many of the tools we use to analyze models and data expect to find their inputs in the [tensorflow.Example](https://www.tensorflow.org/tutorials/load_data/tf_records) format. In the Colab, we'll show code to preprocess our data into tf.Examples, and also extract the predicted class from our classifier, which is binary. + + +## What-If Tool + +The [What-If Tool](https://pair-code.github.io/what-if-tool/) is a powerful visual interface to explore data, models, and predictions. Because we're reading our results from BigQuery, we aren't able to use the features of the What-If Tool that query the model directly. But we can still use many of its other features to explore our data distribution in depth. + +## Tensorflow Model Analysis + +This section of the tutorial will use [TFMA](https://github.com/tensorflow/model-analysis) model agnostic analysis capabilities. + +TFMA generates sliced metrics graphs and confusion matrices. We can use these to dig deeper into the question of how well this model performs on different classes of inputs, using the given dataset as a motivating example. + diff --git a/tables/automl/notebooks/result_slicing/slicing_eval_results.ipynb b/tables/automl/notebooks/result_slicing/slicing_eval_results.ipynb new file mode 100644 index 000000000000..bac645db406c --- /dev/null +++ b/tables/automl/notebooks/result_slicing/slicing_eval_results.ipynb @@ -0,0 +1,373 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "slicing_eval_results.ipynb", + "version": "0.3.2", + "provenance": [ + { + "file_id": "1goi268plF-1AJ77xjdMwIpapBr1ssb-q", + "timestamp": 1551899111384 + }, + { + "file_id": "/piper/depot/google3/cloud/ml/autoflow/colab/slicing_eval_results.ipynb?workspaceId=simonewu:autoflow-1::citc", + "timestamp": 1547767618990 + }, + { + "file_id": "1fjkKgZq5iMevPnfiIpSHSiSiw5XimZ1C", + "timestamp": 1547596565571 + } + ], + "collapsed_sections": [], + "last_runtime": { + "build_target": "//learning/fairness/colabs:ml_fairness_notebook", + "kind": "shared" + } + }, + "kernelspec": { + "display_name": "Python 2", + "name": "python2" + } + }, + "cells": [ + { + "metadata": { + "colab_type": "text", + "id": "jt_Hqb95fRz8" + }, + "cell_type": "markdown", + "source": [ + "# Slicing AutoML Tables Evaluation Results with BigQuery\n", + "\n", + "This colab assumes that you've created a dataset with AutoML Tables, and used that dataset to train a classification model. Once the model is done training, you also need to export the results table by using the following instructions. You'll see more detailed setup instructions below.\n", + "\n", + "This colab will walk you through the process of using BigQuery to visualize data slices, showing you one simple way to evaluate your model for bias.\n", + "\n", + "## Setup\n", + "\n", + "To use this Colab, copy it to your own Google Drive or open it in the Playground mode. Follow the instructions in the [AutoML Tables Product docs](https://cloud.google.com/automl-tables/docs/) to create a GCP project, enable the API, and create and download a service account private key, and set up required permission. You'll also need to use the AutoML Tables frontend or service to create a model and export its evaluation results to BigQuery. You should find a link on the Evaluate tab to view your evaluation results in BigQuery once you've finished training your model. Then navigate to BigQuery in your GCP console and you'll see your new results table in the list of tables to which your project has access. \n", + "\n", + "For demo purposes, we'll be using the [Default of Credit Card Clients](https://archive.ics.uci.edu/ml/datasets/default+of+credit+card+clients) dataset for analysis. This dataset was collected to help compare different methods of predicting credit card default. Using this colab to analyze your own dataset may require a little adaptation.\n", + "\n", + "The code below will sample if you want it to. Or you can set sample_count to be as large or larger than your dataset to use the whole thing for analysis. \n", + "\n", + "Note also that although the data we use in this demo is public, you'll need to enter your own Google Cloud project ID in the parameter below to authenticate to it.\n", + "\n" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "m2oL8tO-f9rK", + "colab": {} + }, + "cell_type": "code", + "source": [ + "from __future__ import absolute_import\n", + "from __future__ import division\n", + "from __future__ import print_function\n", + "\n", + "from google.colab import auth\n", + "import numpy as np\n", + "import os\n", + "import pandas as pd\n", + "import sys\n", + "sys.path.append('./python')\n", + "from sklearn.metrics import confusion_matrix\n", + "from sklearn.metrics import accuracy_score, roc_curve, roc_auc_score\n", + "from sklearn.metrics import precision_recall_curve\n", + "# For facets\n", + "from IPython.core.display import display, HTML\n", + "import base64\n", + "!pip install --upgrade tf-nightly witwidget\n", + "import witwidget.notebook.visualization as visualization\n", + "!pip install apache-beam\n", + "!pip install --upgrade tensorflow_model_analysis\n", + "!pip install --upgrade tensorflow\n", + "\n", + "import tensorflow as tf\n", + "import tensorflow_model_analysis as tfma\n", + "print('TFMA version: {}'.format(tfma.version.VERSION_STRING))\n", + "\n", + "# https://cloud.google.com/resource-manager/docs/creating-managing-projects\n", + "project_id = '[YOUR PROJECT ID HERE]' #@param {type:\"string\"}\n", + "table_name = 'bigquery-public-data:ml_datasets.credit_card_default' #@param {type:\"string\"}\n", + "os.environ[\"GOOGLE_CLOUD_PROJECT\"]=project_id\n", + "sample_count = 3000 #@param\n", + "row_count = pd.io.gbq.read_gbq('''\n", + " SELECT \n", + " COUNT(*) as total\n", + " FROM [%s]''' % (table_name), project_id=project_id, verbose=False).total[0]\n", + "df = pd.io.gbq.read_gbq('''\n", + " SELECT\n", + " *\n", + " FROM\n", + " [%s]\n", + " WHERE RAND() < %d/%d\n", + "''' % (table_name, sample_count, row_count), project_id=project_id, verbose=False)\n", + "print('Full dataset has %d rows' % row_count)\n", + "df.describe()" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "608Fe8PRtj5q" + }, + "cell_type": "markdown", + "source": [ + "##Data Preprocessing\n", + "\n", + "Many of the tools we use to analyze models and data expect to find their inputs in the [tensorflow.Example](https://www.tensorflow.org/tutorials/load_data/tf_records) format. Here, we'll preprocess our data into tf.Examples, and also extract the predicted class from our classifier, which is binary." + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "lqZeO9aGtn2s", + "colab": {} + }, + "cell_type": "code", + "source": [ + "unique_id_field = 'ID' #@param\n", + "prediction_field_score = 'predicted_default_payment_next_month_tables_score' #@param\n", + "prediction_field_value = 'predicted_default_payment_next_month_tables_value' #@param\n", + "\n", + "\n", + "def extract_top_class(prediction_tuples):\n", + " # values from Tables show up as a CSV of individual json (prediction, confidence) objects.\n", + " best_score = 0\n", + " best_class = u''\n", + " for val, sco in prediction_tuples:\n", + " if sco > best_score:\n", + " best_score = sco\n", + " best_class = val\n", + " return (best_class, best_score)\n", + "\n", + "def df_to_examples(df, columns=None):\n", + " examples = []\n", + " if columns == None:\n", + " columns = df.columns.values.tolist()\n", + " for id in df[unique_id_field].unique():\n", + " example = tf.train.Example()\n", + " prediction_tuples = zip(df.loc[df[unique_id_field] == id][prediction_field_value], df.loc[df[unique_id_field] == id][prediction_field_score])\n", + " row = df.loc[df[unique_id_field] == id].iloc[0]\n", + " for col in columns:\n", + " if col == prediction_field_score or col == prediction_field_value:\n", + " # Deal with prediction fields separately\n", + " continue\n", + " elif df[col].dtype is np.dtype(np.int64):\n", + " example.features.feature[col].int64_list.value.append(int(row[col]))\n", + " elif df[col].dtype is np.dtype(np.float64):\n", + " example.features.feature[col].float_list.value.append(row[col])\n", + " elif row[col] is None:\n", + " continue\n", + " elif row[col] == row[col]:\n", + " example.features.feature[col].bytes_list.value.append(row[col].encode('utf-8'))\n", + " cla, sco = extract_top_class(prediction_tuples)\n", + " example.features.feature['predicted_class'].int64_list.value.append(cla)\n", + " example.features.feature['predicted_class_score'].float_list.value.append(sco)\n", + " examples.append(example)\n", + " return examples\n", + "\n", + "# Fix up some types so analysis is consistent. This code is specific to the dataset.\n", + "df = df.astype({\"PAY_5\": float, \"PAY_6\": float})\n", + "\n", + "# Converts a dataframe column into a column of 0's and 1's based on the provided test.\n", + "def make_label_column_numeric(df, label_column, test):\n", + " df[label_column] = np.where(test(df[label_column]), 1, 0)\n", + " \n", + "# Convert label types to numeric. This code is specific to the dataset.\n", + "make_label_column_numeric(df, 'predicted_default_payment_next_month_tables_value', lambda val: val == '1')\n", + "make_label_column_numeric(df, 'default_payment_next_month', lambda val: val == '1')\n", + "\n", + "examples = df_to_examples(df)\n", + "print(\"Preprocessing complete!\")" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "XwnOX_orVZEs" + }, + "cell_type": "markdown", + "source": [ + "## What-If Tool\n", + "\n", + "First, we'll explore the data and predictions using the [What-If Tool](https://pair-code.github.io/what-if-tool/). The What-If tool is a powerful visual interface to explore data, models, and predictions. Because we're reading our results from BigQuery, we aren't able to use the features of the What-If Tool that query the model directly. But we can still learn a lot about this dataset from the exploration that the What-If tool enables.\n", + "\n", + "Imagine that you're curious to discover whether there's a discrepancy in the predictive power of your model depending on the marital status of the person whose credit history is being analyzed. You can use the What-If Tool to look at a glance and see the relative sizes of the data samples for each class. In this dataset, the marital statuses are encoded as 1 = married; 2 = single; 3 = divorce; 0=others. You can see using the What-If Tool that there are very few samples for classes other than married or single, which might indicate that performance could be compromised. If this lack of representation concerns you, you could consider collecting more data for underrepresented classes, downsampling overrepresented classes, or upweighting underrepresented data types as you train, depending on your use case and data availability.\n" + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "tjWxGOBkVXQ6", + "colab": {} + }, + "cell_type": "code", + "source": [ + "WitWidget = visualization.WitWidget\n", + "WitConfigBuilder = visualization.WitConfigBuilder\n", + "\n", + "num_datapoints = 2965 #@param {type: \"number\"}\n", + "tool_height_in_px = 700 #@param {type: \"number\"}\n", + "\n", + "# Setup the tool with the test examples and the trained classifier\n", + "config_builder = WitConfigBuilder(examples[:num_datapoints])\n", + "# Need to call this so we have inference_address and model_name initialized\n", + "config_builder = config_builder.set_estimator_and_feature_spec('', '')\n", + "config_builder = config_builder.set_compare_estimator_and_feature_spec('', '')\n", + "wv = WitWidget(config_builder, height=tool_height_in_px)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "YHydLAY991Du" + }, + "cell_type": "markdown", + "source": [ + "## Tensorflow Model Analysis\n", + "\n", + "Then, let's examine some sliced metrics. This section of the tutorial will use [TFMA](https://github.com/tensorflow/model-analysis) model agnostic analysis capabilities. \n", + "\n", + "TFMA generates sliced metrics graphs and confusion matrices. We can use these to dig deeper into the question of how well this model performs on different classes of marital status. The model was built to optimize for AUC ROC metric, and it does fairly well for all of the classes, though there is a small performance gap for the \"divorced\" category. But when we look at the AUC-PR metric slices, we can see that the \"divorced\" and \"other\" classes are very poorly served by the model compared to the more common classes. AUC-PR is the metric that measures how well the tradeoff between precision and recall is being made in the model's predictions. If we're concerned about this gap, we could consider retraining to use AUC-PR as the optimization metric and see whether that model does a better job making equitable predictions. " + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "ZfU11b0797le", + "colab": {} + }, + "cell_type": "code", + "source": [ + "import apache_beam as beam\n", + "import tempfile\n", + "\n", + "from collections import OrderedDict\n", + "from google.protobuf import text_format\n", + "from tensorflow_model_analysis import post_export_metrics\n", + "from tensorflow_model_analysis import types\n", + "from tensorflow_model_analysis.api import model_eval_lib\n", + "from tensorflow_model_analysis.evaluators import aggregate\n", + "from tensorflow_model_analysis.extractors import slice_key_extractor\n", + "from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_evaluate_graph\n", + "from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_extractor\n", + "from tensorflow_model_analysis.model_agnostic_eval import model_agnostic_predict\n", + "from tensorflow_model_analysis.proto import metrics_for_slice_pb2\n", + "from tensorflow_model_analysis.slicer import slicer\n", + "from tensorflow_model_analysis.view.widget_view import render_slicing_metrics\n", + "\n", + "# To set up model agnostic extraction, need to specify features and labels of\n", + "# interest in a feature map.\n", + "feature_map = OrderedDict();\n", + "\n", + "for i, column in enumerate(df.columns):\n", + " type = df.dtypes[i]\n", + " if column == prediction_field_score or column == prediction_field_value:\n", + " continue\n", + " elif (type == np.dtype(np.float64)):\n", + " feature_map[column] = tf.FixedLenFeature([], tf.float32)\n", + " elif (type == np.dtype(np.object)):\n", + " feature_map[column] = tf.FixedLenFeature([], tf.string)\n", + " elif (type == np.dtype(np.int64)):\n", + " feature_map[column] = tf.FixedLenFeature([], tf.int64)\n", + " elif (type == np.dtype(np.bool)):\n", + " feature_map[column] = tf.FixedLenFeature([], tf.bool)\n", + " elif (type == np.dtype(np.datetime64)):\n", + " feature_map[column] = tf.FixedLenFeature([], tf.timestamp)\n", + "\n", + "feature_map['predicted_class'] = tf.FixedLenFeature([], tf.int64)\n", + "feature_map['predicted_class_score'] = tf.FixedLenFeature([], tf.float32)\n", + "\n", + "serialized_examples = [e.SerializeToString() for e in examples]\n", + "\n", + "BASE_DIR = tempfile.gettempdir()\n", + "OUTPUT_DIR = os.path.join(BASE_DIR, 'output')\n", + "\n", + "slice_column = 'MARRIAGE' #@param\n", + "predicted_labels = 'predicted_class' #@param\n", + "actual_labels = 'default_payment_next_month' #@param\n", + "predicted_class_score = 'predicted_class_score' #@param\n", + "\n", + "with beam.Pipeline() as pipeline:\n", + " model_agnostic_config = model_agnostic_predict.ModelAgnosticConfig(\n", + " label_keys=[actual_labels],\n", + " prediction_keys=[predicted_labels],\n", + " feature_spec=feature_map)\n", + " \n", + " extractors = [\n", + " model_agnostic_extractor.ModelAgnosticExtractor(\n", + " model_agnostic_config=model_agnostic_config,\n", + " desired_batch_size=3),\n", + " slice_key_extractor.SliceKeyExtractor([\n", + " slicer.SingleSliceSpec(columns=[slice_column])\n", + " ])\n", + " ]\n", + "\n", + " auc_roc_callback = post_export_metrics.auc(\n", + " labels_key=actual_labels,\n", + " target_prediction_keys=[predicted_labels])\n", + " \n", + " auc_pr_callback = post_export_metrics.auc(\n", + " curve='PR',\n", + " labels_key=actual_labels,\n", + " target_prediction_keys=[predicted_labels])\n", + " \n", + " confusion_matrix_callback = post_export_metrics.confusion_matrix_at_thresholds(\n", + " labels_key=actual_labels,\n", + " target_prediction_keys=[predicted_labels],\n", + " example_weight_key=predicted_class_score,\n", + " thresholds=[0.0, 0.5, 0.8, 1.0])\n", + "\n", + " # Create our model agnostic aggregator.\n", + " eval_shared_model = types.EvalSharedModel(\n", + " construct_fn=model_agnostic_evaluate_graph.make_construct_fn(\n", + " add_metrics_callbacks=[confusion_matrix_callback,\n", + " auc_roc_callback,\n", + " auc_pr_callback,\n", + " post_export_metrics.example_count()],\n", + " fpl_feed_config=model_agnostic_extractor\n", + " .ModelAgnosticGetFPLFeedConfig(model_agnostic_config)))\n", + "\n", + " # Run Model Agnostic Eval.\n", + " _ = (\n", + " pipeline\n", + " | beam.Create(serialized_examples)\n", + " | 'ExtractEvaluateAndWriteResults' >>\n", + " model_eval_lib.ExtractEvaluateAndWriteResults(\n", + " eval_shared_model=eval_shared_model,\n", + " output_path=OUTPUT_DIR,\n", + " extractors=extractors))\n", + " \n", + "\n", + "eval_result = tfma.load_eval_result(output_path=OUTPUT_DIR)\n", + "render_slicing_metrics(eval_result, slicing_column = slice_column)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "code", + "id": "mOotC2D5Onqu", + "colab": {} + }, + "cell_type": "code", + "source": [ + "" + ], + "execution_count": 0, + "outputs": [] + } + ] +} \ No newline at end of file diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/README.md b/tables/automl/notebooks/retail_product_stockout_prediction/README.md new file mode 100644 index 000000000000..ede346a126ba --- /dev/null +++ b/tables/automl/notebooks/retail_product_stockout_prediction/README.md @@ -0,0 +1,376 @@ +---------------------------------------- +Copyright 2018 Google LLC + +Licensed under the Apache License, Version 2.0 (the "License");you may not use this file except in compliance with the License.You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, softwaredistributed under the License is distributed on an "AS IS" BASIS,WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.See the License for the specific language governing permissions and limitations under the License. + +---------------------------------------- + +# Retail Product Stockouts Prediction using AutoML Tables + +AutoML Tables enables you to build machine learning models based on tables of your own data and host them on Google Cloud for scalability. This solution demonstrates how you can use AutoML Tables to solve a product stockouts problem in the retail industry. This problem is solved using a binary classification approach, which predicts whether a particular product at a certain store will be out-of-stock or not in the next four weeks. Once the solution is built, you can plug this in with your production system and proactively predict stock-outs for your business. + + +Our exercise will + +1. [Walk through the problem of stock-out from a business standpoint](##business-problem) +2. [Explaining the challenges in solving this problem with machine learning](#the-machine-learning-solution) +3. [Demonstrate data preparation for machine learning](#data-preparation) +4. [Step-by-step guide to building the model on AutoML Tables UI](#building-the-model-on-automl-tables-ui) +5. [Step-by-step guide to executing the model through a python script that can be integrated with your production system](#building-the-model-using-automl-tables-python-client-library) +6. [Performance of the model built using AutoML Tables](#evaluation-results-and-business-impact) + + +## Business Problem + +### Problem statement + +A stockout, or out-of-stock (OOS) event is an event that causes inventory to be exhausted. While out-of-stocks can occur along the entire supply chain, the most visible kind are retail out-of-stocks in the fast-moving consumer goods industry (e.g., sweets, diapers, fruits). Stockouts are the opposite of overstocks, where too much inventory is retained. + +### Impact + +According to a study by researchers Thomas Gruen and Daniel Corsten, the global average level of out-of-stocks within retail fast-moving consumer goods sector across developed economies was 8.3% in 2002. This means that shoppers would have a 42% chance of fulfilling a ten-item shopping list without encountering a stockout. Despite the initiatives designed to improve the collaboration of retailers and their suppliers, such as Efficient Consumer Response (ECR), and despite the increasing use of new technologies such as radio-frequency identification (RFID) and point-of-sale data analytics, this situation has improved little over the past decades. + +The biggest impacts being +1. Customer dissatisfaction +2. Loss of revenue + +### Machine Learning Solution + +Using machine learning to solve for stock-outs can help with store operations and thus prevent out-of-stock proactively. + +## The Machine Learning Solution + +There are three big challenges any retailer would face as they try and solve this problem with machine learning: + +1. Data silos: Sales data, supply-chain data, inventory data, etc. may all be in silos. Such disjoint datasets could be a challenge to work with as a machine learning model tries to derive insights from all these data points. +2. Missing Features: Features such as vendor location, weather conditions, etc. could add a lot of value to a machine learning algorithm to learn from. But such features are not always available and when building machine learning solutions we think for collecting features as an iterative approach to improving the machine learning model. +3. Imbalanced dataset: Datasets for classification problems such as retail stock-out are traditionally very imbalanced with fewer cases for stock-out. Designing machine learning solutions by hand for such problems would be time consuming effort when your team should be focusing on collecting features. + +Hence, we recommend using AutoML Tables. With AutoML Tables you only need to work on acquiring all data and features, and AutoML Tables would do the rest. This is a one-click deploy to solving the problem of stock-out with machine learning. + + +## Data Preparation + +### Prerequisite + +To perform this exercise, you need to have a GCP (Google Cloud Platform) account. If you don't have a GCP account, see [Create a GCP project](https://cloud.google.com/resource-manager/docs/creating-managing-projects). + +### Data + +In this solution, you will use two datasets: Training/Evaluation data and Batch Prediction inputs. To access the datasets in BigQuery, you need the following information. + +Training/Evaluation dataset: + +`Project ID: product-stockout` \ +`Dataset ID: product_stockout` \ +`Table ID: stockout` + +Batch Prediction inputs: + +`Project ID: product-stockout` \ +`Dataset ID: product_stockout` \ +`Table ID: batch_prediction_inputs` + +### Data Schema + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Field name + Datatype + Type + Description +
Item_Number + STRING + Identifier + This is the product/ item identifier +
Category + STRING + Identifier + Several items could belong to one category +
Vendor_Number + STRING + Identifier + Product vendor identifier +
Store_Number + STRING + Identifier + Store identifier +
Item_Description + STRING + Text Features + Item Description +
Category_Name + STRING + Text Features + Category Name +
Vendor_Name + STRING + Text Features + Vendor Name +
Store_Name + STRING + Text Features + Store Name +
Address + STRING + Text Features + Address +
City + STRING + Categorical Features + City +
Zip_Code + STRING + Categorical Features + Zip-code +
Store_Location + STRING + Categorical Features + Store Location +
County_Number + STRING + Categorical Features + County Number +
County + STRING + Categorical Features + County Name +
Weekly Sales Quantity +

+ +

INTEGER + Time series data + 52 columns for weekly sales quantity from week 1 to week 52 +
Weekly Sales Dollars + INTEGER + Time series data + 52 columns for weekly sales dollars from week 1 to week 52 +
Inventory + FLOAT + Numeric Feature + This inventory is stocked by the retailer looking at past sales and seasonality of the product to meet demand for future sales. +
Stockout + INTEGER + Label + (1 - Stock-out, 0 - No stock-out) +

+When the demand for four weeks future sales is not met by the inventory in stock we say we see a stock-out. This is because an early warning sign would help the retailer re-stock inventory with a lead time for the stock to be replenished. +

+ + +To use AutoML Tables with BigQuery you do not need to download this dataset. However, if you would like to use AutoML Tables with GCS you may want to download this dataset and upload it into your GCP Project storage bucket. + +Instructions to download dataset: + +Sample Dataset: Download this dataset which contains sales data. + +1. [Link to training data](https://console.cloud.google.com/bigquery?folder=&organizationId=&project=product-stockout&p=product-stockout&d=product_stockout&t=stockout&page=table): \ +Dataset URI: +2. [Link to data for batch predictions](https://console.cloud.google.com/bigquery?folder=&organizationId=&project=product-stockout&p=product-stockout&d=product_stockout&t=batch_prediction_inputs&page=table): \ +Dataset URI: + +Upload this dataset to GCS or BigQuery (optional). + +You could select either [GCS](https://cloud.google.com/storage/) or [BigQuery](https://cloud.google.com/bigquery/) as the location of your choice to store the data for this challenge. + +1. Storing data on GCS: [Creating storage buckets, Uploading data to storage buckets](https://cloud.google.com/storage/docs/creating-buckets) +2. Storing data on BigQuery: [Create and load data to BigQuery](https://cloud.google.com/bigquery/docs/quickstarts/quickstart-web-ui) (optional) + + +## Building the model on AutoML Tables UI + +1. Enable [AutoML Tables](https://cloud.google.com/automl-tables/docs/quickstart#before_you_begin) on GCP. + +2. Visit the [AutoML Tables UI](https://console.cloud.google.com/automl-tables) to begin the process of creating your dataset and training your model. \ +>![alt text](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/automl_stockout_img/Image%201%202019-03-13%20at%201.02.53%20PM.png) + +3. Import your dataset or the dataset you downloaded in the last section \ +Click <+New Dataset> → Dataset Name → Click Create Dataset \ +>![alt text](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/automl_stockout_img/Image%202%202019-03-13%20at%201.05.17%20PM.png) + +4. You can import data from BigQuery or GCS bucket \ + a. For BigQuery enter your GCP project ID, Dataset ID and Table ID \ + After specifying dataset click import dataset \ +>![alt text]https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/automl_stockout_img/Image%203%202019-03-13%20at%201.08.44%20PM.png) + b. For GCS enter the GCS object location by clicking on BROWSE \ + After specifying dataset click import dataset \ +>![alt text](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/automl_stockout_img/Image%204%202019-03-13%20at%201.09.56%20PM.png) + Depending on the size of the dataset this import can take some time. + +5. Once the import is complete you can set the schema of the imported dataset based on your business understanding of the data \ + a. Select Label i.e. Stockout \ + b. Select Variable Type for all features \ + c. Click Continue \ +>![alt text](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/automl_stockout_img/Image%206%202019-03-13%20at%201.20.57%20PM.png) + +6. The imported dataset is now analyzed \ +This helps you analyze the size of your dataset, dig into missing values if any, calculate correlation, mean and standard deviation. If this data quality looks good to you then we can move on to the next tab i.e. train. \ +>![alt text](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/automl_stockout_img/Image%20new%201%202019-03-25%20at%2012.43.13%20AM.png) + +7. Train \ + a. Select a model name \ + b. Select the training budget \ + c. Select all features you would like to use for training \ + d. Select optimization objectives. Such as: ROC, Log Loss or PR curve \ + (As our data is imbalances we use PR curve as our optimization metric) \ + e. Click TRAIN \ + f. Training the model can take some time \ +![alt text](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/automl_stockout_img/Image%208%202019-03-13%20at%201.34.08%20PM.png) + +![alt text](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/automl_stockout_img/Image%20new%202%202019-03-25%20at%2012.44.18%20AM.png) + +8. Once the model is trained you can click on the evaluate tab \ +This tab gives you stats for model evaluation \ + For example our model shows \ + Area Under Precision Recall Curve: 0.645 \ + Area Under ROC Curve: 0.893 \ + Accuracy: 92.5% \ + Log Loss: 0.217 \ +Selecting the threshold lets you set a desired precision and recall on your predictions. \ +>![alt text](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/automl_stockout_img/Image%20new%203%202019-03-25%20at%2012.49.40%20AM.png) + +9. Using the model created let's use batch prediction to predict stock-out \ + a. Batch prediction data inputs could come from BigQuery or your GCS bucket. \ + b. Select the GCS bucket to store the results of your batch prediction \ + c. Click Send Batch Predictions \ +>![alt text](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/automl_stockout_img/Image%2012%202019-03-13%20at%201.56.43%20PM.png) + +>![alt text](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/automl_stockout_img/Image%2013%202019-03-13%20at%201.59.18%20PM.png) + + +## Building the model using AutoML Tables Python Client Library + +In this notebook, you will learn how to build the same model as you have done on the AutoML Tables UI using its Python Client Library. + + +## Evaluation results and business impact + +>![alt text](https://storage.cloud.google.com/cloud-ml-data/automl-tables/notebooks/automl_stockout_img/Image%20new%203%202019-03-25%20at%2012.49.40%20AM.png) + +Thus the evaluation results tell us that the model we built can: + +1. 92.5% Accuracy: That is about 92.5% times you should be confident that the stock-out or no stock-out prediction is accurate. +2. 78.2% Precision: Of the sock-outs identified 78.2% results are expected to actually be stock-outs +3. 44.1% Recall: And of all possible stock-outs 44.1% should be identified by this model +4. 1.5% False Positive Rate: Only 1.5% times an item identified as stock-out may not be out-of-stock + +Thus, with such a machine learning model your business could definitely expect time savings and revenue gain by predicting stock-outs. + +Note: You can always improve this model iteratively by adding business relevant features. \ No newline at end of file diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 1 2019-03-13 at 1.02.53 PM.png b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 1 2019-03-13 at 1.02.53 PM.png new file mode 100644 index 000000000000..94f11b28bb60 Binary files /dev/null and b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 1 2019-03-13 at 1.02.53 PM.png differ diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 12 2019-03-13 at 1.56.43 PM.png b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 12 2019-03-13 at 1.56.43 PM.png new file mode 100644 index 000000000000..f60f3aa5d54f Binary files /dev/null and b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 12 2019-03-13 at 1.56.43 PM.png differ diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 13 2019-03-13 at 1.59.18 PM.png b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 13 2019-03-13 at 1.59.18 PM.png new file mode 100644 index 000000000000..f80bdfb85555 Binary files /dev/null and b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 13 2019-03-13 at 1.59.18 PM.png differ diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 2 2019-03-13 at 1.05.17 PM.png b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 2 2019-03-13 at 1.05.17 PM.png new file mode 100644 index 000000000000..daeb7d9661e2 Binary files /dev/null and b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 2 2019-03-13 at 1.05.17 PM.png differ diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 3 2019-03-13 at 1.08.44 PM.png b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 3 2019-03-13 at 1.08.44 PM.png new file mode 100644 index 000000000000..2cc3f366c13f Binary files /dev/null and b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 3 2019-03-13 at 1.08.44 PM.png differ diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 4 2019-03-13 at 1.09.56 PM.png b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 4 2019-03-13 at 1.09.56 PM.png new file mode 100644 index 000000000000..66b1fe57c8a0 Binary files /dev/null and b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 4 2019-03-13 at 1.09.56 PM.png differ diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 5 2019-03-13 at 1.10.11 PM.png b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 5 2019-03-13 at 1.10.11 PM.png new file mode 100644 index 000000000000..0d27ed38bfb7 Binary files /dev/null and b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 5 2019-03-13 at 1.10.11 PM.png differ diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 6 2019-03-13 at 1.20.57 PM.png b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 6 2019-03-13 at 1.20.57 PM.png new file mode 100644 index 000000000000..02ccd865bc83 Binary files /dev/null and b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 6 2019-03-13 at 1.20.57 PM.png differ diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 8 2019-03-13 at 1.34.08 PM.png b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 8 2019-03-13 at 1.34.08 PM.png new file mode 100644 index 000000000000..d0e7ddb85af6 Binary files /dev/null and b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image 8 2019-03-13 at 1.34.08 PM.png differ diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image new 1 2019-03-25 at 12.43.13 AM.png b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image new 1 2019-03-25 at 12.43.13 AM.png new file mode 100644 index 000000000000..e57b543d0de7 Binary files /dev/null and b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image new 1 2019-03-25 at 12.43.13 AM.png differ diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image new 2 2019-03-25 at 12.44.18 AM.png b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image new 2 2019-03-25 at 12.44.18 AM.png new file mode 100644 index 000000000000..20667b2ef4a0 Binary files /dev/null and b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image new 2 2019-03-25 at 12.44.18 AM.png differ diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image new 3 2019-03-25 at 12.49.40 AM.png b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image new 3 2019-03-25 at 12.49.40 AM.png new file mode 100644 index 000000000000..776d8d42ae0d Binary files /dev/null and b/tables/automl/notebooks/retail_product_stockout_prediction/resources/automl_stockout_img/Image new 3 2019-03-25 at 12.49.40 AM.png differ diff --git a/tables/automl/notebooks/retail_product_stockout_prediction/retail_product_stockout_prediction.ipynb b/tables/automl/notebooks/retail_product_stockout_prediction/retail_product_stockout_prediction.ipynb new file mode 100644 index 000000000000..5916735f5d2a --- /dev/null +++ b/tables/automl/notebooks/retail_product_stockout_prediction/retail_product_stockout_prediction.ipynb @@ -0,0 +1,998 @@ +{ + "nbformat": 4, + "nbformat_minor": 0, + "metadata": { + "colab": { + "name": "retail_product_stockout_prediction.ipynb", + "version": "0.3.2", + "provenance": [], + "collapsed_sections": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + } + }, + "cells": [ + { + "metadata": { + "id": "9V5sA5glWemD", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Copyright 2018 Google LLC \n", + "\n", + "Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "you may not use this file except in compliance with the License.\n", + "You may obtain a copy of the License at\n", + "\n", + "http://www.apache.org/licenses/LICENSE-2.0\n", + "\n", + "Unless required by applicable law or agreed to in writing, software\n", + "distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "See the License for the specific language governing permissions and limitations under the License." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "m26YhtBMvVWA" + }, + "cell_type": "markdown", + "source": [ + "# Retail Product Stockouts Prediction using AutoML Tables\n", + "\n", + "AutoML Tables enables you to build machine learning models based on tables of your own data and host them on Google Cloud for scalability. This solution demonstrates how you can use AutoML Tables to solve a product stockouts problem in the retail industry. This problem is solved using a binary classification approach, which predicts whether a particular product at a certain store will be out-of-stock or not in the next four weeks. Once the solution is built, you can plug this in with your production system and proactively predict stock-outs for your business. \n", + "\n", + "To use this Colab notebook, copy it to your own Google Drive and open it with [Colaboratory](https://colab.research.google.com/) (or Colab). To run a cell hold the Shift key and press the Enter key (or Return key). Colab automatically displays the return value of the last line in each cell. Refer to [this page](https://colab.research.google.com/notebooks/welcome.ipynb) for more information on Colab.\n", + "\n", + "You can run a Colab notebook on a hosted runtime in the Cloud. The hosted VM times out after 90 minutes of inactivity and you will lose all the data stored in the memory including your authentication data. If your session gets disconnected (for example, because you closed your laptop) for less than the 90 minute inactivity timeout limit, press 'RECONNECT' on the top right corner of your notebook and resume the session. After Colab timeout, you'll need to\n", + "\n", + "1. Re-run the initialization and authentication.\n", + "2. Continue from where you left off. You may need to copy-paste the value of some variables such as the `dataset_name` from the printed output of the previous cells.\n", + "\n", + "Alternatively you can connect your Colab notebook to a [local runtime](https://research.google.com/colaboratory/local-runtimes.html)." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "b--5FDDwCG9C" + }, + "cell_type": "markdown", + "source": [ + "## 1. Project set up\n", + "\n", + "\n", + "\n" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "AZs0ICgy4jkQ" + }, + "cell_type": "markdown", + "source": [ + "Follow the [AutoML Tables documentation](https://cloud.google.com/automl-tables/docs/) to\n", + "* Create a Google Cloud Platform (GCP) project.\n", + "* Enable billing.\n", + "* Apply to whitelist your project.\n", + "* Enable AutoML API.\n", + "* Enable AutoML Talbes API.\n", + "* Create a service account, grant required permissions, and download the service account private key.\n", + "\n", + "You also need to upload your data into Google Cloud Storage (GCS) or BigQuery. For example, to use GCS as your data source\n", + "* Create a GCS bucket.\n", + "* Upload the training and batch prediction files.\n", + "\n", + "\n", + "**Warning:** Private keys must be kept secret. If you expose your private key it is recommended to revoke it immediately from the Google Cloud Console." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "xZECt1oL429r" + }, + "cell_type": "markdown", + "source": [ + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "rstRPH9SyZj_" + }, + "cell_type": "markdown", + "source": [ + "## 2. Initialize and authenticate\n", + "This section runs intialization and authentication. It creates an authenticated session which is required for running any of the following sections." + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "BR0POq2UzE7e" + }, + "cell_type": "markdown", + "source": [ + "### Install the client library in Colab\n", + "Run the following cell to install the client libary using `pip`.\n", + "\n", + "See [documentations ](https://cloud.google.com/automl-tables/docs/client-libraries) of Google Cloud AutoML Client Library for Python. \n" + ] + }, + { + "metadata": { + "id": "43aXKjDRt_qZ", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Install AutoML Tables client library { vertical-output: true }\n", + "\n", + "!pip install google-cloud-automl" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "eVFsPPEociwF" + }, + "cell_type": "markdown", + "source": [ + "### Authenticate using service account key\n", + "Run the following cell. Click on the __Choose Files__ button and select the service account private key file. If your Service Account Key file or folder is hidden, you can reveal it in a Mac by pressing the __Command + Shift + .__ combo.\n", + "\n" + ] + }, + { + "metadata": { + "id": "u-kCqysAuaJk", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Authenticate using service account key and create a client. { vertical-output: true }\n", + "\n", + "from google.cloud import automl_v1beta1\n", + "from google.colab import files\n", + "\n", + "# Upload service account key\n", + "keyfile_upload = files.upload()\n", + "keyfile_name = list(keyfile_upload.keys())[0]\n", + "# Authenticate and create an AutoML client.\n", + "client = automl_v1beta1.AutoMlClient.from_service_account_file(keyfile_name)\n", + "# Authenticate and create a prediction service client.\n", + "prediction_client = automl_v1beta1.PredictionServiceClient.from_service_account_file(keyfile_name)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "s3F2xbEJdDvN" + }, + "cell_type": "markdown", + "source": [ + "### Test" + ] + }, + { + "metadata": { + "id": "0uX4aJYUiXh5", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Enter your GCP project ID." + ] + }, + { + "metadata": { + "colab_type": "code", + "id": "6R4h5HF1Dtds", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title GCP project ID and location\n", + "\n", + "project_id = '' #@param {type:'string'}\n", + "location = 'us-central1'\n", + "location_path = client.location_path(project_id, location)\n", + "location_path" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "rUlBcZ3OfWcJ" + }, + "cell_type": "markdown", + "source": [ + "To test whether your project set up and authentication steps were successful, run the following cell to list your datasets in this project.\n", + "\n", + "If no dataset has previously imported into AutoML Tables, you shall expect an empty return." + ] + }, + { + "metadata": { + "cellView": "both", + "colab_type": "code", + "id": "sf32nKXIqYje", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title List datasets. { vertical-output: true }\n", + "\n", + "list_datasets_response = client.list_datasets(location_path)\n", + "datasets = {dataset.display_name: dataset.name for dataset in list_datasets_response}\n", + "datasets" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "t9uE8MvMkOPd" + }, + "cell_type": "markdown", + "source": [ + "You can also print the list of your models by running the following cell.\n", + "\n", + "If no model has previously trained using AutoML Tables, you shall expect an empty return." + ] + }, + { + "metadata": { + "cellView": "both", + "colab_type": "code", + "id": "j4-bYRSWj7xk", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title List models. { vertical-output: true }\n", + "\n", + "list_models_response = client.list_models(location_path)\n", + "models = {model.display_name: model.name for model in list_models_response}\n", + "models" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "qozQWMnOu48y" + }, + "cell_type": "markdown", + "source": [ + "\n", + "\n", + "---\n", + "\n" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "ODt86YuVDZzm" + }, + "cell_type": "markdown", + "source": [ + "## 3. Import training data" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "XwjZc9Q62Fm5" + }, + "cell_type": "markdown", + "source": [ + "### Create dataset" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "_JfZFGSceyE_" + }, + "cell_type": "markdown", + "source": [ + "Select a dataset display name and pass your table source information to create a new dataset." + ] + }, + { + "metadata": { + "id": "Z_JErW3cw-0J", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Create dataset { vertical-output: true, output-height: 200 }\n", + "\n", + "dataset_display_name = 'stockout_data' #@param {type: 'string'}\n", + "\n", + "dataset_dict = {\n", + " 'display_name': dataset_display_name, \n", + " 'tables_dataset_metadata': {}\n", + "}\n", + "\n", + "create_dataset_response = client.create_dataset(\n", + " location_path,\n", + " dataset_dict\n", + ")\n", + "create_dataset_response" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "RLRgvqzUdxfL", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + " #@title Get dataset name { vertical-output: true }\n", + "\n", + "dataset_name = create_dataset_response.name\n", + "dataset_name" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "35YZ9dy34VqJ" + }, + "cell_type": "markdown", + "source": [ + "### Import data" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "3c0o15gVREAw" + }, + "cell_type": "markdown", + "source": [ + "You can import your data to AutoML Tables from GCS or BigQuery. For this solution, you will import data from a BigQuery Table. The URI for your table is in the format of `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", + "\n", + "The BigQuery Table used for demonstration purpose can be accessed as `bq://product-stockout.product_stockout.stockout`. \n", + "\n", + "See the table schema and dataset description from the README. " + ] + }, + { + "metadata": { + "id": "bB_GdeqCJW5i", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title ... if data source is BigQuery { vertical-output: true }\n", + "\n", + "dataset_bq_input_uri = 'bq://product-stockout.product_stockout.stockout' #@param {type: 'string'}\n", + "# Define input configuration.\n", + "input_config = {\n", + " 'bigquery_source': {\n", + " 'input_uri': dataset_bq_input_uri\n", + " }\n", + "}" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "FNVYfpoXJsNB", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + " #@title Import data { vertical-output: true }\n", + "\n", + "import_data_response = client.import_data(dataset_name, \n", + " input_config)\n", + "print('Dataset import operation: {}'.format(import_data_response.operation))" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "1O7tJ8IlefRC", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + " #@title Check if importing the data is complete { vertical-output: true }\n", + "\n", + "# If returns `False`, you can check back again later.\n", + "# Continue with the rest only if this cell returns a `True`.\n", + "import_data_response.done()" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "_WLvyGIDe9ah", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Importing this stockout datasets takes about 10 minutes. \n", + "\n", + "If you re-visit this Colab, uncomment the following cell and run the command to retrieve your dataset. Replace `YOUR_DATASET_NAME` with its actual value obtained in the preceding cells.\n", + "\n", + "`YOUR_DATASET_NAME` is a string in the format of `'projects//locations//datasets/'`." + ] + }, + { + "metadata": { + "id": "P6NkRMyJfAGm", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "# dataset_name = '' #@param {type: 'string'}\n", + "# dataset = client.get_dataset(dataset_name) " + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "QdxBI4s44ZRI", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Review the specs" + ] + }, + { + "metadata": { + "id": "RC0PWKqH4jwr", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "Run the following command to see table specs such as row count." + ] + }, + { + "metadata": { + "id": "v2Vzq_gwXxo-", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Table schema { vertical-output: true }\n", + "\n", + "import google.cloud.automl_v1beta1.proto.data_types_pb2 as data_types\n", + "import matplotlib.pyplot as plt\n", + "\n", + "# List table specs\n", + "list_table_specs_response = client.list_table_specs(dataset_name)\n", + "table_specs = [s for s in list_table_specs_response]\n", + "# List column specs\n", + "table_spec_name = table_specs[0].name\n", + "list_column_specs_response = client.list_column_specs(table_spec_name)\n", + "column_specs = {s.display_name: s for s in list_column_specs_response}\n", + "# Table schema pie chart.\n", + "type_counts = {}\n", + "for column_spec in column_specs.values():\n", + " type_name = data_types.TypeCode.Name(column_spec.data_type.type_code)\n", + " type_counts[type_name] = type_counts.get(type_name, 0) + 1\n", + "\n", + "plt.pie(x=type_counts.values(), labels=type_counts.keys(), autopct='%1.1f%%')\n", + "plt.axis('equal')\n", + "plt.show()\n" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "Lqjq4X43v3ON", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "In the pie chart above, you see this dataset contains three variable types: `FLOAT64` (treated as `Numeric`), `CATEGORY` (treated as `Categorical`) and `STRING` (treated as `Text`). " + ] + }, + { + "metadata": { + "id": "FNykW_YOYt6d", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "___" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "kNRVJqVOL8h3" + }, + "cell_type": "markdown", + "source": [ + "## 4. Update dataset: assign a label column and enable nullable columns" + ] + }, + { + "metadata": { + "id": "VsOPwxN9fOIl", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Get column specs" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "-57gehId9PQ5" + }, + "cell_type": "markdown", + "source": [ + "AutoML Tables automatically detects your data column type. \n", + "\n", + "There are a total of 120 columns in this stockout dataset.\n", + "\n", + "Run the following command to check the column data type that automaticallyed detected. If columns contains only numerical values, but they represent categories, change that column data type to caregorical by updating your schema.\n", + "\n", + "In addition, AutoML Tables detects `Stockout` to be categorical that chooses to run a classification model. " + ] + }, + { + "metadata": { + "id": "Pyku3AHEfSp4", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title List table specs { vertical-output: true }\n", + "\n", + "list_table_specs_response = client.list_table_specs(dataset_name)\n", + "table_specs = [s for s in list_table_specs_response]\n", + "table_specs" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "jso_JBI9fgy6", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Check column data type { vertical-output: true }\n", + "\n", + "# Get column specs.\n", + "table_spec_name = table_specs[0].name\n", + "list_column_specs_response = client.list_column_specs(table_spec_name)\n", + "column_specs = {s.display_name: s for s in list_column_specs_response}\n", + "\n", + "# Print column data types.\n", + "for column in column_specs:\n", + " print(column, '-', column_specs[column].data_type)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "iRqdQ7Xiq04x", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "### Update columns: make categorical\n", + "\n", + "From the column data type, you noticed `Item_Number`, `Category`, `Vendor_Number`, `Store_Number`, `Zip_Code` and `County_Number` have been autodetected as `FLOAT64` (Numerical) instead of `CATEGORY` (Categorical). \n", + "\n", + "In this solution, the columns `Item_Number`, `Category`, `Vendor_Number` and `Store_Number` are not nullable, but `Zip_Code` and `County_Number` can take null values.\n", + "\n", + "To change the data type, you can update the schema by updating the column spec.\n", + "\n", + "`update_column_response = client.update_column_spec(update_column_spec_dict)`" + ] + }, + { + "metadata": { + "id": "gAPg_ymDf4kL", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "def create_update_column_sepc_dict(column_name, type_code, nullable):\n", + " \"\"\"\n", + " Create `update_column_spec_dict` with a given column name and target `type_code`.\n", + " Inputs:\n", + " column_name: string. Represents column name.\n", + " type_code: string. Represents variable type. See details: \\\n", + " https://cloud.google.com/automl-tables/docs/reference/rest/v1beta1/projects.locations.datasets.tableSpecs.columnSpecs#typecode\n", + " nullable: boolean. If true, this DataType can also be null.\n", + " Return:\n", + " update_column_spec_dict: dictionary. Encodes the target column specs.\n", + " \"\"\"\n", + " update_column_spec_dict = {\n", + " 'name': column_specs[column_name].name,\n", + " 'data_type': {\n", + " 'type_code': type_code,\n", + " 'nullable': nullable\n", + " }\n", + " }\n", + " return update_column_spec_dict" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "_xePITEYf5po", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "# Update dataset\n", + "categorical_column_names = ['Item_Number',\n", + " 'Category',\n", + " 'Vendor_Number',\n", + " 'Store_Number',\n", + " 'Zip_Code',\n", + " 'County_Number']\n", + "is_nullable = [False, \n", + " False,\n", + " False,\n", + " False,\n", + " True,\n", + " True]\n", + "\n", + "for i in range(len(categorical_column_names)):\n", + " column_name = categorical_column_names[i]\n", + " nullable = is_nullable[i]\n", + " update_column_spec_dict = create_update_column_sepc_dict(column_name, 'CATEGORY', nullable)\n", + " update_column_response = client.update_column_spec(update_column_spec_dict)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "colab_type": "text", + "id": "nDMH_chybe4w" + }, + "cell_type": "markdown", + "source": [ + "### Update dataset: assign a label\n", + "\n", + "Select the label column and update the dataset." + ] + }, + { + "metadata": { + "id": "hVIruWg0u33t", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Update dataset { vertical-output: true }\n", + "\n", + "label_column_name = 'Stockout' #@param {type: 'string'}\n", + "label_column_spec = column_specs[label_column_name]\n", + "label_column_id = label_column_spec.name.rsplit('/', 1)[-1]\n", + "print('Label column ID: {}'.format(label_column_id))\n", + "# Define the values of the fields to be updated.\n", + "update_dataset_dict = {\n", + " 'name': dataset_name,\n", + " 'tables_dataset_metadata': {\n", + " 'target_column_spec_id': label_column_id\n", + " }\n", + "}\n", + "\n", + "update_dataset_response = client.update_dataset(update_dataset_dict)\n", + "update_dataset_response" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "z23NITLrcxmi", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "___" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "FcKgvj1-Tbgj" + }, + "cell_type": "markdown", + "source": [ + "## 5. Creating a model" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "Pnlk8vdQlO_k" + }, + "cell_type": "markdown", + "source": [ + "### Train a model\n", + "Training the model may take one hour or more. To obtain the results with less training time or budget, you can set [`train_budget_milli_node_hours`](https://cloud.google.com/automl-tables/docs/reference/rest/v1beta1/projects.locations.models), which is the train budget of creating this model, expressed in milli node hours i.e. 1,000 value in this field means 1 node hour. \n", + "\n", + "For demonstration purpose, the following command sets the budget as 1 node hour. You can increate that number up to a maximum of 72 hours ('train_budget_milli_node_hours': 72000) for the best model performance. \n", + "\n", + "You can also select the objective to optimize your model training by setting `optimization_objective`. This solution optimizes the model by maximizing the Area Under the Precision-Recall (PR) Curve. \n" + ] + }, + { + "metadata": { + "id": "11izNd6Fu37N", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Create model { vertical-output: true }\n", + "\n", + "feature_list = list(column_specs.keys())\n", + "feature_list.remove('Stockout')\n", + "\n", + "model_display_name = 'stockout_model' #@param {type:'string'}\n", + "dataset_id = dataset_name.rsplit('/', 1)[-1]\n", + "\n", + "model_dict = {\n", + " 'display_name': model_display_name,\n", + " 'dataset_id': dataset_id, \n", + " 'tables_model_metadata': {\n", + " 'target_column_spec': column_specs['Stockout'],\n", + " 'input_feature_column_specs': [column_specs[f] for f in feature_list],\n", + " 'optimization_objective': 'MAXIMIZE_AU_PRC',\n", + " 'train_budget_milli_node_hours': 1000\n", + " }, \n", + "}\n", + "\n", + "create_model_response = client.create_model(location_path, model_dict)\n", + "print('Dataset import operation: {}'.format(create_model_response.operation))" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "wCQdx9VyhKY5", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Check if model training is complete { vertical-output: true }\n", + "# If returns `False`, you can check back again later.\n", + "# Continue with the rest only if this cell returns a `True`.\n", + "create_model_response.done()" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "bPiR8zMwhQYO", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Retrieve the model name { vertical-output: true }\n", + "create_model_result = create_model_response.result()\n", + "model_name = create_model_result.name\n", + "model_name" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "neYjToB36q9E", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "If your Colab times out, use `client.list_models(location_path)` to check whether your model has been created. \n", + "\n", + "Then uncomment the following cell and run the command to retrieve your model. Replace `YOUR_MODEL_NAME` with its actual value obtained in the preceding cell.\n", + "\n", + "`YOUR_MODEL_NAME` is a string in the format of `'projects//locations//models/'`" + ] + }, + { + "metadata": { + "id": "QptCwUIK7yhU", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "# model_name = '' #@param {type: 'string'}\n", + "# model = client.get_model(model_name)" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "1wS1is9IY5nK", + "colab_type": "text" + }, + "cell_type": "markdown", + "source": [ + "___" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "TarOq84-GXch" + }, + "cell_type": "markdown", + "source": [ + "## 6. Batch prediction" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "Soy5OB8Wbp_R" + }, + "cell_type": "markdown", + "source": [ + "### Initialize prediction" + ] + }, + { + "metadata": { + "colab_type": "text", + "id": "39bIGjIlau5a" + }, + "cell_type": "markdown", + "source": [ + "Your data source for batch prediction can be GCS or BigQuery. For this solution, you will use a BigQuery Table as the input source. The URI for your table is in the format of `bq://PROJECT_ID.DATASET_ID.TABLE_ID`.\n", + "\n", + "To write out the predictions, you need to specify a GCS bucket `gs://BUCKET_NAME`.\n", + "\n", + "The AutoML Tables logs the errors in the `errors.csv` file.\n", + "\n", + "**NOTE:** The client library has a bug. If the following cell returns a `TypeError: Could not convert Any to BatchPredictResult` error, ignore it. The batch prediction output file(s) will be updated to the GCS bucket that you set in the preceding cells." + ] + }, + { + "metadata": { + "id": "gkF3bH0qu4DU", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Start batch prediction { vertical-output: true, output-height: 200 }\n", + "\n", + "batch_predict_bq_input_uri = 'bq://product-stockout.product_stockout.batch_prediction_inputs'\n", + "batch_predict_gcs_output_uri_prefix = 'gs://' #@param {type:'string'}\n", + "\n", + "# Define input source.\n", + "batch_prediction_input_source = {\n", + " 'bigquery_source': {\n", + " 'input_uri': batch_predict_bq_input_uri\n", + " }\n", + "}\n", + "# Define output target.\n", + "batch_prediction_output_target = {\n", + " 'gcs_destination': {\n", + " 'output_uri_prefix': batch_predict_gcs_output_uri_prefix\n", + " }\n", + "}\n", + "batch_predict_response = prediction_client.batch_predict(model_name, \n", + " batch_prediction_input_source, \n", + " batch_prediction_output_target)\n", + "print('Batch prediction operation: {}'.format(batch_predict_response.operation))" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "AVJhh_k0PfxD", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Check if batch prediction is complete { vertical-output: true }\n", + "\n", + "# If returns `False`, you can check back again later.\n", + "# Continue with the rest only if this cell returns a `True`.\n", + "batch_predict_response.done()" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "8nr5q2M8W2VX", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Retrieve batch prediction metadata { vertical-output: true }\n", + "\n", + "batch_predict_response.metadata" + ], + "execution_count": 0, + "outputs": [] + }, + { + "metadata": { + "id": "kgwbJwS2iLpc", + "colab_type": "code", + "colab": {} + }, + "cell_type": "code", + "source": [ + "#@title Check prediction results { vertical-output: true }\n", + "\n", + "gcs_output_directory = batch_predict_response.metadata.batch_predict_details.output_info.gcs_output_directory\n", + "result_file = gcs_output_directory + '/result.csv'\n", + "print('Batch prediction results are stored as: {}'.format(result_file))" + ], + "execution_count": 0, + "outputs": [] + } + ] +} \ No newline at end of file