From f957b278b39e0a472a3153e9e1906c2d5f2ac2e5 Mon Sep 17 00:00:00 2001 From: Henry Solberg Date: Tue, 14 Nov 2023 18:52:15 -0800 Subject: [PATCH] docs: add an example notebook about line graphs (#197) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Thank you for opening a Pull Request! Before submitting your PR, there are a few things you can do to make sure it goes smoothly: - [ ] Make sure to open an issue as a [bug/issue](https://togithub.com/googleapis/python-bigquery-dataframes/issues/new/choose) before writing your code! That way we can discuss the change, evaluate designs, and agree on the general idea - [ ] Ensure the tests and linter pass - [ ] Code coverage does not decrease (if any source code was changed) - [ ] Appropriate docs were updated (if necessary) Fixes # 🦕 --- .../bq_dataframes_covid_line_graphs.ipynb | 598 ++++++++++++++++++ noxfile.py | 1 + 2 files changed, 599 insertions(+) create mode 100644 notebooks/visualization/bq_dataframes_covid_line_graphs.ipynb diff --git a/notebooks/visualization/bq_dataframes_covid_line_graphs.ipynb b/notebooks/visualization/bq_dataframes_covid_line_graphs.ipynb new file mode 100644 index 0000000000..8b18cc8967 --- /dev/null +++ b/notebooks/visualization/bq_dataframes_covid_line_graphs.ipynb @@ -0,0 +1,598 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9GIt_orUtNvA" + }, + "outputs": [], + "source": [ + "# Copyright 2023 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h7AT6h2ItNvD" + }, + "source": [ + "## Use BigQuery DataFrames to visualize COVID-19 data\n", + "\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \n", + " \"Colab Run in Colab\n", + " \n", + " \n", + " \n", + " \"GitHub\n", + " View on GitHub\n", + " \n", + "
" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "n-MFJQxLtNvE" + }, + "source": [ + "## Overview\n", + "\n", + "The goal of this notebook is to demonstrate creating line graphs from a ~20 million-row BigQuery dataset using BigQuery DataFrames. We will first create a plain line graph using matplotlip, then we will downsample and download our data to create a graph with a line of best fit using seaborn.\n", + "\n", + "If you're like me, during 2020 (and/or later years) you often found yourself looking at charts like [these](https://health.google.com/covid-19/open-data/explorer/statistics) visualizing COVID-19 cases over time. For our first graph, we're going to recreate one of those charts by filtering, summing, and then graphing COVID-19 data from the United States. BigQuery DataFrame's default integration with matplotlib will get us a satisfying result for this first graph.\n", + "\n", + "For our second graph, though, we want to use a scatterplot with a line of best fit, something that matplotlib will not do for us automatically. So, we'll demonstrate how to downsample our data and use seaborn to make our plot. Our second graph will be of symptom-related search trends against new cases of COVID-19, so we'll see if searches for things like \"cough\" and \"fever\" are more common in the places and times where more new cases of COVID-19 occur." + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "ffqBzbNztNvF" + }, + "source": [ + "### Dataset\n", + "\n", + "This notebook uses the [BigQuery COVID-19 Open Data](https://pantheon.corp.google.com/marketplace/product/bigquery-public-datasets/covid19-open-data). In this dataset, each row represents a new observation of the COVID-19 situation in a particular time and place. We will use the \"new_confirmed\" column, which contains the number of new COVID-19 cases at each observation, along with the \"search_trends_cough\", \"search_trends_fever\", and \"search_trends_bruise\" columns, which are [Google Trends](https://trends.google.com/trends/) data for searches related to cough, fever, and bruises. In the first section of the notebook, we will also use the \"country_code\" and \"date\" columns to compile one data point per day for a particular country." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Nf__tMR-tNvF" + }, + "source": [ + "### Costs\n", + "\n", + "This tutorial uses billable components of Google Cloud:\n", + "\n", + "* BigQuery (compute)\n", + "\n", + "Learn about [BigQuery compute pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing_models),\n", + "and use the [Pricing Calculator](https://cloud.google.com/products/calculator/)\n", + "to generate a cost estimate based on your projected usage." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7_rsbkCktNvG" + }, + "source": [ + "## Before you begin\n", + "\n", + "### Set up your Google Cloud project\n", + "\n", + "**The following steps are required, regardless of your notebook environment.**\n", + "\n", + "1. [Select or create a Google Cloud project](https://console.cloud.google.com/cloud-resource-manager). When you first create an account, you get a $300 free credit towards your compute/storage costs.\n", + "\n", + "2. [Make sure that billing is enabled for your project](https://cloud.google.com/billing/docs/how-to/modify-project).\n", + "\n", + "3. [Enable the BigQuery API](https://console.cloud.google.com/flows/enableapi?apiid=bigquery.googleapis.com).\n", + "\n", + "4. If you are running this notebook locally, you need to install the [Cloud SDK](https://cloud.google.com/sdk)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XZKC6iMFxmMG" + }, + "source": [ + "#### Set your project ID\n", + "\n", + "**If you don't know your project ID**, try the following:\n", + "* Run `gcloud config list`.\n", + "* Run `gcloud projects list`.\n", + "* See the support page: [Locate the project ID](https://support.google.com/googleapi/answer/7014113)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4aooKMmnxrWF" + }, + "outputs": [], + "source": [ + "PROJECT_ID = \"\" # @param {type:\"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pv5A8Tm-yC1U" + }, + "source": [ + "#### Set the region\n", + "\n", + "You can also change the `REGION` variable used by BigQuery. Learn more about [BigQuery regions](https://cloud.google.com/bigquery/docs/locations#supported_locations)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bk03Rt_HyGx-" + }, + "outputs": [], + "source": [ + "REGION = \"US\" # @param {type: \"string\"}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B9RWxD1btNvK" + }, + "source": [ + "Now we are ready to use BigQuery DataFrames!" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wJ0gXezj2w1t" + }, + "source": [ + "## Visualization #1: Cases over time in the US" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "xckgWno6ouHY" + }, + "source": [ + "### Set up project and filter data" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "-uiY0hh4tNvK" + }, + "source": [ + "First, let's do project setup. We use options to tell BigQuery DataFrames what project and what region to use for our cloud computing." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "R7STCS8xB5d2" + }, + "outputs": [], + "source": [ + "import bigframes.pandas as bf\n", + "\n", + "bf.options.bigquery.project = PROJECT_ID\n", + "bf.options.bigquery.location = REGION" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v6FGschEowht" + }, + "source": [ + "Next, we read the data from a publicly available BigQuery dataset. This will take ~1 minute." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "zDSwoBo1CU3G" + }, + "outputs": [], + "source": [ + "all_data = bf.read_gbq(\"bigquery-public-data.covid19_open_data.covid19_open_data\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9qV2y3iHp13y" + }, + "source": [ + "Using pandas syntax, we will select from our all_data input dataframe only those rows where the country_code is US. This is called row filtering." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UjMT_qhjf8Fu" + }, + "outputs": [], + "source": [ + "usa_data = all_data[all_data[\"country_code\"] == \"US\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IYCUayWkwq8c" + }, + "source": [ + "We're only concerned with the date and the total number of confirmed cases for now, so select just those two columns as well." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "IaoUf57ZwrJ8" + }, + "outputs": [], + "source": [ + "usa_data = usa_data[[\"date\", \"new_confirmed\"]]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "94oqNRnDvGkr" + }, + "source": [ + "### Sum data" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "TNCQWZW83U0b" + }, + "source": [ + "`usa_data.groupby(\"date\")` will give us a groupby object that lets us perform operations on groups of rows with the same date. We call sum on that object to get the sum for each day. This process might be familiar to pandas users." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tYDoaKgJChiq" + }, + "outputs": [], + "source": [ + "# numeric_only = True because we don't want to sum dates\n", + "new_cases_usa = usa_data.groupby(\"date\").sum(numeric_only = True)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3jcwFPgK5BLh" + }, + "source": [ + "### Line graph" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8GvJAgnH5Nzi" + }, + "source": [ + "BigQuery DataFrames implements some of the interface required by matplotlib. This means we can pass our DataFrame right into `pyplot.plt` and using the default settings, matplotlib will draw a simple line graph for us." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gFbCgfFC2gHw" + }, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt\n", + "\n", + "# matplotlin will draw a line graph by default\n", + "plt.plot(new_cases_usa)\n", + "# Rotate the labels on the x axis so that they don't overlap\n", + "plt.xticks(rotation=45)\n", + "# label the y axis for clarity\n", + "plt.ylabel(\"New Cases\")\n", + "\n", + "# Show the plot\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sM5-HFDx70RG" + }, + "source": [ + "## Visualization #2: Symptom-related searches compared to new cases" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "se1b6Vf4XB9_" + }, + "source": [ + "### Filter data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wl2o-NYMoygb" + }, + "source": [ + "We're curious if searches for symptoms like \"cough\" and \"fever\" went up in the same times and places that new COVID-19 cases occured, compared to non-symptoms like \"bruise.\" Let's plot searches vs. new cases to see if it looks like there's a correlation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "olfnCzyg8jYi" + }, + "source": [ + "First, we select the new cases column and the search trends we're interested in." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LqqHzjty8jk0" + }, + "outputs": [], + "source": [ + "symptom_data = all_data[[\"new_confirmed\", \"search_trends_cough\", \"search_trends_fever\", \"search_trends_bruise\"]]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b3DlJX-k9SPk" + }, + "source": [ + "Not all rows have data for all of these columns, so let's select only the rows that do." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "g4MeM8Oe9Q6X" + }, + "outputs": [], + "source": [ + "symptom_data = symptom_data.dropna()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IlXt__om9QYI" + }, + "source": [ + "We want to use a line of best fit to make the correlation stand out. Matplotlib does not include a feature for lines of best fit, but seaborn, which is built on matplotlib, does.\n", + "\n", + "BigQuery DataFrames does not currently integrate with seaborn by default. So we will demonstrate how to downsample and download a DataFrame, and use seaborn on the downloaded data." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MmfgKMaEXNbL" + }, + "source": [ + "### Downsample and download" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wIuG1JRTPAk9" + }, + "source": [ + "BigQuery DataFrames options let us set up the sampling functionality we need. Calls to `to_pandas()` usually download all the data available in our BigQuery table and store it locally as a pandas DataFrame. `pd.options.sampling.enable_downsampling = True` will make future calls to `to_pandas` use downsampling to download only part of the data, and `pd.options.sampling.max_download_size` allows us to set the amount of data to download." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "x95ZgBkyDMP4" + }, + "outputs": [], + "source": [ + "bf.options.sampling.enable_downsampling = True # enable downsampling\n", + "bf.options.sampling.max_download_size = 5 # download only 5 mb of data" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "C6sCXkrQPJC_" + }, + "source": [ + "Download the data and note the message letting us know that downsampling is being used." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "V0OK02D7PJSL" + }, + "outputs": [], + "source": [ + "local_symptom_data = symptom_data.to_pandas(sampling_method=\"uniform\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T9Hub_EAXWvY" + }, + "source": [ + "### Graph with lines of best fit" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": { + "id": "hoQ9TPgUPJnN" + }, + "source": [ + "We will now use seaborn to make the plots with the lines of best fit for cough, fever, and bruise. Note that since we're working with a local pandas dataframe, you could use any other Python library or technique you're familiar with, but we'll stick to seaborn for this notebook.\n", + "\n", + "Seaborn will take a few seconds to calculate the lines. Since cough and fever are symptoms of COVID-19, but bruising isn't, we expect the slope of the line of best fit to be positive in the first two graphs, but not the third, indicating that there is a correlation between new COVID-19 cases and cough- and fever-related searches." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EG7qM3R18bOb" + }, + "outputs": [], + "source": [ + "import seaborn as sns\n", + "\n", + "# first, convert to a data type that is suitable for seaborn\n", + "local_symptom_data[\"new_confirmed\"] = \\\n", + " local_symptom_data[\"new_confirmed\"].astype(float)\n", + "local_symptom_data[\"search_trends_cough\"] = \\\n", + " local_symptom_data[\"search_trends_cough\"].astype(float)\n", + "\n", + "# draw the graph. This might take ~30 seconds.\n", + "sns.regplot(x=\"new_confirmed\", y=\"search_trends_cough\", data=local_symptom_data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5nVy61rEGaM4" + }, + "outputs": [], + "source": [ + "# similarly, for fever\n", + "\n", + "local_symptom_data[\"search_trends_fever\"] = \\\n", + " local_symptom_data[\"search_trends_fever\"].astype(float)\n", + "sns.regplot(x=\"new_confirmed\", y=\"search_trends_fever\", data=local_symptom_data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-S1A9E3WGaYH" + }, + "outputs": [], + "source": [ + "# similarly, for bruise\n", + "local_symptom_data[\"search_trends_bruise\"] = \\\n", + " local_symptom_data[\"search_trends_bruise\"].astype(float)\n", + "sns.regplot(\n", + " x=\"new_confirmed\",\n", + " y=\"search_trends_bruise\",\n", + " data=local_symptom_data\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hd2A8707Uhz2" + }, + "source": [ + "We see that the slope of the line is positive in the graphs for cough and fever, but flat for bruise. That means that in places with increasing new cases of COVID-19, we saw increasing searches for cough and fever, but we didn't see increasing searches for unrelated symptoms like bruises. Interesting!" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Recap" + ] + }, + { + "attachments": {}, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We used matplotlib to draw a line graph of COVID-19 cases over time in the USA. Then, we used downsampling to download only a portion of the available data, and used seaborn locally to plot lines of best fit to observe corellation between COVID-19 cases and searches for related vs. unrelated symptoms.\n", + "\n", + "Thank you for using BigQuery DataFrames!" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/noxfile.py b/noxfile.py index 3dd23ba04f..da61232fc7 100644 --- a/noxfile.py +++ b/noxfile.py @@ -615,6 +615,7 @@ def notebook(session): "notebooks/vertex_sdk/sdk2_bigframes_pytorch.ipynb", "notebooks/vertex_sdk/sdk2_bigframes_sklearn.ipynb", "notebooks/vertex_sdk/sdk2_bigframes_tensorflow.ipynb", + "notebooks/visualization/bq_dataframes_covid_line_graphs.ipynb", # The experimental notebooks imagine features that don't yet # exist or only exist as temporary prototypes. "notebooks/experimental/longer_ml_demo.ipynb",