diff --git a/notebooks/experimental/semantic_operators.ipynb b/notebooks/experimental/semantic_operators.ipynb index 8a2f083419..f9c7f67358 100644 --- a/notebooks/experimental/semantic_operators.ipynb +++ b/notebooks/experimental/semantic_operators.ipynb @@ -1,3032 +1,3208 @@ { - "cells": [ - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [], - "source": [ - "# Copyright 2024 Google LLC\n", - "#\n", - "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", - "# you may not use this file except in compliance with the License.\n", - "# You may obtain a copy of the License at\n", - "#\n", - "# https://www.apache.org/licenses/LICENSE-2.0\n", - "#\n", - "# Unless required by applicable law or agreed to in writing, software\n", - "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", - "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", - "# See the License for the specific language governing permissions and\n", - "# limitations under the License." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# BigQuery DataFrames AI (semantic) Operator Demo" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The BigQuery DataFrames team implements AI operators inspired by the \"Lotus\" paper: https://arxiv.org/pdf/2407.11418.\n", - "\n", - "This notebook gives you a hands-on preview of AI operator APIs powered by LLM. You can open this notebook on Google Colab [here](https://colab.research.google.com/github/googleapis/python-bigquery-dataframes/blob/main/notebooks/experimental/semantic_operators.ipynb). \n", - "\n", - "The notebook has two sections. The first section introduces the API syntax with examples, with the aim to get you familiar with how AI operators work. The second section applies AI operators on a large real-world dataset. You will also find some performance statistics there." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Preparation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "First, import the BigQuery DataFrames modules." - ] - }, - { - "cell_type": "code", - "execution_count": 2, - "metadata": {}, - "outputs": [], - "source": [ - "import bigframes\n", - "import bigframes.pandas as bpd" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Make sure the BigFrames version is at least `1.23.0`" - ] - }, - { - "cell_type": "code", - "execution_count": 3, - "metadata": {}, - "outputs": [], - "source": [ - "from packaging.version import Version\n", - "\n", - "assert Version(bigframes.__version__) >= Version(\"1.23.0\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Turn on the semantic operator experiment. You will see a warning sign saying that these operators are still under experiments. If you don't turn on the experiment before using the operators, you will get `NotImplemenetedError`s." - ] - }, - { - "cell_type": "code", - "execution_count": 4, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/_config/experiment_options.py:34: UserWarning: Semantic operators are still under experiments, and are subject to change in the future.\n", - " warnings.warn(\n" - ] - } - ], - "source": [ - "bigframes.options.experiments.semantic_operators = True" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Optional: turn off the display of progress bar so that only the operation results will be printed out" - ] - }, - { - "cell_type": "code", - "execution_count": 5, - "metadata": {}, - "outputs": [], - "source": [ - "bpd.options.display.progress_bar = None" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Create LLM instances. They will be passed in as parameters for each semantic operator." - ] - }, - { - "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/pandas/__init__.py:258: DefaultLocationWarning: No explicit location is set, so using location US for the session.\n", - " return global_session.get_global_session()\n" - ] - } - ], - "source": [ - "from bigframes.ml import llm\n", - "gemini_model = llm.GeminiTextGenerator(model_name=\"gemini-1.5-flash-001\")\n", - "text_embedding_model = llm.TextEmbeddingGenerator(model_name=\"text-embedding-005\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "*Note*: semantic operators could be expensive over a large set of data. As a result, our team added this option `bigframes.options.compute.sem_ops_confirmation_threshold` at `version 1.31.0` so that the BigQuery Dataframe will ask for your confirmation if the amount of data to be processed is too large. If the amount of rows exceeds your threshold, you will see a prompt for your keyboard input -- 'y' to proceed and 'n' to abort. If you abort the operation, no LLM processing will be done.\n", - "\n", - "The default threshold is 0, which means the operators will always ask for confirmations. You are free to adjust the value as needed. You can also set the threshold to `None` to disable this feature." - ] - }, - { - "cell_type": "code", - "execution_count": 7, - "metadata": {}, - "outputs": [], - "source": [ - "if Version(bigframes.__version__) >= Version(\"1.31.0\"):\n", - " bigframes.options.compute.semantic_ops_confirmation_threshold = 100" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "If you would like your operations to fail automatically when the data is too large, set `bigframes.options.compute.semantic_ops_threshold_autofail` to `True`:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "# if Version(bigframes.__version__) >= Version(\"1.31.0\"):\n", - "# bigframes.options.compute.semantic_ops_threshold_autofail = True" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# The API" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You will learn about each semantic operator by trying some examples." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Semantic Filtering" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Semantic filtering allows you to filter your dataframe based on the instruction (i.e. prompt) you provided. \n", - "\n", - "First, create a dataframe:" - ] - }, - { - "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
countrycity
0USASeattle
1GermanyBerlin
2JapanKyoto
\n", - "

3 rows × 2 columns

\n", - "
[3 rows x 2 columns in total]" + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UYeZd_I8iouP" + }, + "outputs": [], + "source": [ + "# Copyright 2024 Google LLC\n", + "#\n", + "# Licensed under the Apache License, Version 2.0 (the \"License\");\n", + "# you may not use this file except in compliance with the License.\n", + "# You may obtain a copy of the License at\n", + "#\n", + "# https://www.apache.org/licenses/LICENSE-2.0\n", + "#\n", + "# Unless required by applicable law or agreed to in writing, software\n", + "# distributed under the License is distributed on an \"AS IS\" BASIS,\n", + "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n", + "# See the License for the specific language governing permissions and\n", + "# limitations under the License." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rWJnGj2ViouP" + }, + "source": [ + "# BigFrames AI (semantic) Operator Tutorial\n", + "\n", + "\n", + "\n", + " \n", + " \n", + "
\n", + " \n", + " \"Colab Run in Colab\n", + " \n", + " \n", + " \n", + " \"GitHub\n", + " View on GitHub\n", + " \n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mgOrr256iouQ" + }, + "source": [ + "This notebook provides a hands-on preview of AI operator APIs powered by the Gemini model.\n", + "\n", + "The notebook is divided into two sections. The first section introduces the API syntax with examples, aiming to familiarize you with how AI operators work. The second section applies AI operators to a large real-world dataset and presents performance statistics.\n", + "\n", + "This work is inspired by [this paper](https://arxiv.org/pdf/2407.11418) and powered by BigQuery ML and Vertex AI." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2ymVbJV2iouQ" + }, + "source": [ + "# Preparation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vvVzFzo3iouQ" + }, + "source": [ + "First, import the BigFrames modules.\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Jb9glT2ziouQ" + }, + "outputs": [], + "source": [ + "import bigframes\n", + "import bigframes.pandas as bpd" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xQiCWj7OiouQ" + }, + "source": [ + "Make sure the BigFrames version is at least `1.23.0`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LTPpI8IpiouQ" + }, + "outputs": [], + "source": [ + "from packaging.version import Version\n", + "\n", + "assert Version(bigframes.__version__) >= Version(\"1.23.0\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "agxLmtlbiouR" + }, + "source": [ + "Turn on the semantic operator experiment. You will see a warning sign saying that these operators are still under experiments. If you don't turn on the experiment before using the operators, you will get `NotImplemenetedError`s." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "1wXqdDr8iouR" + }, + "outputs": [], + "source": [ + "bigframes.options.experiments.semantic_operators = True" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Specify your GCP project and location." ], - "text/plain": [ - " country city\n", - "0 USA Seattle\n", - "1 Germany Berlin\n", - "2 Japan Kyoto\n", - "\n", - "[3 rows x 2 columns]" - ] - }, - "execution_count": 8, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df = bpd.DataFrame({'country': ['USA', 'Germany', 'Japan'], 'city': ['Seattle', 'Berlin', 'Kyoto']})\n", - "df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, filter this dataframe by keeping only the rows where the value in `city` column is the capital of the value in `country` column. The column references could be \"escaped\" by using a pair of braces in your instruction. In this example, your instruction should be like this:\n", - "```\n", - "The {city} is the capital of the {country}.\n", - "```\n", - "\n", - "Note that this is not a Python f-string, so you shouldn't prefix your instruction with an `f`." - ] - }, - { - "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
countrycity
1GermanyBerlin
\n", - "

1 rows × 2 columns

\n", - "
[1 rows x 2 columns in total]" + "metadata": { + "id": "W8TPUvnsqxhv" + } + }, + { + "cell_type": "code", + "source": [ + "bpd.options.bigquery.project = 'YOUR_PROJECT_ID'\n", + "bpd.options.bigquery.location = 'US'" ], - "text/plain": [ - " country city\n", - "1 Germany Berlin\n", - "\n", - "[1 rows x 2 columns]" - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df.semantics.filter(\"The {city} is the capital of the {country}\", model=gemini_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The filter operator extracts the information from the referenced column to enrich your instruction with context. The instruction is then sent for the designated model for evaluation. For filtering operations, the LLM is asked to return only `True` and `False` for each row, and the operator removes the rows accordingly." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Semantic Mapping" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Semantic mapping allows to you to combine values from multiple columns into a single output based your instruction. \n", - "\n", - "Here is an example:" - ] - }, - { - "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ingredient_1ingredient_2
0BunBeef Patty
1Soy BeanBittern
2SausageLong Bread
\n", - "

3 rows × 2 columns

\n", - "
[3 rows x 2 columns in total]" + "metadata": { + "id": "vCkraKOeqJFl" + }, + "execution_count": null, + "outputs": [] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "n95MFlS0iouR" + }, + "source": [ + "**Optional**: turn off the display of progress bar so that only the operation results will be printed out" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5r6ahx7MiouR" + }, + "outputs": [], + "source": [ + "# bpd.options.display.progress_bar = None" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "93iYvp7niouR" + }, + "source": [ + "Create LLM instances. They will be passed in as parameters for each semantic operator.\n", + "\n", + "This tutorial uses the \"gemini-1.5-flash-001\" model for text generation and \"text-embedding-005\" for embedding. While these are recommended, you can choose [other Vertex AI LLM models](https://cloud.google.com/vertex-ai/generative-ai/docs/learn/models) based on your needs and availability. Ensure you have [sufficient quota](https://cloud.google.com/vertex-ai/generative-ai/docs/quotas) for your chosen models and adjust it if necessary." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tHkymaLNiouR" + }, + "outputs": [], + "source": [ + "from bigframes.ml import llm\n", + "gemini_model = llm.GeminiTextGenerator(model_name=\"gemini-1.5-flash-001\")\n", + "text_embedding_model = llm.TextEmbeddingGenerator(model_name=\"text-embedding-005\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mbFDcvnPiouR" + }, + "source": [ + "**Note**: semantic operators could be expensive over a large set of data. As a result, our team added this option `bigframes.options.compute.sem_ops_confirmation_threshold` at `version 1.31.0` so that the BigFrames will ask for your confirmation if the amount of data to be processed is too large. If the amount of rows exceeds your threshold, you will see a prompt for your keyboard input -- 'y' to proceed and 'n' to abort. If you abort the operation, no LLM processing will be done.\n", + "\n", + "The default threshold is 0, which means the operators will always ask for confirmations. You are free to adjust the value as needed. You can also set the threshold to `None` to disable this feature." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "F4dZm4b7iouR" + }, + "outputs": [], + "source": [ + "if Version(bigframes.__version__) >= Version(\"1.31.0\"):\n", + " bigframes.options.compute.semantic_ops_confirmation_threshold = 1000" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_dEA3G9RiouR" + }, + "source": [ + "If you would like your operations to fail automatically when the data is too large, set `bigframes.options.compute.semantic_ops_threshold_autofail` to `True`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BoUK-cpbiouS" + }, + "outputs": [], + "source": [ + "# if Version(bigframes.__version__) >= Version(\"1.31.0\"):\n", + "# bigframes.options.compute.semantic_ops_threshold_autofail = True" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hQft3o3OiouS" + }, + "source": [ + "# API Samples" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dt5Kl-QGiouS" + }, + "source": [ + "You will learn about each semantic operator by trying some examples." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "J7XAT459iouS" + }, + "source": [ + "## Semantic Filtering" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9d5HUIvliouS" + }, + "source": [ + "Semantic filtering allows you to filter your dataframe based on the instruction (i.e. prompt) you provided.\n", + "\n", + "First, create a dataframe:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 190 + }, + "id": "NDpCRGd_iouS", + "outputId": "5048c935-06d3-4ef1-ad87-72e14a30b1b7" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " country city\n", + "0 USA Seattle\n", + "1 Germany Berlin\n", + "2 Japan Kyoto\n", + "\n", + "[3 rows x 2 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countrycity
0USASeattle
1GermanyBerlin
2JapanKyoto
\n", + "

3 rows × 2 columns

\n", + "
[3 rows x 2 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 10 + } ], - "text/plain": [ - " ingredient_1 ingredient_2\n", - "0 Bun Beef Patty\n", - "1 Soy Bean Bittern\n", - "2 Sausage Long Bread\n", - "\n", - "[3 rows x 2 columns]" - ] - }, - "execution_count": 10, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df = bpd.DataFrame({\n", - " \"ingredient_1\": [\"Bun\", \"Soy Bean\", \"Sausage\"], \n", - " \"ingredient_2\": [\"Beef Patty\", \"Bittern\", \"Long Bread\"]\n", - " })\n", - "df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, you ask LLM what kind of food can be made from the two ingredients in each row. The column reference syntax in your instruction stays the same. In addition, you need to specify the column name by setting the `output_column` parameter to hold the mapping results." - ] - }, - { - "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ingredient_1ingredient_2food
0BunBeef PattyBurger
1Soy BeanBitternTofu
2SausageLong BreadHotdog
\n", - "

3 rows × 3 columns

\n", - "
[3 rows x 3 columns in total]" + "source": [ + "df = bpd.DataFrame({'country': ['USA', 'Germany', 'Japan'], 'city': ['Seattle', 'Berlin', 'Kyoto']})\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6AXmT7sniouS" + }, + "source": [ + "Now, filter this dataframe by keeping only the rows where the value in `city` column is the capital of the value in `country` column. The column references could be \"escaped\" by using a pair of braces in your instruction. In this example, your instruction should be like this:\n", + "```\n", + "The {city} is the capital of the {country}.\n", + "```\n", + "\n", + "Note that this is not a Python f-string, so you shouldn't prefix your instruction with an `f`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 127 + }, + "id": "ipW3Z_l4iouS", + "outputId": "ad447459-225a-419c-d4c8-fedac4a9ed0f" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " country city\n", + "1 Germany Berlin\n", + "\n", + "[1 rows x 2 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
countrycity
1GermanyBerlin
\n", + "

1 rows × 2 columns

\n", + "
[1 rows x 2 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 11 + } ], - "text/plain": [ - " ingredient_1 ingredient_2 food\n", - "0 Bun Beef Patty Burger \n", - "\n", - "1 Soy Bean Bittern Tofu \n", - "\n", - "2 Sausage Long Bread Hotdog \n", - "\n", - "\n", - "[3 rows x 3 columns]" - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df.semantics.map(\"What is the food made from {ingredient_1} and {ingredient_2}? One word only.\", output_column=\"food\", model=gemini_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Semantic Joining" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Semantic joining can join two dataframes based on the instruction you provided. \n", - "\n", - "First, you prepare two dataframes:" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [], - "source": [ - "cities = bpd.DataFrame({'city': ['Seattle', 'Ottawa', 'Berlin', 'Shanghai', 'New Delhi']})\n", - "continents = bpd.DataFrame({'continent': ['North America', 'Africa', 'Asia']})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You want to join the `cities` with `continents` to form a new dataframe such that, in each row the city from the `cities` data frame is in the continent from the `continents` dataframe. You could re-use the aforementioned column reference syntax:" - ] - }, - { - "cell_type": "code", - "execution_count": 13, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
citycontinent
0SeattleNorth America
1OttawaNorth America
2ShanghaiAsia
3New DelhiAsia
\n", - "

4 rows × 2 columns

\n", - "
[4 rows x 2 columns in total]" + "source": [ + "df.semantics.filter(\"The {city} is the capital of the {country}\", model=gemini_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "swKvgfm1iouS" + }, + "source": [ + "The filter operator extracts the information from the referenced column to enrich your instruction with context. The instruction is then sent for the designated model for evaluation. For filtering operations, the LLM is asked to return only `True` and `False` for each row, and the operator removes the rows accordingly." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r_2AAGGoiouS" + }, + "source": [ + "## Semantic Mapping" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vT6skC57iouS" + }, + "source": [ + "Semantic mapping allows to you to combine values from multiple columns into a single output based your instruction.\n", + "\n", + "Here is an example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 190 + }, + "id": "BQ7xeUK3iouS", + "outputId": "33dcb742-77ed-4bea-8dbc-1cf775102a25" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " ingredient_1 ingredient_2\n", + "0 Bun Beef Patty\n", + "1 Soy Bean Bittern\n", + "2 Sausage Long Bread\n", + "\n", + "[3 rows x 2 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ingredient_1ingredient_2
0BunBeef Patty
1Soy BeanBittern
2SausageLong Bread
\n", + "

3 rows × 2 columns

\n", + "
[3 rows x 2 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 12 + } ], - "text/plain": [ - " city continent\n", - "0 Seattle North America\n", - "1 Ottawa North America\n", - "2 Shanghai Asia\n", - "3 New Delhi Asia\n", - "\n", - "[4 rows x 2 columns]" - ] - }, - "execution_count": 13, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "cities.semantics.join(continents, \"{city} is in {continent}\", model=gemini_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "!! **Important:** Semantic join can trigger probihitively expensitve operations! This operation first cross joins two dataframes, then invokes semantic filter on each row. That means if you have two dataframes of sizes `M` and `N`, the total amount of queries sent to the LLM is on the scale of `M * N`. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Self Joins" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This self-join example is for demonstrating a special case: what happens when the joining columns exist in both data frames? It turns out that you need to provide extra information in your column references: by attaching \"left.\" and \"right.\" prefixes to your column names. \n", - "\n", - "Create an example data frame:" - ] - }, - { - "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [], - "source": [ - "animals = bpd.DataFrame({'animal': ['cow', 'cat', 'spider', 'elephant']})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You want to compare the weights of these animals, and output all the pairs where the animal on the left is heavier than the animal on the right. In this case, you use `left.animal` and `right.animal` to differentiate the data sources:" - ] - }, - { - "cell_type": "code", - "execution_count": 15, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
animal_leftanimal_right
0cowcat
1cowspider
2catspider
3elephantcow
4elephantcat
5elephantspider
\n", - "

6 rows × 2 columns

\n", - "
[6 rows x 2 columns in total]" + "source": [ + "df = bpd.DataFrame({\n", + " \"ingredient_1\": [\"Bun\", \"Soy Bean\", \"Sausage\"],\n", + " \"ingredient_2\": [\"Beef Patty\", \"Bittern\", \"Long Bread\"]\n", + " })\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VFObP2aFiouS" + }, + "source": [ + "Now, you ask LLM what kind of food can be made from the two ingredients in each row. The column reference syntax in your instruction stays the same. In addition, you need to specify the column name by setting the `output_column` parameter to hold the mapping results." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 190 + }, + "id": "PpL24AQFiouS", + "outputId": "e7aff038-bf4b-4833-def8-fe2648e8885b" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " ingredient_1 ingredient_2 food\n", + "0 Bun Beef Patty Burger \n", + "\n", + "1 Soy Bean Bittern Tofu \n", + "\n", + "2 Sausage Long Bread Hotdog \n", + "\n", + "\n", + "[3 rows x 3 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ingredient_1ingredient_2food
0BunBeef PattyBurger
1Soy BeanBitternTofu
2SausageLong BreadHotdog
\n", + "

3 rows × 3 columns

\n", + "
[3 rows x 3 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 13 + } ], - "text/plain": [ - " animal_left animal_right\n", - "0 cow cat\n", - "1 cow spider\n", - "2 cat spider\n", - "3 elephant cow\n", - "4 elephant cat\n", - "5 elephant spider\n", - "\n", - "[6 rows x 2 columns]" - ] - }, - "execution_count": 15, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "animals.semantics.join(animals, \"{left.animal} generally weighs heavier than {right.animal}\", model=gemini_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Semantic Aggregation" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Semantic aggregation merges all the values in a column into one. At this moment you can only aggregate a single column in each oeprator call.\n", - "\n", - "Here is an example:" - ] - }, - { - "cell_type": "code", - "execution_count": 16, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Movies
0Titanic
1The Wolf of Wall Street
2Killers of the Flower Moon
3The Revenant
4Inception
5Shuttle Island
6The Great Gatsby
\n", - "

7 rows × 1 columns

\n", - "
[7 rows x 1 columns in total]" + "source": [ + "df.semantics.map(\"What is the food made from {ingredient_1} and {ingredient_2}? One word only.\", output_column=\"food\", model=gemini_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "70WTZZfdiouS" + }, + "source": [ + "## Semantic Joining" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u93uieRaiouS" + }, + "source": [ + "Semantic joining can join two dataframes based on the instruction you provided.\n", + "\n", + "First, you prepare two dataframes:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dffIGEUEiouS" + }, + "outputs": [], + "source": [ + "cities = bpd.DataFrame({'city': ['Seattle', 'Ottawa', 'Berlin', 'Shanghai', 'New Delhi']})\n", + "continents = bpd.DataFrame({'continent': ['North America', 'Africa', 'Asia']})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hz0X-0RtiouS" + }, + "source": [ + "You want to join the `cities` with `continents` to form a new dataframe such that, in each row the city from the `cities` data frame is in the continent from the `continents` dataframe. You could re-use the aforementioned column reference syntax:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 221 + }, + "id": "WPIOHEwCiouT", + "outputId": "976586c3-b5db-4088-a46a-44dfbf822ecb" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " city continent\n", + "0 Seattle North America\n", + "1 Ottawa North America\n", + "2 Shanghai Asia\n", + "3 New Delhi Asia\n", + "\n", + "[4 rows x 2 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
citycontinent
0SeattleNorth America
1OttawaNorth America
2ShanghaiAsia
3New DelhiAsia
\n", + "

4 rows × 2 columns

\n", + "
[4 rows x 2 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 15 + } ], - "text/plain": [ - " Movies\n", - "0 Titanic\n", - "1 The Wolf of Wall Street\n", - "2 Killers of the Flower Moon\n", - "3 The Revenant\n", - "4 Inception\n", - "5 Shuttle Island\n", - "6 The Great Gatsby\n", - "\n", - "[7 rows x 1 columns]" - ] - }, - "execution_count": 16, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df = bpd.DataFrame({\n", - " \"Movies\": [\n", - " \"Titanic\",\n", - " \"The Wolf of Wall Street\",\n", - " \"Killers of the Flower Moon\",\n", - " \"The Revenant\",\n", - " \"Inception\",\n", - " \"Shuttle Island\",\n", - " \"The Great Gatsby\",\n", - " ],\n", - "})\n", - "df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You ask LLM to find the oldest movie:" - ] - }, - { - "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/plain": [ - "0 Titanic \n", - "\n", - "Name: Movies, dtype: string" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "agg_df = df.semantics.agg(\"Find the oldest movie from {Movies}. Reply with only the movie title\", model=gemini_model)\n", - "agg_df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Instead of going through each row one by one, this operator first batches rows to get many aggregation results. It then repeatly batches those results for aggregation, until there is only one value left. You could set the batch size with `max_agg_rows` parameter, which defaults to 10." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Semantic Top K" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Semantic Top K selects the top K values based on your instruction. Here is an example:" - ] - }, - { - "cell_type": "code", - "execution_count": 18, - "metadata": {}, - "outputs": [], - "source": [ - "df = bpd.DataFrame({\"Animals\": [\"Corgi\", \"Orange Cat\", \"Parrot\", \"Tarantula\"]})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You want to find the top two most popular pets:" - ] - }, - { - "cell_type": "code", - "execution_count": 19, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
Animals
0Corgi
1Orange Cat
\n", - "

2 rows × 1 columns

\n", - "
[2 rows x 1 columns in total]" + "source": [ + "cities.semantics.join(continents, \"{city} is in {continent}\", model=gemini_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4Qc97GMWiouT" + }, + "source": [ + "!! **Important:** Semantic join can trigger probihitively expensitve operations! This operation first cross joins two dataframes, then invokes semantic filter on each row. That means if you have two dataframes of sizes `M` and `N`, the total amount of queries sent to the LLM is on the scale of `M * N`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MUEJXT1IiouT" + }, + "source": [ + "### Self Joins" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QvX-nCogiouT" + }, + "source": [ + "This self-join example is for demonstrating a special case: what happens when the joining columns exist in both data frames? It turns out that you need to provide extra information in your column references: by attaching \"left.\" and \"right.\" prefixes to your column names.\n", + "\n", + "Create an example data frame:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OIGz5sqxiouW" + }, + "outputs": [], + "source": [ + "animals = bpd.DataFrame({'animal': ['cow', 'cat', 'spider', 'elephant']})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VmJbuWNniouX" + }, + "source": [ + "You want to compare the weights of these animals, and output all the pairs where the animal on the left is heavier than the animal on the right. In this case, you use `left.animal` and `right.animal` to differentiate the data sources:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 284 + }, + "id": "UHfggdhBiouX", + "outputId": "a439e3aa-1382-4244-951f-127dc8da0fe3" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " animal_left animal_right\n", + "0 cow cat\n", + "1 cow spider\n", + "2 cat spider\n", + "3 elephant cow\n", + "4 elephant cat\n", + "5 elephant spider\n", + "\n", + "[6 rows x 2 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
animal_leftanimal_right
0cowcat
1cowspider
2catspider
3elephantcow
4elephantcat
5elephantspider
\n", + "

6 rows × 2 columns

\n", + "
[6 rows x 2 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 17 + } ], - "text/plain": [ - " Animals\n", - "0 Corgi\n", - "1 Orange Cat\n", - "\n", - "[2 rows x 1 columns]" - ] - }, - "execution_count": 19, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df.semantics.top_k(\"{Animals} are more popular as pets\", model=gemini_model, k=2)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Under the hood, the semantic top K operator performs pair-wise comparisons with LLM. The top K results are returned in the order of their indices instead of their ranks." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Semantic Search" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Semantic search searches the most similar values to your query within a single column. Here is an example:" - ] - }, - { - "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
creatures
0salmon
1sea urchin
2baboons
3frog
4chimpanzee
\n", - "

5 rows × 1 columns

\n", - "
[5 rows x 1 columns in total]" + "source": [ + "animals.semantics.join(animals, \"{left.animal} generally weighs heavier than {right.animal}\", model=gemini_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KONR7ywqiouX" + }, + "source": [ + "## Semantic Aggregation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "I8iNRogoiouX" + }, + "source": [ + "Semantic aggregation merges all the values in a column into one. At this moment you can only aggregate a single column in each oeprator call.\n", + "\n", + "Here is an example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 315 + }, + "id": "9tsem17aiouX", + "outputId": "1db5fa6e-b59d-41f5-9c13-db2c9ed0415b" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Movies\n", + "0 Titanic\n", + "1 The Wolf of Wall Street\n", + "2 Killers of the Flower Moon\n", + "3 The Revenant\n", + "4 Inception\n", + "5 Shuttle Island\n", + "6 The Great Gatsby\n", + "\n", + "[7 rows x 1 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Movies
0Titanic
1The Wolf of Wall Street
2Killers of the Flower Moon
3The Revenant
4Inception
5Shuttle Island
6The Great Gatsby
\n", + "

7 rows × 1 columns

\n", + "
[7 rows x 1 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 18 + } ], - "text/plain": [ - " creatures\n", - "0 salmon\n", - "1 sea urchin\n", - "2 baboons\n", - "3 frog\n", - "4 chimpanzee\n", - "\n", - "[5 rows x 1 columns]" - ] - }, - "execution_count": 20, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df = bpd.DataFrame({\"creatures\": [\"salmon\", \"sea urchin\", \"baboons\", \"frog\", \"chimpanzee\"]})\n", - "df" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You want to get the top 2 creatures that are most similar to \"monkey\":" - ] - }, - { - "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n", - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n", - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
creaturessimilarity score
2baboons0.708434
4chimpanzee0.635844
\n", - "

2 rows × 2 columns

\n", - "
[2 rows x 2 columns in total]" + "source": [ + "df = bpd.DataFrame({\n", + " \"Movies\": [\n", + " \"Titanic\",\n", + " \"The Wolf of Wall Street\",\n", + " \"Killers of the Flower Moon\",\n", + " \"The Revenant\",\n", + " \"Inception\",\n", + " \"Shuttle Island\",\n", + " \"The Great Gatsby\",\n", + " ],\n", + "})\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uA9XpV0aiouX" + }, + "source": [ + "You ask LLM to find the oldest movie:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KzYoX3mRiouX", + "outputId": "1ac50d7b-dfa7-4c16-8daf-aeb03b6df7a5" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0 Titanic \n", + "\n", + "Name: Movies, dtype: string" + ] + }, + "metadata": {}, + "execution_count": 19 + } ], - "text/plain": [ - " creatures similarity score\n", - "2 baboons 0.708434\n", - "4 chimpanzee 0.635844\n", - "\n", - "[2 rows x 2 columns]" - ] - }, - "execution_count": 21, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df.semantics.search(\"creatures\", query=\"monkey\", top_k = 2, model = text_embedding_model, score_column='similarity score')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Notice that you are using a text embedding model this time. This model generates embedding vectors for both your query as well as the values in the search space. The operator then uses BigQuery's built-in VECTOR_SEARCH function to find the nearest neighbors of your query.\n", - "\n", - "In addition, `score_column` is an optional parameter for storing the distances between the results and your query. If not set, the score column won't be attached to the result." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Semantic Similarity Join" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "When you want to perform multiple similarity queries in the same value space, you could use similarity join to simplify your call. For example:" - ] - }, - { - "cell_type": "code", - "execution_count": 22, - "metadata": {}, - "outputs": [], - "source": [ - "df1 = bpd.DataFrame({'animal': ['monkey', 'spider', 'salmon', 'giraffe', 'sparrow']})\n", - "df2 = bpd.DataFrame({'animal': ['scorpion', 'baboon', 'owl', 'elephant', 'tuna']})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this example, you want to pick the most related animal from `df2` for each value in `df1`." - ] - }, - { - "cell_type": "code", - "execution_count": 23, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n", - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
animalanimal_1distance
0monkeybaboon0.620521
1spiderscorpion0.728024
2salmontuna0.782141
3giraffeelephant0.7135
4sparrowowl0.810864
\n", - "

5 rows × 3 columns

\n", - "
[5 rows x 3 columns in total]" + "source": [ + "agg_df = df.semantics.agg(\"Find the oldest movie from {Movies}. Reply with only the movie title\", model=gemini_model)\n", + "agg_df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "drvn75qJiouX" + }, + "source": [ + "Instead of going through each row one by one, this operator first batches rows to get many aggregation results. It then repeatly batches those results for aggregation, until there is only one value left. You could set the batch size with `max_agg_rows` parameter, which defaults to 10." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kU7BsyTyiouX" + }, + "source": [ + "## Semantic Top K" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s9QePXEoiouX" + }, + "source": [ + "Semantic Top K selects the top K values based on your instruction. Here is an example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bMQqtyZ2iouX" + }, + "outputs": [], + "source": [ + "df = bpd.DataFrame({\"Animals\": [\"Corgi\", \"Orange Cat\", \"Parrot\", \"Tarantula\"]})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KiljGBSCiouX" + }, + "source": [ + "You want to find the top two most popular pets:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 159 + }, + "id": "OZv5WUGIiouX", + "outputId": "ae1cee27-cc31-455e-c4ac-c0a9a5cf4ca5" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Animals\n", + "0 Corgi\n", + "1 Orange Cat\n", + "\n", + "[2 rows x 1 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Animals
0Corgi
1Orange Cat
\n", + "

2 rows × 1 columns

\n", + "
[2 rows x 1 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 21 + } ], - "text/plain": [ - " animal animal_1 distance\n", - "0 monkey baboon 0.620521\n", - "1 spider scorpion 0.728024\n", - "2 salmon tuna 0.782141\n", - "3 giraffe elephant 0.7135\n", - "4 sparrow owl 0.810864\n", - "\n", - "[5 rows x 3 columns]" - ] - }, - "execution_count": 23, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df1.semantics.sim_join(df2, left_on='animal', right_on='animal', top_k=1, model=text_embedding_model, score_column='distance')" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "!! **Important** Like semantic join, this operator can also be very expensive. To guard against unexpected processing of large dataset, use the `bigframes.options.compute.sem_ops_confirmation_threshold` option to specify a threshold. " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Semantic Cluster" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Semantic Cluster group similar values together. For example:" - ] - }, - { - "cell_type": "code", - "execution_count": 24, - "metadata": {}, - "outputs": [], - "source": [ - "df = bpd.DataFrame({'Product': ['Smartphone', 'Laptop', 'Coffee Maker', 'T-shirt', 'Jeans']})" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You want to cluster these products into 3 groups:" - ] - }, - { - "cell_type": "code", - "execution_count": 25, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n", - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
ProductCluster ID
0Smartphone3
1Laptop3
2Coffee Maker1
3T-shirt2
4Jeans2
\n", - "

5 rows × 2 columns

\n", - "
[5 rows x 2 columns in total]" + "source": [ + "df.semantics.top_k(\"{Animals} are more popular as pets\", model=gemini_model, k=2)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dC8fyu3aiouX" + }, + "source": [ + "Under the hood, the semantic top K operator performs pair-wise comparisons with LLM. The top K results are returned in the order of their indices instead of their ranks." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sIszJ0zPiouX" + }, + "source": [ + "## Semantic Search" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e4ojHRKAiouX" + }, + "source": [ + "Semantic search searches the most similar values to your query within a single column. Here is an example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 253 + }, + "id": "gnQSIZ5SiouX", + "outputId": "dd6e1ecb-1bad-4a7c-8065-e56c697d0863" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " creatures\n", + "0 salmon\n", + "1 sea urchin\n", + "2 baboons\n", + "3 frog\n", + "4 chimpanzee\n", + "\n", + "[5 rows x 1 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
creatures
0salmon
1sea urchin
2baboons
3frog
4chimpanzee
\n", + "

5 rows × 1 columns

\n", + "
[5 rows x 1 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 22 + } ], - "text/plain": [ - " Product Cluster ID\n", - "0 Smartphone 3\n", - "1 Laptop 3\n", - "2 Coffee Maker 1\n", - "3 T-shirt 2\n", - "4 Jeans 2\n", - "\n", - "[5 rows x 2 columns]" - ] - }, - "execution_count": 25, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "df.semantics.cluster_by(column='Product', output_column='Cluster ID', model=text_embedding_model, n_clusters=3)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This operator uses the the embedding model to generate vectors for each value, and then the KMeans algorithm for clustering." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Performance Analyses" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this section, you will use BigQuery's public data of hacker news to perform some heavy work. We recommend you to check the code without executing them in order to save your time and money. The execution results are attached after each cell for your reference.\n", - "\n", - "First, load 3K rows from the table:" - ] - }, - { - "cell_type": "code", - "execution_count": 26, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
titletextbyscoretimestamptype
0<NA>Well, most people aren&#x27;t alcoholics, so I...slipframe<NA>2021-06-26 02:37:56+00:00comment
1<NA>No, you don&#x27;t really <i>need</i> a smartp...vetinari<NA>2023-04-19 15:56:34+00:00comment
2<NA>It&#x27;s for the late Paul Allen RIP. Should&...lsr_ssri<NA>2018-10-16 01:07:55+00:00comment
3<NA>Yup they are dangerous. Be careful Donald Trump.Sven7<NA>2015-08-10 16:05:54+00:00comment
4<NA>Sure, it&#x27;s totally reasonable. Just point...nicoburns<NA>2020-10-05 11:20:51+00:00comment
5<NA>I wonder how long before special forces start ...autisticcurio<NA>2020-09-01 15:38:50+00:00comment
6The Impending NY Tech Apocalypse: Here's What ...<NA>gaoprea32011-09-27 22:43:27+00:00story
7<NA>Where would you relocate to? I'm assuming that...pavel_lishin<NA>2011-09-16 19:02:01+00:00comment
8Eureca beta is live. A place for your business...<NA>ricardos12012-10-15 13:09:32+00:00story
9<NA>It doesn’t work on Safari, and WebKit based br...archiewood<NA>2023-04-21 16:45:13+00:00comment
10<NA>I guess I don’t see the relevance. Vegans eat ...stevula<NA>2023-01-19 20:05:54+00:00comment
11<NA>I remember watching the American news media go...fareesh<NA>2019-06-17 19:49:17+00:00comment
12<NA>This article is incorrectly using the current ...stale2002<NA>2018-03-18 18:57:21+00:00comment
13<NA>In the firm I made my internship, we have to u...iserlohnmage<NA>2019-10-22 10:41:01+00:00comment
14<NA>The main reason it requires unsafe is for memo...comex<NA>2017-05-05 20:45:37+00:00comment
15Discord vs. IRC Rough Notes<NA>todsacerdoti482024-07-12 18:39:52+00:00story
16<NA>you have to auth again when you use apple pay.empath75<NA>2017-09-12 18:58:20+00:00comment
17<NA>It goes consumer grade, automotive, military, ...moftz<NA>2021-04-13 01:24:03+00:00comment
18<NA>I don&#x27;t have a link handy but the differe...KennyBlanken<NA>2022-05-13 16:08:38+00:00comment
19<NA>&gt; I don&#x27;t think the use case you menti...colanderman<NA>2017-09-28 05:16:06+00:00comment
20<NA>I think you need to watch it again, because yo...vladimirralev<NA>2018-12-07 11:25:52+00:00comment
21Oh dear: new Yahoo anti-spoofing measures brea...<NA>joshreads12014-04-08 13:29:50+00:00story
22How Much Warmer Was Your City in 2016?<NA>smb0612017-02-16 23:26:34+00:00story
23<NA>Except that they clearly never tried to incent...aenis<NA>2022-01-31 17:08:57+00:00comment
24Working Best at Coffee Shops<NA>GiraffeNecktie2492011-04-19 14:25:17+00:00story
\n", - "

25 rows × 6 columns

\n", - "
[3000 rows x 6 columns in total]" + "source": [ + "df = bpd.DataFrame({\"creatures\": [\"salmon\", \"sea urchin\", \"baboons\", \"frog\", \"chimpanzee\"]})\n", + "df" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5apfIaZMiouX" + }, + "source": [ + "You want to get the top 2 creatures that are most similar to \"monkey\":" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 159 + }, + "id": "CkAuFgPYiouY", + "outputId": "723c7604-f53c-43d7-c754-4c91ec198dff" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " creatures similarity score\n", + "2 baboons 0.708434\n", + "4 chimpanzee 0.635844\n", + "\n", + "[2 rows x 2 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
creaturessimilarity score
2baboons0.708434
4chimpanzee0.635844
\n", + "

2 rows × 2 columns

\n", + "
[2 rows x 2 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 23 + } ], - "text/plain": [ - " title \\\n", - "0 \n", - "1 \n", - "2 \n", - "3 \n", - "4 \n", - "5 \n", - "6 The Impending NY Tech Apocalypse: Here's What ... \n", - "7 \n", - "8 Eureca beta is live. A place for your business... \n", - "9 \n", - "10 \n", - "11 \n", - "12 \n", - "13 \n", - "14 \n", - "15 Discord vs. IRC Rough Notes \n", - "16 \n", - "17 \n", - "18 \n", - "19 \n", - "20 \n", - "21 Oh dear: new Yahoo anti-spoofing measures brea... \n", - "22 How Much Warmer Was Your City in 2016? \n", - "23 \n", - "24 Working Best at Coffee Shops \n", - "\n", - " text by score \\\n", - "0 Well, most people aren't alcoholics, so I... slipframe \n", - "1 No, you don't really need a smartp... vetinari \n", - "2 It's for the late Paul Allen RIP. Should&... lsr_ssri \n", - "3 Yup they are dangerous. Be careful Donald Trump. Sven7 \n", - "4 Sure, it's totally reasonable. Just point... nicoburns \n", - "5 I wonder how long before special forces start ... autisticcurio \n", - "6 gaoprea 3 \n", - "7 Where would you relocate to? I'm assuming that... pavel_lishin \n", - "8 ricardos 1 \n", - "9 It doesn’t work on Safari, and WebKit based br... archiewood \n", - "10 I guess I don’t see the relevance. Vegans eat ... stevula \n", - "11 I remember watching the American news media go... fareesh \n", - "12 This article is incorrectly using the current ... stale2002 \n", - "13 In the firm I made my internship, we have to u... iserlohnmage \n", - "14 The main reason it requires unsafe is for memo... comex \n", - "15 todsacerdoti 48 \n", - "16 you have to auth again when you use apple pay. empath75 \n", - "17 It goes consumer grade, automotive, military, ... moftz \n", - "18 I don't have a link handy but the differe... KennyBlanken \n", - "19 > I don't think the use case you menti... colanderman \n", - "20 I think you need to watch it again, because yo... vladimirralev \n", - "21 joshreads 1 \n", - "22 smb06 1 \n", - "23 Except that they clearly never tried to incent... aenis \n", - "24 GiraffeNecktie 249 \n", - "\n", - " timestamp type \n", - "0 2021-06-26 02:37:56+00:00 comment \n", - "1 2023-04-19 15:56:34+00:00 comment \n", - "2 2018-10-16 01:07:55+00:00 comment \n", - "3 2015-08-10 16:05:54+00:00 comment \n", - "4 2020-10-05 11:20:51+00:00 comment \n", - "5 2020-09-01 15:38:50+00:00 comment \n", - "6 2011-09-27 22:43:27+00:00 story \n", - "7 2011-09-16 19:02:01+00:00 comment \n", - "8 2012-10-15 13:09:32+00:00 story \n", - "9 2023-04-21 16:45:13+00:00 comment \n", - "10 2023-01-19 20:05:54+00:00 comment \n", - "11 2019-06-17 19:49:17+00:00 comment \n", - "12 2018-03-18 18:57:21+00:00 comment \n", - "13 2019-10-22 10:41:01+00:00 comment \n", - "14 2017-05-05 20:45:37+00:00 comment \n", - "15 2024-07-12 18:39:52+00:00 story \n", - "16 2017-09-12 18:58:20+00:00 comment \n", - "17 2021-04-13 01:24:03+00:00 comment \n", - "18 2022-05-13 16:08:38+00:00 comment \n", - "19 2017-09-28 05:16:06+00:00 comment \n", - "20 2018-12-07 11:25:52+00:00 comment \n", - "21 2014-04-08 13:29:50+00:00 story \n", - "22 2017-02-16 23:26:34+00:00 story \n", - "23 2022-01-31 17:08:57+00:00 comment \n", - "24 2011-04-19 14:25:17+00:00 story \n", - "...\n", - "\n", - "[3000 rows x 6 columns]" - ] - }, - "execution_count": 26, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "hacker_news = bpd.read_gbq(\"bigquery-public-data.hacker_news.full\")[['title', 'text', 'by', 'score', 'timestamp', 'type']].head(3000)\n", - "hacker_news" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Then, keep only the rows that have text content:" - ] - }, - { - "cell_type": "code", - "execution_count": 27, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "2556" - ] - }, - "execution_count": 27, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "hacker_news_with_texts = hacker_news[hacker_news['text'].isnull() == False]\n", - "len(hacker_news_with_texts)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can get an idea of the input token length by calculating the average string length." - ] - }, - { - "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "390.29068857589976" - ] - }, - "execution_count": 28, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "hacker_news_with_texts['text'].str.len().mean()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "[Optional] You can raise the confirmation threshold for a smoother experience." - ] - }, - { - "cell_type": "code", - "execution_count": 29, - "metadata": {}, - "outputs": [], - "source": [ - "if Version(bigframes.__version__) >= Version(\"1.31.0\"):\n", - " bigframes.options.compute.semantic_ops_confirmation_threshold = 5000" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now it's LLM's turn. You want to keep only the rows whose texts are talking about iPhone. This will take several minutes to finish." - ] - }, - { - "cell_type": "code", - "execution_count": 30, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "This operation will process about 2556 rows. Proceed? [Y/n]\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
titletextbyscoretimestamptype
9<NA>It doesn’t work on Safari, and WebKit based br...archiewood<NA>2023-04-21 16:45:13+00:00comment
420<NA>Well last time I got angry down votes for sayi...drieddust<NA>2021-01-11 19:27:27+00:00comment
814<NA>New iPhone should be announced on September. L...meerita<NA>2019-07-30 20:54:42+00:00comment
1515<NA>Why would this take a week? i(phone)OS was ori...TheOtherHobbes<NA>2021-06-08 09:25:24+00:00comment
1562<NA>&gt;or because Apple drama brings many clicks?...weberer<NA>2022-09-05 13:16:02+00:00comment
\n", - "

5 rows × 6 columns

\n", - "
[5 rows x 6 columns in total]" + "source": [ + "df.semantics.search(\"creatures\", query=\"monkey\", top_k = 2, model = text_embedding_model, score_column='similarity score')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GDZeVzFTiouY" + }, + "source": [ + "Note that you are using a text embedding model this time. This model generates embedding vectors for both your query as well as the values in the search space. The operator then uses BigQuery's built-in VECTOR_SEARCH function to find the nearest neighbors of your query.\n", + "\n", + "In addition, `score_column` is an optional parameter for storing the distances between the results and your query. If not set, the score column won't be attached to the result." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EXNutIXqiouY" + }, + "source": [ + "## Semantic Similarity Join" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BhWrhQMjiouY" + }, + "source": [ + "When you want to perform multiple similarity queries in the same value space, you could use similarity join to simplify your call. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cUc7-8O6iouY" + }, + "outputs": [], + "source": [ + "df1 = bpd.DataFrame({'animal': ['monkey', 'spider', 'salmon', 'giraffe', 'sparrow']})\n", + "df2 = bpd.DataFrame({'animal': ['scorpion', 'baboon', 'owl', 'elephant', 'tuna']})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k96WerOviouY" + }, + "source": [ + "In this example, you want to pick the most related animal from `df2` for each value in `df1`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 253 + }, + "id": "wPV5EkfpiouY", + "outputId": "4be1211d-0353-4b94-8c27-ebd568e8e104" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " animal animal_1 distance\n", + "0 monkey baboon 0.620521\n", + "1 spider scorpion 0.728024\n", + "2 salmon tuna 0.782141\n", + "3 giraffe elephant 0.7135\n", + "4 sparrow owl 0.810864\n", + "\n", + "[5 rows x 3 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
animalanimal_1distance
0monkeybaboon0.620521
1spiderscorpion0.728024
2salmontuna0.782141
3giraffeelephant0.7135
4sparrowowl0.810864
\n", + "

5 rows × 3 columns

\n", + "
[5 rows x 3 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 25 + } ], - "text/plain": [ - " title text by \\\n", - "9 It doesn’t work on Safari, and WebKit based br... archiewood \n", - "420 Well last time I got angry down votes for sayi... drieddust \n", - "814 New iPhone should be announced on September. L... meerita \n", - "1515 Why would this take a week? i(phone)OS was ori... TheOtherHobbes \n", - "1562 >or because Apple drama brings many clicks?... weberer \n", - "\n", - " score timestamp type \n", - "9 2023-04-21 16:45:13+00:00 comment \n", - "420 2021-01-11 19:27:27+00:00 comment \n", - "814 2019-07-30 20:54:42+00:00 comment \n", - "1515 2021-06-08 09:25:24+00:00 comment \n", - "1562 2022-09-05 13:16:02+00:00 comment \n", - "\n", - "[5 rows x 6 columns]" - ] - }, - "execution_count": 30, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "iphone_comments=hacker_news_with_texts.semantics.filter(\"The {text} is mainly focused on iPhone\", gemini_model)\n", - "iphone_comments" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The performance of the semantic operators depends on the length of your input as well as your quota. Here are our benchmarks for running the previous operation over data of different sizes.\n", - "\n", - "* 800 Rows -> 1m 21.3s\n", - "* 2550 Rows -> 5m 9s\n", - "* 8500 Rows -> 16m 34.4s\n", - "\n", - "These numbers can give you a general idea of how fast the operators run." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Now, use LLM to summarize the sentiments towards iPhone:" - ] - }, - { - "cell_type": "code", - "execution_count": 31, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
titletextbyscoretimestamptypesentiment
9<NA>It doesn’t work on Safari, and WebKit based br...archiewood<NA>2023-04-21 16:45:13+00:00commentFrustrated, but hopeful.
420<NA>Well last time I got angry down votes for sayi...drieddust<NA>2021-01-11 19:27:27+00:00commentFrustrated and angry.
814<NA>New iPhone should be announced on September. L...meerita<NA>2019-07-30 20:54:42+00:00commentExcited anticipation.
1515<NA>Why would this take a week? i(phone)OS was ori...TheOtherHobbes<NA>2021-06-08 09:25:24+00:00commentFrustrated, critical, obvious.
1562<NA>&gt;or because Apple drama brings many clicks?...weberer<NA>2022-09-05 13:16:02+00:00commentNegative, clickbait, Apple.
\n", - "

5 rows × 7 columns

\n", - "
[5 rows x 7 columns in total]" + "source": [ + "df1.semantics.sim_join(df2, left_on='animal', right_on='animal', top_k=1, model=text_embedding_model, score_column='distance')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GplzD7v0iouY" + }, + "source": [ + "!! **Important** Like semantic join, this operator can also be very expensive. To guard against unexpected processing of large dataset, use the `bigframes.options.compute.sem_ops_confirmation_threshold` option to specify a threshold." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uG6FyMH_iouY" + }, + "source": [ + "## Semantic Cluster" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uIh3ViNciouY" + }, + "source": [ + "Semantic Cluster group similar values together. For example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jyQ_aT9qiouY" + }, + "outputs": [], + "source": [ + "df = bpd.DataFrame({'Product': ['Smartphone', 'Laptop', 'Coffee Maker', 'T-shirt', 'Jeans']})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K3IMIFrtiouY" + }, + "source": [ + "You want to cluster these products into 3 groups:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 253 + }, + "id": "0Tc0DqXJiouY", + "outputId": "1c8b6e28-713c-4666-e623-3b2c42c50b30" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " Product Cluster ID\n", + "0 Smartphone 1\n", + "1 Laptop 1\n", + "2 Coffee Maker 1\n", + "3 T-shirt 1\n", + "4 Jeans 1\n", + "\n", + "[5 rows x 2 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
ProductCluster ID
0Smartphone1
1Laptop1
2Coffee Maker1
3T-shirt1
4Jeans1
\n", + "

5 rows × 2 columns

\n", + "
[5 rows x 2 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 27 + } ], - "text/plain": [ - " title text by \\\n", - "9 It doesn’t work on Safari, and WebKit based br... archiewood \n", - "420 Well last time I got angry down votes for sayi... drieddust \n", - "814 New iPhone should be announced on September. L... meerita \n", - "1515 Why would this take a week? i(phone)OS was ori... TheOtherHobbes \n", - "1562 >or because Apple drama brings many clicks?... weberer \n", - "\n", - " score timestamp type \\\n", - "9 2023-04-21 16:45:13+00:00 comment \n", - "420 2021-01-11 19:27:27+00:00 comment \n", - "814 2019-07-30 20:54:42+00:00 comment \n", - "1515 2021-06-08 09:25:24+00:00 comment \n", - "1562 2022-09-05 13:16:02+00:00 comment \n", - "\n", - " sentiment \n", - "9 Frustrated, but hopeful. \n", - " \n", - "420 Frustrated and angry. \n", - " \n", - "814 Excited anticipation. \n", - " \n", - "1515 Frustrated, critical, obvious. \n", - " \n", - "1562 Negative, clickbait, Apple. \n", - " \n", - "\n", - "[5 rows x 7 columns]" - ] - }, - "execution_count": 31, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "iphone_comments.semantics.map(\"Summarize the sentiment of the {text}. Your answer should have at most 3 words\", output_column=\"sentiment\", model=gemini_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here is another example: count the number of rows whose authors have animals in their names." - ] - }, - { - "cell_type": "code", - "execution_count": 32, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/venv/lib/python3.11/site-packages/IPython/core/interactiveshell.py:3577: UserWarning: Reading cached table from 2025-01-03 01:18:29.080474+00:00 to avoid incompatibilies with previous reads of this table. To read the latest version, set `use_cache=False` or close the current session with Session.close() or bigframes.pandas.close_session().\n", - " exec(code_obj, self.user_global_ns, self.user_ns)\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
titletextbyscoretimestamptype
0<NA>Well, most people aren&#x27;t alcoholics, so I...slipframe<NA>2021-06-26 02:37:56+00:00comment
1<NA>No, you don&#x27;t really <i>need</i> a smartp...vetinari<NA>2023-04-19 15:56:34+00:00comment
2<NA>It&#x27;s for the late Paul Allen RIP. Should&...lsr_ssri<NA>2018-10-16 01:07:55+00:00comment
3<NA>Yup they are dangerous. Be careful Donald Trump.Sven7<NA>2015-08-10 16:05:54+00:00comment
4<NA>Sure, it&#x27;s totally reasonable. Just point...nicoburns<NA>2020-10-05 11:20:51+00:00comment
5<NA>I wonder how long before special forces start ...autisticcurio<NA>2020-09-01 15:38:50+00:00comment
6The Impending NY Tech Apocalypse: Here's What ...<NA>gaoprea32011-09-27 22:43:27+00:00story
7<NA>Where would you relocate to? I'm assuming that...pavel_lishin<NA>2011-09-16 19:02:01+00:00comment
8Eureca beta is live. A place for your business...<NA>ricardos12012-10-15 13:09:32+00:00story
9<NA>It doesn’t work on Safari, and WebKit based br...archiewood<NA>2023-04-21 16:45:13+00:00comment
10<NA>I guess I don’t see the relevance. Vegans eat ...stevula<NA>2023-01-19 20:05:54+00:00comment
11<NA>I remember watching the American news media go...fareesh<NA>2019-06-17 19:49:17+00:00comment
12<NA>This article is incorrectly using the current ...stale2002<NA>2018-03-18 18:57:21+00:00comment
13<NA>In the firm I made my internship, we have to u...iserlohnmage<NA>2019-10-22 10:41:01+00:00comment
14<NA>The main reason it requires unsafe is for memo...comex<NA>2017-05-05 20:45:37+00:00comment
15Discord vs. IRC Rough Notes<NA>todsacerdoti482024-07-12 18:39:52+00:00story
16<NA>you have to auth again when you use apple pay.empath75<NA>2017-09-12 18:58:20+00:00comment
17<NA>It goes consumer grade, automotive, military, ...moftz<NA>2021-04-13 01:24:03+00:00comment
18<NA>I don&#x27;t have a link handy but the differe...KennyBlanken<NA>2022-05-13 16:08:38+00:00comment
19<NA>&gt; I don&#x27;t think the use case you menti...colanderman<NA>2017-09-28 05:16:06+00:00comment
20<NA>I think you need to watch it again, because yo...vladimirralev<NA>2018-12-07 11:25:52+00:00comment
21Oh dear: new Yahoo anti-spoofing measures brea...<NA>joshreads12014-04-08 13:29:50+00:00story
22How Much Warmer Was Your City in 2016?<NA>smb0612017-02-16 23:26:34+00:00story
23<NA>Except that they clearly never tried to incent...aenis<NA>2022-01-31 17:08:57+00:00comment
24Working Best at Coffee Shops<NA>GiraffeNecktie2492011-04-19 14:25:17+00:00story
\n", - "

25 rows × 6 columns

\n", - "
[3000 rows x 6 columns in total]" + "source": [ + "df.semantics.cluster_by(column='Product', output_column='Cluster ID', model=text_embedding_model, n_clusters=3)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zWIzYX3niouY" + }, + "source": [ + "This operator uses the the embedding model to generate vectors for each value, and then the KMeans algorithm for clustering." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hgj8GoQhiouY" + }, + "source": [ + "# Performance Analyses" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EZomL0BciouY" + }, + "source": [ + "In this section, you will use BigQuery's public data of hacker news to perform some heavy work. We recommend you to check the code without executing them in order to save your time and money. The execution results are attached after each cell for your reference.\n", + "\n", + "First, load 3k rows from the table:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 880 + }, + "id": "wRR0SrcSiouY", + "outputId": "3b25f3a3-09c7-4396-9107-4aa4cdb4b963" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " title \\\n", + "0 \n", + "1 \n", + "2 \n", + "3 \n", + "4 \n", + "5 \n", + "6 The Impending NY Tech Apocalypse: Here's What ... \n", + "7 \n", + "8 Eureca beta is live. A place for your business... \n", + "9 \n", + "10 \n", + "11 \n", + "12 \n", + "13 \n", + "14 \n", + "15 Discord vs. IRC Rough Notes \n", + "16 \n", + "17 \n", + "18 \n", + "19 \n", + "20 \n", + "21 Oh dear: new Yahoo anti-spoofing measures brea... \n", + "22 How Much Warmer Was Your City in 2016? \n", + "23 \n", + "24 Working Best at Coffee Shops \n", + "\n", + " text by score \\\n", + "0 Well, most people aren't alcoholics, so I... slipframe \n", + "1 No, you don't really need a smartp... vetinari \n", + "2 It's for the late Paul Allen RIP. Should&... lsr_ssri \n", + "3 Yup they are dangerous. Be careful Donald Trump. Sven7 \n", + "4 Sure, it's totally reasonable. Just point... nicoburns \n", + "5 I wonder how long before special forces start ... autisticcurio \n", + "6 gaoprea 3 \n", + "7 Where would you relocate to? I'm assuming that... pavel_lishin \n", + "8 ricardos 1 \n", + "9 It doesn’t work on Safari, and WebKit based br... archiewood \n", + "10 I guess I don’t see the relevance. Vegans eat ... stevula \n", + "11 I remember watching the American news media go... fareesh \n", + "12 This article is incorrectly using the current ... stale2002 \n", + "13 In the firm I made my internship, we have to u... iserlohnmage \n", + "14 The main reason it requires unsafe is for memo... comex \n", + "15 todsacerdoti 48 \n", + "16 you have to auth again when you use apple pay. empath75 \n", + "17 It goes consumer grade, automotive, military, ... moftz \n", + "18 I don't have a link handy but the differe... KennyBlanken \n", + "19 > I don't think the use case you menti... colanderman \n", + "20 I think you need to watch it again, because yo... vladimirralev \n", + "21 joshreads 1 \n", + "22 smb06 1 \n", + "23 Except that they clearly never tried to incent... aenis \n", + "24 GiraffeNecktie 249 \n", + "\n", + " timestamp type \n", + "0 2021-06-26 02:37:56+00:00 comment \n", + "1 2023-04-19 15:56:34+00:00 comment \n", + "2 2018-10-16 01:07:55+00:00 comment \n", + "3 2015-08-10 16:05:54+00:00 comment \n", + "4 2020-10-05 11:20:51+00:00 comment \n", + "5 2020-09-01 15:38:50+00:00 comment \n", + "6 2011-09-27 22:43:27+00:00 story \n", + "7 2011-09-16 19:02:01+00:00 comment \n", + "8 2012-10-15 13:09:32+00:00 story \n", + "9 2023-04-21 16:45:13+00:00 comment \n", + "10 2023-01-19 20:05:54+00:00 comment \n", + "11 2019-06-17 19:49:17+00:00 comment \n", + "12 2018-03-18 18:57:21+00:00 comment \n", + "13 2019-10-22 10:41:01+00:00 comment \n", + "14 2017-05-05 20:45:37+00:00 comment \n", + "15 2024-07-12 18:39:52+00:00 story \n", + "16 2017-09-12 18:58:20+00:00 comment \n", + "17 2021-04-13 01:24:03+00:00 comment \n", + "18 2022-05-13 16:08:38+00:00 comment \n", + "19 2017-09-28 05:16:06+00:00 comment \n", + "20 2018-12-07 11:25:52+00:00 comment \n", + "21 2014-04-08 13:29:50+00:00 story \n", + "22 2017-02-16 23:26:34+00:00 story \n", + "23 2022-01-31 17:08:57+00:00 comment \n", + "24 2011-04-19 14:25:17+00:00 story \n", + "...\n", + "\n", + "[3000 rows x 6 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titletextbyscoretimestamptype
0<NA>Well, most people aren&#x27;t alcoholics, so I...slipframe<NA>2021-06-26 02:37:56+00:00comment
1<NA>No, you don&#x27;t really <i>need</i> a smartp...vetinari<NA>2023-04-19 15:56:34+00:00comment
2<NA>It&#x27;s for the late Paul Allen RIP. Should&...lsr_ssri<NA>2018-10-16 01:07:55+00:00comment
3<NA>Yup they are dangerous. Be careful Donald Trump.Sven7<NA>2015-08-10 16:05:54+00:00comment
4<NA>Sure, it&#x27;s totally reasonable. Just point...nicoburns<NA>2020-10-05 11:20:51+00:00comment
5<NA>I wonder how long before special forces start ...autisticcurio<NA>2020-09-01 15:38:50+00:00comment
6The Impending NY Tech Apocalypse: Here's What ...<NA>gaoprea32011-09-27 22:43:27+00:00story
7<NA>Where would you relocate to? I'm assuming that...pavel_lishin<NA>2011-09-16 19:02:01+00:00comment
8Eureca beta is live. A place for your business...<NA>ricardos12012-10-15 13:09:32+00:00story
9<NA>It doesn’t work on Safari, and WebKit based br...archiewood<NA>2023-04-21 16:45:13+00:00comment
10<NA>I guess I don’t see the relevance. Vegans eat ...stevula<NA>2023-01-19 20:05:54+00:00comment
11<NA>I remember watching the American news media go...fareesh<NA>2019-06-17 19:49:17+00:00comment
12<NA>This article is incorrectly using the current ...stale2002<NA>2018-03-18 18:57:21+00:00comment
13<NA>In the firm I made my internship, we have to u...iserlohnmage<NA>2019-10-22 10:41:01+00:00comment
14<NA>The main reason it requires unsafe is for memo...comex<NA>2017-05-05 20:45:37+00:00comment
15Discord vs. IRC Rough Notes<NA>todsacerdoti482024-07-12 18:39:52+00:00story
16<NA>you have to auth again when you use apple pay.empath75<NA>2017-09-12 18:58:20+00:00comment
17<NA>It goes consumer grade, automotive, military, ...moftz<NA>2021-04-13 01:24:03+00:00comment
18<NA>I don&#x27;t have a link handy but the differe...KennyBlanken<NA>2022-05-13 16:08:38+00:00comment
19<NA>&gt; I don&#x27;t think the use case you menti...colanderman<NA>2017-09-28 05:16:06+00:00comment
20<NA>I think you need to watch it again, because yo...vladimirralev<NA>2018-12-07 11:25:52+00:00comment
21Oh dear: new Yahoo anti-spoofing measures brea...<NA>joshreads12014-04-08 13:29:50+00:00story
22How Much Warmer Was Your City in 2016?<NA>smb0612017-02-16 23:26:34+00:00story
23<NA>Except that they clearly never tried to incent...aenis<NA>2022-01-31 17:08:57+00:00comment
24Working Best at Coffee Shops<NA>GiraffeNecktie2492011-04-19 14:25:17+00:00story
\n", + "

25 rows × 6 columns

\n", + "
[3000 rows x 6 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 9 + } ], - "text/plain": [ - " title \\\n", - "0 \n", - "1 \n", - "2 \n", - "3 \n", - "4 \n", - "5 \n", - "6 The Impending NY Tech Apocalypse: Here's What ... \n", - "7 \n", - "8 Eureca beta is live. A place for your business... \n", - "9 \n", - "10 \n", - "11 \n", - "12 \n", - "13 \n", - "14 \n", - "15 Discord vs. IRC Rough Notes \n", - "16 \n", - "17 \n", - "18 \n", - "19 \n", - "20 \n", - "21 Oh dear: new Yahoo anti-spoofing measures brea... \n", - "22 How Much Warmer Was Your City in 2016? \n", - "23 \n", - "24 Working Best at Coffee Shops \n", - "\n", - " text by score \\\n", - "0 Well, most people aren't alcoholics, so I... slipframe \n", - "1 No, you don't really need a smartp... vetinari \n", - "2 It's for the late Paul Allen RIP. Should&... lsr_ssri \n", - "3 Yup they are dangerous. Be careful Donald Trump. Sven7 \n", - "4 Sure, it's totally reasonable. Just point... nicoburns \n", - "5 I wonder how long before special forces start ... autisticcurio \n", - "6 gaoprea 3 \n", - "7 Where would you relocate to? I'm assuming that... pavel_lishin \n", - "8 ricardos 1 \n", - "9 It doesn’t work on Safari, and WebKit based br... archiewood \n", - "10 I guess I don’t see the relevance. Vegans eat ... stevula \n", - "11 I remember watching the American news media go... fareesh \n", - "12 This article is incorrectly using the current ... stale2002 \n", - "13 In the firm I made my internship, we have to u... iserlohnmage \n", - "14 The main reason it requires unsafe is for memo... comex \n", - "15 todsacerdoti 48 \n", - "16 you have to auth again when you use apple pay. empath75 \n", - "17 It goes consumer grade, automotive, military, ... moftz \n", - "18 I don't have a link handy but the differe... KennyBlanken \n", - "19 > I don't think the use case you menti... colanderman \n", - "20 I think you need to watch it again, because yo... vladimirralev \n", - "21 joshreads 1 \n", - "22 smb06 1 \n", - "23 Except that they clearly never tried to incent... aenis \n", - "24 GiraffeNecktie 249 \n", - "\n", - " timestamp type \n", - "0 2021-06-26 02:37:56+00:00 comment \n", - "1 2023-04-19 15:56:34+00:00 comment \n", - "2 2018-10-16 01:07:55+00:00 comment \n", - "3 2015-08-10 16:05:54+00:00 comment \n", - "4 2020-10-05 11:20:51+00:00 comment \n", - "5 2020-09-01 15:38:50+00:00 comment \n", - "6 2011-09-27 22:43:27+00:00 story \n", - "7 2011-09-16 19:02:01+00:00 comment \n", - "8 2012-10-15 13:09:32+00:00 story \n", - "9 2023-04-21 16:45:13+00:00 comment \n", - "10 2023-01-19 20:05:54+00:00 comment \n", - "11 2019-06-17 19:49:17+00:00 comment \n", - "12 2018-03-18 18:57:21+00:00 comment \n", - "13 2019-10-22 10:41:01+00:00 comment \n", - "14 2017-05-05 20:45:37+00:00 comment \n", - "15 2024-07-12 18:39:52+00:00 story \n", - "16 2017-09-12 18:58:20+00:00 comment \n", - "17 2021-04-13 01:24:03+00:00 comment \n", - "18 2022-05-13 16:08:38+00:00 comment \n", - "19 2017-09-28 05:16:06+00:00 comment \n", - "20 2018-12-07 11:25:52+00:00 comment \n", - "21 2014-04-08 13:29:50+00:00 story \n", - "22 2017-02-16 23:26:34+00:00 story \n", - "23 2022-01-31 17:08:57+00:00 comment \n", - "24 2011-04-19 14:25:17+00:00 story \n", - "...\n", - "\n", - "[3000 rows x 6 columns]" - ] - }, - "execution_count": 32, - "metadata": {}, - "output_type": "execute_result" - } - ], - "source": [ - "hacker_news = bpd.read_gbq(\"bigquery-public-data.hacker_news.full\")[['title', 'text', 'by', 'score', 'timestamp', 'type']].head(3000)\n", - "hacker_news" - ] - }, - { - "cell_type": "code", - "execution_count": 33, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "This operation will process about 3000 rows. Proceed? [Y/n]\n" - ] - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "/usr/local/google/home/sycai/src/python-bigquery-dataframes/bigframes/core/__init__.py:109: PreviewWarning: Interpreting JSON column(s) as StringDtype and pyarrow.large_string. This behavior may change in future versions.\n", - " warnings.warn(\n" - ] - }, - { - "data": { - "text/html": [ - "
\n", - "\n", - "\n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - " \n", - "
titletextbyscoretimestamptype
24Working Best at Coffee Shops<NA>GiraffeNecktie2492011-04-19 14:25:17+00:00story
98<NA>i resisted switching to chrome for months beca...catshirt<NA>2011-04-06 08:02:24+00:00comment
137FDA reverses marketing ban on Juul e-cigarettes<NA>anigbrowl22024-06-06 16:42:40+00:00story
188<NA>I think it&#x27;s more than hazing. It may be ...bayesianhorse<NA>2015-06-18 16:42:53+00:00comment
209<NA>I like the idea of moving that arrow the way h...rattray<NA>2015-06-08 02:15:30+00:00comment
228<NA>I don&#x27;t understand why a beginner would s...wolco<NA>2019-02-03 14:35:43+00:00comment
290<NA>I leaerned more with one minute of this than a...agumonkey<NA>2016-07-16 06:19:39+00:00comment
303<NA>I've suggested a <i>rationale</i> for the tabo...mechanical_fish<NA>2008-12-17 04:42:02+00:00comment
312<NA>Do you have any reference for this?<p>I&#x27;m...banashark<NA>2023-11-13 19:57:00+00:00comment
322<NA>Default search scope is an option in the Finde...kitsunesoba<NA>2017-08-13 17:15:19+00:00comment
391<NA>Orthogonality and biology aren&#x27;t friends.agumonkey<NA>2016-04-24 16:33:41+00:00comment
396<NA>I chose some random physics book that was good...prawn<NA>2011-03-27 22:29:51+00:00comment
424<NA>Seeing this get huge on Twitter. It&#x27;s the...shenanigoat<NA>2016-01-09 03:04:22+00:00comment
428<NA>Looking through the comments there are a numbe...moomin<NA>2024-10-01 14:37:04+00:00comment
429<NA>Legacy media is a tough business. GBTC is payi...arcticbull<NA>2021-04-16 16:30:33+00:00comment
436<NA>Same thing if you sell unsafe food, yet we hav...jabradoodle<NA>2023-08-03 20:47:52+00:00comment
438<NA>There was briefly a thing called HSCSD (&quot;...LeoPanthera<NA>2019-02-11 19:49:29+00:00comment
446<NA>&gt; This article is a bit comical to read and...lapcat<NA>2023-01-02 16:00:49+00:00comment
453<NA>Large positions are most likely sold off in sm...meowkit<NA>2021-01-27 23:22:48+00:00comment
507<NA>A US-based VPN (or really any VPN) is only goi...RandomBacon<NA>2019-04-05 00:58:58+00:00comment
543<NA><a href=\"https:&#x2F;&#x2F;codeberg.org&#x2F;A...ElectronBadger<NA>2023-12-13 08:13:15+00:00comment
565<NA>It’s much harder for people without hands to w...Aeolun<NA>2024-05-03 11:58:13+00:00comment
612<NA>So by using ADMIN_SL0T instead was it just set...minitoar<NA>2021-03-05 16:07:56+00:00comment
660<NA>Outstanding!cafard<NA>2022-06-09 09:51:54+00:00comment
673<NA>On the other hand, something can be said for &...babby<NA>2013-08-12 00:31:02+00:00comment
\n", - "

25 rows × 6 columns

\n", - "
[123 rows x 6 columns in total]" + "source": [ + "hacker_news = bpd.read_gbq(\"bigquery-public-data.hacker_news.full\")[['title', 'text', 'by', 'score', 'timestamp', 'type']].head(3000)\n", + "hacker_news" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3e94DPOdiouY" + }, + "source": [ + "Then, keep only the rows that have text content:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "mQl8hc1biouY", + "outputId": "2b4ffa85-9d95-4a20-9040-0420c67da2d4" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "2556" + ] + }, + "metadata": {}, + "execution_count": 10 + } ], - "text/plain": [ - " title \\\n", - "24 Working Best at Coffee Shops \n", - "98 \n", - "137 FDA reverses marketing ban on Juul e-cigarettes \n", - "188 \n", - "209 \n", - "228 \n", - "290 \n", - "303 \n", - "312 \n", - "322 \n", - "391 \n", - "396 \n", - "424 \n", - "428 \n", - "429 \n", - "436 \n", - "438 \n", - "446 \n", - "453 \n", - "507 \n", - "543 \n", - "565 \n", - "612 \n", - "660 \n", - "673 \n", - "\n", - " text by \\\n", - "24 GiraffeNecktie \n", - "98 i resisted switching to chrome for months beca... catshirt \n", - "137 anigbrowl \n", - "188 I think it's more than hazing. It may be ... bayesianhorse \n", - "209 I like the idea of moving that arrow the way h... rattray \n", - "228 I don't understand why a beginner would s... wolco \n", - "290 I leaerned more with one minute of this than a... agumonkey \n", - "303 I've suggested a rationale for the tabo... mechanical_fish \n", - "312 Do you have any reference for this?

I'm... banashark \n", - "322 Default search scope is an option in the Finde... kitsunesoba \n", - "391 Orthogonality and biology aren't friends. agumonkey \n", - "396 I chose some random physics book that was good... prawn \n", - "424 Seeing this get huge on Twitter. It's the... shenanigoat \n", - "428 Looking through the comments there are a numbe... moomin \n", - "429 Legacy media is a tough business. GBTC is payi... arcticbull \n", - "436 Same thing if you sell unsafe food, yet we hav... jabradoodle \n", - "438 There was briefly a thing called HSCSD ("... LeoPanthera \n", - "446 > This article is a bit comical to read and... lapcat \n", - "453 Large positions are most likely sold off in sm... meowkit \n", - "507 A US-based VPN (or really any VPN) is only goi... RandomBacon \n", - "543 2011-04-06 08:02:24+00:00 comment \n", - "137 2 2024-06-06 16:42:40+00:00 story \n", - "188 2015-06-18 16:42:53+00:00 comment \n", - "209 2015-06-08 02:15:30+00:00 comment \n", - "228 2019-02-03 14:35:43+00:00 comment \n", - "290 2016-07-16 06:19:39+00:00 comment \n", - "303 2008-12-17 04:42:02+00:00 comment \n", - "312 2023-11-13 19:57:00+00:00 comment \n", - "322 2017-08-13 17:15:19+00:00 comment \n", - "391 2016-04-24 16:33:41+00:00 comment \n", - "396 2011-03-27 22:29:51+00:00 comment \n", - "424 2016-01-09 03:04:22+00:00 comment \n", - "428 2024-10-01 14:37:04+00:00 comment \n", - "429 2021-04-16 16:30:33+00:00 comment \n", - "436 2023-08-03 20:47:52+00:00 comment \n", - "438 2019-02-11 19:49:29+00:00 comment \n", - "446 2023-01-02 16:00:49+00:00 comment \n", - "453 2021-01-27 23:22:48+00:00 comment \n", - "507 2019-04-05 00:58:58+00:00 comment \n", - "543 2023-12-13 08:13:15+00:00 comment \n", - "565 2024-05-03 11:58:13+00:00 comment \n", - "612 2021-03-05 16:07:56+00:00 comment \n", - "660 2022-06-09 09:51:54+00:00 comment \n", - "673 2013-08-12 00:31:02+00:00 comment \n", - "...\n", - "\n", - "[123 rows x 6 columns]" - ] - }, - "execution_count": 33, - "metadata": {}, - "output_type": "execute_result" + "source": [ + "hacker_news_with_texts = hacker_news[hacker_news['text'].isnull() == False]\n", + "len(hacker_news_with_texts)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JWalDtLDiouZ" + }, + "source": [ + "You can get an idea of the input token length by calculating the average string length." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "PZeg4LCUiouZ", + "outputId": "05b67cac-6b3d-42ef-d6d6-b578a9734f4c" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "390.05125195618155" + ] + }, + "metadata": {}, + "execution_count": 11 + } + ], + "source": [ + "hacker_news_with_texts['text'].str.len().mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2IXqskHHiouZ" + }, + "source": [ + "**Optional**: You can raise the confirmation threshold for a smoother experience." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EpjXQ4FViouZ" + }, + "outputs": [], + "source": [ + "if Version(bigframes.__version__) >= Version(\"1.31.0\"):\n", + " bigframes.options.compute.semantic_ops_confirmation_threshold = 5000" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SYFB-X1RiouZ" + }, + "source": [ + "Now it's LLM's turn. You want to keep only the rows whose texts are talking about iPhone. This will take several minutes to finish." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 253 + }, + "id": "rditQlmoiouZ", + "outputId": "2b44dcbf-2ef5-4119-ca05-9b082db9c0c1" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " title text by \\\n", + "9 It doesn’t work on Safari, and WebKit based br... archiewood \n", + "420 Well last time I got angry down votes for sayi... drieddust \n", + "815 New iPhone should be announced on September. L... meerita \n", + "1516 Why would this take a week? i(phone)OS was ori... TheOtherHobbes \n", + "1563 >or because Apple drama brings many clicks?... weberer \n", + "\n", + " score timestamp type \n", + "9 2023-04-21 16:45:13+00:00 comment \n", + "420 2021-01-11 19:27:27+00:00 comment \n", + "815 2019-07-30 20:54:42+00:00 comment \n", + "1516 2021-06-08 09:25:24+00:00 comment \n", + "1563 2022-09-05 13:16:02+00:00 comment \n", + "\n", + "[5 rows x 6 columns]" + ], + "text/html": [ + "

\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titletextbyscoretimestamptype
9<NA>It doesn’t work on Safari, and WebKit based br...archiewood<NA>2023-04-21 16:45:13+00:00comment
420<NA>Well last time I got angry down votes for sayi...drieddust<NA>2021-01-11 19:27:27+00:00comment
815<NA>New iPhone should be announced on September. L...meerita<NA>2019-07-30 20:54:42+00:00comment
1516<NA>Why would this take a week? i(phone)OS was ori...TheOtherHobbes<NA>2021-06-08 09:25:24+00:00comment
1563<NA>&gt;or because Apple drama brings many clicks?...weberer<NA>2022-09-05 13:16:02+00:00comment
\n", + "

5 rows × 6 columns

\n", + "
[5 rows x 6 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 13 + } + ], + "source": [ + "iphone_comments = hacker_news_with_texts.semantics.filter(\"The {text} is mainly focused on iPhone\", gemini_model)\n", + "iphone_comments" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yl24sJFIiouZ" + }, + "source": [ + "The performance of the semantic operators depends on the length of your input as well as your quota. Here are our benchmarks for running the previous operation over data of different sizes. Here are the estimates supposing your quota is [the default 200 requests per minute](https://cloud.google.com/vertex-ai/generative-ai/docs/quotas):\n", + "\n", + "* 800 Rows -> ~4m\n", + "* 2550 Rows -> ~13m\n", + "* 8500 Rows -> ~40m\n", + "\n", + "These numbers can give you a general idea of how fast the operators run." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eo4nfISuiouZ" + }, + "source": [ + "Now, use LLM to summarize the sentiments towards iPhone:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 253 + }, + "id": "IlKBrNxUiouZ", + "outputId": "818d01e4-1cdf-42a2-9e02-61c4736a8905" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " title text by \\\n", + "9 It doesn’t work on Safari, and WebKit based br... archiewood \n", + "420 Well last time I got angry down votes for sayi... drieddust \n", + "815 New iPhone should be announced on September. L... meerita \n", + "1516 Why would this take a week? i(phone)OS was ori... TheOtherHobbes \n", + "1563 >or because Apple drama brings many clicks?... weberer \n", + "\n", + " score timestamp type \\\n", + "9 2023-04-21 16:45:13+00:00 comment \n", + "420 2021-01-11 19:27:27+00:00 comment \n", + "815 2019-07-30 20:54:42+00:00 comment \n", + "1516 2021-06-08 09:25:24+00:00 comment \n", + "1563 2022-09-05 13:16:02+00:00 comment \n", + "\n", + " sentiment \n", + "9 Frustrated, but hopeful. \n", + " \n", + "420 Frustrated and angry. \n", + " \n", + "815 Excited anticipation. \n", + " \n", + "1516 Frustrated, critical, obvious. \n", + " \n", + "1563 Negative, clickbait, Apple. \n", + " \n", + "\n", + "[5 rows x 7 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titletextbyscoretimestamptypesentiment
9<NA>It doesn’t work on Safari, and WebKit based br...archiewood<NA>2023-04-21 16:45:13+00:00commentFrustrated, but hopeful.
420<NA>Well last time I got angry down votes for sayi...drieddust<NA>2021-01-11 19:27:27+00:00commentFrustrated and angry.
815<NA>New iPhone should be announced on September. L...meerita<NA>2019-07-30 20:54:42+00:00commentExcited anticipation.
1516<NA>Why would this take a week? i(phone)OS was ori...TheOtherHobbes<NA>2021-06-08 09:25:24+00:00commentFrustrated, critical, obvious.
1563<NA>&gt;or because Apple drama brings many clicks?...weberer<NA>2022-09-05 13:16:02+00:00commentNegative, clickbait, Apple.
\n", + "

5 rows × 7 columns

\n", + "
[5 rows x 7 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 14 + } + ], + "source": [ + "iphone_comments.semantics.map(\"Summarize the sentiment of the {text}. Your answer should have at most 3 words\", output_column=\"sentiment\", model=gemini_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y7_16T2xiouZ" + }, + "source": [ + "Here is another example: count the number of rows whose authors have animals in their names." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 880 + }, + "id": "CbGwc_uXiouZ", + "outputId": "138acca0-7fb9-495a-e797-0d42495d65e6" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " title \\\n", + "0 \n", + "1 \n", + "2 \n", + "3 \n", + "4 \n", + "5 \n", + "6 The Impending NY Tech Apocalypse: Here's What ... \n", + "7 \n", + "8 Eureca beta is live. A place for your business... \n", + "9 \n", + "10 \n", + "11 \n", + "12 \n", + "13 \n", + "14 \n", + "15 Discord vs. IRC Rough Notes \n", + "16 \n", + "17 \n", + "18 \n", + "19 \n", + "20 \n", + "21 Oh dear: new Yahoo anti-spoofing measures brea... \n", + "22 How Much Warmer Was Your City in 2016? \n", + "23 \n", + "24 Working Best at Coffee Shops \n", + "\n", + " text by score \\\n", + "0 Well, most people aren't alcoholics, so I... slipframe \n", + "1 No, you don't really need a smartp... vetinari \n", + "2 It's for the late Paul Allen RIP. Should&... lsr_ssri \n", + "3 Yup they are dangerous. Be careful Donald Trump. Sven7 \n", + "4 Sure, it's totally reasonable. Just point... nicoburns \n", + "5 I wonder how long before special forces start ... autisticcurio \n", + "6 gaoprea 3 \n", + "7 Where would you relocate to? I'm assuming that... pavel_lishin \n", + "8 ricardos 1 \n", + "9 It doesn’t work on Safari, and WebKit based br... archiewood \n", + "10 I guess I don’t see the relevance. Vegans eat ... stevula \n", + "11 I remember watching the American news media go... fareesh \n", + "12 This article is incorrectly using the current ... stale2002 \n", + "13 In the firm I made my internship, we have to u... iserlohnmage \n", + "14 The main reason it requires unsafe is for memo... comex \n", + "15 todsacerdoti 48 \n", + "16 you have to auth again when you use apple pay. empath75 \n", + "17 It goes consumer grade, automotive, military, ... moftz \n", + "18 I don't have a link handy but the differe... KennyBlanken \n", + "19 > I don't think the use case you menti... colanderman \n", + "20 I think you need to watch it again, because yo... vladimirralev \n", + "21 joshreads 1 \n", + "22 smb06 1 \n", + "23 Except that they clearly never tried to incent... aenis \n", + "24 GiraffeNecktie 249 \n", + "\n", + " timestamp type \n", + "0 2021-06-26 02:37:56+00:00 comment \n", + "1 2023-04-19 15:56:34+00:00 comment \n", + "2 2018-10-16 01:07:55+00:00 comment \n", + "3 2015-08-10 16:05:54+00:00 comment \n", + "4 2020-10-05 11:20:51+00:00 comment \n", + "5 2020-09-01 15:38:50+00:00 comment \n", + "6 2011-09-27 22:43:27+00:00 story \n", + "7 2011-09-16 19:02:01+00:00 comment \n", + "8 2012-10-15 13:09:32+00:00 story \n", + "9 2023-04-21 16:45:13+00:00 comment \n", + "10 2023-01-19 20:05:54+00:00 comment \n", + "11 2019-06-17 19:49:17+00:00 comment \n", + "12 2018-03-18 18:57:21+00:00 comment \n", + "13 2019-10-22 10:41:01+00:00 comment \n", + "14 2017-05-05 20:45:37+00:00 comment \n", + "15 2024-07-12 18:39:52+00:00 story \n", + "16 2017-09-12 18:58:20+00:00 comment \n", + "17 2021-04-13 01:24:03+00:00 comment \n", + "18 2022-05-13 16:08:38+00:00 comment \n", + "19 2017-09-28 05:16:06+00:00 comment \n", + "20 2018-12-07 11:25:52+00:00 comment \n", + "21 2014-04-08 13:29:50+00:00 story \n", + "22 2017-02-16 23:26:34+00:00 story \n", + "23 2022-01-31 17:08:57+00:00 comment \n", + "24 2011-04-19 14:25:17+00:00 story \n", + "...\n", + "\n", + "[3000 rows x 6 columns]" + ], + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titletextbyscoretimestamptype
0<NA>Well, most people aren&#x27;t alcoholics, so I...slipframe<NA>2021-06-26 02:37:56+00:00comment
1<NA>No, you don&#x27;t really <i>need</i> a smartp...vetinari<NA>2023-04-19 15:56:34+00:00comment
2<NA>It&#x27;s for the late Paul Allen RIP. Should&...lsr_ssri<NA>2018-10-16 01:07:55+00:00comment
3<NA>Yup they are dangerous. Be careful Donald Trump.Sven7<NA>2015-08-10 16:05:54+00:00comment
4<NA>Sure, it&#x27;s totally reasonable. Just point...nicoburns<NA>2020-10-05 11:20:51+00:00comment
5<NA>I wonder how long before special forces start ...autisticcurio<NA>2020-09-01 15:38:50+00:00comment
6The Impending NY Tech Apocalypse: Here's What ...<NA>gaoprea32011-09-27 22:43:27+00:00story
7<NA>Where would you relocate to? I'm assuming that...pavel_lishin<NA>2011-09-16 19:02:01+00:00comment
8Eureca beta is live. A place for your business...<NA>ricardos12012-10-15 13:09:32+00:00story
9<NA>It doesn’t work on Safari, and WebKit based br...archiewood<NA>2023-04-21 16:45:13+00:00comment
10<NA>I guess I don’t see the relevance. Vegans eat ...stevula<NA>2023-01-19 20:05:54+00:00comment
11<NA>I remember watching the American news media go...fareesh<NA>2019-06-17 19:49:17+00:00comment
12<NA>This article is incorrectly using the current ...stale2002<NA>2018-03-18 18:57:21+00:00comment
13<NA>In the firm I made my internship, we have to u...iserlohnmage<NA>2019-10-22 10:41:01+00:00comment
14<NA>The main reason it requires unsafe is for memo...comex<NA>2017-05-05 20:45:37+00:00comment
15Discord vs. IRC Rough Notes<NA>todsacerdoti482024-07-12 18:39:52+00:00story
16<NA>you have to auth again when you use apple pay.empath75<NA>2017-09-12 18:58:20+00:00comment
17<NA>It goes consumer grade, automotive, military, ...moftz<NA>2021-04-13 01:24:03+00:00comment
18<NA>I don&#x27;t have a link handy but the differe...KennyBlanken<NA>2022-05-13 16:08:38+00:00comment
19<NA>&gt; I don&#x27;t think the use case you menti...colanderman<NA>2017-09-28 05:16:06+00:00comment
20<NA>I think you need to watch it again, because yo...vladimirralev<NA>2018-12-07 11:25:52+00:00comment
21Oh dear: new Yahoo anti-spoofing measures brea...<NA>joshreads12014-04-08 13:29:50+00:00story
22How Much Warmer Was Your City in 2016?<NA>smb0612017-02-16 23:26:34+00:00story
23<NA>Except that they clearly never tried to incent...aenis<NA>2022-01-31 17:08:57+00:00comment
24Working Best at Coffee Shops<NA>GiraffeNecktie2492011-04-19 14:25:17+00:00story
\n", + "

25 rows × 6 columns

\n", + "
[3000 rows x 6 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 7 + } + ], + "source": [ + "hacker_news = bpd.read_gbq(\"bigquery-public-data.hacker_news.full\")[['title', 'text', 'by', 'score', 'timestamp', 'type']].head(3000)\n", + "hacker_news" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 880 + }, + "id": "9dzU8SNziouZ", + "outputId": "da8815c1-c411-4afc-d1ca-5e44c75b5b48" + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " title \\\n", + "24 Working Best at Coffee Shops \n", + "98 \n", + "137 FDA reverses marketing ban on Juul e-cigarettes \n", + "188 \n", + "209 \n", + "228 \n", + "290 \n", + "303 \n", + "312 \n", + "322 \n", + "391 \n", + "396 \n", + "424 \n", + "428 \n", + "429 \n", + "436 \n", + "438 \n", + "446 \n", + "453 \n", + "507 \n", + "543 \n", + "565 \n", + "612 \n", + "660 \n", + "673 \n", + "\n", + " text by \\\n", + "24 GiraffeNecktie \n", + "98 i resisted switching to chrome for months beca... catshirt \n", + "137 anigbrowl \n", + "188 I think it's more than hazing. It may be ... bayesianhorse \n", + "209 I like the idea of moving that arrow the way h... rattray \n", + "228 I don't understand why a beginner would s... wolco \n", + "290 I leaerned more with one minute of this than a... agumonkey \n", + "303 I've suggested a rationale for the tabo... mechanical_fish \n", + "312 Do you have any reference for this?

I'm... banashark \n", + "322 Default search scope is an option in the Finde... kitsunesoba \n", + "391 Orthogonality and biology aren't friends. agumonkey \n", + "396 I chose some random physics book that was good... prawn \n", + "424 Seeing this get huge on Twitter. It's the... shenanigoat \n", + "428 Looking through the comments there are a numbe... moomin \n", + "429 Legacy media is a tough business. GBTC is payi... arcticbull \n", + "436 Same thing if you sell unsafe food, yet we hav... jabradoodle \n", + "438 There was briefly a thing called HSCSD ("... LeoPanthera \n", + "446 > This article is a bit comical to read and... lapcat \n", + "453 Large positions are most likely sold off in sm... meowkit \n", + "507 A US-based VPN (or really any VPN) is only goi... RandomBacon \n", + "543 2011-04-06 08:02:24+00:00 comment \n", + "137 2 2024-06-06 16:42:40+00:00 story \n", + "188 2015-06-18 16:42:53+00:00 comment \n", + "209 2015-06-08 02:15:30+00:00 comment \n", + "228 2019-02-03 14:35:43+00:00 comment \n", + "290 2016-07-16 06:19:39+00:00 comment \n", + "303 2008-12-17 04:42:02+00:00 comment \n", + "312 2023-11-13 19:57:00+00:00 comment \n", + "322 2017-08-13 17:15:19+00:00 comment \n", + "391 2016-04-24 16:33:41+00:00 comment \n", + "396 2011-03-27 22:29:51+00:00 comment \n", + "424 2016-01-09 03:04:22+00:00 comment \n", + "428 2024-10-01 14:37:04+00:00 comment \n", + "429 2021-04-16 16:30:33+00:00 comment \n", + "436 2023-08-03 20:47:52+00:00 comment \n", + "438 2019-02-11 19:49:29+00:00 comment \n", + "446 2023-01-02 16:00:49+00:00 comment \n", + "453 2021-01-27 23:22:48+00:00 comment \n", + "507 2019-04-05 00:58:58+00:00 comment \n", + "543 2023-12-13 08:13:15+00:00 comment \n", + "565 2024-05-03 11:58:13+00:00 comment \n", + "612 2021-03-05 16:07:56+00:00 comment \n", + "660 2022-06-09 09:51:54+00:00 comment \n", + "673 2013-08-12 00:31:02+00:00 comment \n", + "...\n", + "\n", + "[123 rows x 6 columns]" + ], + "text/html": [ + "

\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
titletextbyscoretimestamptype
24Working Best at Coffee Shops<NA>GiraffeNecktie2492011-04-19 14:25:17+00:00story
98<NA>i resisted switching to chrome for months beca...catshirt<NA>2011-04-06 08:02:24+00:00comment
137FDA reverses marketing ban on Juul e-cigarettes<NA>anigbrowl22024-06-06 16:42:40+00:00story
188<NA>I think it&#x27;s more than hazing. It may be ...bayesianhorse<NA>2015-06-18 16:42:53+00:00comment
209<NA>I like the idea of moving that arrow the way h...rattray<NA>2015-06-08 02:15:30+00:00comment
228<NA>I don&#x27;t understand why a beginner would s...wolco<NA>2019-02-03 14:35:43+00:00comment
290<NA>I leaerned more with one minute of this than a...agumonkey<NA>2016-07-16 06:19:39+00:00comment
303<NA>I've suggested a <i>rationale</i> for the tabo...mechanical_fish<NA>2008-12-17 04:42:02+00:00comment
312<NA>Do you have any reference for this?<p>I&#x27;m...banashark<NA>2023-11-13 19:57:00+00:00comment
322<NA>Default search scope is an option in the Finde...kitsunesoba<NA>2017-08-13 17:15:19+00:00comment
391<NA>Orthogonality and biology aren&#x27;t friends.agumonkey<NA>2016-04-24 16:33:41+00:00comment
396<NA>I chose some random physics book that was good...prawn<NA>2011-03-27 22:29:51+00:00comment
424<NA>Seeing this get huge on Twitter. It&#x27;s the...shenanigoat<NA>2016-01-09 03:04:22+00:00comment
428<NA>Looking through the comments there are a numbe...moomin<NA>2024-10-01 14:37:04+00:00comment
429<NA>Legacy media is a tough business. GBTC is payi...arcticbull<NA>2021-04-16 16:30:33+00:00comment
436<NA>Same thing if you sell unsafe food, yet we hav...jabradoodle<NA>2023-08-03 20:47:52+00:00comment
438<NA>There was briefly a thing called HSCSD (&quot;...LeoPanthera<NA>2019-02-11 19:49:29+00:00comment
446<NA>&gt; This article is a bit comical to read and...lapcat<NA>2023-01-02 16:00:49+00:00comment
453<NA>Large positions are most likely sold off in sm...meowkit<NA>2021-01-27 23:22:48+00:00comment
507<NA>A US-based VPN (or really any VPN) is only goi...RandomBacon<NA>2019-04-05 00:58:58+00:00comment
543<NA><a href=\"https:&#x2F;&#x2F;codeberg.org&#x2F;A...ElectronBadger<NA>2023-12-13 08:13:15+00:00comment
565<NA>It’s much harder for people without hands to w...Aeolun<NA>2024-05-03 11:58:13+00:00comment
612<NA>So by using ADMIN_SL0T instead was it just set...minitoar<NA>2021-03-05 16:07:56+00:00comment
660<NA>Outstanding!cafard<NA>2022-06-09 09:51:54+00:00comment
673<NA>On the other hand, something can be said for &...babby<NA>2013-08-12 00:31:02+00:00comment
\n", + "

25 rows × 6 columns

\n", + "
[123 rows x 6 columns in total]" + ] + }, + "metadata": {}, + "execution_count": 13 + } + ], + "source": [ + "hacker_news.semantics.filter(\"{by} contains animal name\", model=gemini_model)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3bpkaspoiouZ" + }, + "source": [ + "Here are the runtime numbers with 500 requests per minute [raised quota](https://cloud.google.com/vertex-ai/generative-ai/docs/quotas):\n", + "* 3000 rows -> ~6m\n", + "* 10000 rows -> ~26m" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.9" + }, + "colab": { + "provenance": [], + "include_colab_link": true } - ], - "source": [ - "hacker_news.semantics.filter(\"{by} contains animal name\", model=gemini_model)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here are our performance numbers:\n", - "* 3000 rows -> 6m 9.2s\n", - "* 10000 rows -> 26m 42.4s" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "venv", - "language": "python", - "name": "python3" }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.9" - } - }, - "nbformat": 4, - "nbformat_minor": 2 + "nbformat": 4, + "nbformat_minor": 0 }