From c45a1663a72b076b8a6e736b4116efe4f4e9ac1a Mon Sep 17 00:00:00 2001 From: BrianShen <96436972+brianshen3@users.noreply.github.com> Date: Tue, 5 Nov 2024 09:21:57 -0500 Subject: [PATCH] chore: Upload Embeddings Doc (#719) * Added code example to NLS upload * created uploading embeddings doc * Fixed URL link issues and updated titles * Created instructions for upload embeddings * Updated uploaded embeddings to show embedding name --- .../images/upload-embeddings-enable.gif | 3 + .../set-up-natural-language-search.md | 11 ++- .../advanced-usage/upload-embeddings.md | 67 +++++++++++++++++++ mkdocs.yml | 1 + 4 files changed, 76 insertions(+), 6 deletions(-) create mode 100644 docs/assets/images/upload-embeddings-enable.gif create mode 100644 docs/dataset/advanced-usage/upload-embeddings.md diff --git a/docs/assets/images/upload-embeddings-enable.gif b/docs/assets/images/upload-embeddings-enable.gif new file mode 100644 index 000000000..a100ed7f3 --- /dev/null +++ b/docs/assets/images/upload-embeddings-enable.gif @@ -0,0 +1,3 @@ +version https://git-lfs.github.com/spec/v1 +oid sha256:cbe6bf2ca5d8af853199132634939d76d082cf10b0dabd30b2f778664be58734 +size 9617273 diff --git a/docs/automations/set-up-natural-language-search.md b/docs/automations/set-up-natural-language-search.md index f79c5a2b6..d50e8e9d1 100644 --- a/docs/automations/set-up-natural-language-search.md +++ b/docs/automations/set-up-natural-language-search.md @@ -37,12 +37,11 @@ In this document, we will go over main components of the below and steps you need to take to tailor it for your application. !!! Example - The [`kolena`](https://github.com/kolenaIO/kolena) repository contains a runnable - [example](https://github.com/kolenaIO/kolena/tree/trunk/examples/dataset/search_embeddings) for - embeddings extraction and - upload. This builds off the data uploaded in the + The [`Kolena`](https://github.com/kolenaIO/kolena) repository includes a + [code example](https://github.com/kolenaIO/kolena/tree/trunk/examples/dataset/search_embeddings) for + extraction and uploading embeddings. It builds on data from the [semantic_segmentation](https://github.com/kolenaIO/kolena/tree/trunk/examples/dataset/semantic_segmentation) - example dataset, and is best run after this data has been uploaded to your Kolena environment. + example dataset, so ensure the dataset is uploaded to your Kolena environment before running the code example. Uploading embeddings to Kolena can be done in four simple steps: @@ -56,7 +55,7 @@ Uploading embeddings to Kolena can be done in four simple steps: The package can be installed via `pip` or `uv` and requires use of your kolena token which can be created on the [:kolena-developer-16: Developer](https://app.kolena.com/redirect/developer) page. -We first [retrieve and set](../installing-kolena.md#initialization) our `KOLENA_TOKEN` environment variable. +We first [retrieve and set](../installing-kolena.md) our `KOLENA_TOKEN` environment variable. This is used by the uploader for authentication against your Kolena instance. ```shell diff --git a/docs/dataset/advanced-usage/upload-embeddings.md b/docs/dataset/advanced-usage/upload-embeddings.md new file mode 100644 index 000000000..c446a6f84 --- /dev/null +++ b/docs/dataset/advanced-usage/upload-embeddings.md @@ -0,0 +1,67 @@ +--- +icon: kolena/classification-16 +--- + +# :kolena-classification-16: Uploading Custom Embeddings + +This guide explains how to upload your own embeddings to Kolena using the Kolena SDK. +Please ensure you have the SDK installed. +[Instructions for installing the SDK are available here.](https://docs.kolena.com/installing-kolena/) + +## Step 1: Import the Embedding Upload Function + +To upload embeddings, use the `upload_dataset_embeddings` function from Kolena. You can import +it with the following code: +```python +from kolena._experimental.search import upload_dataset_embeddings +``` + +## Step 2: Prepare the Required DataFrame + +The DataFrame you upload should have: + +- Unique Identifier Columns: This is typically the `locator` field, which serves as a unique identifier for each entry. + Could be multiple id fields combined together like `locator` + `person_id` +- Embedding Column: Each embedding must have the same size across all rows. + +### Example code + +Here’s an example where we download the `instance-seg` dataset from Kolena, +then add a placeholder embedding (a zero-filled array): +```python +from kolena.dataset import download_dataset + +dataset = "instance-seg" +df = download_dataset(dataset) +df_embedding = df['locator'] +df_embedding["embedding"] = [np.zeros((1,512))] * len(df_embedding) +``` +!!! Note + Replace the placeholder embeddings with embeddings generated from your own embedding model. + +## Step 3: Upload the DataFrame using Kolena SDK + +With the DataFrame prepared, use the `upload_dataset_embeddings` function to upload it to Kolena. + +```python +upload_dataset_embeddings(dataset_name="instance-seg", key="my-embedding-model", df_embedding=df_embedding) +``` + +The `dataset_name` parameter specifies the target dataset where the embeddings will be uploaded. +The key parameter is a unique identifier for the embeddings being uploaded, allowing multiple embeddings + to be associated with the same dataset. Finally, `df_embeddings` is the DataFrame object + prepared in Step 2 that contains the data you want to upload. + +## Step 4: Verify Your Embeddings in Kolena Studio + +To confirm the embeddings uploaded successfully: + +- Open Kolena Studio. +- In the top right corner, click on "Off" beside the embeddings toggle to enable embeddings view. +- Select the embedding model key saved in the upload function. In the example it was `my-embedding-model`. +- Choose from the visualization options: UMAP, t-SNE, or PCA. + +![Enabling Embeddings on Studio](../../assets/images/upload-embeddings-enable.gif) + +If you have trouble with creating embeddings, refer to our example code for +[generating image embeddings and uploading to Kolena](https://github.com/kolenaIO/kolena/tree/trunk/examples/dataset/search_embeddings). diff --git a/mkdocs.yml b/mkdocs.yml index 7ac1a967d..df6a9fa6e 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -25,6 +25,7 @@ nav: - Automatically Extract Image Properties: automations/extract-image-metadata.md - Automatically Extract Bounding Box Properties: automations/extract-bounding-box-metadata.md - Setting Up Natural Language Search: automations/set-up-natural-language-search.md + - Uploading Custom Embeddings: dataset/advanced-usage/upload-embeddings.md - Object Detection with Kolena: dataset/object-detection.md - LLM Powered Data Processing: dataset/advanced-usage/llm-prompt-extraction.md - Custom Queries and Fields: dataset/advanced-usage/custom-queries.md