Skip to content

Commit

Permalink
chore: Upload Embeddings Doc (#719)
Browse files Browse the repository at this point in the history
* Added code example to NLS upload

* created uploading embeddings doc

* Fixed URL link issues and updated titles

* Created instructions for upload embeddings

* Updated uploaded embeddings to show embedding name
  • Loading branch information
brianshen3 authored Nov 5, 2024
1 parent 0772ba8 commit c45a166
Show file tree
Hide file tree
Showing 4 changed files with 76 additions and 6 deletions.
3 changes: 3 additions & 0 deletions docs/assets/images/upload-embeddings-enable.gif
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
11 changes: 5 additions & 6 deletions docs/automations/set-up-natural-language-search.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,11 @@ In this document, we will go over main components of the below
and steps you need to take to tailor it for your application.

!!! Example
The [`kolena`](https://github.com/kolenaIO/kolena) repository contains a runnable
[example](https://github.com/kolenaIO/kolena/tree/trunk/examples/dataset/search_embeddings) for
embeddings extraction and
upload. This builds off the data uploaded in the
The [`Kolena`](https://github.com/kolenaIO/kolena) repository includes a
[code example](https://github.com/kolenaIO/kolena/tree/trunk/examples/dataset/search_embeddings) for
extraction and uploading embeddings. It builds on data from the
[semantic_segmentation](https://github.com/kolenaIO/kolena/tree/trunk/examples/dataset/semantic_segmentation)
example dataset, and is best run after this data has been uploaded to your Kolena environment.
example dataset, so ensure the dataset is uploaded to your Kolena environment before running the code example.

Uploading embeddings to Kolena can be done in four simple steps:

Expand All @@ -56,7 +55,7 @@ Uploading embeddings to Kolena can be done in four simple steps:
The package can be installed via `pip` or `uv` and requires use of your kolena token which can be created
on the [:kolena-developer-16: Developer](https://app.kolena.com/redirect/developer) page.

We first [retrieve and set](../installing-kolena.md#initialization) our `KOLENA_TOKEN` environment variable.
We first [retrieve and set](../installing-kolena.md) our `KOLENA_TOKEN` environment variable.
This is used by the uploader for authentication against your Kolena instance.

```shell
Expand Down
67 changes: 67 additions & 0 deletions docs/dataset/advanced-usage/upload-embeddings.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
---
icon: kolena/classification-16
---

# :kolena-classification-16: Uploading Custom Embeddings

This guide explains how to upload your own embeddings to Kolena using the Kolena SDK.
Please ensure you have the SDK installed.
[Instructions for installing the SDK are available here.](https://docs.kolena.com/installing-kolena/)

## Step 1: Import the Embedding Upload Function

To upload embeddings, use the `upload_dataset_embeddings` function from Kolena. You can import
it with the following code:
```python
from kolena._experimental.search import upload_dataset_embeddings
```

## Step 2: Prepare the Required DataFrame

The DataFrame you upload should have:

- Unique Identifier Columns: This is typically the `locator` field, which serves as a unique identifier for each entry.
Could be multiple id fields combined together like `locator` + `person_id`
- Embedding Column: Each embedding must have the same size across all rows.

### Example code

Here’s an example where we download the `instance-seg` dataset from Kolena,
then add a placeholder embedding (a zero-filled array):
```python
from kolena.dataset import download_dataset

dataset = "instance-seg"
df = download_dataset(dataset)
df_embedding = df['locator']
df_embedding["embedding"] = [np.zeros((1,512))] * len(df_embedding)
```
!!! Note
Replace the placeholder embeddings with embeddings generated from your own embedding model.

## Step 3: Upload the DataFrame using Kolena SDK

With the DataFrame prepared, use the `upload_dataset_embeddings` function to upload it to Kolena.

```python
upload_dataset_embeddings(dataset_name="instance-seg", key="my-embedding-model", df_embedding=df_embedding)
```

The `dataset_name` parameter specifies the target dataset where the embeddings will be uploaded.
The key parameter is a unique identifier for the embeddings being uploaded, allowing multiple embeddings
to be associated with the same dataset. Finally, `df_embeddings` is the DataFrame object
prepared in Step 2 that contains the data you want to upload.

## Step 4: Verify Your Embeddings in Kolena Studio

To confirm the embeddings uploaded successfully:

- Open Kolena Studio.
- In the top right corner, click on "Off" beside the embeddings toggle to enable embeddings view.
- Select the embedding model key saved in the upload function. In the example it was `my-embedding-model`.
- Choose from the visualization options: UMAP, t-SNE, or PCA.

![Enabling Embeddings on Studio](../../assets/images/upload-embeddings-enable.gif)

If you have trouble with creating embeddings, refer to our example code for
[generating image embeddings and uploading to Kolena](https://github.com/kolenaIO/kolena/tree/trunk/examples/dataset/search_embeddings).
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ nav:
- Automatically Extract Image Properties: automations/extract-image-metadata.md
- Automatically Extract Bounding Box Properties: automations/extract-bounding-box-metadata.md
- Setting Up Natural Language Search: automations/set-up-natural-language-search.md
- Uploading Custom Embeddings: dataset/advanced-usage/upload-embeddings.md
- Object Detection with Kolena: dataset/object-detection.md
- LLM Powered Data Processing: dataset/advanced-usage/llm-prompt-extraction.md
- Custom Queries and Fields: dataset/advanced-usage/custom-queries.md
Expand Down

0 comments on commit c45a166

Please sign in to comment.