Skip to content

Commit

Permalink
feat: initialize the vector search document structure (#17983)
Browse files Browse the repository at this point in the history
* feat: initialize the vector search document structure

* fix: merge toc

* fix

* feat: add langchain + llamaindex integration guide

* feat: add jinaai embedding integration guide

* vector search: refine wording (#1)

* vector search: refine wording

* Discard changes to tidb-cloud/create-tidb-cluster-serverless.md

* remove "cluster with vector search enabled"

* Update tidb-cloud/vector-search-overview.md

* Apply suggestions from code review

Co-authored-by: Mini256 <minianter@foxmail.com>

---------

Co-authored-by: Mini256 <minianter@foxmail.com>

* feat: add peewee + sqlalchemy integration guide

* feat: add django integration quickstart

* add supported distance functions

* fix: add faqs

---------

Co-authored-by: Aolin <aolinz@outlook.com>
  • Loading branch information
Mini256 and Oreoxmt authored Jun 25, 2024
1 parent d70a5e7 commit bdb4b58
Show file tree
Hide file tree
Showing 13 changed files with 2,313 additions and 22 deletions.
45 changes: 23 additions & 22 deletions TOC-tidb-cloud.md
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,29 @@
- [Connect AWS DMS to TiDB Cloud clusters](/tidb-cloud/tidb-cloud-connect-aws-dms.md)
- Explore Data
- [Chat2Query (Beta)](/tidb-cloud/explore-data-with-chat2query.md)
- Vector Search (Beta)
- [Overview](/tidb-cloud/vector-search-overview.md)
- Get Started
- [Get Started with SQL](/tidb-cloud/vector-search-get-started-via-sql.md)
- [Get Started with Python Client](/tidb-cloud/vector-search-get-started-via-python-client.md)
- Integrations
- [Overview](/tidb-cloud/vector-search-integration-overview.md)
- AI Frameworks
- [LlamaIndex](/tidb-cloud/vector-search-integrate-with-llamaindex.md)
- [Langchain](/tidb-cloud/vector-search-integrate-with-langchain.md)
- Embedding Models / Services
- [JinaAI](/tidb-cloud/vector-search-integrate-with-jinaai-embedding.md)
- ORM Libraries
- [SQLAlchemy](/tidb-cloud/vector-search-integrate-with-sqlalchemy.md)
- [Peewee](/tidb-cloud/vector-search-integrate-with-peewee.md)
- [Django ORM](/tidb-cloud/vector-search-integrate-with-django-orm.md)
- Reference
- [Vector Data Types](/tidb-cloud/vector-search-data-types.md)
- [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md)
- [Vector Index](/tidb-cloud/vector-search-index.md)
- [Limitations](/tidb-cloud/vector-search-limitations.md)
- [FAQs](/tidb-cloud/vector-search-faqs.md)
- [Changelogs](/tidb-cloud/vector-search-changelogs.md)
- Data Service (Beta)
- [Overview](/tidb-cloud/data-service-overview.md)
- [Get Started](/tidb-cloud/data-service-get-started.md)
Expand All @@ -253,28 +276,6 @@
- [Use OpenAPI Specification with Next.js](/tidb-cloud/data-service-oas-with-nextjs.md)
- [Data App Configuration Files](/tidb-cloud/data-service-app-config-files.md)
- [Response and Status Code](/tidb-cloud/data-service-response-and-status-code.md)
- Vector Search
- [Overview](/tidb-cloud/vector-search-overview.md)
- [Quick Start](/tidb-cloud/vector-search-quick-start.md)
- Tutorials
- [Build Semantic Search for Texts](/tidb-cloud/vector-search-tutorial-semantic-search.md)
- [Build AI Chatbot for Knowledge Base](/tidb-cloud/vector-search-tutorial-chatbot.md)
- AI Integrations
- [LangChain](/tidb-cloud/vector-search-integration-langchain.md)
- [LlamaIndex](/tidb-cloud/vector-search-integration-llamaindex.md)
- [Dify](/tidb-cloud/vector-search-integration-dify.md)
- Programming Languages Integrations
- [Python](/tidb-cloud/vector-search-integration-python.md)
- [Go](/tidb-cloud/vector-search-integration-golang.md)
- [Node.js](/tidb-cloud/vector-search-integration-nodejs.md)
- [Java](/tidb-cloud/vector-search-integration-java.md)
- Reference
- [Vector Data Types](/tidb-cloud/vector-search-data-types.md)
- [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md)
- [Vector Index](/tidb-cloud/vector-search-index.md)
- [Limitations](/tidb-cloud/vector-search-limitations.md)
- [FAQs](/tidb-cloud/vector-search-faqs.md)
- [Changelogs](/tidb-cloud/vector-search-changelogs.md)
- Stream Data
- [Changefeed Overview](/tidb-cloud/changefeed-overview.md)
- [To MySQL Sink](/tidb-cloud/changefeed-sink-to-mysql.md)
Expand Down
Binary file added media/vector-search/embedding-search.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
41 changes: 41 additions & 0 deletions tidb-cloud/vector-search-faqs.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
---
title: Vector Search FAQs
summary: Learn about the FAQs related to TiDB Vector Search.
---

# Vector Search FAQs

This document lists the most frequently asked questions about TiDB Vector Search.

## General FAQs

### What is TiDB Vector Search?

TiDB Vector search allows you to power generative AI, or implement semantic search or similarity search for texts, images, videos, audios or any type of data. Rather than searching on the data itself, vector search allows you to search on the meanings of the data.

### What are the key use cases?

You can use machine learning models like OpenAI and Hugging Face to create and store vector embeddings in TiDB. Then you can use TiDB Vector Search for retrieval augmented generation (RAG), semantic search, recommendation engines, dynamic personalization, and other use cases.

### Does Vector Search work with articles, images or media files?

Yes. TiDB Vector Search can query any kind of data that can be turned into a vector embedding. You can store both vector embeddings and the data in the same TiDB cluster or even the same table without the need to set up other vector search engines.

### What AI integrations does TiDB Vector Search support?

TiDB Vector has now been integrated into [Langchain](/tidb-cloud/vector-search-integrate-with-langchain.md) and [LlamaIndex](/tidb-cloud/vector-search-integrate-with-llamaindex.md).

### Which vector embeddings does TiDB Vector Search support?

TiDB supports vector embeddings under the 16000-dimension limit.

### How can I speed up the Vector Search?

You can create an index over the vector column to speed up the Vector Search. See Build AI Apps with TiDB Vector Search for more details.

### How do I get support for Vector Search or about general usage of TiDB Serverless?

We value your feedback and always here to help, you can choose either way to get support:

- Discord: https://discord.gg/zcqexutz2R
- Support Portal: https://tidb.support.pingcap.com/
201 changes: 201 additions & 0 deletions tidb-cloud/vector-search-get-started-via-python-client.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
---
title: Get Started with Vector Search Using the Python Client
summary: Learn how to quickly get started with the TiDB vector search feature in TiDB Cloud using a Python client and perform semantic searches.
---

# Get Started with Vector Search Using the Python Client

This tutorial demonstrates how to get started with the [vector search](/tidb-cloud/vector-search-overview.md) feature in TiDB Cloud using a Python client. You will learn how to use the Python client [`tidb-vector`](https://github.com/pingcap/tidb-vector-python) to:

- Set up your environment.
- Connect to your TiDB cluster.
- Create a vector table.
- Store vector embeddings.
- Perform vector search queries.

> **Note**
>
> The vector search feature is currently in beta and only available for [TiDB Serverless](/tidb-cloud/select-cluster-tier.md#tidb-serverless) clusters.
## Prerequisites

To complete this tutorial, you need:

- [Python 3.8 or higher](https://www.python.org/downloads/) installed.
- [Git](https://git-scm.com/downloads) installed.
- A TiDB Serverless cluster. Follow [creating a TiDB Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one.

## Get started

This section demonstrates how to get started with the vector search feature using the Python client [`tidb-vector`](https://github.com/pingcap/tidb-vector-python).

To run the demo directly, check out the sample code in the [pingcap/tidb-vector-python](https://github.com/pingcap/tidb-vector-python/blob/main/examples/python-client-quickstart) repository.

### Step 1. Create a new Python project

In your preferred directory, create a new Python project and a file named `example.py`.

```shell
mkdir python-client-quickstart
cd python-client-quickstart
touch example.py
```

### Step 2. Install required dependencies

In your project directory, run the following command to install the necessary packages:

```shell
pip install sqlalchemy pymysql sentence-transformers tidb-vector
```

- `tidb-vector`: the Python client for interacting with the vector search feature in TiDB Cloud, which is based on [SQLAlchemy](https://www.sqlalchemy.org).
- [`sentence-transformers`](https://sbert.net): a Python library that provides pre-trained models for generating [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding) from text.

### Step 3. Configure the connection string to the TiDB cluster

1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page.

2. Click **Connect** in the upper-right corner. A connection dialog is displayed.

3. Ensure the configurations in the connection dialog match your operating environment.

- **Endpoint Type** is set to `Public`.
- **Branch** is set to `main`.
- **Connect With** is set to `SQLAlchemy`.
- **Operating System** matches your environment.

> **Tip:**
>
> If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution.
4. Click the **PyMySQL** tab and copy the connection string.

> **Tip:**
>
> If you have not set a password yet, click **Generate Password** to generate a random password.
5. In the root directory of your Python project, create a `.env` file and paste the connection string into it.

The following is an example for macOS:

```dotenv
TIDB_DATABASE_URL="mysql+pymysql://<prefix>.root:<password>@gateway01.<region>.prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
```

### Step 4. Initialize the embedding model

An [embedding model](/tidb-cloud/vector-search-overview.md#embedding-model) transforms data into [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding). This example uses the pre-trained model [**msmarco-MiniLM-L12-cos-v5**](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) for text embedding. This lightweight model, provided by the `sentence-transformers` library, transforms text data into 384-dimensional vector embeddings.

To set up the model, copy the following code into the `example.py` file. This code initializes a `SentenceTransformer` instance and defines a `text_to_embedding()` function for later use.

```python
from sentence_transformers import SentenceTransformer

print("Downloading and loading the embedding model...")
embed_model = SentenceTransformer("sentence-transformers/msmarco-MiniLM-L12-cos-v5", trust_remote_code=True)
embed_model_dims = embed_model.get_sentence_embedding_dimension()

def text_to_embedding(text):
"""Generates vector embeddings for the given text."""
embedding = embed_model.encode(text)
return embedding.tolist()
```

### Step 5. Connect to the TiDB cluster

Use the `TiDBVectorClient` class to connect to your TiDB cluster and create a table `embedded_documents` with a vector column to serve as the vector store.

> **Note**
>
> Ensure the dimension of your vector column matches the dimension of the vectors produced by your embedding model. For example, the **msmarco-MiniLM-L12-cos-v5** model generates vectors with 384 dimensions.
```python
import os
from tidb_vector.integrations import TiDBVectorClient
from dotenv import load_dotenv

# Load the connection string from the .env file
load_dotenv()

vector_store = TiDBVectorClient(
# The table which will store the vector data.
table_name='embedded_documents',
# The connection string to the TiDB cluster.
connection_string=os.environ.get('TIDB_DATABASE_URL'),
# The dimension of the vector generated by the embedding model.
vector_dimension=embed_model_dims,
# Determine whether to recreate the table if it already exists.
drop_existing_table=True,
)
```

### Step 6. Embed text data and store the vectors

In this step, you will prepare sample documents containing single words, such as "dog", "fish", and "tree". The following code uses the `text_to_embedding()` function to transform these text documents into vector embeddings, and then inserts them into the vector store.

```python
documents = [
{
"id": "f8e7dee2-63b6-42f1-8b60-2d46710c1971",
"text": "dog",
"embedding": text_to_embedding("dog"),
"metadata": {"category": "animal"},
},
{
"id": "8dde1fbc-2522-4ca2-aedf-5dcb2966d1c6",
"text": "fish",
"embedding": text_to_embedding("fish"),
"metadata": {"category": "animal"},
},
{
"id": "e4991349-d00b-485c-a481-f61695f2b5ae",
"text": "tree",
"embedding": text_to_embedding("tree"),
"metadata": {"category": "plant"},
},
]

vector_store.insert(
ids=[doc["id"] for doc in documents],
texts=[doc["text"] for doc in documents],
embeddings=[doc["embedding"] for doc in documents],
metadatas=[doc["metadata"] for doc in documents],
)
```

### Step 7. Perform a vector search query

In this step, you will search for "a swimming animal", which doesn't directly match any words in existing documents.

The following code uses the `text_to_embedding()` function again to convert the query text into a vector embedding, and then queries with the embedding to find the top three closest matches.

```python
def print_result(query, result):
print(f"Search result (\"{query}\"):")
for r in result:
print(f"- text: \"{r.document}\", distance: {r.distance}")

query = "a swimming animal"
query_embedding = text_to_embedding(query)
search_result = vector_store.query(query_embedding, k=3)
print_result(query, search_result)
```

Run the `example.py` file and the output is as follows:

```plain
Search result ("a swimming animal"):
- text: "fish", distance: 0.4586619425596351
- text: "dog", distance: 0.6521646263795423
- text: "tree", distance: 0.7980725077476978
```

From the output, the swimming animal is most likely a fish, or a dog with a gift for swimming.

This demonstration shows how vector search can efficiently locate the most relevant documents, with search results organized by the proximity of the vectors: the smaller the distance, the more relevant the document.

## See also

- [Vector Column](/tidb-cloud/vector-search-vector-column.md)
- [Vector Index](/tidb-cloud/vector-search-vector-index.md)
Loading

0 comments on commit bdb4b58

Please sign in to comment.