feat: initialize the vector search document structure (#17983)

* feat: initialize the vector search document structure * fix: merge toc * fix * feat: add langchain + llamaindex integration guide * feat: add jinaai embedding integration guide * vector search: refine wording (#1) * vector search: refine wording * Discard changes to tidb-cloud/create-tidb-cluster-serverless.md * remove "cluster with vector search enabled" * Update tidb-cloud/vector-search-overview.md * Apply suggestions from code review Co-authored-by: Mini256 <minianter@foxmail.com> --------- Co-authored-by: Mini256 <minianter@foxmail.com> * feat: add peewee + sqlalchemy integration guide * feat: add django integration quickstart * add supported distance functions * fix: add faqs --------- Co-authored-by: Aolin <aolinz@outlook.com>
pingcap · Jun 25, 2024 · bdb4b58 · bdb4b58
1 parent d70a5e7
commit bdb4b58
Show file tree

Hide file tree

Showing 13 changed files with 2,313 additions and 22 deletions.
diff --git a/TOC-tidb-cloud.md b/TOC-tidb-cloud.md
@@ -240,6 +240,29 @@
     - [Connect AWS DMS to TiDB Cloud clusters](/tidb-cloud/tidb-cloud-connect-aws-dms.md)
 - Explore Data
   - [Chat2Query (Beta)](/tidb-cloud/explore-data-with-chat2query.md)
+- Vector Search (Beta)
+  - [Overview](/tidb-cloud/vector-search-overview.md)
+  - Get Started
+    - [Get Started with SQL](/tidb-cloud/vector-search-get-started-via-sql.md)
+    - [Get Started with Python Client](/tidb-cloud/vector-search-get-started-via-python-client.md)
+  - Integrations
+    - [Overview](/tidb-cloud/vector-search-integration-overview.md)
+    - AI Frameworks
+      - [LlamaIndex](/tidb-cloud/vector-search-integrate-with-llamaindex.md)
+      - [Langchain](/tidb-cloud/vector-search-integrate-with-langchain.md)
+    - Embedding Models / Services
+      - [JinaAI](/tidb-cloud/vector-search-integrate-with-jinaai-embedding.md)
+    - ORM Libraries
+      - [SQLAlchemy](/tidb-cloud/vector-search-integrate-with-sqlalchemy.md)
+      - [Peewee](/tidb-cloud/vector-search-integrate-with-peewee.md)
+      - [Django ORM](/tidb-cloud/vector-search-integrate-with-django-orm.md)
+  - Reference
+    - [Vector Data Types](/tidb-cloud/vector-search-data-types.md)
+    - [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md)
+    - [Vector Index](/tidb-cloud/vector-search-index.md)
+  - [Limitations](/tidb-cloud/vector-search-limitations.md)
+  - [FAQs](/tidb-cloud/vector-search-faqs.md)
+  - [Changelogs](/tidb-cloud/vector-search-changelogs.md)
 - Data Service (Beta)
   - [Overview](/tidb-cloud/data-service-overview.md)
   - [Get Started](/tidb-cloud/data-service-get-started.md)
@@ -253,28 +276,6 @@
   - [Use OpenAPI Specification with Next.js](/tidb-cloud/data-service-oas-with-nextjs.md)
   - [Data App Configuration Files](/tidb-cloud/data-service-app-config-files.md)
   - [Response and Status Code](/tidb-cloud/data-service-response-and-status-code.md)
-- Vector Search
-  - [Overview](/tidb-cloud/vector-search-overview.md)
-  - [Quick Start](/tidb-cloud/vector-search-quick-start.md)
-  - Tutorials
-    - [Build Semantic Search for Texts](/tidb-cloud/vector-search-tutorial-semantic-search.md)
-    - [Build AI Chatbot for Knowledge Base](/tidb-cloud/vector-search-tutorial-chatbot.md)
-  - AI Integrations
-    - [LangChain](/tidb-cloud/vector-search-integration-langchain.md)
-    - [LlamaIndex](/tidb-cloud/vector-search-integration-llamaindex.md)
-    - [Dify](/tidb-cloud/vector-search-integration-dify.md)
-  - Programming Languages Integrations
-    - [Python](/tidb-cloud/vector-search-integration-python.md)
-    - [Go](/tidb-cloud/vector-search-integration-golang.md)
-    - [Node.js](/tidb-cloud/vector-search-integration-nodejs.md)
-    - [Java](/tidb-cloud/vector-search-integration-java.md)
-  - Reference
-    - [Vector Data Types](/tidb-cloud/vector-search-data-types.md)
-    - [Vector Functions and Operators](/tidb-cloud/vector-search-functions-and-operators.md)
-    - [Vector Index](/tidb-cloud/vector-search-index.md)
-  - [Limitations](/tidb-cloud/vector-search-limitations.md)
-  - [FAQs](/tidb-cloud/vector-search-faqs.md)
-  - [Changelogs](/tidb-cloud/vector-search-changelogs.md)
 - Stream Data
   - [Changefeed Overview](/tidb-cloud/changefeed-overview.md)
   - [To MySQL Sink](/tidb-cloud/changefeed-sink-to-mysql.md)

diff --git a/media/vector-search/embedding-search.png b/media/vector-search/embedding-search.png
diff --git a/tidb-cloud/vector-search-faqs.md b/tidb-cloud/vector-search-faqs.md
@@ -0,0 +1,41 @@
+---
+title: Vector Search FAQs
+summary: Learn about the FAQs related to TiDB Vector Search.
+---
+
+# Vector Search FAQs
+
+This document lists the most frequently asked questions about TiDB Vector Search.
+
+## General FAQs
+
+### What is TiDB Vector Search?
+
+TiDB Vector search allows you to power generative AI, or implement semantic search or similarity search for texts, images, videos, audios or any type of data. Rather than searching on the data itself, vector search allows you to search on the meanings of the data.
+
+### What are the key use cases?
+
+You can use machine learning models like OpenAI and Hugging Face to create and store vector embeddings in TiDB. Then you can use TiDB Vector Search for retrieval augmented generation (RAG), semantic search, recommendation engines, dynamic personalization, and other use cases.
+
+### Does Vector Search work with articles, images or media files?
+
+Yes. TiDB Vector Search can query any kind of data that can be turned into a vector embedding. You can store both vector embeddings and the data in the same TiDB cluster or even the same table without the need to set up other vector search engines.
+
+### What AI integrations does TiDB Vector Search support?
+
+TiDB Vector has now been integrated into [Langchain](/tidb-cloud/vector-search-integrate-with-langchain.md) and [LlamaIndex](/tidb-cloud/vector-search-integrate-with-llamaindex.md).
+
+### Which vector embeddings does TiDB Vector Search support?
+
+TiDB supports vector embeddings under the 16000-dimension limit.
+
+### How can I speed up the Vector Search?
+
+You can create an index over the vector column to speed up the Vector Search. See Build AI Apps with TiDB Vector Search for more details.
+
+### How do I get support for Vector Search or about general usage of TiDB Serverless?
+
+We value your feedback and always here to help, you can choose either way to get support:
+
+- Discord: https://discord.gg/zcqexutz2R
+- Support Portal: https://tidb.support.pingcap.com/
diff --git a/tidb-cloud/vector-search-get-started-via-python-client.md b/tidb-cloud/vector-search-get-started-via-python-client.md
@@ -0,0 +1,201 @@
+---
+title: Get Started with Vector Search Using the Python Client
+summary: Learn how to quickly get started with the TiDB vector search feature in TiDB Cloud using a Python client and perform semantic searches.
+---
+
+# Get Started with Vector Search Using the Python Client
+
+This tutorial demonstrates how to get started with the [vector search](/tidb-cloud/vector-search-overview.md) feature in TiDB Cloud using a Python client. You will learn how to use the Python client [`tidb-vector`](https://github.com/pingcap/tidb-vector-python) to:
+
+- Set up your environment.
+- Connect to your TiDB cluster.
+- Create a vector table.
+- Store vector embeddings.
+- Perform vector search queries.
+
+> **Note**
+>
+> The vector search feature is currently in beta and only available for [TiDB Serverless](/tidb-cloud/select-cluster-tier.md#tidb-serverless) clusters.
+
+## Prerequisites
+
+To complete this tutorial, you need:
+
+- [Python 3.8 or higher](https://www.python.org/downloads/) installed.
+- [Git](https://git-scm.com/downloads) installed.
+- A TiDB Serverless cluster. Follow [creating a TiDB Serverless cluster](/tidb-cloud/create-tidb-cluster-serverless.md) to create your own TiDB Cloud cluster if you don't have one.
+
+## Get started
+
+This section demonstrates how to get started with the vector search feature using the Python client [`tidb-vector`](https://github.com/pingcap/tidb-vector-python).
+
+To run the demo directly, check out the sample code in the [pingcap/tidb-vector-python](https://github.com/pingcap/tidb-vector-python/blob/main/examples/python-client-quickstart) repository.
+
+### Step 1. Create a new Python project
+
+In your preferred directory, create a new Python project and a file named `example.py`.
+
+```shell
+mkdir python-client-quickstart
+cd python-client-quickstart
+touch example.py
+```
+
+### Step 2. Install required dependencies
+
+In your project directory, run the following command to install the necessary packages:
+
+```shell
+pip install sqlalchemy pymysql sentence-transformers tidb-vector
+```
+
+- `tidb-vector`: the Python client for interacting with the vector search feature in TiDB Cloud, which is based on [SQLAlchemy](https://www.sqlalchemy.org).
+- [`sentence-transformers`](https://sbert.net): a Python library that provides pre-trained models for generating [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding) from text.
+
+### Step 3. Configure the connection string to the TiDB cluster
+
+1. Navigate to the [**Clusters**](https://tidbcloud.com/console/clusters) page, and then click the name of your target cluster to go to its overview page.
+
+2. Click **Connect** in the upper-right corner. A connection dialog is displayed.
+
+3. Ensure the configurations in the connection dialog match your operating environment.
+
+    - **Endpoint Type** is set to `Public`.
+    - **Branch** is set to `main`.
+    - **Connect With** is set to `SQLAlchemy`.
+    - **Operating System** matches your environment.
+
+    > **Tip:**
+    >
+    > If your program is running in Windows Subsystem for Linux (WSL), switch to the corresponding Linux distribution.
+
+4. Click the **PyMySQL** tab and copy the connection string.
+
+    > **Tip:**
+    > 
+    > If you have not set a password yet, click **Generate Password** to generate a random password.
+
+5. In the root directory of your Python project, create a `.env` file and paste the connection string into it.
+
+     The following is an example for macOS:
+
+    ```dotenv
+    TIDB_DATABASE_URL="mysql+pymysql://<prefix>.root:<password>@gateway01.<region>.prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
+    ```
+
+### Step 4. Initialize the embedding model
+
+An [embedding model](/tidb-cloud/vector-search-overview.md#embedding-model) transforms data into [vector embeddings](/tidb-cloud/vector-search-overview.md#vector-embedding). This example uses the pre-trained model [**msmarco-MiniLM-L12-cos-v5**](https://huggingface.co/sentence-transformers/msmarco-MiniLM-L12-cos-v5) for text embedding. This lightweight model, provided by the `sentence-transformers` library, transforms text data into 384-dimensional vector embeddings.
+
+To set up the model, copy the following code into the `example.py` file. This code initializes a `SentenceTransformer` instance and defines a `text_to_embedding()` function for later use.
+
+```python
+from sentence_transformers import SentenceTransformer
+
+print("Downloading and loading the embedding model...")
+embed_model = SentenceTransformer("sentence-transformers/msmarco-MiniLM-L12-cos-v5", trust_remote_code=True)
+embed_model_dims = embed_model.get_sentence_embedding_dimension()
+
+def text_to_embedding(text):
+    """Generates vector embeddings for the given text."""
+    embedding = embed_model.encode(text)
+    return embedding.tolist()
+```
+
+### Step 5. Connect to the TiDB cluster
+
+Use the `TiDBVectorClient` class to connect to your TiDB cluster and create a table `embedded_documents` with a vector column to serve as the vector store.
+
+> **Note**
+> 
+> Ensure the dimension of your vector column matches the dimension of the vectors produced by your embedding model. For example, the **msmarco-MiniLM-L12-cos-v5** model generates vectors with 384 dimensions.
+
+```python
+import os
+from tidb_vector.integrations import TiDBVectorClient
+from dotenv import load_dotenv
+
+# Load the connection string from the .env file
+load_dotenv()
+
+vector_store = TiDBVectorClient(
+   # The table which will store the vector data.
+   table_name='embedded_documents',
+   # The connection string to the TiDB cluster.
+   connection_string=os.environ.get('TIDB_DATABASE_URL'),
+   # The dimension of the vector generated by the embedding model.
+   vector_dimension=embed_model_dims,
+   # Determine whether to recreate the table if it already exists.
+   drop_existing_table=True,
+)
+```
+
+### Step 6. Embed text data and store the vectors
+
+In this step, you will prepare sample documents containing single words, such as "dog", "fish", and "tree". The following code uses the `text_to_embedding()` function to transform these text documents into vector embeddings, and then inserts them into the vector store.
+
+```python
+documents = [
+    {
+        "id": "f8e7dee2-63b6-42f1-8b60-2d46710c1971",
+        "text": "dog",
+        "embedding": text_to_embedding("dog"),
+        "metadata": {"category": "animal"},
+    },
+    {
+        "id": "8dde1fbc-2522-4ca2-aedf-5dcb2966d1c6",
+        "text": "fish",
+        "embedding": text_to_embedding("fish"),
+        "metadata": {"category": "animal"},
+    },
+    {
+        "id": "e4991349-d00b-485c-a481-f61695f2b5ae",
+        "text": "tree",
+        "embedding": text_to_embedding("tree"),
+        "metadata": {"category": "plant"},
+    },
+]
+
+vector_store.insert(
+    ids=[doc["id"] for doc in documents],
+    texts=[doc["text"] for doc in documents],
+    embeddings=[doc["embedding"] for doc in documents],
+    metadatas=[doc["metadata"] for doc in documents],
+)
+```
+
+### Step 7. Perform a vector search query
+
+In this step, you will search for "a swimming animal", which doesn't directly match any words in existing documents. 
+
+The following code uses the `text_to_embedding()` function again to convert the query text into a vector embedding, and then queries with the embedding to find the top three closest matches.
+
+```python
+def print_result(query, result):
+   print(f"Search result (\"{query}\"):")
+   for r in result:
+      print(f"- text: \"{r.document}\", distance: {r.distance}")
+
+query = "a swimming animal"
+query_embedding = text_to_embedding(query)
+search_result = vector_store.query(query_embedding, k=3)
+print_result(query, search_result)
+```
+
+Run the `example.py` file and the output is as follows:
+
+```plain
+Search result ("a swimming animal"):
+- text: "fish", distance: 0.4586619425596351
+- text: "dog", distance: 0.6521646263795423
+- text: "tree", distance: 0.7980725077476978
+```
+
+From the output, the swimming animal is most likely a fish, or a dog with a gift for swimming.
+
+This demonstration shows how vector search can efficiently locate the most relevant documents, with search results organized by the proximity of the vectors: the smaller the distance, the more relevant the document.
+
+## See also
+
+- [Vector Column](/tidb-cloud/vector-search-vector-column.md)
+- [Vector Index](/tidb-cloud/vector-search-vector-index.md)