Get Started with TiDB + AI via Python
This tutorial demonstrates how to develop a simple AI application that provides semantic search features. Unlike traditional keyword search, semantic search intelligently understands the meaning behind your query. For example, if you have documents titled "dog", "fish", and "tree", and you search for "a swimming animal", the application would identify "fish" as the most relevant result.
Throughout this tutorial, you will develop this AI application using TiDB Vector Search, Python, TiDB Vector SDK for Python, and AI models.
Prerequisites
To complete this tutorial, you need:
- Python 3.8 or higher installed.
- Git installed.
- A TiDB Cloud Serverless cluster. Follow creating a TiDB Cloud Serverless cluster to create your own TiDB Cloud cluster if you don't have one.
Get started
To run the demo directly, check out the sample code in the pingcap/tidb-vector-python repository.
Step 1. Create a new Python project
In your preferred directory, create a new Python project and a file named example.py
:
mkdir python-client-quickstart
cd python-client-quickstart
touch example.py
Step 2. Install required dependencies
In your project directory, run the following command to install the required packages:
pip install sqlalchemy pymysql sentence-transformers tidb-vector python-dotenv
tidb-vector
: the Python client for interacting with Vector Search in TiDB Cloud.sentence-transformers
: a Python library that provides pre-trained models for generating vector embeddings from text.
Step 3. Configure the connection string to the TiDB cluster
Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.
Click Connect in the upper-right corner. A connection dialog is displayed.
Ensure the configurations in the connection dialog match your operating environment.
- Connection Type is set to
Public
. - Branch is set to
main
. - Connect With is set to
SQLAlchemy
. - Operating System matches your environment.
- Connection Type is set to
Click the PyMySQL tab and copy the connection string.
In the root directory of your Python project, create a
.env
file and paste the connection string into it.The following is an example for macOS:
TIDB_DATABASE_URL="mysql+pymysql://<prefix>.root:<password>@gateway01.<region>.prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
Step 4. Initialize the embedding model
An embedding model transforms data into vector embeddings. This example uses the pre-trained model msmarco-MiniLM-L12-cos-v5 for text embedding. This lightweight model, provided by the sentence-transformers
library, transforms text data into 384-dimensional vector embeddings.
To set up the model, copy the following code into the example.py
file. This code initializes a SentenceTransformer
instance and defines a text_to_embedding()
function for later use.
from sentence_transformers import SentenceTransformer
print("Downloading and loading the embedding model...")
embed_model = SentenceTransformer("sentence-transformers/msmarco-MiniLM-L12-cos-v5", trust_remote_code=True)
embed_model_dims = embed_model.get_sentence_embedding_dimension()
def text_to_embedding(text):
"""Generates vector embeddings for the given text."""
embedding = embed_model.encode(text)
return embedding.tolist()
Step 5. Connect to the TiDB cluster
Use the TiDBVectorClient
class to connect to your TiDB cluster and create a table embedded_documents
with a vector column to serve as the vector store.
import os
from tidb_vector.integrations import TiDBVectorClient
from dotenv import load_dotenv
# Load the connection string from the .env file
load_dotenv()
vector_store = TiDBVectorClient(
# The table which will store the vector data.
table_name='embedded_documents',
# The connection string to the TiDB cluster.
connection_string=os.environ.get('TIDB_DATABASE_URL'),
# The dimension of the vector generated by the embedding model.
vector_dimension=embed_model_dims,
# Determine whether to recreate the table if it already exists.
drop_existing_table=True,
)
Step 6. Embed text data and store the vectors
In this step, you will prepare sample documents containing single words, such as "dog", "fish", and "tree". The following code uses the text_to_embedding()
function to transform these text documents into vector embeddings, and then inserts them into the vector store.
documents = [
{
"id": "f8e7dee2-63b6-42f1-8b60-2d46710c1971",
"text": "dog",
"embedding": text_to_embedding("dog"),
"metadata": {"category": "animal"},
},
{
"id": "8dde1fbc-2522-4ca2-aedf-5dcb2966d1c6",
"text": "fish",
"embedding": text_to_embedding("fish"),
"metadata": {"category": "animal"},
},
{
"id": "e4991349-d00b-485c-a481-f61695f2b5ae",
"text": "tree",
"embedding": text_to_embedding("tree"),
"metadata": {"category": "plant"},
},
]
vector_store.insert(
ids=[doc["id"] for doc in documents],
texts=[doc["text"] for doc in documents],
embeddings=[doc["embedding"] for doc in documents],
metadatas=[doc["metadata"] for doc in documents],
)
Step 7. Perform semantic search
In this step, you will search for "a swimming animal", which doesn't directly match any words in existing documents.
The following code uses the text_to_embedding()
function again to convert the query text into a vector embedding, and then queries with the embedding to find the top three closest matches.
def print_result(query, result):
print(f"Search result (\"{query}\"):")
for r in result:
print(f"- text: \"{r.document}\", distance: {r.distance}")
query = "a swimming animal"
query_embedding = text_to_embedding(query)
search_result = vector_store.query(query_embedding, k=3)
print_result(query, search_result)
Run the example.py
file and the output is as follows:
Search result ("a swimming animal"):
- text: "fish", distance: 0.4562914811223072
- text: "dog", distance: 0.6469335836410557
- text: "tree", distance: 0.798545178640937
From the output, the swimming animal is most likely a fish, or a dog with a gift for swimming.
This demonstration shows how vector search can efficiently locate the most relevant documents, with search results organized by the proximity of the vectors: the smaller the distance, the more relevant the document.