Integrate TiDB Vector Search with Jina AI Embeddings API
This tutorial walks you through how to use Jina AI to generate embeddings for text data, and then store the embeddings in TiDB vector storage and search similar texts based on embeddings.
Prerequisites
To complete this tutorial, you need:
- Python 3.8 or higher installed.
- Git installed.
- A TiDB Cloud Serverless cluster. Follow creating a TiDB Cloud Serverless cluster to create your own TiDB Cloud cluster if you don't have one.
Run the sample app
You can quickly learn about how to integrate TiDB Vector Search with JinaAI Embedding by following the steps below.
Step 1. Clone the repository
Clone the tidb-vector-python
repository to your local machine:
git clone https://github.com/pingcap/tidb-vector-python.git
Step 2. Create a virtual environment
Create a virtual environment for your project:
cd tidb-vector-python/examples/jina-ai-embeddings-demo
python3 -m venv .venv
source .venv/bin/activate
Step 3. Install required dependencies
Install the required dependencies for the demo project:
pip install -r requirements.txt
Step 4. Configure the environment variables
Get the Jina AI API key from the Jina AI Embeddings API page. Then, obtain the cluster connection string and configure environment variables as follows:
Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.
Click Connect in the upper-right corner. A connection dialog is displayed.
Ensure the configurations in the connection dialog match your operating environment.
- Connection Type is set to
Public
- Branch is set to
main
- Connect With is set to
SQLAlchemy
- Operating System matches your environment.
- Connection Type is set to
Switch to the PyMySQL tab and click the Copy icon to copy the connection string.
Set the Jina AI API key and the TiDB connection string as environment variables in your terminal, or create a
.env
file with the following environment variables:JINAAI_API_KEY="****" TIDB_DATABASE_URL="{tidb_connection_string}"The following is an example connection string for macOS:
TIDB_DATABASE_URL="mysql+pymysql://<prefix>.root:<password>@gateway01.<region>.prod.aws.tidbcloud.com:4000/test?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
Step 5. Run the demo
python jina-ai-embeddings-demo.py
Example output:
- Inserting Data to TiDB...
- Inserting: Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.
- Inserting: TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.
- List All Documents and Their Distances to the Query:
- distance: 0.3585317326132522
content: Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.
- distance: 0.10858102967720984
content: TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.
- The Most Relevant Document and Its Distance to the Query:
- distance: 0.10858102967720984
content: TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.
Sample code snippets
Get embeddings from Jina AI
Define a generate_embeddings
helper function to call Jina AI embeddings API:
import os
import requests
import dotenv
dotenv.load_dotenv()
JINAAI_API_KEY = os.getenv('JINAAI_API_KEY')
def generate_embeddings(text: str):
JINAAI_API_URL = 'https://api.jina.ai/v1/embeddings'
JINAAI_HEADERS = {
'Content-Type': 'application/json',
'Authorization': f'Bearer {JINAAI_API_KEY}'
}
JINAAI_REQUEST_DATA = {
'input': [text],
'model': 'jina-embeddings-v2-base-en' # with dimension 768.
}
response = requests.post(JINAAI_API_URL, headers=JINAAI_HEADERS, json=JINAAI_REQUEST_DATA)
return response.json()['data'][0]['embedding']
Connect to the TiDB cluster
Connect to the TiDB cluster through SQLAlchemy:
import os
import dotenv
from tidb_vector.sqlalchemy import VectorType
from sqlalchemy.orm import Session, declarative_base
dotenv.load_dotenv()
TIDB_DATABASE_URL = os.getenv('TIDB_DATABASE_URL')
assert TIDB_DATABASE_URL is not None
engine = create_engine(url=TIDB_DATABASE_URL, pool_recycle=300)
Define the vector table schema
Create a table named jinaai_tidb_demo_documents
with a content
column for storing texts and a vector column named content_vec
for storing embeddings:
from sqlalchemy import Column, Integer, String, create_engine
from sqlalchemy.orm import declarative_base
Base = declarative_base()
class Document(Base):
__tablename__ = "jinaai_tidb_demo_documents"
id = Column(Integer, primary_key=True)
content = Column(String(255), nullable=False)
content_vec = Column(
# DIMENSIONS is determined by the embedding model,
# for Jina AI's jina-embeddings-v2-base-en model it's 768.
VectorType(dim=768),
comment="hnsw(distance=cosine)"
Create embeddings with Jina AI and store in TiDB
Use the Jina AI Embeddings API to generate embeddings for each piece of text and store the embeddings in TiDB:
TEXTS = [
'Jina AI offers best-in-class embeddings, reranker and prompt optimizer, enabling advanced multimodal AI.',
'TiDB is an open-source MySQL-compatible database that supports Hybrid Transactional and Analytical Processing (HTAP) workloads.',
]
data = []
for text in TEXTS:
# Generate embeddings for the texts via Jina AI API.
embedding = generate_embeddings(text)
data.append({
'text': text,
'embedding': embedding
})
with Session(engine) as session:
print('- Inserting Data to TiDB...')
for item in data:
print(f' - Inserting: {item["text"]}')
session.add(Document(
content=item['text'],
content_vec=item['embedding']
))
session.commit()
Perform semantic search with Jina AI embeddings in TiDB
Generate the embedding for the query text via Jina AI embeddings API, and then search for the most relevant document based on the cosine distance between the embedding of the query text and each embedding in the vector table:
query = 'What is TiDB?'
# Generate the embedding for the query via Jina AI API.
query_embedding = generate_embeddings(query)
with Session(engine) as session:
print('- The Most Relevant Document and Its Distance to the Query:')
doc, distance = session.query(
Document,
Document.content_vec.cosine_distance(query_embedding).label('distance')
).order_by(
'distance'
).limit(1).first()
print(f' - distance: {distance}\n'
f' content: {doc.content}')