Integrate Vector Search with LlamaIndex
This tutorial demonstrates how to integrate the vector search feature of TiDB with LlamaIndex.
Prerequisites
To complete this tutorial, you need:
- Python 3.8 or higher installed.
- Jupyter Notebook installed.
- Git installed.
- A TiDB cluster.
If you don't have a TiDB cluster, you can create one as follows:
- Follow Deploy a local test TiDB cluster or Deploy a production TiDB cluster to create a local cluster.
- Follow Creating a TiDB Cloud Serverless cluster to create your own TiDB Cloud cluster.
Get started
This section provides step-by-step instructions for integrating TiDB Vector Search with LlamaIndex to perform semantic searches.
Step 1. Create a new Jupyter Notebook file
In the root directory, create a new Jupyter Notebook file named integrate_with_llamaindex.ipynb
:
touch integrate_with_llamaindex.ipynb
Step 2. Install required dependencies
In your project directory, run the following command to install the required packages:
pip install llama-index-vector-stores-tidbvector
pip install llama-index
Open the integrate_with_llamaindex.ipynb
file in Jupyter Notebook and add the following code to import the required packages:
import textwrap
from llama_index.core import SimpleDirectoryReader, StorageContext
from llama_index.core import VectorStoreIndex
from llama_index.vector_stores.tidbvector import TiDBVectorStore
Step 3. Configure environment variables
Configure the environment variables depending on the TiDB deployment option you've selected.
- TiDB Cloud Serverless
- TiDB Self-Managed
For a TiDB Cloud Serverless cluster, take the following steps to obtain the cluster connection string and configure environment variables:
Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.
Click Connect in the upper-right corner. A connection dialog is displayed.
Ensure the configurations in the connection dialog match your operating environment.
- Connection Type is set to
Public
. - Branch is set to
main
. - Connect With is set to
SQLAlchemy
. - Operating System matches your environment.
- Connection Type is set to
Click the PyMySQL tab and copy the connection string.
Configure environment variables.
This document uses OpenAI as the embedding model provider. In this step, you need to provide the connection string obtained from from the previous step and your OpenAI API key.
To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key:
# Use getpass to securely prompt for environment variables in your terminal. import getpass import os # Copy your connection string from the TiDB Cloud console. # Connection string format: "mysql+pymysql://<USER>:<PASSWORD>@<HOST>:4000/<DB>?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true" tidb_connection_string = getpass.getpass("TiDB Connection String:") os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
This document uses OpenAI as the embedding model provider. In this step, you need to provide the connection string of your TiDB cluster and your OpenAI API key.
To configure the environment variables, run the following code. You will be prompted to enter your connection string and OpenAI API key:
# Use getpass to securely prompt for environment variables in your terminal.
import getpass
import os
# Connection string format: "mysql+pymysql://<USER>:<PASSWORD>@<HOST>:4000/<DB>?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
tidb_connection_string = getpass.getpass("TiDB Connection String:")
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")
Taking macOS as an example, the cluster connection string is as follows:
TIDB_DATABASE_URL="mysql+pymysql://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE_NAME>"
# For example: TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test"
You need to modify the parameters in the connection string according to your TiDB cluster. If you are running TiDB on your local machine, <HOST>
is 127.0.0.1
by default. The initial <PASSWORD>
is empty, so if you are starting the cluster for the first time, you can omit this field.
The following are descriptions for each parameter:
<USERNAME>
: The username to connect to the TiDB cluster.<PASSWORD>
: The password to connect to the TiDB cluster.<HOST>
: The host of the TiDB cluster.<PORT>
: The port of the TiDB cluster.<DATABASE>
: The name of the database you want to connect to.
Step 4. Load the sample document
Step 4.1 Download the sample document
In your project directory, create a directory named data/paul_graham/
and download the sample document paul_graham_essay.txt
from the run-llama/llama_index GitHub repository.
!mkdir -p 'data/paul_graham/'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/docs/examples/data/paul_graham/paul_graham_essay.txt' -O 'data/paul_graham/paul_graham_essay.txt'
Step 4.2 Load the document
Load the sample document from data/paul_graham/paul_graham_essay.txt
using the SimpleDirectoryReader
class.
documents = SimpleDirectoryReader("./data/paul_graham").load_data()
print("Document ID:", documents[0].doc_id)
for index, document in enumerate(documents):
document.metadata = {"book": "paul_graham"}
Step 5. Embed and store document vectors
Step 5.1 Initialize the TiDB vector store
The following code creates a table named paul_graham_test
in TiDB, which is optimized for vector search.
tidbvec = TiDBVectorStore(
connection_string=tidb_connection_url,
table_name="paul_graham_test",
distance_strategy="cosine",
vector_dimension=1536,
drop_existing_table=False,
)
Upon successful execution, you can directly view and access the paul_graham_test
table in your TiDB database.
Step 5.2 Generate and store embeddings
The following code parses the documents, generates embeddings, and stores them in the TiDB vector store.
storage_context = StorageContext.from_defaults(vector_store=tidbvec)
index = VectorStoreIndex.from_documents(
documents, storage_context=storage_context, show_progress=True
)
The expected output is as follows:
Parsing nodes: 100%|██████████| 1/1 [00:00<00:00, 8.76it/s]
Generating embeddings: 100%|██████████| 21/21 [00:02<00:00, 8.22it/s]
Step 6. Perform a vector search
The following creates a query engine based on the TiDB vector store and performs a semantic similarity search.
query_engine = index.as_query_engine()
response = query_engine.query("What did the author do?")
print(textwrap.fill(str(response), 100))
The expected output is as follows:
The author worked on writing, programming, building microcomputers, giving talks at conferences,
publishing essays online, developing spam filters, painting, hosting dinner parties, and purchasing
a building for office use.
Step 7. Search with metadata filters
To refine your searches, you can use metadata filters to retrieve specific nearest-neighbor results that match the applied filters.
Query with book != "paul_graham"
filter
The following example excludes results where the book
metadata field is "paul_graham"
:
from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
)
query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
MetadataFilter(key="book", value="paul_graham", operator="!="),
]
),
similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
The expected output is as follows:
Empty Response
Query with book == "paul_graham"
filter
The following example filters results to include only documents where the book
metadata field is "paul_graham"
:
from llama_index.core.vector_stores.types import (
MetadataFilter,
MetadataFilters,
)
query_engine = index.as_query_engine(
filters=MetadataFilters(
filters=[
MetadataFilter(key="book", value="paul_graham", operator="=="),
]
),
similarity_top_k=2,
)
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
The expected output is as follows:
The author learned programming on an IBM 1401 using an early version of Fortran in 9th grade, then
later transitioned to working with microcomputers like the TRS-80 and Apple II. Additionally, the
author studied philosophy in college but found it unfulfilling, leading to a switch to studying AI.
Later on, the author attended art school in both the US and Italy, where they observed a lack of
substantial teaching in the painting department.
Step 8. Delete documents
Delete the first document from the index:
tidbvec.delete(documents[0].doc_id)
Check whether the documents had been deleted:
query_engine = index.as_query_engine()
response = query_engine.query("What did the author learn?")
print(textwrap.fill(str(response), 100))
The expected output is as follows:
Empty Response