Hybrid Search
By using full-text search, you can retrieve documents based on exact keywords. By using vector search, you can retrieve documents based on semantic similarity. Can we combine these two search methods to improve the retrieval quality and handle more scenarios? Yes, this approach is known as hybrid search and is commonly used in AI applications.
A general workflow of hybrid search in TiDB is as follows:
- Use TiDB for full-text search and vector search.
- Use a reranker to combine the results from both searches.
This tutorial demonstrates how to use hybrid search in TiDB with the pytidb Python SDK, which provides built-in support for embedding and reranking. Using pytidb is completely optional — you can perform a search using SQL directly and use your own reranking model as you like.
Prerequisites
Hybrid search relies on both full-text search and vector search. Full-text search is still in the early stages, and we are continuously rolling it out to more customers. Currently, Full-text search is only available for the following product option and regions:
- TiDB Cloud Serverless:
Frankfurt (eu-central-1)
andSingapore (ap-southeast-1)
To complete this tutorial, make sure you have a TiDB Cloud Serverless cluster in a supported region. If you don't have one, follow Creating a TiDB Cloud Serverless cluster to create it.
Get started
Step 1. Install the pytidb Python SDK
pip install "pytidb[models]"
# (Alternative) If you don't want to use built-in embedding functions and rerankers:
# pip install pytidb
# (Optional) To convert query results to pandas DataFrame:
# pip install pandas
Step 2. Connect to TiDB
from pytidb import TiDBClient
db = TiDBClient.connect(
host="HOST_HERE",
port=4000,
username="USERNAME_HERE",
password="PASSWORD_HERE",
database="DATABASE_HERE",
)
You can get these connection parameters from the TiDB Cloud console:
Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.
Click Connect in the upper-right corner. A connection dialog is displayed, with connection parameters listed.
For example, if the connection parameters are displayed as follows:
HOST: gateway01.us-east-1.prod.shared.aws.tidbcloud.com PORT: 4000 USERNAME: 4EfqPF23YKBxaQb.root PASSWORD: abcd1234 DATABASE: test CA: /etc/ssl/cert.pemThe corresponding Python code to connect to the TiDB Cloud Serverless cluster would be as follows:
db = TiDBClient.connect( host="gateway01.us-east-1.prod.shared.aws.tidbcloud.com", port=4000, username="4EfqPF23YKBxaQb.root", password="abcd1234", database="test", )Note that the preceding example is for demonstration purposes only. You need to fill in the parameters with your own values and keep them secure.
Step 3. Create a table
As an example, create a table named chunks
with the following columns:
id
(int): the ID of the chunk.text
(text): the text content of the chunk.text_vec
(vector): the vector representation of the text, automatically generated by the embedding model in pytidb.user_id
(int): the ID of the user who created the chunk.
from pytidb.schema import TableModel, Field
from pytidb.embeddings import EmbeddingFunction
text_embed = EmbeddingFunction("openai/text-embedding-3-small")
class Chunk(TableModel, table=True):
__tablename__ = "chunks"
id: int = Field(primary_key=True)
text: str = Field()
text_vec: list[float] = text_embed.VectorField(
source_field="text"
) # 👈 Define the vector field.
user_id: int = Field()
table = db.create_table(schema=Chunk)
Step 4. Insert data
table.bulk_insert(
[
Chunk(id=2, text="bar", user_id=2), # 👈 The text field will be embedded to a
Chunk(id=3, text="baz", user_id=3), # vector and stored in the "text_vec" field
Chunk(id=4, text="qux", user_id=4), # automatically.
]
)
Step 5. Perform a hybrid search
In this example, use the jina-reranker model to rerank the search results.
from pytidb.rerankers import Reranker
jinaai = Reranker(model_name="jina_ai/jina-reranker-m0")
df = (
table.search("<query>", search_type="hybrid")
.rerank(jinaai, "text") # 👈 Rerank the query result using the jinaai model.
.limit(2)
.to_pandas()
)
For a complete example, see pytidb hybrid search demo.
See also
Feedback & help
Full-text search is still in the early stages with limited accessibility. If you would like to try full-text search in a region that is not yet available, or if you have feedback or need help, feel free to reach out to us: