Hybrid Search
By using full-text search, you can retrieve documents based on exact keywords. By using vector search, you can retrieve documents based on semantic similarity. Can we combine these two search methods to improve the retrieval quality and handle more scenarios? Yes, this approach is known as hybrid search and is commonly used in AI applications.
A general workflow of hybrid search in TiDB is as follows:
- Use TiDB for full-text search and vector search.
- Use a reranker to combine the results from both searches.
This tutorial demonstrates how to use hybrid search in TiDB with the pytidb Python SDK, which provides built-in support for embedding and reranking. Using pytidb is completely optional — you can perform a search using SQL directly and use your own reranking model as you like.
Prerequisites
Full-text search is still in the early stages, and we are continuously rolling it out to more customers. Currently, full-text search is only available on TiDB Cloud Starter and TiDB Cloud Essential in the following regions:
- AWS:
Frankfurt (eu-central-1)andSingapore (ap-southeast-1)
To complete this tutorial, make sure you have a TiDB Cloud Starter cluster in a supported region. If you don't have one, follow Creating a TiDB Cloud Starter cluster to create it.
Get started
Step 1. Install the pytidb Python SDK
pip install "pytidb[models]"
# (Alternative) If you don't want to use built-in embedding functions and rerankers:
# pip install pytidb
# (Optional) To convert query results to pandas DataFrame:
# pip install pandas
Step 2. Connect to TiDB
from pytidb import TiDBClient
db = TiDBClient.connect(
host="HOST_HERE",
port=4000,
username="USERNAME_HERE",
password="PASSWORD_HERE",
database="DATABASE_HERE",
)
You can get these connection parameters from the TiDB Cloud console as follows:
Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.
Click Connect in the upper-right corner. A connection dialog is displayed, with connection parameters listed.
For example, if the connection parameters are displayed as follows:
HOST: gateway01.us-east-1.prod.shared.aws.tidbcloud.com PORT: 4000 USERNAME: 4EfqPF23YKBxaQb.root PASSWORD: abcd1234 DATABASE: test CA: /etc/ssl/cert.pemThe corresponding Python code to connect to the TiDB Cloud Starter cluster would be as follows:
db = TiDBClient.connect( host="gateway01.us-east-1.prod.shared.aws.tidbcloud.com", port=4000, username="4EfqPF23YKBxaQb.root", password="abcd1234", database="test", )Note that the preceding example is for demonstration purposes only. You need to fill in the parameters with your own values and keep them secure.
Step 3. Create a table
As an example, create a table named chunks with the following columns:
id(int): the ID of the chunk.text(text): the text content of the chunk.text_vec(vector): the vector representation of the text, automatically generated by the embedding model in pytidb.user_id(int): the ID of the user who created the chunk.
from pytidb.schema import TableModel, Field
from pytidb.embeddings import EmbeddingFunction
text_embed = EmbeddingFunction("openai/text-embedding-3-small")
class Chunk(TableModel, table=True):
__tablename__ = "chunks"
id: int = Field(primary_key=True)
text: str = Field()
text_vec: list[float] = text_embed.VectorField(
source_field="text"
) # 👈 Define the vector field.
user_id: int = Field()
table = db.create_table(schema=Chunk)
Step 4. Insert data
table.bulk_insert(
[
Chunk(id=2, text="bar", user_id=2), # 👈 The text field will be embedded to a
Chunk(id=3, text="baz", user_id=3), # vector and stored in the "text_vec" field
Chunk(id=4, text="qux", user_id=4), # automatically.
]
)
Step 5. Perform a hybrid search
In this example, use the jina-reranker model to rerank the search results.
from pytidb.rerankers import Reranker
jinaai = Reranker(model_name="jina_ai/jina-reranker-m0")
df = (
table.search("<query>", search_type="hybrid")
.rerank(jinaai, "text") # 👈 Rerank the query result using the jinaai model.
.limit(2)
.to_pandas()
)
For a complete example, see pytidb hybrid search demo.
Fusion methods
Fusion methods combine results from vector (semantic) and full-text (keyword) searches into a single, unified ranking. This ensures that the final results leverage both semantic relevance and keyword matching.
pytidb supports two fusion methods:
rrf: Reciprocal Rank Fusion (default)weighted: Weighted Score Fusion
You can select the fusion method that best fits your use case to optimize hybrid search results.
Reciprocal Rank Fusion (RRF)
Reciprocal Rank Fusion (RRF) is an algorithm that evaluates search results by leveraging the rank of documents in multiple result sets.
For more details, see the RRF paper.
Enable reciprocal rank fusion by specifying the method parameter as "rrf" in the .fusion() method.
results = (
table.search(
"AI database", search_type="hybrid"
)
.fusion(method="rrf")
.limit(3)
.to_list()
)
Parameters:
k: A constant (default: 60) to prevent division by zero and control the impact of high-ranked documents.
Weighted Score Fusion
Weighted Score Fusion combines vector search and full-text search scores using a weighted sum:
final_score = vs_weight * vector_score + fts_weight * fulltext_score
Enable weighted score fusion by specifying the method parameter as "weighted" in the .fusion() method.
For example, to give more weight to vector search, set the vs_weight parameter to 0.7 and the fts_weight parameter to 0.3:
results = (
table.search(
"AI database", search_type="hybrid"
)
.fusion(method="weighted", vs_weight=0.7, fts_weight=0.3)
.limit(3)
.to_list()
)
Parameters:
vs_weight: The weight of the vector search score.fts_weight: The weight of the full-text search score.
Rerank method
Hybrid search also supports reranking using reranker-specific models.
Use the rerank() method to specify a reranker that sorts search results by relevance between the query and the documents.
Example: Using Jina AI Reranker to rerank the hybrid search results
reranker = Reranker(
# Use the `jina-reranker-m0` model
model_name="jina_ai/jina-reranker-m0",
api_key="{your-jinaai-api-key}"
)
results = (
table.search(
"AI database", search_type="hybrid"
)
.fusion(method="rrf", k=60)
.rerank(reranker, "text")
.limit(3)
.to_list()
)
To check other reranker models, see Reranking.
See also
Feedback & help
Full-text search is still in the early stages with limited accessibility. If you would like to try full-text search in a region that is not yet available, or if you have feedback or need help, feel free to reach out to us:
- Ask the community on Discord or Slack.
- Submit a support ticket for TiDB Cloud