📣

TiDB Cloud Essential is now in public preview. Try it out →

Hybrid Search

By using full-text search, you can retrieve documents based on exact keywords. By using vector search, you can retrieve documents based on semantic similarity. Can we combine these two search methods to improve the retrieval quality and handle more scenarios? Yes, this approach is known as hybrid search and is commonly used in AI applications.

A general workflow of hybrid search in TiDB is as follows:

Use TiDB for full-text search and vector search.
Use a reranker to combine the results from both searches.

This tutorial demonstrates how to use hybrid search in TiDB with the pytidb Python SDK, which provides built-in support for embedding and reranking. Using pytidb is completely optional — you can perform a search using SQL directly and use your own reranking model as you like.

Prerequisites

Full-text search is still in the early stages, and we are continuously rolling it out to more customers. Currently, full-text search is only available on TiDB Cloud Starter and TiDB Cloud Essential in the following regions:

AWS: Frankfurt (eu-central-1) and Singapore (ap-southeast-1)

To complete this tutorial, make sure you have a TiDB Cloud Starter cluster in a supported region. If you don't have one, follow Creating a TiDB Cloud Starter cluster to create it.

Get started

Step 1. Install the pytidb Python SDK

pip install "pytidb[models]"

# (Alternative) If you don't want to use built-in embedding functions and rerankers:
# pip install pytidb

# (Optional) To convert query results to pandas DataFrame:
# pip install pandas

Step 2. Connect to TiDB

from pytidb import TiDBClient

db = TiDBClient.connect(
    host="HOST_HERE",
    port=4000,
    username="USERNAME_HERE",
    password="PASSWORD_HERE",
    database="DATABASE_HERE",
)

You can get these connection parameters from the TiDB Cloud console as follows:

Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.
Click Connect in the upper-right corner. A connection dialog is displayed, with connection parameters listed.
For example, if the connection parameters are displayed as follows:
```
HOST:     gateway01.us-east-1.prod.shared.aws.tidbcloud.com
PORT:     4000
USERNAME: 4EfqPF23YKBxaQb.root
PASSWORD: abcd1234
DATABASE: test
CA:       /etc/ssl/cert.pem
```
The corresponding Python code to connect to the TiDB Cloud Starter cluster would be as follows:
```
db = TiDBClient.connect(
    host="gateway01.us-east-1.prod.shared.aws.tidbcloud.com",
    port=4000,
    username="4EfqPF23YKBxaQb.root",
    password="abcd1234",
    database="test",
)
```
Note that the preceding example is for demonstration purposes only. You need to fill in the parameters with your own values and keep them secure.

Step 3. Create a table

As an example, create a table named chunks with the following columns:

id (int): the ID of the chunk.
text (text): the text content of the chunk.
text_vec (vector): the vector representation of the text, automatically generated by the embedding model in pytidb.
user_id (int): the ID of the user who created the chunk.

from pytidb.schema import TableModel, Field
from pytidb.embeddings import EmbeddingFunction

text_embed = EmbeddingFunction("openai/text-embedding-3-small")

class Chunk(TableModel, table=True):
    __tablename__ = "chunks"

    id: int = Field(primary_key=True)
    text: str = Field()
    text_vec: list[float] = text_embed.VectorField(
        source_field="text"
    )  # 👈 Define the vector field.
    user_id: int = Field()

table = db.create_table(schema=Chunk)

Step 4. Insert data

table.bulk_insert(
    [
        Chunk(id=2, text="bar", user_id=2),   # 👈 The text field will be embedded to a
        Chunk(id=3, text="baz", user_id=3),   # vector and stored in the "text_vec" field
        Chunk(id=4, text="qux", user_id=4),   # automatically.
    ]
)

Step 5. Perform a hybrid search

In this example, use the jina-reranker model to rerank the search results.

from pytidb.rerankers import Reranker

jinaai = Reranker(model_name="jina_ai/jina-reranker-m0")

df = (
  table.search("<query>", search_type="hybrid")
    .rerank(jinaai, "text")  # 👈 Rerank the query result using the jinaai model.
    .limit(2)
    .to_pandas()
)

For a complete example, see pytidb hybrid search demo.

Fusion methods

Fusion methods combine results from vector (semantic) and full-text (keyword) searches into a single, unified ranking. This ensures that the final results leverage both semantic relevance and keyword matching.

pytidb supports two fusion methods:

rrf: Reciprocal Rank Fusion (default)
weighted: Weighted Score Fusion

You can select the fusion method that best fits your use case to optimize hybrid search results.

Reciprocal Rank Fusion (RRF)

Reciprocal Rank Fusion (RRF) is an algorithm that evaluates search results by leveraging the rank of documents in multiple result sets.

For more details, see the RRF paper.

Enable reciprocal rank fusion by specifying the method parameter as "rrf" in the .fusion() method.

results = (
    table.search(
        "AI database", search_type="hybrid"
    )
    .fusion(method="rrf")
    .limit(3)
    .to_list()
)

Parameters:

k: A constant (default: 60) to prevent division by zero and control the impact of high-ranked documents.

Weighted Score Fusion

Weighted Score Fusion combines vector search and full-text search scores using a weighted sum:

final_score = vs_weight * vector_score + fts_weight * fulltext_score

Enable weighted score fusion by specifying the method parameter as "weighted" in the .fusion() method.

For example, to give more weight to vector search, set the vs_weight parameter to 0.7 and the fts_weight parameter to 0.3:

results = (
    table.search(
        "AI database", search_type="hybrid"
    )
    .fusion(method="weighted", vs_weight=0.7, fts_weight=0.3)
    .limit(3)
    .to_list()
)

Parameters:

vs_weight: The weight of the vector search score.
fts_weight: The weight of the full-text search score.

Rerank method

Hybrid search also supports reranking using reranker-specific models.

Use the rerank() method to specify a reranker that sorts search results by relevance between the query and the documents.

Example: Using Jina AI Reranker to rerank the hybrid search results

reranker = Reranker(
    # Use the `jina-reranker-m0` model
    model_name="jina_ai/jina-reranker-m0",
    api_key="{your-jinaai-api-key}"
)

results = (
    table.search(
        "AI database", search_type="hybrid"
    )
    .fusion(method="rrf", k=60)
    .rerank(reranker, "text")
    .limit(3)
    .to_list()
)

To check other reranker models, see Reranking.

Feedback & help

Full-text search is still in the early stages with limited accessibility. If you would like to try full-text search in a region that is not yet available, or if you have feedback or need help, feel free to reach out to us:

Ask the community on Discord or Slack.
Submit a support ticket for TiDB Cloud