Hybrid Search

By using full-text search, you can retrieve documents based on exact keywords. By using vector search, you can retrieve documents based on semantic similarity. Can we combine these two search methods to improve the retrieval quality and handle more scenarios? Yes, this approach is known as hybrid search and is commonly used in AI applications.

A general workflow of hybrid search in TiDB is as follows:

  1. Use TiDB for full-text search and vector search.
  2. Use a reranker to combine the results from both searches.

Hybrid Search

This tutorial demonstrates how to use hybrid search in TiDB with the pytidb Python SDK, which provides built-in support for embedding and reranking. Using pytidb is completely optional — you can perform a search using SQL directly and use your own reranking model as you like.

Prerequisites

Hybrid search relies on both full-text search and vector search. Full-text search is still in the early stages, and we are continuously rolling it out to more customers. Currently, Full-text search is only available for the following product option and regions:

  • TiDB Cloud Serverless: Frankfurt (eu-central-1) and Singapore (ap-southeast-1)

To complete this tutorial, make sure you have a TiDB Cloud Serverless cluster in a supported region. If you don't have one, follow Creating a TiDB Cloud Serverless cluster to create it.

Get started

Step 1. Install the pytidb Python SDK

pip install "pytidb[models]" # (Alternative) If you don't want to use built-in embedding functions and rerankers: # pip install pytidb # (Optional) To convert query results to pandas DataFrame: # pip install pandas

Step 2. Connect to TiDB

from pytidb import TiDBClient db = TiDBClient.connect( host="HOST_HERE", port=4000, username="USERNAME_HERE", password="PASSWORD_HERE", database="DATABASE_HERE", )

You can get these connection parameters from the TiDB Cloud console:

  1. Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.

  2. Click Connect in the upper-right corner. A connection dialog is displayed, with connection parameters listed.

    For example, if the connection parameters are displayed as follows:

    HOST: gateway01.us-east-1.prod.shared.aws.tidbcloud.com PORT: 4000 USERNAME: 4EfqPF23YKBxaQb.root PASSWORD: abcd1234 DATABASE: test CA: /etc/ssl/cert.pem

    The corresponding Python code to connect to the TiDB Cloud Serverless cluster would be as follows:

    db = TiDBClient.connect( host="gateway01.us-east-1.prod.shared.aws.tidbcloud.com", port=4000, username="4EfqPF23YKBxaQb.root", password="abcd1234", database="test", )

    Note that the preceding example is for demonstration purposes only. You need to fill in the parameters with your own values and keep them secure.

Step 3. Create a table

As an example, create a table named chunks with the following columns:

  • id (int): the ID of the chunk.
  • text (text): the text content of the chunk.
  • text_vec (vector): the vector representation of the text, automatically generated by the embedding model in pytidb.
  • user_id (int): the ID of the user who created the chunk.
from pytidb.schema import TableModel, Field from pytidb.embeddings import EmbeddingFunction text_embed = EmbeddingFunction("openai/text-embedding-3-small") class Chunk(TableModel, table=True): __tablename__ = "chunks" id: int = Field(primary_key=True) text: str = Field() text_vec: list[float] = text_embed.VectorField( source_field="text" ) # 👈 Define the vector field. user_id: int = Field() table = db.create_table(schema=Chunk)

Step 4. Insert data

table.bulk_insert( [ Chunk(id=2, text="bar", user_id=2), # 👈 The text field will be embedded to a Chunk(id=3, text="baz", user_id=3), # vector and stored in the "text_vec" field Chunk(id=4, text="qux", user_id=4), # automatically. ] )

In this example, use the jina-reranker model to rerank the search results.

from pytidb.rerankers import Reranker jinaai = Reranker(model_name="jina_ai/jina-reranker-m0") df = ( table.search("<query>", search_type="hybrid") .rerank(jinaai, "text") # 👈 Rerank the query result using the jinaai model. .limit(2) .to_pandas() )

For a complete example, see pytidb hybrid search demo.

See also

Feedback & help

Full-text search is still in the early stages with limited accessibility. If you would like to try full-text search in a region that is not yet available, or if you have feedback or need help, feel free to reach out to us:

Was this page helpful?