📣
TiDB Cloud Essential is now in public preview. Try it out →

Get Started with TiDB + AI via Python



This document demonstrates how to get started with Vector Search in TiDB using Python SDK. Follow along to build your first AI application working with TiDB.

By following this document, you will learn how to:

  • Connect to TiDB using the TiDB Python SDK.
  • Generate text embeddings with popular embedding models.
  • Store vectors in TiDB tables.
  • Perform semantic search using vector similarity.

Prerequisites

  • Go to tidbcloud.com to create a TiDB Cloud Starter cluster for free or using tiup playground to deploy a TiDB Self-Managed cluster for local testing.

Installation

pytidb is the official Python SDK for TiDB, designed to help developers build AI applications efficiently.

To install the Python SDK, run the following command:

pip install pytidb

To use built-in embedding function, install the models extension (alternative):

pip install "pytidb[models]"

Connect to database

    You can get these connection parameters from the TiDB Cloud console:

    1. Navigate to the Clusters page, and then click the name of your target cluster to go to its overview page.
    2. Click Connect in the upper-right corner. A connection dialog is displayed, with connection parameters listed.

    For example, if the connection parameters are displayed as follows:

    HOST: gateway01.us-east-1.prod.shared.aws.tidbcloud.com PORT: 4000 USERNAME: 4EfqPF23YKBxaQb.root PASSWORD: abcd1234 DATABASE: test CA: /etc/ssl/cert.pem

    The corresponding Python code to connect to the TiDB Cloud Starter cluster would be as follows:

    from pytidb import TiDBClient client = TiDBClient.connect( host="gateway01.us-east-1.prod.shared.aws.tidbcloud.com", port=4000, username="4EfqPF23YKBxaQb.root", password="abcd1234", database="test", )

    Here is a basic example for connecting to a self-managed TiDB cluster:

    from pytidb import TiDBClient client = TiDBClient.connect( host="localhost", port=4000, username="root", password="", database="test", ensure_db=True, )

    Once connected, you can use the client object to operate tables, query data, and more.

    Create an embedding function

    When working with embedding models, you can leverage the embedding function to automatically vectorize your data at both insertion and query stages. It natively supports popular embedding models like OpenAI, Jina AI, Hugging Face, Sentence Transformers, and others.

      Go to OpenAI platform to create your API key for embedding.

      from pytidb.embeddings import EmbeddingFunction text_embed = EmbeddingFunction( model_name="openai/text-embedding-3-small", api_key="<your-openai-api-key>", )

      Go to Jina AI to create your API key for embedding.

      from pytidb.embeddings import EmbeddingFunction text_embed = EmbeddingFunction( model_name="jina/jina-embeddings-v3", api_key="<your-jina-api-key>", )

      Create a table

      As an example, create a table named chunks with the following columns:

      • id (int): the ID of the chunk.
      • text (text): the text content of the chunk.
      • text_vec (vector): the vector embeddings of the text.
      • user_id (int): the ID of the user who created the chunk.
      from pytidb.schema import TableModel, Field, VectorField class Chunk(TableModel): id: int | None = Field(default=None, primary_key=True) text: str = Field() text_vec: list[float] = text_embed.VectorField(source_field="text") user_id: int = Field() table = client.create_table(schema=Chunk, if_exists="overwrite")

      Once created, you can use the table object to insert data, search data, and more.

      Insert Data

      Now let's add some sample data to our table.

      table.bulk_insert([ # 👇 The text will be automatically embedded and populated into the `text_vec` field. Chunk(text="PyTiDB is a Python library for developers to connect to TiDB.", user_id=2), Chunk(text="LlamaIndex is a framework for building AI applications.", user_id=2), Chunk(text="OpenAI is a company and platform that provides AI models service and tools.", user_id=3), ])

      Search for nearest neighbors

      To search for nearest neighbors of a given query, you can use the table.search() method. This method performs a vector search by default.

      table.search( # 👇 Pass the query text directly, it will be embedded to a query vector automatically. "A library for my artificial intelligence software" ) .limit(3).to_list()

      In this example, vector search compares the query vector with the stored vectors in the text_vec field of the chunks table and returns the top 3 most semantically relevant results based on similarity scores.

      The closer _distance means the more similar the two vectors are.

      [ { 'id': 2, 'text': 'LlamaIndex is a framework for building AI applications.', 'text_vec': [...], 'user_id': 2, '_distance': 0.5719928358786761, '_score': 0.4280071641213239 }, { 'id': 3, 'text': 'OpenAI is a company and platform that provides AI models service and tools.', 'text_vec': [...], 'user_id': 3, '_distance': 0.603133726213383, '_score': 0.396866273786617 }, { 'id': 1, 'text': 'PyTiDB is a Python library for developers to connect to TiDB.', 'text_vec': [...], 'user_id': 2, '_distance': 0.6202191842385758, '_score': 0.3797808157614242 } ]

      Delete data

      To delete a specific row from the table, you can use the table.delete() method:

      table.delete({ "id": 1 })

      Drop table

      When you no longer need a table, you can drop it using the client.drop_table() method:

      client.drop_table("chunks")

      Next steps

      Was this page helpful?