OpenAI-Compatible Embeddings

This tutorial demonstrates how to use OpenAI-compatible embedding services to generate text embeddings, store them in TiDB, and perform semantic search.

Note

Currently, Auto Embedding is only available on TiDB Cloud Starter clusters hosted on AWS.

OpenAI-compatible embedding services

Because the OpenAI Embedding API is widely used, many providers offer compatible APIs, such as:

The TiDB Python SDK pytidb provides the EmbeddingFunction class to integrate with OpenAI-compatible embedding services.

Usage example

This example shows how to create a vector table, insert documents, and perform similarity search using an OpenAI-compatible embedding model.

Step 1: Connect to the database

from pytidb import TiDBClient

tidb_client = TiDBClient.connect(
    host="{gateway-region}.prod.aws.tidbcloud.com",
    port=4000,
    username="{prefix}.root",
    password="{password}",
    database="{database}",
    ensure_db=True,
)

Step 2: Define the embedding function

To integrate with an OpenAI-compatible embedding service, initialize the EmbeddingFunction class and set the model_name parameter with the openai/ prefix.

from pytidb.embeddings import EmbeddingFunction

openai_like_embed = EmbeddingFunction(
    model_name="openai/{model_name}",
    api_base="{your-api-base}",
    api_key="{your-api-key}",
)

The parameters are:

model_name: Specifies the model to use. Use the format openai/{model_name}.
api_base: The base URL of your OpenAI-compatible embedding API service.
api_key: The API key used to authenticate with the embedding API service.

Example: Use Ollama with the nomic-embed-text model

openai_like_embed = EmbeddingFunction(
    model_name="openai/nomic-embed-text",
    api_base="http://localhost:11434/v1",
)

Example: Use vLLM with the intfloat/e5-mistral-7b-instruct model

openai_like_embed = EmbeddingFunction(
    model_name="openai/intfloat/e5-mistral-7b-instruct",
    api_base="http://localhost:8000/v1"
)

Step 3: Create a vector table

Create a table with a vector field that uses Ollama and the nomic-embed-text model.

from pytidb.schema import TableModel, Field
from pytidb.embeddings import EmbeddingFunction
from pytidb.datatype import TEXT

openai_like_embed = EmbeddingFunction(
    model_name="openai/nomic-embed-text",
    api_base="{your-api-base}",
)

class Document(TableModel):
    __tablename__ = "sample_documents"
    id: int = Field(primary_key=True)
    content: str = Field(sa_type=TEXT)
    embedding: list[float] = openai_like_embed.VectorField(source_field="content")

table = tidb_client.create_table(schema=Document, if_exists="overwrite")

Step 4: Insert data into the table

Use the table.insert() or table.bulk_insert() API to add data:

documents = [
    Document(id=1, content="Java: Object-oriented language for cross-platform development."),
    Document(id=2, content="Java coffee: Bold Indonesian beans with low acidity."),
    Document(id=3, content="Java island: Densely populated, home to Jakarta."),
    Document(id=4, content="Java's syntax is used in Android apps."),
    Document(id=5, content="Dark roast Java beans enhance espresso blends."),
]
table.bulk_insert(documents)

With Auto Embedding enabled, TiDB automatically generates vector values when you insert data.

Step 5: Search for similar documents

Use the table.search() API to perform vector search:

results = table.search("How to start learning Java programming?") \
    .limit(2) \
    .to_list()
print(results)

With Auto Embedding enabled, TiDB automatically generates embeddings for query text during vector search.

OpenAI-Compatible Embeddings