OpenAI Embeddings
This document describes how to use OpenAI embedding models with Auto Embedding in TiDB Cloud to perform semantic searches with text queries.
Available models
All OpenAI models are available for use with the openai/ prefix if you bring your own OpenAI API key (BYOK). For example:
text-embedding-3-small
- Name:
openai/text-embedding-3-small - Dimensions: 512-1536 (default: 1536)
- Distance metric: Cosine, L2
- Price: Charged by OpenAI
- Hosted by TiDB Cloud: ❌
- Bring Your Own Key: ✅
text-embedding-3-large
- Name:
openai/text-embedding-3-large - Dimensions: 256-3072 (default: 3072)
- Distance metric: Cosine, L2
- Price: Charged by OpenAI
- Hosted by TiDB Cloud: ❌
- Bring Your Own Key: ✅
For a full list of available models, see OpenAI Documentation.
Usage example
This example shows how to create a vector table, insert documents, and run similarity search using OpenAI embedding models.
You can integrate the OpenAI Embeddings API with TiDB using the AI SDK or native SQL functions for automatic embedding generation.
Step 1: Connect to the database
from pytidb import TiDBClient
tidb_client = TiDBClient.connect(
host="{gateway-region}.prod.aws.tidbcloud.com",
port=4000,
username="{prefix}.root",
password="{password}",
database="{database}",
ensure_db=True,
)
mysql -h {gateway-region}.prod.aws.tidbcloud.com \
-P 4000 \
-u {prefix}.root \
-p{password} \
-D {database}
Step 2: Configure the API key
Create an API key in the OpenAI API Platform and bring your own key (BYOK) to use the embedding service.
Configure the API key for the OpenAI embedding provider using the TiDB Client:
tidb_client.configure_embedding_provider(
provider="openai",
api_key="{your-openai-api-key}",
)
Set the API key for the OpenAI embedding provider using SQL:
SET @@GLOBAL.TIDB_EXP_EMBED_OPENAI_API_KEY = "{your-openai-api-key}";
Step 3: Create a vector table
Create a table with a vector field that uses the openai/text-embedding-3-small model to generate 1536-dimensional vectors:
from pytidb.schema import TableModel, Field
from pytidb.embeddings import EmbeddingFunction
from pytidb.datatype import TEXT
class Document(TableModel):
__tablename__ = "sample_documents"
id: int = Field(primary_key=True)
content: str = Field(sa_type=TEXT)
embedding: list[float] = EmbeddingFunction(
model_name="openai/text-embedding-3-small"
).VectorField(source_field="content")
table = tidb_client.create_table(schema=Document, if_exists="overwrite")
CREATE TABLE sample_documents (
`id` INT PRIMARY KEY,
`content` TEXT,
`embedding` VECTOR(1536) GENERATED ALWAYS AS (EMBED_TEXT(
"openai/text-embedding-3-small",
`content`
)) STORED
);
Step 4: Insert data into the table
Use the table.insert() or table.bulk_insert() API to add data:
documents = [
Document(id=1, content="Java: Object-oriented language for cross-platform development."),
Document(id=2, content="Java coffee: Bold Indonesian beans with low acidity."),
Document(id=3, content="Java island: Densely populated, home to Jakarta."),
Document(id=4, content="Java's syntax is used in Android apps."),
Document(id=5, content="Dark roast Java beans enhance espresso blends."),
]
table.bulk_insert(documents)
Insert data using the INSERT INTO statement:
INSERT INTO sample_documents (id, content)
VALUES
(1, "Java: Object-oriented language for cross-platform development."),
(2, "Java coffee: Bold Indonesian beans with low acidity."),
(3, "Java island: Densely populated, home to Jakarta."),
(4, "Java's syntax is used in Android apps."),
(5, "Dark roast Java beans enhance espresso blends.");
Step 5: Search for similar documents
Use the table.search() API to perform vector search:
results = table.search("How to start learning Java programming?") \
.limit(2) \
.to_list()
print(results)
Use the VEC_EMBED_COSINE_DISTANCE function to perform vector search with cosine distance:
SELECT
`id`,
`content`,
VEC_EMBED_COSINE_DISTANCE(embedding, "How to start learning Java programming?") AS _distance
FROM sample_documents
ORDER BY _distance ASC
LIMIT 2;
Result:
+------+----------------------------------------------------------------+
| id | content |
+------+----------------------------------------------------------------+
| 1 | Java: Object-oriented language for cross-platform development. |
| 4 | Java's syntax is used in Android apps. |
+------+----------------------------------------------------------------+
Use Azure OpenAI
To use OpenAI embedding models on Azure, set the global variable TIDB_EXP_EMBED_OPENAI_API_BASE to the URL of your Azure resource. For example:
SET @@GLOBAL.TIDB_EXP_EMBED_OPENAI_API_KEY = 'your-openai-api-key-here';
SET @@GLOBAL.TIDB_EXP_EMBED_OPENAI_API_BASE = 'https://<your-resource-name>.openai.azure.com/openai/v1';
CREATE TABLE sample (
`id` INT,
`content` TEXT,
`embedding` VECTOR(3072) GENERATED ALWAYS AS (EMBED_TEXT(
"openai/text-embedding-3-large",
`content`
)) STORED
);
INSERT INTO sample
(`id`, `content`)
VALUES
(1, "Java: Object-oriented language for cross-platform development."),
(2, "Java coffee: Bold Indonesian beans with low acidity."),
(3, "Java island: Densely populated, home to Jakarta."),
(4, "Java's syntax is used in Android apps."),
(5, "Dark roast Java beans enhance espresso blends.");
SELECT `id`, `content` FROM sample
ORDER BY
VEC_EMBED_COSINE_DISTANCE(
embedding,
"How to start learning Java programming?"
)
LIMIT 2;
Even if your resource URL appears as https://<your-resource-name>.cognitiveservices.azure.com/, you still need to use https://<your-resource-name>.openai.azure.com/openai/v1 as the API base to keep OpenAI-compatible request and response formats.
To switch from Azure OpenAI to OpenAI directly, set TIDB_EXP_EMBED_OPENAI_API_BASE to an empty string:
SET @@GLOBAL.TIDB_EXP_EMBED_OPENAI_API_BASE = '';
Options
All OpenAI embedding options are supported via the additional_json_options parameter of the EMBED_TEXT() function.
Example: Use an alternative dimension for text-embedding-3-large
CREATE TABLE sample (
`id` INT,
`content` TEXT,
`embedding` VECTOR(1024) GENERATED ALWAYS AS (EMBED_TEXT(
"openai/text-embedding-3-large",
`content`,
'{"dimensions": 1024}'
)) STORED
);
For all available options, see OpenAI Documentation.