Cohere Embeddings
This document describes how to use Cohere embedding models with Auto Embedding in TiDB Cloud to perform semantic searches with text queries.
Available models
TiDB Cloud provides the following Cohere embedding models natively. No API key is required.
Cohere Embed v3 model
- Name:
tidbcloud_free/cohere/embed-english-v3 - Dimensions: 1024
- Distance metric: Cosine, L2
- Languages: English
- Maximum input text tokens: 512 (about 4 characters per token)
- Maximum input text characters: 2,048
- Price: Free
- Hosted by TiDB Cloud: ✅
tidbcloud_free/cohere/embed-english-v3 - Bring Your Own Key: ✅
cohere/embed-english-v3.0
Cohere Multilingual Embed v3 model
- Name:
tidbcloud_free/cohere/embed-multilingual-v3 - Dimensions: 1024
- Distance metric: Cosine, L2
- Languages: 100+ languages
- Maximum input text tokens: 512 (about 4 characters per token)
- Maximum input text characters: 2,048
- Price: Free
- Hosted by TiDB Cloud: ✅
tidbcloud_free/cohere/embed-multilingual-v3 - Bring Your Own Key: ✅
cohere/embed-multilingual-v3.0
Alternatively, all Cohere models are available for use with the cohere/ prefix if you bring your own Cohere API key (BYOK). For example:
Cohere Embed v4 model
- Name:
cohere/embed-v4.0 - Dimensions: 256, 512, 1024, 1536 (default)
- Distance metric: Cosine, L2
- Maximum input text tokens: 128,000
- Price: Charged by Cohere
- Hosted by TiDB Cloud: ❌
- Bring Your Own Key: ✅
For a full list of Cohere models, see Cohere Documentation.
SQL usage example (TiDB Cloud hosted)
The following example shows how to use a Cohere embedding model hosted by TiDB Cloud with Auto Embedding.
CREATE TABLE sample (
`id` INT,
`content` TEXT,
`embedding` VECTOR(1024) GENERATED ALWAYS AS (EMBED_TEXT(
"tidbcloud_free/cohere/embed-multilingual-v3",
`content`,
'{"input_type": "search_document", "input_type@search": "search_query"}'
)) STORED
);
Insert and query data:
INSERT INTO sample
(`id`, `content`)
VALUES
(1, "Java: Object-oriented language for cross-platform development."),
(2, "Java coffee: Bold Indonesian beans with low acidity."),
(3, "Java island: Densely populated, home to Jakarta."),
(4, "Java's syntax is used in Android apps."),
(5, "Dark roast Java beans enhance espresso blends.");
SELECT `id`, `content` FROM sample
ORDER BY
VEC_EMBED_COSINE_DISTANCE(
embedding,
"How to start learning Java programming?"
)
LIMIT 2;
Result:
+------+----------------------------------------------------------------+
| id | content |
+------+----------------------------------------------------------------+
| 1 | Java: Object-oriented language for cross-platform development. |
| 4 | Java's syntax is used in Android apps. |
+------+----------------------------------------------------------------+
Options (TiDB Cloud hosted)
Both the Embed v3 and Multilingual Embed v3 models support the following options, which you can specify via the additional_json_options parameter of the EMBED_TEXT() function.
input_type(required): prepends special tokens to indicate the purpose of the embedding. You must use the same input type consistently when generating embeddings for the same task, otherwise embeddings will be mapped to different semantic spaces and become incompatible. The only exception is semantic search, where documents are embedded withsearch_documentand queries are embedded withsearch_query.search_document: generates embeddings from documents to store in a vector database.search_query: generates embeddings from queries to search against stored embeddings in a vector database.classification: generates embeddings to be used as input for a text classifier.clustering: generates embeddings for clustering tasks.
truncate(optional): controls how the API handles inputs longer than the maximum token length. You can specify one of the following values:NONE(default): returns an error when the input exceeds the maximum input token length.START: discards text from the beginning until the input fits.END: discards text from the end until the input fits.
Usage example (BYOK)
This example shows how to create a vector table, insert documents, and run similarity search using Bring Your Own Key (BYOK) Cohere models.
Step 1: Connect to the database
from pytidb import TiDBClient
tidb_client = TiDBClient.connect(
host="{gateway-region}.prod.aws.tidbcloud.com",
port=4000,
username="{prefix}.root",
password="{password}",
database="{database}",
ensure_db=True,
)
mysql -h {gateway-region}.prod.aws.tidbcloud.com \
-P 4000 \
-u {prefix}.root \
-p{password} \
-D {database}
Step 2: Configure the API key
Create your API key from the Cohere Dashboard and bring your own key (BYOK) to use the embedding service.
Configure the API key for the Cohere embedding provider using the TiDB Client:
tidb_client.configure_embedding_provider(
provider="cohere",
api_key="{your-cohere-api-key}",
)
Set the API key for the Cohere embedding provider using SQL:
SET @@GLOBAL.TIDB_EXP_EMBED_COHERE_API_KEY = "{your-cohere-api-key}";
Step 3: Create a vector table
Create a table with a vector field that uses the cohere/embed-v4.0 model to generate 1536-dimensional vectors (default dimension):
from pytidb.schema import TableModel, Field
from pytidb.embeddings import EmbeddingFunction
from pytidb.datatype import TEXT
class Document(TableModel):
__tablename__ = "sample_documents"
id: int = Field(primary_key=True)
content: str = Field(sa_type=TEXT)
embedding: list[float] = EmbeddingFunction(
model_name="cohere/embed-v4.0"
).VectorField(source_field="content")
table = tidb_client.create_table(schema=Document, if_exists="overwrite")
CREATE TABLE sample_documents (
`id` INT PRIMARY KEY,
`content` TEXT,
`embedding` VECTOR(1536) GENERATED ALWAYS AS (EMBED_TEXT(
"cohere/embed-v4.0",
`content`
)) STORED
);
Step 4: Insert data into the table
Use the table.insert() or table.bulk_insert() API to add data:
documents = [
Document(id=1, content="Python: High-level programming language for data science and web development."),
Document(id=2, content="Python snake: Non-venomous constrictor found in tropical regions."),
Document(id=3, content="Python framework: Django and Flask are popular web frameworks."),
Document(id=4, content="Python libraries: NumPy and Pandas for data analysis."),
Document(id=5, content="Python ecosystem: Rich collection of packages and tools."),
]
table.bulk_insert(documents)
Insert data using the INSERT INTO statement:
INSERT INTO sample_documents (id, content)
VALUES
(1, "Python: High-level programming language for data science and web development."),
(2, "Python snake: Non-venomous constrictor found in tropical regions."),
(3, "Python framework: Django and Flask are popular web frameworks."),
(4, "Python libraries: NumPy and Pandas for data analysis."),
(5, "Python ecosystem: Rich collection of packages and tools.");
Step 5: Search for similar documents
Use the table.search() API to perform vector search:
results = table.search("How to learn Python programming?") \
.limit(2) \
.to_list()
print(results)
Use the VEC_EMBED_COSINE_DISTANCE function to perform vector search based on cosine distance metric:
SELECT
`id`,
`content`,
VEC_EMBED_COSINE_DISTANCE(embedding, "How to learn Python programming?") AS _distance
FROM sample_documents
ORDER BY _distance ASC
LIMIT 2;
Options (BYOK)
All Cohere embedding options are supported via the additional_json_options parameter of the EMBED_TEXT() function.
Example: Specify different input_type for search and insert operations
Use the @search suffix to indicate that the field takes effect only during vector search queries.
CREATE TABLE sample (
`id` INT,
`content` TEXT,
`embedding` VECTOR(1024) GENERATED ALWAYS AS (EMBED_TEXT(
"cohere/embed-v4.0",
`content`,
'{"input_type": "search_document", "input_type@search": "search_query"}'
)) STORED
);
Example: Use an alternative dimension
CREATE TABLE sample (
`id` INT,
`content` TEXT,
`embedding` VECTOR(512) GENERATED ALWAYS AS (EMBED_TEXT(
"cohere/embed-v4.0",
`content`,
'{"output_dimension": 512}'
)) STORED
);
For all available options, see Cohere Documentation.