📣

TiDB Cloud Essential is now in public preview. Try it out →

Cohere Embeddings

This document describes how to use Cohere embedding models with Auto Embedding in TiDB Cloud to perform semantic searches with text queries.

Note

Auto Embedding is only available on TiDB Cloud Starter clusters hosted on AWS.

Available models

TiDB Cloud provides the following Cohere embedding models natively. No API key is required.

Cohere Embed v3 model

Name: tidbcloud_free/cohere/embed-english-v3
Dimensions: 1024
Distance metric: Cosine, L2
Languages: English
Maximum input text tokens: 512 (about 4 characters per token)
Maximum input text characters: 2,048
Price: Free
Hosted by TiDB Cloud: ✅ tidbcloud_free/cohere/embed-english-v3
Bring Your Own Key: ✅ cohere/embed-english-v3.0

Cohere Multilingual Embed v3 model

Name: tidbcloud_free/cohere/embed-multilingual-v3
Dimensions: 1024
Distance metric: Cosine, L2
Languages: 100+ languages
Maximum input text tokens: 512 (about 4 characters per token)
Maximum input text characters: 2,048
Price: Free
Hosted by TiDB Cloud: ✅ tidbcloud_free/cohere/embed-multilingual-v3
Bring Your Own Key: ✅ cohere/embed-multilingual-v3.0

Alternatively, all Cohere models are available for use with the cohere/ prefix if you bring your own Cohere API key (BYOK). For example:

Cohere Embed v4 model

Name: cohere/embed-v4.0
Dimensions: 256, 512, 1024, 1536 (default)
Distance metric: Cosine, L2
Maximum input text tokens: 128,000
Price: Charged by Cohere
Hosted by TiDB Cloud: ❌
Bring Your Own Key: ✅

For a full list of Cohere models, see Cohere Documentation.

SQL usage example (TiDB Cloud hosted)

The following example shows how to use a Cohere embedding model hosted by TiDB Cloud with Auto Embedding.

CREATE TABLE sample (
  `id`        INT,
  `content`   TEXT,
  `embedding` VECTOR(1024) GENERATED ALWAYS AS (EMBED_TEXT(
                "tidbcloud_free/cohere/embed-multilingual-v3",
                `content`,
                '{"input_type": "search_document", "input_type@search": "search_query"}'
              )) STORED
);

Note

For the Cohere embedding model, you must specify input_type in the EMBED_TEXT() function when defining the table. For example, '{"input_type": "search_document", "input_type@search": "search_query"}' means that input_type is set to search_document for data insertion and search_query is automatically applied during vector searches.
The @search suffix indicates that the field takes effect only during vector search queries, so you do not need to specify input_type again when writing a query.

Insert and query data:

INSERT INTO sample
    (`id`, `content`)
VALUES
    (1, "Java: Object-oriented language for cross-platform development."),
    (2, "Java coffee: Bold Indonesian beans with low acidity."),
    (3, "Java island: Densely populated, home to Jakarta."),
    (4, "Java's syntax is used in Android apps."),
    (5, "Dark roast Java beans enhance espresso blends.");


SELECT `id`, `content` FROM sample
ORDER BY
  VEC_EMBED_COSINE_DISTANCE(
    embedding,
    "How to start learning Java programming?"
  )
LIMIT 2;

Result:

+------+----------------------------------------------------------------+
| id   | content                                                        |
+------+----------------------------------------------------------------+
|    1 | Java: Object-oriented language for cross-platform development. |
|    4 | Java's syntax is used in Android apps.                         |
+------+----------------------------------------------------------------+

Options (TiDB Cloud hosted)

Both the Embed v3 and Multilingual Embed v3 models support the following options, which you can specify via the additional_json_options parameter of the EMBED_TEXT() function.

input_type (required): prepends special tokens to indicate the purpose of the embedding. You must use the same input type consistently when generating embeddings for the same task, otherwise embeddings will be mapped to different semantic spaces and become incompatible. The only exception is semantic search, where documents are embedded with search_document and queries are embedded with search_query.
- search_document: generates embeddings from documents to store in a vector database.
- search_query: generates embeddings from queries to search against stored embeddings in a vector database.
- classification: generates embeddings to be used as input for a text classifier.
- clustering: generates embeddings for clustering tasks.
truncate (optional): controls how the API handles inputs longer than the maximum token length. You can specify one of the following values:
- NONE (default): returns an error when the input exceeds the maximum input token length.
- START: discards text from the beginning until the input fits.
- END: discards text from the end until the input fits.

Usage example (BYOK)

This example shows how to create a vector table, insert documents, and run similarity search using Bring Your Own Key (BYOK) Cohere models.

Step 1: Connect to the database

from pytidb import TiDBClient

tidb_client = TiDBClient.connect(
    host="{gateway-region}.prod.aws.tidbcloud.com",
    port=4000,
    username="{prefix}.root",
    password="{password}",
    database="{database}",
    ensure_db=True,
)

mysql -h {gateway-region}.prod.aws.tidbcloud.com \
    -P 4000 \
    -u {prefix}.root \
    -p{password} \
    -D {database}

Step 2: Configure the API key

Create your API key from the Cohere Dashboard and bring your own key (BYOK) to use the embedding service.

Configure the API key for the Cohere embedding provider using the TiDB Client:

tidb_client.configure_embedding_provider(
    provider="cohere",
    api_key="{your-cohere-api-key}",
)

Set the API key for the Cohere embedding provider using SQL:

SET @@GLOBAL.TIDB_EXP_EMBED_COHERE_API_KEY = "{your-cohere-api-key}";

Step 3: Create a vector table

Create a table with a vector field that uses the cohere/embed-v4.0 model to generate 1536-dimensional vectors (default dimension):

from pytidb.schema import TableModel, Field
from pytidb.embeddings import EmbeddingFunction
from pytidb.datatype import TEXT

class Document(TableModel):
    __tablename__ = "sample_documents"
    id: int = Field(primary_key=True)
    content: str = Field(sa_type=TEXT)
    embedding: list[float] = EmbeddingFunction(
        model_name="cohere/embed-v4.0"
    ).VectorField(source_field="content")

table = tidb_client.create_table(schema=Document, if_exists="overwrite")

CREATE TABLE sample_documents (
    `id`        INT PRIMARY KEY,
    `content`   TEXT,
    `embedding` VECTOR(1536) GENERATED ALWAYS AS (EMBED_TEXT(
        "cohere/embed-v4.0",
        `content`
    )) STORED
);

Step 4: Insert data into the table

Use the table.insert() or table.bulk_insert() API to add data:

documents = [
    Document(id=1, content="Python: High-level programming language for data science and web development."),
    Document(id=2, content="Python snake: Non-venomous constrictor found in tropical regions."),
    Document(id=3, content="Python framework: Django and Flask are popular web frameworks."),
    Document(id=4, content="Python libraries: NumPy and Pandas for data analysis."),
    Document(id=5, content="Python ecosystem: Rich collection of packages and tools."),
]
table.bulk_insert(documents)

Insert data using the INSERT INTO statement:

INSERT INTO sample_documents (id, content)
VALUES
    (1, "Python: High-level programming language for data science and web development."),
    (2, "Python snake: Non-venomous constrictor found in tropical regions."),
    (3, "Python framework: Django and Flask are popular web frameworks."),
    (4, "Python libraries: NumPy and Pandas for data analysis."),
    (5, "Python ecosystem: Rich collection of packages and tools.");

Step 5: Search for similar documents

Use the table.search() API to perform vector search:

results = table.search("How to learn Python programming?") \
    .limit(2) \
    .to_list()
print(results)

Use the VEC_EMBED_COSINE_DISTANCE function to perform vector search based on cosine distance metric:

SELECT
    `id`,
    `content`,
    VEC_EMBED_COSINE_DISTANCE(embedding, "How to learn Python programming?") AS _distance
FROM sample_documents
ORDER BY _distance ASC
LIMIT 2;

Options (BYOK)

All Cohere embedding options are supported via the additional_json_options parameter of the EMBED_TEXT() function.

Example: Specify different input_type for search and insert operations

Use the @search suffix to indicate that the field takes effect only during vector search queries.

CREATE TABLE sample (
  `id`        INT,
  `content`   TEXT,
  `embedding` VECTOR(1024) GENERATED ALWAYS AS (EMBED_TEXT(
                "cohere/embed-v4.0",
                `content`,
                '{"input_type": "search_document", "input_type@search": "search_query"}'
              )) STORED
);

Example: Use an alternative dimension

CREATE TABLE sample (
  `id`        INT,
  `content`   TEXT,
  `embedding` VECTOR(512) GENERATED ALWAYS AS (EMBED_TEXT(
                "cohere/embed-v4.0",
                `content`,
                '{"output_dimension": 512}'
              )) STORED
);

For all available options, see Cohere Documentation.