将向量检索集成到 LangChain

本教程演示如何将 TiDB 的向量检索功能与 LangChain 集成。

注意

向量检索功能处于 beta 阶段，可能会在没有提前通知的情况下发生变更。如果你发现了 bug，可以在 GitHub 上提交 issue。

注意

向量检索功能适用于 TiDB 自建版、TiDB Cloud Starter、TiDB Cloud Essential 和 TiDB Cloud Dedicated。对于 TiDB 自建版和 TiDB Cloud Dedicated，TiDB 版本需为 v8.4.0 及以上（推荐 v8.5.0 及以上）。

小贴士

你可以在 Jupyter Notebook 查看完整的示例代码，或直接在 Colab 在线环境中运行示例代码。

前置条件

完成本教程，你需要：

已安装 Python 3.8 或更高版本。
已安装 Jupyter Notebook。
已安装 Git。
一个 TiDB 集群。

如果你还没有 TiDB 集群，可以按如下方式创建：

（推荐）参考创建 TiDB Cloud Starter 集群创建属于你自己的 TiDB Cloud 集群。
参考部署本地测试 TiDB 集群或部署生产环境 TiDB 集群创建 v8.4.0 或更高版本的本地集群。

快速开始

本节将逐步指导你如何将 TiDB 向量检索与 LangChain 集成，实现语义检索。

步骤 1. 新建 Jupyter Notebook 文件

在你喜欢的目录下，新建一个名为 integrate_with_langchain.ipynb 的 Jupyter Notebook 文件：

touch integrate_with_langchain.ipynb

步骤 2. 安装所需依赖

在你的项目目录下，运行以下命令安装所需的依赖包：

!pip install langchain langchain-community
!pip install langchain-openai
!pip install pymysql
!pip install tidb-vector

在 Jupyter Notebook 中打开 integrate_with_langchain.ipynb 文件，并添加以下代码以导入所需包：

from langchain_community.document_loaders import TextLoader
from langchain_community.vectorstores import TiDBVectorStore
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter

步骤 3. 配置环境

根据你选择的 TiDB 部署方式，配置环境变量。

对于 TiDB Cloud Starter 集群，按以下步骤获取集群连接串并配置环境变量：

进入 Clusters 页面，点击目标集群名称进入集群概览页。
点击右上角的 Connect，弹出连接对话框。
确认连接对话框中的配置与你的操作环境一致。
- Connection Type 选择 Public。
- Branch 选择 main。
- Connect With 选择 SQLAlchemy。
- Operating System 与你的环境一致。
点击 PyMySQL 标签页，复制连接串。
小贴士
如果你还未设置密码，可点击 Generate Password 生成随机密码。

配置环境变量。

本文档以 OpenAI 作为嵌入模型提供方为例。在此步骤中，你需要提供上一步获取的连接串和你的 OpenAI API key。

运行以下代码配置环境变量。你将被提示输入连接串和 OpenAI API key：

# 使用 getpass 在终端安全地输入环境变量。
import getpass
import os

# 从 TiDB Cloud 控制台复制你的连接串。
# 连接串格式: "mysql+pymysql://<USER>:<PASSWORD>@<HOST>:4000/<DB>?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
tidb_connection_string = getpass.getpass("TiDB Connection String:")
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

本文档以 OpenAI 作为嵌入模型提供方为例。在此步骤中，你需要提供上一步获取的连接串和你的 OpenAI API key。

运行以下代码配置环境变量。你将被提示输入连接串和 OpenAI API key：

# 使用 getpass 在终端安全地输入环境变量。
import getpass
import os

# 连接串格式: "mysql+pymysql://<USER>:<PASSWORD>@<HOST>:4000/<DB>?ssl_ca=/etc/ssl/cert.pem&ssl_verify_cert=true&ssl_verify_identity=true"
tidb_connection_string = getpass.getpass("TiDB Connection String:")
os.environ["OPENAI_API_KEY"] = getpass.getpass("OpenAI API Key:")

以 macOS 为例，集群连接串如下：

TIDB_DATABASE_URL="mysql+pymysql://<USERNAME>:<PASSWORD>@<HOST>:<PORT>/<DATABASE_NAME>"
# 例如: TIDB_DATABASE_URL="mysql+pymysql://root@127.0.0.1:4000/test"

你需要根据你的 TiDB 集群实际情况修改连接参数的值。如果你在本地运行 TiDB，<HOST> 默认为 127.0.0.1。初始 <PASSWORD> 为空，因此首次启动集群时可以省略该字段。

各参数说明如下：

<USERNAME>：连接 TiDB 集群的用户名。
<PASSWORD>：连接 TiDB 集群的密码。
<HOST>：TiDB 集群的主机地址。
<PORT>：TiDB 集群的端口号。
<DATABASE>：你要连接的数据库名称。

步骤 4. 加载示例文档

步骤 4.1 下载示例文档

在你的项目目录下，新建 data/how_to/ 目录，并从 langchain-ai/langchain GitHub 仓库下载示例文档 state_of_the_union.txt。

!mkdir -p 'data/how_to/'
!wget 'https://raw.githubusercontent.com/langchain-ai/langchain/master/docs/docs/how_to/state_of_the_union.txt' -O 'data/how_to/state_of_the_union.txt'

步骤 4.2 加载并切分文档

从 data/how_to/state_of_the_union.txt 加载示例文档，并使用 CharacterTextSplitter 将其切分为大约 1,000 个字符的块。

loader = TextLoader("data/how_to/state_of_the_union.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

步骤 5. 嵌入并存储文档向量

TiDB 向量存储支持余弦距离（consine）和欧氏距离（l2）两种向量相似度度量方式，默认策略为余弦距离。

以下代码将在 TiDB 中创建一个名为 embedded_documents 的表，该表已针对向量检索进行了优化。

embeddings = OpenAIEmbeddings()
vector_store = TiDBVectorStore.from_documents(
    documents=docs,
    embedding=embeddings,
    table_name="embedded_documents",
    connection_string=tidb_connection_string,
    distance_strategy="cosine",  # default, another option is "l2"
)

执行成功后，你可以在 TiDB 数据库中直接查看和访问 embedded_documents 表。

步骤 6. 执行向量检索

本步骤演示如何在文档 state_of_the_union.txt 中查询 “What did the president say about Ketanji Brown Jackson”。

query = "What did the president say about Ketanji Brown Jackson"

选项 1：使用 `similarity_search_with_score()`

similarity_search_with_score() 方法会计算文档与查询之间的向量空间距离。该距离作为相似度分数，由所选的 distance_strategy 决定。该方法返回分数最低的前 k 个文档。分数越低，文档与查询的相似度越高。

docs_with_score = vector_store.similarity_search_with_score(query, k=3)
for doc, score in docs_with_score:
   print("-" * 80)
   print("Score: ", score)
   print(doc.page_content)
   print("-" * 80)

期望输出

--------------------------------------------------------------------------------
Score:  0.18472413652518527
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.21757513022785557
A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.

We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.

We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.

We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.

We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.22676987253721725
And for our LGBTQ+ Americans, let’s finally get the bipartisan Equality Act to my desk. The onslaught of state laws targeting transgender Americans and their families is wrong.

As I said last year, especially to our younger transgender Americans, I will always have your back as your President, so you can be yourself and reach your God-given potential.

While it often appears that we never agree, that isn’t true. I signed 80 bipartisan bills into law last year. From preventing government shutdowns to protecting Asian-Americans from still-too-common hate crimes to reforming military justice.

And soon, we’ll strengthen the Violence Against Women Act that I first wrote three decades ago. It is important for us to show the nation that we can come together and do big things.

So tonight I’m offering a Unity Agenda for the Nation. Four big things we can do together.

First, beat the opioid epidemic.
--------------------------------------------------------------------------------

选项 2：使用 `similarity_search_with_relevance_scores()`

similarity_search_with_relevance_scores() 方法返回相关性分数最高的前 k 个文档。分数越高，文档与查询的相似度越高。

docs_with_relevance_score = vector_store.similarity_search_with_relevance_scores(query, k=2)
for doc, score in docs_with_relevance_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

期望输出

--------------------------------------------------------------------------------
Score:  0.8152758634748147
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Score:  0.7824248697721444
A former top litigator in private practice. A former federal public defender. And from a family of public school educators and police officers. A consensus builder. Since she’s been nominated, she’s received a broad range of support—from the Fraternal Order of Police to former judges appointed by Democrats and Republicans.

And if we are to advance liberty and justice, we need to secure the Border and fix the immigration system.

We can do both. At our border, we’ve installed new technology like cutting-edge scanners to better detect drug smuggling.

We’ve set up joint patrols with Mexico and Guatemala to catch more human traffickers.

We’re putting in place dedicated immigration judges so families fleeing persecution and violence can have their cases heard faster.

We’re securing commitments and supporting partners in South and Central America to host more refugees and secure their own borders.
--------------------------------------------------------------------------------

作为检索器使用

在 Langchain 中，retriever 是一个用于响应非结构化查询检索文档的接口，功能比向量存储更丰富。以下代码演示如何将 TiDB 向量存储作为检索器使用。

retriever = vector_store.as_retriever(
   search_type="similarity_score_threshold",
   search_kwargs={"k": 3, "score_threshold": 0.8},
)
docs_retrieved = retriever.invoke(query)
for doc in docs_retrieved:
   print("-" * 80)
   print(doc.page_content)
   print("-" * 80)

期望输出如下：

--------------------------------------------------------------------------------
Tonight. I call on the Senate to: Pass the Freedom to Vote Act. Pass the John Lewis Voting Rights Act. And while you’re at it, pass the Disclose Act so Americans can know who is funding our elections.

Tonight, I’d like to honor someone who has dedicated his life to serve this country: Justice Stephen Breyer—an Army veteran, Constitutional scholar, and retiring Justice of the United States Supreme Court. Justice Breyer, thank you for your service.

One of the most serious constitutional responsibilities a President has is nominating someone to serve on the United States Supreme Court.

And I did that 4 days ago, when I nominated Circuit Court of Appeals Judge Ketanji Brown Jackson. One of our nation’s top legal minds, who will continue Justice Breyer’s legacy of excellence.
--------------------------------------------------------------------------------

移除向量存储

要移除已存在的 TiDB 向量存储，可使用 drop_vectorstore() 方法：

vector_store.drop_vectorstore()

使用元数据过滤进行检索

为了进一步优化检索结果，你可以使用元数据过滤器，仅返回符合过滤条件的最近邻结果。

支持的元数据类型

TiDB 向量存储中的每个文档都可以配有元数据，元数据以 JSON 对象的键值对形式存储。键始终为字符串，值可以是以下类型之一：

字符串
数值：整数或浮点数
布尔值：true 或 false

例如，以下是一个合法的元数据负载：

{
  "page": 12,
  "book_title": "Siddhartha"
}

元数据过滤语法

可用的过滤器包括：

$or：匹配任意一个条件的向量。
$and：同时匹配所有条件的向量。
$eq：等于指定值。
$ne：不等于指定值。
$gt：大于指定值。
$gte：大于等于指定值。
$lt：小于指定值。
$lte：小于等于指定值。
$in：在指定值数组中。
$nin：不在指定值数组中。

如果某个文档的元数据如下：

{
  "page": 12,
  "book_title": "Siddhartha"
}

以下元数据过滤器都可以匹配该文档：

{ "page": 12 }

{ "page": { "$eq": 12 } }

{
  "page": {
    "$in": [11, 12, 13]
  }
}

{ "page": { "$nin": [13] } }

{ "page": { "$lt": 11 } }

{
  "$or": [{ "page": 11 }, { "page": 12 }],
  "$and": [{ "page": 12 }, { "page": 13 }]
}

在元数据过滤器中，TiDB 会将每个键值对视为一个独立的过滤条件，并使用 AND 逻辑操作符将这些条件组合。

示例

以下示例向 TiDBVectorStore 添加两个文档，并为每个文档添加 title 字段作为元数据：

vector_store.add_texts(
    texts=[
        "TiDB Vector offers advanced, high-speed vector processing capabilities, enhancing AI workflows with efficient data handling and analytics support.",
        "TiDB Vector, starting as low as $10 per month for basic usage",
    ],
    metadatas=[
        {"title": "TiDB Vector functionality"},
        {"title": "TiDB Vector Pricing"},
    ],
)

期望输出如下：

[UUID('c782cb02-8eec-45be-a31f-fdb78914f0a7'),
 UUID('08dcd2ba-9f16-4f29-a9b7-18141f8edae3')]

使用元数据过滤器进行相似度检索：

docs_with_score = vector_store.similarity_search_with_score(
    "Introduction to TiDB Vector", filter={"title": "TiDB Vector functionality"}, k=4
)
for doc, score in docs_with_score:
    print("-" * 80)
    print("Score: ", score)
    print(doc.page_content)
    print("-" * 80)

期望输出如下：

--------------------------------------------------------------------------------
Score:  0.12761409169211535
TiDB Vector offers advanced, high-speed vector processing capabilities, enhancing AI workflows with efficient data handling and analytics support.
--------------------------------------------------------------------------------

高级用法示例：旅行社场景

本节演示将向量检索与 Langchain 集成的旅行社场景。目标是为客户生成个性化旅行报告，帮助他们找到拥有特定设施（如干净的休息室和素食选项）的机场。

流程主要分为两步：

对机场评论进行语义检索，找出符合所需设施的机场代码。
执行 SQL 查询，将这些代码与航线信息关联，突出显示符合用户偏好的航空公司和目的地。

准备数据

首先，创建一个用于存储机场航线数据的表：

# 创建用于存储航班计划数据的表。
vector_store.tidb_vector_client.execute(
    """CREATE TABLE airplan_routes (
        id INT AUTO_INCREMENT PRIMARY KEY,
        airport_code VARCHAR(10),
        airline_code VARCHAR(10),
        destination_code VARCHAR(10),
        route_details TEXT,
        duration TIME,
        frequency INT,
        airplane_type VARCHAR(50),
        price DECIMAL(10, 2),
        layover TEXT
    );"""
)

# 向 airplan_routes 和向量表中插入一些示例数据。
vector_store.tidb_vector_client.execute(
    """INSERT INTO airplan_routes (
        airport_code,
        airline_code,
        destination_code,
        route_details,
        duration,
        frequency,
        airplane_type,
        price,
        layover
    ) VALUES
    ('JFK', 'DL', 'LAX', 'Non-stop from JFK to LAX.', '06:00:00', 5, 'Boeing 777', 299.99, 'None'),
    ('LAX', 'AA', 'ORD', 'Direct LAX to ORD route.', '04:00:00', 3, 'Airbus A320', 149.99, 'None'),
    ('EFGH', 'UA', 'SEA', 'Daily flights from SFO to SEA.', '02:30:00', 7, 'Boeing 737', 129.99, 'None');
    """
)
vector_store.add_texts(
    texts=[
        "Clean lounges and excellent vegetarian dining options. Highly recommended.",
        "Comfortable seating in lounge areas and diverse food selections, including vegetarian.",
        "Small airport with basic facilities.",
    ],
    metadatas=[
        {"airport_code": "JFK"},
        {"airport_code": "LAX"},
        {"airport_code": "EFGH"},
    ],
)

期望输出如下：

[UUID('6dab390f-acd9-4c7d-b252-616606fbc89b'),
 UUID('9e811801-0e6b-4893-8886-60f4fb67ce69'),
 UUID('f426747c-0f7b-4c62-97ed-3eeb7c8dd76e')]

执行语义检索

以下代码检索拥有干净设施和素食选项的机场：

retriever = vector_store.as_retriever(
    search_type="similarity_score_threshold",
    search_kwargs={"k": 3, "score_threshold": 0.85},
)
semantic_query = "Could you recommend a US airport with clean lounges and good vegetarian dining options?"
reviews = retriever.invoke(semantic_query)
for r in reviews:
    print("-" * 80)
    print(r.page_content)
    print(r.metadata)
    print("-" * 80)

期望输出如下：

--------------------------------------------------------------------------------
Clean lounges and excellent vegetarian dining options. Highly recommended.
{'airport_code': 'JFK'}
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
Comfortable seating in lounge areas and diverse food selections, including vegetarian.
{'airport_code': 'LAX'}
--------------------------------------------------------------------------------

获取机场详细信息

从检索结果中提取机场代码，并查询数据库获取详细航线信息：

# 从元数据中提取机场代码
airport_codes = [review.metadata["airport_code"] for review in reviews]

# 执行查询获取机场详细信息
search_query = "SELECT * FROM airplan_routes WHERE airport_code IN :codes"
params = {"codes": tuple(airport_codes)}

airport_details = vector_store.tidb_vector_client.execute(search_query, params)
airport_details.get("result")

期望输出如下：

[(1, 'JFK', 'DL', 'LAX', 'Non-stop from JFK to LAX.', datetime.timedelta(seconds=21600), 5, 'Boeing 777', Decimal('299.99'), 'None'),
 (2, 'LAX', 'AA', 'ORD', 'Direct LAX to ORD route.', datetime.timedelta(seconds=14400), 3, 'Airbus A320', Decimal('149.99'), 'None')]

流程简化

你也可以通过一条 SQL 查询简化整个流程：

search_query = f"""
    SELECT
        VEC_Cosine_Distance(se.embedding, :query_vector) as distance,
        ar.*,
        se.document as airport_review
    FROM
        airplan_routes ar
    JOIN
        {TABLE_NAME} se ON ar.airport_code = JSON_UNQUOTE(JSON_EXTRACT(se.meta, '$.airport_code'))
    ORDER BY distance ASC
    LIMIT 5;
"""
query_vector = embeddings.embed_query(semantic_query)
params = {"query_vector": str(query_vector)}
airport_details = vector_store.tidb_vector_client.execute(search_query, params)
airport_details.get("result")

期望输出如下：

[(0.1219207353407008, 1, 'JFK', 'DL', 'LAX', 'Non-stop from JFK to LAX.', datetime.timedelta(seconds=21600), 5, 'Boeing 777', Decimal('299.99'), 'None', 'Clean lounges and excellent vegetarian dining options. Highly recommended.'),
 (0.14613754359804654, 2, 'LAX', 'AA', 'ORD', 'Direct LAX to ORD route.', datetime.timedelta(seconds=14400), 3, 'Airbus A320', Decimal('149.99'), 'None', 'Comfortable seating in lounge areas and diverse food selections, including vegetarian.'),
 (0.19840519342700513, 3, 'EFGH', 'UA', 'SEA', 'Daily flights from SFO to SEA.', datetime.timedelta(seconds=9000), 7, 'Boeing 737', Decimal('129.99'), 'None', 'Small airport with basic facilities.')]

清理数据

最后，通过删除创建的表来清理资源：

vector_store.tidb_vector_client.execute("DROP TABLE airplan_routes")

期望输出如下：

{'success': True, 'result': 0, 'error': None}

将向量检索集成到 LangChain

前置条件

快速开始

步骤 1. 新建 Jupyter Notebook 文件

步骤 2. 安装所需依赖

步骤 3. 配置环境

步骤 4. 加载示例文档

步骤 4.1 下载示例文档

步骤 4.2 加载并切分文档

步骤 5. 嵌入并存储文档向量

步骤 6. 执行向量检索

选项 1：使用 `similarity_search_with_score()`

选项 2：使用 `similarity_search_with_relevance_scores()`

作为检索器使用

移除向量存储

使用元数据过滤进行检索

支持的元数据类型

元数据过滤语法

示例

高级用法示例：旅行社场景

准备数据

执行语义检索

获取机场详细信息

流程简化

清理数据

参见

文档内容是否有帮助？

将向量检索集成到 LangChain

前置条件

快速开始

步骤 1. 新建 Jupyter Notebook 文件

步骤 2. 安装所需依赖

步骤 3. 配置环境

步骤 4. 加载示例文档

步骤 4.1 下载示例文档

步骤 4.2 加载并切分文档

步骤 5. 嵌入并存储文档向量

步骤 6. 执行向量检索

选项 1：使用 similarity_search_with_score()

选项 2：使用 similarity_search_with_relevance_scores()

作为检索器使用

移除向量存储

使用元数据过滤进行检索

支持的元数据类型

元数据过滤语法

示例

高级用法示例：旅行社场景

准备数据

执行语义检索

获取机场详细信息

流程简化

清理数据

参见

文档内容是否有帮助？

选项 1：使用 `similarity_search_with_score()`

选项 2：使用 `similarity_search_with_relevance_scores()`