使用 RedisVL 实现大语言模型的语义缓存

RedisVL 提供了一个强大的 `SemanticCache` 接口，利用 Redis 的内置缓存能力和向量搜索功能，存储之前回答过的问题的响应。这不仅减少了对大语言模型（LLM）服务的请求和 token 消耗，降低了成本，还通过缩短生成响应的时间提升了应用的吞吐量。本文将详细介绍如何使用 RedisVL 作为语义缓存，涵盖初始化、基本使用、距离阈值调整、TTL 策略、性能测试以及带标签和过滤器

Hello.Reader

596人浏览 · 2025-06-23 07:45:00

Hello.Reader · 2025-06-23 07:45:00 发布

一、前置条件

在开始之前，请确保：

已安装 redisvl 并激活相应的 Python 环境。
运行 Redis 实例，建议使用 Redis Stack。
配置了 OpenAI API 密钥（或替换为其他 LLM 服务）。

以下是初始化 OpenAI 客户端的示例代码：

import os
import getpass
from openai import OpenAI

os.environ["TOKENIZERS_PARALLELISM"] = "False"
api_key = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")
client = OpenAI(api_key=api_key)

def ask_openai(question: str) -> str:
    response = client.completions.create(
        model="gpt-3.5-turbo-instruct",
        prompt=question,
        max_tokens=200
    )
    return response.choices[0].text.strip()

# 测试
print(ask_openai("What is the capital of France?"))  # 输出：The capital of France is Paris.

二、初始化 SemanticCache

SemanticCache 在初始化时会自动在 Redis 中创建一个用于存储语义缓存内容的索引。以下是初始化代码：

from redisvl.extensions.cache.llm import SemanticCache
from redisvl.utils.vectorize import HFTextVectorizer

llmcache = SemanticCache(
    name="llmcache",                                          # 索引名称
    redis_url="redis://localhost:6379",                       # Redis 连接 URL
    distance_threshold=0.1,                                   # 语义相似度阈值
    vectorizer=HFTextVectorizer("redis/langcache-embed-v1"),  # 嵌入模型
)

检查索引信息：

rvl index info -i llmcache

输出：

Index Information:
╭───────────────┬───────────────┬───────────────┬───────────────┬───────────────╮
│ Index Name    │ Storage Type  │ Prefixes      │ Index Options │ Indexing      │
├───────────────┼───────────────┼───────────────┼───────────────┼───────────────┤
| llmcache      | HASH          | ['llmcache']  | []            | 0             |
╰───────────────┴───────────────┴───────────────┴───────────────┴───────────────╯
Index Fields:
╭─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────╮
│ Name            │ Attribute       │ Type            │ Field Option    │ Option Value    │ Field Option    │ Option Value    │ Field Option    │ Option Value    │ Field Option    │ Option Value    │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┬─────────────────┤
│ prompt          │ prompt          │ TEXT            │ WEIGHT          │ 1               │                 │                 │                 │                 │                 │                 │                 │
│ response        │ response        │ TEXT            │ WEIGHT          │ 1               │                 │                 │                 │                 │                 │                 │                 │
│ inserted_at     │ inserted_at     │ NUMERIC         │                 │                 │                 │                 │                 │                 │                 │                 │                 │
│ updated_at      │ updated_at      │ NUMERIC         │                 │                 │                 │                 │                 │                 │                 │                 │                 │
│ prompt_vector   │ prompt_vector   │ VECTOR          │ algorithm       │ FLAT            │ data_type       │ FLOAT32         │ dim             │ 768             │ distance_metric │ COSINE          │                 │
╰─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────╯

三、基本缓存使用

以下展示如何使用 SemanticCache 存储和检索响应：

question = "What is the capital of France?"

# 检查缓存（初始为空）
if response := llmcache.check(prompt=question):
    print(response)
else:
    print("Empty cache")  # 输出：Empty cache

# 存储问题、答案和元数据
llmcache.store(
    prompt=question,
    response="Paris",
    metadata={"city": "Paris", "country": "france"}
)

# 再次检查缓存
if response := llmcache.check(prompt=question, return_fields=["prompt", "response", "metadata"]):
    print(response)

输出：

[{'prompt': 'What is the capital of France?', 'response': 'Paris', 'metadata': {'city': 'Paris', 'country': 'france'}, 'key': 'llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545'}]

检查语义相似的查询：

question = "What actually is the capital of France?"
print(llmcache.check(prompt=question)[0]['response'])  # 输出：Paris

四、调整距离阈值

语义相似度阈值可以动态调整，以适应不同的嵌入模型或业务需求：

llmcache.set_threshold(0.5)  # 放宽阈值

question = "What is the capital city of the country in Europe that also has a city named Nice?"
print(llmcache.check(prompt=question)[0]['response'])  # 输出：Paris

清除缓存：

llmcache.clear()
print(llmcache.check(prompt=question))  # 输出：[]

五、使用 TTL 策略

Redis 支持 TTL（Time To Live）策略，允许为缓存条目设置过期时间。以下示例展示如何设置 5 秒的 TTL：

llmcache.set_ttl(5)  # 设置 5 秒 TTL
llmcache.store("This is a TTL test", "This is a TTL test response")
time.sleep(6)
result = llmcache.check("This is a TTL test")
print(result)  # 输出：[]

# 重置 TTL 为 null（长期存储）
llmcache.set_ttl()

六、性能测试

通过比较使用和不使用缓存的响应时间，展示 SemanticCache 的性能优势：

import time
import numpy as np

def answer_question(question: str) -> str:
    results = llmcache.check(prompt=question)
    if results:
        return results[0]["response"]
    else:
        answer = ask_openai(question)
        return answer

# 测试无缓存的响应时间
start = time.time()
question = "What was the name of the first US President?"
answer = answer_question(question)
end = time.time()
print(f"Without caching, a call to OpenAI took {end-start} seconds.")

# 存储到缓存
llmcache.store(prompt=question, response="George Washington")

# 测试缓存的平均响应时间
times = []
for _ in range(10):
    cached_start = time.time()
    cached_answer = answer_question(question)
    cached_end = time.time()
    times.append(cached_end-cached_start)

avg_time_with_cache = np.mean(times)
print(f"Avg time with cache: {avg_time_with_cache}")
print(f"Time saved: {round(((end - start) - avg_time_with_cache) / (end - start) * 100, 2)}%")

输出：

Without caching, a call to OpenAI took 0.8826751708984375 seconds.
Avg time with cache: 0.0463670015335083
Time saved: 94.75%

检查索引统计：

rvl stats -i llmcache

七、缓存访问控制与标签过滤

在多用户或复杂工作流场景中，可以通过自定义 filterable_fields 实现数据隔离和精确查询：

private_cache = SemanticCache(
    name="private_cache",
    filterable_fields=[{"name": "user_id", "type": "tag"}]
)

# 存储不同用户的数据
private_cache.store(
    prompt="What is the phone number linked to my account?",
    response="The number on file is 123-555-0000",
    filters={"user_id": "abc"}
)
private_cache.store(
    prompt="What's the phone number linked in my account?",
    response="The number on file is 123-555-1111",
    filters={"user_id": "def"}
)

# 使用标签过滤器查询
from redisvl.query.filter import Tag
user_id_filter = Tag("user_id") == "abc"
response = private_cache.check(
    prompt="What is the phone number linked to my account?",
    filter_expression=user_id_filter,
    num_results=2
)
print(f"found {len(response)} entry \n{response[0]['response']}")

输出：

found 1 entry 
The number on file is 123-555-0000

清理：

private_cache.delete()

八、复杂过滤器示例

支持多个可过滤字段和复杂的过滤表达式：

complex_cache = SemanticCache(
    name="account_data",
    filterable_fields=[
        {"name": "user_id", "type": "tag"},
        {"name": "account_type", "type": "tag"},
        {"name": "account_balance", "type": "numeric"},
        {"name": "transaction_amount", "type": "numeric"}
    ]
)

# 存储多条记录
complex_cache.store(
    prompt="what is my most recent checking account transaction under $100?",
    response="Your most recent transaction was for $75",
    filters={"user_id": "abc", "account_type": "checking", "transaction_amount": 75}
)
complex_cache.store(
    prompt="what is my most recent savings account transaction?",
    response="Your most recent deposit was for $300",
    filters={"user_id": "abc", "account_type": "savings", "transaction_amount": 300}
)
complex_cache.store(
    prompt="what is my most recent checking account transaction over $200?",
    response="Your most recent transaction was for $350",
    filters={"user_id": "abc", "account_type": "checking", "transaction_amount": 350}
)
complex_cache.store(
    prompt="what is my checking account balance?",
    response="Your current checking account is $1850",
    filters={"user_id": "abc", "account_type": "checking"}
)

# 使用复杂过滤器查询
from redisvl.query.filter import Num
value_filter = Num("transaction_amount") > 100
account_filter = Tag("account_type") == "checking"
complex_filter = value_filter & account_filter
complex_cache.set_threshold(0.3)
response = complex_cache.check(
    prompt="what is my most recent checking account transaction?",
    filter_expression=complex_filter,
    num_results=5
)
print(f'found {len(response)} entry')
print(response[0]["response"])

输出：

found 1 entry
Your most recent transaction was for $350

清理：

complex_cache.delete()

九、总结

RedisVL 的 SemanticCache 提供了一种高效的方式来缓存 LLM 响应，通过向量搜索实现语义匹配，显著降低请求延迟和成本。其支持动态阈值调整、TTL 策略、标签过滤和复杂查询，使其适用于多用户、复杂工作流的场景。无论是简单的问答缓存还是带访问控制的复杂应用，RedisVL 都能提供灵活且高性能的解决方案。更多详情，请参阅 RedisVL 官方文档。

九章云极普惠算力

更多推荐

fastapi-code-generator完全指南：从OpenAPI规范快速构建高性能API应用

fastapi-code-generator是一款强大的工具，能够从OpenAPI规范文件快速创建FastAPI应用，帮助开发者高效构建高性能的API服务。通过自动化代码生成过程，它大大减少了手动编写重复代码的工作量，让开发者可以更专注于业务逻辑的实现。## 为什么选择fastapi-code-generator？在现代API开发中，遵循OpenAPI规范已经成为行业标准。然而，手动根据规

九章云极普惠算力

终极指南：Conformer模型如何重新定义语音识别架构的边界

Conformer模型作为语音识别领域的革命性架构，融合了Transformer的自注意力机制与CNN的局部特征提取能力，在开源语音处理工具包ESPnet中得到了广泛应用。本文将深入解析Conformer模型的核心原理、架构优势及其在ESPnet中的实现方式，帮助开发者快速掌握这一先进技术。## Conformer模型：Transformer与CNN的完美融合Conformer模型创新性地

九章云极普惠算力

如何用智能手机打造你的专属OpenBot：低成本智能机器人完整指南

OpenBot是一个革命性的开源项目，它让你能够将普通智能手机转变为功能强大的智能机器人。只需简单的组装和配置，你就能拥有一个具备自主导航、物体识别和远程控制功能的机器人，成本不到100美元。本指南将带你一步步完成从零件准备到机器人运行的全过程，即使你是毫无经验的新手也能轻松上手。## 🤖 OpenBot：重新定义智能手机的可能性想象一下，你的旧手机不仅能拍照打电话，还能变成一个会移动、