使用 RedisVL 实现大语言模型的语义缓存
RedisVL 提供了一个强大的 `SemanticCache` 接口,利用 Redis 的内置缓存能力和向量搜索功能,存储之前回答过的问题的响应。这不仅减少了对大语言模型(LLM)服务的请求和 token 消耗,降低了成本,还通过缩短生成响应的时间提升了应用的吞吐量。本文将详细介绍如何使用 RedisVL 作为语义缓存,涵盖初始化、基本使用、距离阈值调整、TTL 策略、性能测试以及带标签和过滤器
一、前置条件
在开始之前,请确保:
- 已安装
redisvl并激活相应的 Python 环境。 - 运行 Redis 实例,建议使用 Redis Stack。
- 配置了 OpenAI API 密钥(或替换为其他 LLM 服务)。
以下是初始化 OpenAI 客户端的示例代码:
import os
import getpass
from openai import OpenAI
os.environ["TOKENIZERS_PARALLELISM"] = "False"
api_key = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")
client = OpenAI(api_key=api_key)
def ask_openai(question: str) -> str:
response = client.completions.create(
model="gpt-3.5-turbo-instruct",
prompt=question,
max_tokens=200
)
return response.choices[0].text.strip()
# 测试
print(ask_openai("What is the capital of France?")) # 输出:The capital of France is Paris.
二、初始化 SemanticCache
SemanticCache 在初始化时会自动在 Redis 中创建一个用于存储语义缓存内容的索引。以下是初始化代码:
from redisvl.extensions.cache.llm import SemanticCache
from redisvl.utils.vectorize import HFTextVectorizer
llmcache = SemanticCache(
name="llmcache", # 索引名称
redis_url="redis://localhost:6379", # Redis 连接 URL
distance_threshold=0.1, # 语义相似度阈值
vectorizer=HFTextVectorizer("redis/langcache-embed-v1"), # 嵌入模型
)
检查索引信息:
rvl index info -i llmcache
输出:
Index Information:
╭───────────────┬───────────────┬───────────────┬───────────────┬───────────────╮
│ Index Name │ Storage Type │ Prefixes │ Index Options │ Indexing │
├───────────────┼───────────────┼───────────────┼───────────────┼───────────────┤
| llmcache | HASH | ['llmcache'] | [] | 0 |
╰───────────────┴───────────────┴───────────────┴───────────────┴───────────────╯
Index Fields:
╭─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────┬─────────────────╮
│ Name │ Attribute │ Type │ Field Option │ Option Value │ Field Option │ Option Value │ Field Option │ Option Value │ Field Option │ Option Value │
├─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┼─────────────────┬─────────────────┤
│ prompt │ prompt │ TEXT │ WEIGHT │ 1 │ │ │ │ │ │ │ │
│ response │ response │ TEXT │ WEIGHT │ 1 │ │ │ │ │ │ │ │
│ inserted_at │ inserted_at │ NUMERIC │ │ │ │ │ │ │ │ │ │
│ updated_at │ updated_at │ NUMERIC │ │ │ │ │ │ │ │ │ │
│ prompt_vector │ prompt_vector │ VECTOR │ algorithm │ FLAT │ data_type │ FLOAT32 │ dim │ 768 │ distance_metric │ COSINE │ │
╰─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────┴─────────────────╯
三、基本缓存使用
以下展示如何使用 SemanticCache 存储和检索响应:
question = "What is the capital of France?"
# 检查缓存(初始为空)
if response := llmcache.check(prompt=question):
print(response)
else:
print("Empty cache") # 输出:Empty cache
# 存储问题、答案和元数据
llmcache.store(
prompt=question,
response="Paris",
metadata={"city": "Paris", "country": "france"}
)
# 再次检查缓存
if response := llmcache.check(prompt=question, return_fields=["prompt", "response", "metadata"]):
print(response)
输出:
[{'prompt': 'What is the capital of France?', 'response': 'Paris', 'metadata': {'city': 'Paris', 'country': 'france'}, 'key': 'llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545'}]
检查语义相似的查询:
question = "What actually is the capital of France?"
print(llmcache.check(prompt=question)[0]['response']) # 输出:Paris
四、调整距离阈值
语义相似度阈值可以动态调整,以适应不同的嵌入模型或业务需求:
llmcache.set_threshold(0.5) # 放宽阈值
question = "What is the capital city of the country in Europe that also has a city named Nice?"
print(llmcache.check(prompt=question)[0]['response']) # 输出:Paris
清除缓存:
llmcache.clear()
print(llmcache.check(prompt=question)) # 输出:[]
五、使用 TTL 策略
Redis 支持 TTL(Time To Live)策略,允许为缓存条目设置过期时间。以下示例展示如何设置 5 秒的 TTL:
llmcache.set_ttl(5) # 设置 5 秒 TTL
llmcache.store("This is a TTL test", "This is a TTL test response")
time.sleep(6)
result = llmcache.check("This is a TTL test")
print(result) # 输出:[]
# 重置 TTL 为 null(长期存储)
llmcache.set_ttl()
六、性能测试
通过比较使用和不使用缓存的响应时间,展示 SemanticCache 的性能优势:
import time
import numpy as np
def answer_question(question: str) -> str:
results = llmcache.check(prompt=question)
if results:
return results[0]["response"]
else:
answer = ask_openai(question)
return answer
# 测试无缓存的响应时间
start = time.time()
question = "What was the name of the first US President?"
answer = answer_question(question)
end = time.time()
print(f"Without caching, a call to OpenAI took {end-start} seconds.")
# 存储到缓存
llmcache.store(prompt=question, response="George Washington")
# 测试缓存的平均响应时间
times = []
for _ in range(10):
cached_start = time.time()
cached_answer = answer_question(question)
cached_end = time.time()
times.append(cached_end-cached_start)
avg_time_with_cache = np.mean(times)
print(f"Avg time with cache: {avg_time_with_cache}")
print(f"Time saved: {round(((end - start) - avg_time_with_cache) / (end - start) * 100, 2)}%")
输出:
Without caching, a call to OpenAI took 0.8826751708984375 seconds.
Avg time with cache: 0.0463670015335083
Time saved: 94.75%
检查索引统计:
rvl stats -i llmcache
七、缓存访问控制与标签过滤
在多用户或复杂工作流场景中,可以通过自定义 filterable_fields 实现数据隔离和精确查询:
private_cache = SemanticCache(
name="private_cache",
filterable_fields=[{"name": "user_id", "type": "tag"}]
)
# 存储不同用户的数据
private_cache.store(
prompt="What is the phone number linked to my account?",
response="The number on file is 123-555-0000",
filters={"user_id": "abc"}
)
private_cache.store(
prompt="What's the phone number linked in my account?",
response="The number on file is 123-555-1111",
filters={"user_id": "def"}
)
# 使用标签过滤器查询
from redisvl.query.filter import Tag
user_id_filter = Tag("user_id") == "abc"
response = private_cache.check(
prompt="What is the phone number linked to my account?",
filter_expression=user_id_filter,
num_results=2
)
print(f"found {len(response)} entry \n{response[0]['response']}")
输出:
found 1 entry
The number on file is 123-555-0000
清理:
private_cache.delete()
八、复杂过滤器示例
支持多个可过滤字段和复杂的过滤表达式:
complex_cache = SemanticCache(
name="account_data",
filterable_fields=[
{"name": "user_id", "type": "tag"},
{"name": "account_type", "type": "tag"},
{"name": "account_balance", "type": "numeric"},
{"name": "transaction_amount", "type": "numeric"}
]
)
# 存储多条记录
complex_cache.store(
prompt="what is my most recent checking account transaction under $100?",
response="Your most recent transaction was for $75",
filters={"user_id": "abc", "account_type": "checking", "transaction_amount": 75}
)
complex_cache.store(
prompt="what is my most recent savings account transaction?",
response="Your most recent deposit was for $300",
filters={"user_id": "abc", "account_type": "savings", "transaction_amount": 300}
)
complex_cache.store(
prompt="what is my most recent checking account transaction over $200?",
response="Your most recent transaction was for $350",
filters={"user_id": "abc", "account_type": "checking", "transaction_amount": 350}
)
complex_cache.store(
prompt="what is my checking account balance?",
response="Your current checking account is $1850",
filters={"user_id": "abc", "account_type": "checking"}
)
# 使用复杂过滤器查询
from redisvl.query.filter import Num
value_filter = Num("transaction_amount") > 100
account_filter = Tag("account_type") == "checking"
complex_filter = value_filter & account_filter
complex_cache.set_threshold(0.3)
response = complex_cache.check(
prompt="what is my most recent checking account transaction?",
filter_expression=complex_filter,
num_results=5
)
print(f'found {len(response)} entry')
print(response[0]["response"])
输出:
found 1 entry
Your most recent transaction was for $350
清理:
complex_cache.delete()
九、总结
RedisVL 的 SemanticCache 提供了一种高效的方式来缓存 LLM 响应,通过向量搜索实现语义匹配,显著降低请求延迟和成本。其支持动态阈值调整、TTL 策略、标签过滤和复杂查询,使其适用于多用户、复杂工作流的场景。无论是简单的问答缓存还是带访问控制的复杂应用,RedisVL 都能提供灵活且高性能的解决方案。更多详情,请参阅 RedisVL 官方文档。
更多推荐
所有评论(0)