语义缓存

使用RedisVL进行语义缓存

注意:
本文档是这个Jupyter笔记本的转换形式。

在开始之前,请确保以下事项:

  1. 您已经安装了RedisVL并激活了该环境。
  2. 您有一个运行中的Redis实例,具备Redis查询引擎功能。

大型语言模型的语义缓存

RedisVL 提供了一个 SemanticCache 接口,该接口利用 Redis 的内置缓存功能和向量搜索来存储之前已回答问题的响应。这减少了发送到 LLM 服务的请求和令牌数量,从而降低了成本,并通过减少生成响应所需的时间来提高应用程序的吞吐量。

本文档将教你如何将Redis用作应用程序的语义缓存。

首先导入OpenAI,以便您可以使用他们的API来响应用户提示。您还将创建一个简单的ask_openai辅助方法来帮助。

import os
import getpass
import time

from openai import OpenAI

import numpy as np

os.environ["TOKENIZERS_PARALLELISM"] = "False"

api_key = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")

client = OpenAI(api_key=api_key)

def ask_openai(question: str) -> str:
    response = client.completions.create(
      model="gpt-3.5-turbo-instruct",
      prompt=question,
      max_tokens=200
    )
    return response.choices[0].text.strip()
# Test
print(ask_openai("What is the capital of France?"))
The capital of France is Paris.

初始化 SemanticCache

初始化时,SemanticCache 会自动在 Redis 中为语义缓存内容创建索引。

from redisvl.extensions.llmcache import SemanticCache

llmcache = SemanticCache(
    name="llmcache",                     # underlying search index name
    prefix="llmcache",                   # redis key prefix for hash entries
    redis_url="redis://localhost:6379",  # redis connection url string
    distance_threshold=0.1               # semantic cache distance threshold
)
# look at the index specification created for the semantic cache lookup
$ rvl index info -i llmcache

    Index Information:
    ╭──────────────┬────────────────┬──────────────┬─────────────────┬────────────╮
     Index Name    Storage Type    Prefixes      Index Options      Indexing 
    ├──────────────┼────────────────┼──────────────┼─────────────────┼────────────┤
     llmcache      HASH            ['llmcache']  []                        0 
    ╰──────────────┴────────────────┴──────────────┴─────────────────┴────────────╯
    Index Fields:
    ╭───────────────┬───────────────┬────────┬────────────────┬────────────────╮
     Name           Attribute      Type    Field Option      Option Value 
    ├───────────────┼───────────────┼────────┼────────────────┼────────────────┤
     prompt         prompt         TEXT    WEIGHT                       1 
     response       response       TEXT    WEIGHT                       1 
     prompt_vector  prompt_vector  VECTOR                                 
    ╰───────────────┴───────────────┴────────┴────────────────┴────────────────╯

基本缓存使用

question = "What is the capital of France?"
# Check the semantic cache -- should be empty
if response := llmcache.check(prompt=question):
    print(response)
else:
    print("Empty cache")

    Empty cache

您的初始缓存检查应该是空的,因为您尚未在缓存中存储任何内容。下面,将question、正确的response以及任何任意的metadata(作为Python字典对象)存储在缓存中。

# Cache the question, answer, and arbitrary metadata
llmcache.store(
    prompt=question,
    response="Paris",
    metadata={"city": "Paris", "country": "france"}
)
# Check the cache again
if response := llmcache.check(prompt=question, return_fields=["prompt", "response", "metadata"]):
    print(response)
else:
    print("Empty cache")

[{'id': 'llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545', 'vector_distance': '9.53674316406e-07', 'prompt': 'What is the capital of France?', 'response': 'Paris', 'metadata': {'city': 'Paris', 'country': 'france'}}]
# Check for a semantically similar result
question = "What actually is the capital of France?"
llmcache.check(prompt=question)[0]['response']

    'Paris'

自定义距离阈值

对于大多数使用场景,正确的语义相似度阈值不是一个固定的量。根据嵌入模型的选择、输入查询的属性以及业务使用场景的不同,阈值可能需要调整。

幸运的是,您可以随时无缝调整阈值,如下所示:

# Widen the semantic distance threshold
llmcache.set_threshold(0.3)
# Really try to trick it by asking around the point
# But is able to slip just under our new threshold
question = "What is the capital city of the country in Europe that also has a city named Nice?"
llmcache.check(prompt=question)[0]['response']

    'Paris'
# Invalidate the cache completely by clearing it out
llmcache.clear()

# should be empty now
llmcache.check(prompt=question)

    []

使用TTL

Redis 使用可选的生存时间(TTL)策略来在未来某个时间点使单个键过期。这使您可以专注于数据流和业务逻辑,而无需担心复杂的清理任务。

SemanticCache上设置的TTL策略允许您暂时保留缓存条目。将TTL策略设置为5秒。

llmcache.set_ttl(5) # 5 seconds
llmcache.store("This is a TTL test", "This is a TTL test response")

time.sleep(5)
# confirm that the cache has cleared by now on it's own
result = llmcache.check("This is a TTL test")

print(result)

[]
# Reset the TTL to null (long lived data)
llmcache.set_ttl()

简单的性能测试

接下来,您将测量使用SemanticCache所获得的速度提升。您将使用time模块来测量使用和不使用SemanticCache生成响应所需的时间。

def answer_question(question: str) -> str:
    """Helper function to answer a simple question using OpenAI with a wrapper
    check for the answer in the semantic cache first.

    Args:
        question (str): User input question.

    Returns:
        str: Response.
    """
    results = llmcache.check(prompt=question)
    if results:
        return results[0]["response"]
    else:
        answer = ask_openai(question)
        return answer
start = time.time()
# asking a question -- openai response time
question = "What was the name of the first US President?"
answer = answer_question(question)
end = time.time()

print(f"Without caching, a call to openAI to answer this simple question took {end-start} seconds.")

Without caching, a call to openAI to answer this simple question took 0.5017588138580322 seconds.
llmcache.store(prompt=question, response="George Washington")
# Calculate the avg latency for caching over LLM usage
times = []

for _ in range(10):
    cached_start = time.time()
    cached_answer = answer_question(question)
    cached_end = time.time()
    times.append(cached_end-cached_start)

avg_time_with_cache = np.mean(times)
print(f"Avg time taken with LLM cache enabled: {avg_time_with_cache}")
print(f"Percentage of time saved: {round(((end - start) - avg_time_with_cache) / (end - start) * 100, 2)}%")

启用LLM缓存后的平均时间:0.2560166358947754 节省的时间百分比:82.47%


```bash
# check the stats of the index
$ rvl stats -i llmcache

    Statistics:
    ╭─────────────────────────────┬─────────────╮
    │ Stat Key                    │ Value       │
    ├─────────────────────────────┼─────────────┤
    │ num_docs                    │ 1           │
    │ num_terms                   │ 19          │
    │ max_doc_id                  │ 3           │
    │ num_records                 │ 23          │
    │ percent_indexed             │ 1           │
    │ hash_indexing_failures      │ 0           │
    │ number_of_uses              │ 19          │
    │ bytes_per_record_avg        │ 5.30435     │
    │ doc_table_size_mb           │ 0.000134468 │
    │ inverted_sz_mb              │ 0.000116348 │
    │ key_table_size_mb           │ 2.76566e-05 │
    │ offset_bits_per_record_avg  │ 8           │
    │ offset_vectors_sz_mb        │ 2.09808e-05 │
    │ offsets_per_term_avg        │ 0.956522    │
    │ records_per_doc_avg         │ 23          │
    │ sortable_values_size_mb     │ 0           │
    │ total_indexing_time         │ 1.211       │
    │ total_inverted_index_blocks │ 19          │
    │ vector_index_sz_mb          │ 3.0161      │
    ╰─────────────────────────────┴─────────────╯
# Clear the cache AND delete the underlying index
llmcache.delete()
RATE THIS PAGE
Back to top ↑