语义缓存

使用RedisVL进行语义缓存

注意：

本文档是这个Jupyter笔记本的转换形式。

在开始之前，请确保以下事项：

您已经安装了RedisVL并激活了该环境。
您有一个运行中的Redis实例，具备Redis查询引擎功能。

大型语言模型的语义缓存

RedisVL 提供了一个 SemanticCache 接口，该接口利用 Redis 的内置缓存功能和向量搜索来存储之前已回答问题的响应。这减少了发送到 LLM 服务的请求和令牌数量，从而降低了成本，并通过减少生成响应所需的时间来提高应用程序的吞吐量。

本文档将教你如何将Redis用作应用程序的语义缓存。

首先导入OpenAI，以便您可以使用他们的API来响应用户提示。您还将创建一个简单的ask_openai辅助方法来帮助。

import os
import getpass
import time

from openai import OpenAI

import numpy as np

os.environ["TOKENIZERS_PARALLELISM"] = "False"

api_key = os.getenv("OPENAI_API_KEY") or getpass.getpass("Enter your OpenAI API key: ")

client = OpenAI(api_key=api_key)

def ask_openai(question: str) -> str:
    response = client.completions.create(
      model="gpt-3.5-turbo-instruct",
      prompt=question,
      max_tokens=200
    )
    return response.choices[0].text.strip()

# Test
print(ask_openai("What is the capital of France?"))

The capital of France is Paris.

初始化 `SemanticCache`

初始化时，SemanticCache 会自动在 Redis 中为语义缓存内容创建索引。

from redisvl.extensions.llmcache import SemanticCache

llmcache = SemanticCache(
    name="llmcache",                     # underlying search index name
    prefix="llmcache",                   # redis key prefix for hash entries
    redis_url="redis://localhost:6379",  # redis connection url string
    distance_threshold=0.1               # semantic cache distance threshold
)

# look at the index specification created for the semantic cache lookup
$ rvl index info -i llmcache

    Index Information:
    ╭──────────────┬────────────────┬──────────────┬─────────────────┬────────────╮
    │ Index Name   │ Storage Type   │ Prefixes     │ Index Options   │   Indexing │
    ├──────────────┼────────────────┼──────────────┼─────────────────┼────────────┤
    │ llmcache     │ HASH           │ ['llmcache'] │ []              │          0 │
    ╰──────────────┴────────────────┴──────────────┴─────────────────┴────────────╯
    Index Fields:
    ╭───────────────┬───────────────┬────────┬────────────────┬────────────────╮
    │ Name          │ Attribute     │ Type   │ Field Option   │   Option Value │
    ├───────────────┼───────────────┼────────┼────────────────┼────────────────┤
    │ prompt        │ prompt        │ TEXT   │ WEIGHT         │              1 │
    │ response      │ response      │ TEXT   │ WEIGHT         │              1 │
    │ prompt_vector │ prompt_vector │ VECTOR │                │                │
    ╰───────────────┴───────────────┴────────┴────────────────┴────────────────╯

基本缓存使用

question = "What is the capital of France?"

# Check the semantic cache -- should be empty
if response := llmcache.check(prompt=question):
    print(response)
else:
    print("Empty cache")

    Empty cache

您的初始缓存检查应该是空的，因为您尚未在缓存中存储任何内容。下面，将question、正确的response以及任何任意的metadata（作为Python字典对象）存储在缓存中。

# Cache the question, answer, and arbitrary metadata
llmcache.store(
    prompt=question,
    response="Paris",
    metadata={"city": "Paris", "country": "france"}
)

# Check the cache again
if response := llmcache.check(prompt=question, return_fields=["prompt", "response", "metadata"]):
    print(response)
else:
    print("Empty cache")

[{'id': 'llmcache:115049a298532be2f181edb03f766770c0db84c22aff39003fec340deaec7545', 'vector_distance': '9.53674316406e-07', 'prompt': 'What is the capital of France?', 'response': 'Paris', 'metadata': {'city': 'Paris', 'country': 'france'}}]

# Check for a semantically similar result
question = "What actually is the capital of France?"
llmcache.check(prompt=question)[0]['response']

    'Paris'

自定义距离阈值

对于大多数使用场景，正确的语义相似度阈值不是一个固定的量。根据嵌入模型的选择、输入查询的属性以及业务使用场景的不同，阈值可能需要调整。

幸运的是，您可以随时无缝调整阈值，如下所示：

# Widen the semantic distance threshold
llmcache.set_threshold(0.3)

# Really try to trick it by asking around the point
# But is able to slip just under our new threshold
question = "What is the capital city of the country in Europe that also has a city named Nice?"
llmcache.check(prompt=question)[0]['response']

    'Paris'

# Invalidate the cache completely by clearing it out
llmcache.clear()

# should be empty now
llmcache.check(prompt=question)

    []

使用TTL

Redis 使用可选的生存时间（TTL）策略来在未来某个时间点使单个键过期。这使您可以专注于数据流和业务逻辑，而无需担心复杂的清理任务。

在SemanticCache上设置的TTL策略允许您暂时保留缓存条目。将TTL策略设置为5秒。

llmcache.set_ttl(5) # 5 seconds

llmcache.store("This is a TTL test", "This is a TTL test response")

time.sleep(5)

# confirm that the cache has cleared by now on it's own
result = llmcache.check("This is a TTL test")

print(result)

[]

# Reset the TTL to null (long lived data)
llmcache.set_ttl()

简单的性能测试

接下来，您将测量使用SemanticCache所获得的速度提升。您将使用time模块来测量使用和不使用SemanticCache生成响应所需的时间。

def answer_question(question: str) -> str:
    """Helper function to answer a simple question using OpenAI with a wrapper
    check for the answer in the semantic cache first.

    Args:
        question (str): User input question.

    Returns:
        str: Response.
    """
    results = llmcache.check(prompt=question)
    if results:
        return results[0]["response"]
    else:
        answer = ask_openai(question)
        return answer

start = time.time()
# asking a question -- openai response time
question = "What was the name of the first US President?"
answer = answer_question(question)
end = time.time()

print(f"Without caching, a call to openAI to answer this simple question took {end-start} seconds.")

Without caching, a call to openAI to answer this simple question took 0.5017588138580322 seconds.

llmcache.store(prompt=question, response="George Washington")

# Calculate the avg latency for caching over LLM usage
times = []

for _ in range(10):
    cached_start = time.time()
    cached_answer = answer_question(question)
    cached_end = time.time()
    times.append(cached_end-cached_start)

avg_time_with_cache = np.mean(times)
print(f"Avg time taken with LLM cache enabled: {avg_time_with_cache}")
print(f"Percentage of time saved: {round(((end - start) - avg_time_with_cache) / (end - start) * 100, 2)}%")

启用LLM缓存后的平均时间：0.2560166358947754 节省的时间百分比：82.47%


```bash
# check the stats of the index
$ rvl stats -i llmcache

    Statistics:
    ╭─────────────────────────────┬─────────────╮
    │ Stat Key                    │ Value       │
    ├─────────────────────────────┼─────────────┤
    │ num_docs                    │ 1           │
    │ num_terms                   │ 19          │
    │ max_doc_id                  │ 3           │
    │ num_records                 │ 23          │
    │ percent_indexed             │ 1           │
    │ hash_indexing_failures      │ 0           │
    │ number_of_uses              │ 19          │
    │ bytes_per_record_avg        │ 5.30435     │
    │ doc_table_size_mb           │ 0.000134468 │
    │ inverted_sz_mb              │ 0.000116348 │
    │ key_table_size_mb           │ 2.76566e-05 │
    │ offset_bits_per_record_avg  │ 8           │
    │ offset_vectors_sz_mb        │ 2.09808e-05 │
    │ offsets_per_term_avg        │ 0.956522    │
    │ records_per_doc_avg         │ 23          │
    │ sortable_values_size_mb     │ 0           │
    │ total_indexing_time         │ 1.211       │
    │ total_inverted_index_blocks │ 19          │
    │ vector_index_sz_mb          │ 3.0161      │
    ╰─────────────────────────────┴─────────────╯

# Clear the cache AND delete the underlying index
llmcache.delete()