Skip to main content

提示缓存

对于OpenAI、Anthropic和Deepseek,LiteLLM遵循OpenAI提示缓存使用对象的格式:

"usage": {
"prompt_tokens": 2006,
"completion_tokens": 300,
"total_tokens": 2306,
"prompt_tokens_details": {
"cached_tokens": 1920
},
"completion_tokens_details": {
"reasoning_tokens": 0
}
# 仅限ANTHROPIC #
"cache_creation_input_tokens": 0
}
  • prompt_tokens:这些是非缓存的提示令牌(与Anthropic相同,相当于Deepseek的prompt_cache_miss_tokens)。
  • completion_tokens:这些是由模型生成的输出令牌。
  • total_tokens:prompt_tokens + completion_tokens的总和。
  • prompt_tokens_details:包含cached_tokens的对象。
    • cached_tokens:对该调用缓存命中的令牌。
  • completion_tokens_details:包含reasoning_tokens的对象。
  • 仅限ANTHROPICcache_creation_input_tokens是写入缓存的令牌数量。(Anthropic对此收费)。

快速开始

注意:OpenAI的缓存仅适用于包含1024个或更多令牌的提示

from litellm import completion 
import os

os.environ["OPENAI_API_KEY"] = ""

for _ in range(2):
response = completion(
model="gpt-4o",
messages=[
# 系统消息
{
"role": "system",
"content": [
{
"type": "text",
"text": "Here is the full text of a complex legal agreement"
* 400,
}
],
},
# 使用cache_control参数标记以进行缓存,以便此检查点可以从先前的缓存中读取。
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
{
"role": "assistant",
"content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
},
# 最后一步使用cache-control标记,以便在后续步骤中继续。
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
],
temperature=0.2,
max_tokens=10,
)

print("response=", response)
print("response.usage=", response.usage)

assert "prompt_tokens_details" in response.usage
assert response.usage.prompt_tokens_details.cached_tokens > 0
  1. 设置config.yaml
model_list:
- model_name: gpt-4o
litellm_params:
model: openai/gpt-4o
api_key: os.environ/OPENAI_API_KEY
  1. 启动代理
litellm --config /path/to/config.yaml
  1. 测试它!
from openai import OpenAI 
import os

client = OpenAI(
api_key="LITELLM_PROXY_KEY", # sk-1234
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)

for _ in range(2):
response = client.chat.completions.create(
model="gpt-4o",
messages=[
# 系统消息
{
"role": "system",
"content": [
{
"type": "text",
"text": "Here is the full text of a complex legal agreement"
* 400,
}
],
},
# 用cache_control参数标记为缓存,以便此检查点可以从之前的缓存中读取。
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
{
"role": "assistant",
"content": "Certainly! the key terms and conditions are the following: the contract is 1 year long for $10/mo",
},
# 最后一轮标记为cache-control,以便在后续对话中继续使用。
{
"role": "user",
"content": [
{
"type": "text",
"text": "What are the key terms and conditions in this agreement?",
}
],
},
],
temperature=0.2,
max_tokens=10,
)

print("response=", response)
print("response.usage=", response.usage)

assert "prompt_tokens_details" in response.usage
assert response.usage.prompt_tokens_details.cached_tokens > 0

Anthropic 示例

Anthropic 对缓存写入收费。

使用 "cache_control": {"type": "ephemeral"} 指定要缓存的内容。

如果你为其他任何语言模型提供商传递这个参数,它将被忽略。

from litellm import completion
import litellm
import os

litellm.set_verbose = True # 👈 查看原始请求
os.environ["ANTHROPIC_API_KEY"] = ""

response = completion(
model="anthropic/claude-3-5-sonnet-20240620",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "你是一名负责分析法律文件的AI助手。",
},
{
"type": "text",
"text": "这里是一份复杂法律协议的全文" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "这份协议中的关键条款和条件是什么?",
},
]
)

print(response.usage)
  1. 设置 config.yaml
model_list:
- model_name: claude-3-5-sonnet-20240620
litellm_params:
model: anthropic/claude-3-5-sonnet-20240620
api_key: os.environ/ANTHROPIC_API_KEY
  1. 启动代理
litellm --config /path/to/config.yaml
  1. 测试一下!
from openai import OpenAI
import os

client = OpenAI(
api_key="LITELLM_PROXY_KEY", # sk-1234
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)

response = client.chat.completions.create(
model="claude-3-5-sonnet-20240620",
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "你是一名负责分析法律文件的AI助手。",
},
{
"type": "text",
"text": "这里是一份复杂法律协议的全文" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "这份协议中的关键条款和条件是什么?",
},
]
)

print(response.usage)

Deepeek 示例

与 OpenAI 的工作方式相同。

from litellm import completion
import litellm
import os

os.environ["DEEPSEEK_API_KEY"] = ""

litellm.set_verbose = True # 👈 查看原始请求

model_name = "deepseek/deepseek-chat"
messages_1 = [
{
"role": "system",
"content": "你是一名历史专家。用户将提供一系列问题,你的回答应该简洁,并以`Answer:`开头",
},
{
"role": "user",
"content": "秦始皇是在哪一年统一六国的?",
},
{"role": "assistant", "content": "Answer: 公元前221年"},
{"role": "user", "content": "汉朝的创始人是谁?"},
{"role": "assistant", "content": "Answer: 刘邦"},
{"role": "user", "content": "唐朝的最后一位皇帝是谁?"},
{"role": "assistant", "content": "Answer: 李祝"},
{
"role": "user",
"content": "明朝的创始人皇帝是谁?",
},
{"role": "assistant", "content": "Answer: 朱元璋"},
{
"role": "user",
"content": "清朝的创始人皇帝是谁?",
},
]

message_2 = [
{
"role": "system",
"content": "你是一名历史专家。用户将提供一系列问题,你的回答应该简洁,并以`Answer:`开头",
},
{
"role": "user",
"content": "秦始皇是在哪一年统一六国的?",
},
{"role": "assistant", "content": "Answer: 公元前221年"},
{"role": "user", "content": "汉朝的创始人是谁?"},
{"role": "assistant", "content": "Answer: 刘邦"},
{"role": "user", "content": "唐朝的最后一位皇帝是谁?"},
{"role": "assistant", "content": "Answer: 李祝"},
{
"role": "user",
"content": "明朝的创始人皇帝是谁?",
},
{"role": "assistant", "content": "Answer: 朱元璋"},
{"role": "user", "content": "商朝是在什么时候灭亡的?"},
]

response_1 = litellm.completion(model=model_name, messages=messages_1)
response_2 = litellm.completion(model=model_name, messages=message_2)

# 在这里添加任何断言以检查响应
print(response_2.usage)

计算成本

缓存命中提示标记的成本可能与缓存未命中提示标记的成本不同。 使用 completion_cost() 函数计算成本(也处理提示缓存成本计算)。查看更多辅助函数

cost = completion_cost(completion_response=response, model=model)

用法

from litellm import completion, completion_cost
import litellm
import os

litellm.set_verbose = True # 👈 查看原始请求
os.environ["ANTHROPIC_API_KEY"] = ""
model = "anthropic/claude-3-5-sonnet-20240620"
response = completion(
model=model,
messages=[
{
"role": "system",
"content": [
{
"type": "text",
"text": "你是负责分析法律文件的AI助手。",
},
{
"type": "text",
"text": "这是一份复杂的法律协议的全文" * 400,
"cache_control": {"type": "ephemeral"},
},
],
},
{
"role": "user",
"content": "这份协议中的关键条款和条件是什么?",
},
]
)

print(response.usage)

cost = completion_cost(completion_response=response, model=model)

formatted_string = f"${float(cost):.10f}"
print(formatted_string)

LiteLLM 在响应头中返回计算出的成本 - x-litellm-response-cost

from openai import OpenAI

client = OpenAI(
api_key="LITELLM_PROXY_KEY", # sk-1234..
base_url="LITELLM_PROXY_BASE" # http://0.0.0.0:4000
)
response = client.chat.completions.with_raw_response.create(
messages=[{
"role": "user",
"content": "这是一个测试",
}],
model="gpt-3.5-turbo",
)
print(response.headers.get('x-litellm-response-cost'))

completion = response.parse() # 获取 `chat.completions.create()` 会返回的对象
print(completion)

检查模型支持

使用 supports_prompt_caching() 检查模型是否支持提示缓存

from litellm.utils import supports_prompt_caching

supports_pc: bool = supports_prompt_caching(model="anthropic/claude-3-5-sonnet-20240620")

assert supports_pc

使用 /model/info 端点检查代理上的模型是否支持提示缓存

  1. 设置 config.yaml
model_list:
- model_name: claude-3-5-sonnet-20240620
litellm_params:
model: anthropic/claude-3-5-sonnet-20240620
api_key: os.environ/ANTHROPIC_API_KEY
  1. 启动代理
litellm --config /path/to/config.yaml
  1. 测试它!
curl -L -X GET 'http://0.0.0.0:4000/v1/model/info' \
-H 'Authorization: Bearer sk-1234' \

预期响应

{
"data": [
{
"model_name": "claude-3-5-sonnet-20240620",
"litellm_params": {
"model": "anthropic/claude-3-5-sonnet-20240620"
},
"model_info": {
"key": "claude-3-5-sonnet-20240620",
...
"supports_prompt_caching": true # 👈 查找这个!
}
}
]
}

这将检查我们维护的 模型信息/成本映射

优云智算