离线推理管道#

在本教程中，我们将提供一系列示例来介绍lmdeploy.pipeline的使用方法。

您可以在此指南中查看详细的管道API。

用法#

使用默认参数的示例：

from lmdeploy import pipeline

pipe = pipeline('internlm/internlm2_5-7b-chat')
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)

在这个例子中，管道默认分配了预定百分比的GPU内存用于存储k/v缓存。这个比例由参数TurbomindEngineConfig.cache_max_entry_count决定。

在LMDeploy的演进过程中，设置k/v缓存比率的策略有所变化。以下是变更历史：

v0.2.0 <= lmdeploy <= v0.2.1

TurbomindEngineConfig.cache_max_entry_count 默认值为0.5，表示50%的GPU总内存分配给k/v缓存。如果在内存小于40G的GPU上部署7B模型，可能会出现内存不足（OOM）错误。如果遇到OOM错误，请按以下方式减少k/v缓存占用的比例：

from lmdeploy import pipeline, TurbomindEngineConfig

# decrease the ratio of the k/v cache occupation to 20%
backend_config = TurbomindEngineConfig(cache_max_entry_count=0.2)

pipe = pipeline('internlm/internlm2_5-7b-chat',
                backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)

lmdeploy > v0.2.1

k/v缓存的分配策略已更改为按比例从GPU空闲内存中预留空间。默认情况下，比例TurbomindEngineConfig.cache_max_entry_count已调整为0.8。如果发生OOM错误，类似于上述方法，请考虑降低比例值以减少k/v缓存的内存使用。

一个展示如何设置张量并行数的示例:

from lmdeploy import pipeline, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(tp=2)
pipe = pipeline('internlm/internlm2_5-7b-chat',
                backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'])
print(response)

设置采样参数的示例：

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('internlm/internlm2_5-7b-chat',
                backend_config=backend_config)
response = pipe(['Hi, pls intro yourself', 'Shanghai is'],
                gen_config=gen_config)
print(response)

OpenAI格式提示输入的示例：

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('internlm/internlm2_5-7b-chat',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': 'Hi, pls intro yourself'
}], [{
    'role': 'user',
    'content': 'Shanghai is'
}]]
response = pipe(prompts,
                gen_config=gen_config)
print(response)

流模式的一个示例：

from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig

backend_config = TurbomindEngineConfig(tp=2)
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('internlm/internlm2_5-7b-chat',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': 'Hi, pls intro yourself'
}], [{
    'role': 'user',
    'content': 'Shanghai is'
}]]
for item in pipe.stream_infer(prompts, gen_config=gen_config):
    print(item)

计算logits和ppl的示例：

from transformers import AutoTokenizer
from lmdeploy import pipeline
model_repoid_or_path='internlm/internlm2_5-7b-chat'
pipe = pipeline(model_repoid_or_path)
tokenizer = AutoTokenizer.from_pretrained(model_repoid_or_path, trust_remote_code=True)

# logits
messages = [
   {"role": "user", "content": "Hello, how are you?"},
]
input_ids = tokenizer.apply_chat_template(messages)
logits = pipe.get_logits(input_ids)

# ppl
ppl = pipe.get_ppl(input_ids)

注意

get_ppl 返回交叉熵损失，而不在之后应用指数操作

以下是使用 PyTorch 后端的示例。请先安装 Triton。

pip install triton>=2.1.0

from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig

backend_config = PytorchEngineConfig(session_len=2048)
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('internlm/internlm2_5-7b-chat',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': 'Hi, pls intro yourself'
}], [{
    'role': 'user',
    'content': 'Shanghai is'
}]]
response = pipe(prompts, gen_config=gen_config)
print(response)

一个lora的示例。

from lmdeploy import pipeline, GenerationConfig, PytorchEngineConfig

backend_config = PytorchEngineConfig(session_len=2048,
                                     adapters=dict(lora_name_1='chenchi/lora-chatglm2-6b-guodegang'))
gen_config = GenerationConfig(top_p=0.8,
                              top_k=40,
                              temperature=0.8,
                              max_new_tokens=1024)
pipe = pipeline('THUDM/chatglm2-6b',
                backend_config=backend_config)
prompts = [[{
    'role': 'user',
    'content': '您猜怎么着'
}]]
response = pipe(prompts, gen_config=gen_config, adapter_name='lora_name_1')
print(response)

常见问题解答#

运行时错误：在当前进程完成其引导阶段之前，尝试启动一个新进程。

如果你在pytorch后端中得到了这个tp>1的结果。请确保python脚本包含以下内容
```
if __name__ == '__main__':
```
通常，在多线程或多进程的上下文中，可能需要确保初始化代码只执行一次。在这种情况下，if __name__ == '__main__': 可以帮助确保这些初始化代码只在主程序中运行，而不会在每个新创建的进程或线程中重复执行。
要自定义聊天模板，请参考 chat_template.md。
如果lora的权重有对应的聊天模板，你可以先将聊天模板注册到lmdeploy，然后使用聊天模板名称作为适配器名称。

离线推理管道

目录

离线推理管道#

用法#

常见问题解答#