开始使用DeepSpeed进行基于Transformer模型的推理

DeepSpeed-Inference v2 已经到来，它被称为 DeepSpeed-FastGen！为了获得最佳性能、最新功能和最新模型支持，请查看我们的 DeepSpeed-FastGen 发布博客！

DeepSpeed-Inference 引入了多项功能，以高效地服务于基于 Transformer 的 PyTorch 模型。它支持模型并行（MP），以适应原本无法放入 GPU 内存的大型模型。即使对于较小的模型，MP 也可以用于减少推理的延迟。为了进一步减少延迟和成本，我们引入了推理定制的内核。最后，我们提出了一种称为 MoQ 的新方法，用于量化模型，以缩小模型并降低生产中的推理成本。有关 DeepSpeed 中推理相关优化的更多详细信息，请参阅我们的博客文章。

DeepSpeed 为使用 DeepSpeed、Megatron 和 HuggingFace 训练的兼容基于 transformer 的模型提供了无缝的推理模式，这意味着我们不需要在建模方面进行任何更改，例如导出模型或从训练好的检查点创建不同的检查点。要在多 GPU 上运行兼容模型的推理，只需提供模型并行度和检查点信息或已从检查点加载的模型，DeepSpeed 将完成其余工作。它将根据需要自动分区模型，将兼容的高性能内核注入到您的模型中，并管理 GPU 间的通信。有关兼容模型的列表，请参见这里。

初始化用于推理

对于使用DeepSpeed进行推理，使用init_inference API来加载模型进行推理。在这里，您可以指定MP度，如果模型尚未加载适当的检查点，您还可以使用json文件或检查点路径提供检查点描述。

要注入高性能内核，您需要为兼容模型将replace_with_kernel_inject设置为True。对于DeepSpeed不支持的模型，用户可以提交一个PR，在replace_policy类中定义一个新策略，该策略指定Transformer层的不同参数，例如注意力和前馈部分。DeepSpeed中的策略类创建了原始用户提供的层实现参数与DeepSpeed的推理优化Transformer层之间的映射。

# create the model
if args.pre_load_checkpoint:
    model = model_class.from_pretrained(args.model_name_or_path)
else:
    model = model_class()
...

import deepspeed

# Initialize the DeepSpeed-Inference engine
ds_engine = deepspeed.init_inference(model,
                                 tensor_parallel={"tp_size": 2},
                                 dtype=torch.half,
                                 checkpoint=None if args.pre_load_checkpoint else args.checkpoint_json,
                                 replace_with_kernel_inject=True)
model = ds_engine.module
output = model('Input String')

要对我们不支持内核的模型仅使用模型并行性进行推理，您可以传递一个注入策略，该策略显示Transformer编码器/解码器层上的两个特定线性层：1）注意力输出GeMM和2）层输出GeMM。我们需要这些层部分来添加GPU之间所需的全归约通信，以跨模型并行等级合并部分结果。下面，我们提供了一个示例，展示如何使用deepspeed-inference与T5模型：

# create the model
import transformers
from transformers.models.t5.modeling_t5 import T5Block

import deepspeed

pipe = pipeline("text2text-generation", model="google/t5-v1_1-small", device=local_rank)
# Initialize the DeepSpeed-Inference engine
pipe.model = deepspeed.init_inference(
    pipe.model,
    tensor_parallel={"tp_size": world_size},
    dtype=torch.float,
    injection_policy={T5Block: ('SelfAttention.o', 'EncDecAttention.o', 'DenseReluDense.wo')}
)
output = pipe('Input String')

加载检查点

对于使用HuggingFace训练的模型，可以使用from_pretrained API预先加载模型检查点，如上所示。对于使用模型并行训练的Megatron-LM模型，我们需要一个包含所有模型并行检查点的JSON配置列表。下面我们展示了如何加载使用MP=2训练的Megatron-LM检查点。

"checkpoint.json":
{
    "type": "Megatron",
    "version": 0.0,
    "checkpoints": [
        "mp_rank_00/model_optim_rng.pt",
        "mp_rank_01/model_optim_rng.pt",
    ],
}

对于使用DeepSpeed训练的模型，检查点json文件只需要存储模型检查点的路径。

"checkpoint.json":
{
    "type": "ds_model",
    "version": 0.0,
    "checkpoints": "path_to_checkpoints",
}

DeepSpeed 支持在推理时使用与训练时不同的模型并行度（MP）。例如，一个在没有任何MP的情况下训练的模型可以在MP=2的情况下运行，或者一个在MP=4的情况下训练的模型可以在没有任何MP的情况下进行推理。DeepSpeed 在初始化期间根据需要自动合并或分割检查点。

启动

使用DeepSpeed启动器deepspeed在多个GPU上启动推理：

deepspeed --num_gpus 2 inference.py

端到端 GPT NEO 2.7B 推理

DeepSpeed 推理可以与 HuggingFace 的 pipeline 结合使用。以下是结合 DeepSpeed 推理与 HuggingFace pipeline 的端到端客户端代码，用于使用 GPT-NEO-2.7B 模型生成文本。

# Filename: gpt-neo-2.7b-generation.py
import os
import deepspeed
import torch
from transformers import pipeline

local_rank = int(os.getenv('LOCAL_RANK', '0'))
world_size = int(os.getenv('WORLD_SIZE', '1'))
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-2.7B',
                     device=local_rank)



generator.model = deepspeed.init_inference(generator.model,
                                           tensor_parallel={"tp_size": world_size},
                                           dtype=torch.float,
                                           replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, min_length=50)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
    print(string)

上述脚本修改了HuggingFace文本生成管道中的模型，以使用DeepSpeed进行推理。请注意，尽管原始模型是在没有任何模型并行的情况下训练的，并且检查点也是单GPU检查点，但我们仍然可以在多个GPU上运行推理，使用跨GPU的模型并行张量切片。要运行客户端，只需运行：

deepspeed --num_gpus 2 gpt-neo-2.7b-generation.py

以下是生成文本的输出。您可以尝试其他提示，看看这个模型如何生成文本。

[{
    'generated_text': 'DeepSpeed is a blog about the future. We will consider the future of work, the future of living, and the future of society. We will focus in particular on the evolution of living conditions for humans and animals in the Anthropocene and its repercussions'
}]

数据类型和量化模型

DeepSpeed 推理支持 fp32、fp16 和 int8 参数。可以使用 init_inference 中的 dtype 设置适当的数据类型，DeepSpeed 将选择针对该数据类型优化的内核。对于量化的 int8 模型，如果模型是使用 DeepSpeed 的量化方法（MoQ）进行量化的，则需要将应用量化的设置传递给 init_inference。此设置包括用于量化的组数以及 transformer 的 MLP 部分是否使用额外分组进行量化。有关这些参数的更多信息，请访问我们的量化教程。

import deepspeed
model = deepspeed.init_inference(model,
                                 checkpoint='./checkpoint.json',
                                 dtype=torch.int8,
                                 quantization_setting=(quantize_groups,
                                                       mlp_extra_grouping)
                                )

恭喜！您已经完成了DeepSpeed推理教程。