推理¶

TorchRec 提供了易于使用的 API，用于通过即时模块交换将编写的 TorchRec 模型转换为优化的推理模型，以进行分布式推理。

这将模型中的TorchRec模块（如EmbeddingBagCollection）转换为量化和分片版本，可以使用torch.fx和TorchScript进行编译，以便在C++环境中进行推理。

预期用途是在模型上调用quantize_inference_model，然后调用shard_quant_model。

torchrec.inference.modules.quantize_inference_model(model: Module, quantization_mapping: Optional[Dict[str, Type[Module]]] = None, per_table_weight_dtype: Optional[Dict[str, dtype]] = None, fp_weight_dtype: dtype = torch.int8, quantization_dtype: dtype = torch.int8, output_dtype: dtype = torch.float32) → Module¶

量化模型，模块交换TorchRec训练模块与其量化对应模块（例如EmbeddingBagCollection -> QuantEmbeddingBagCollection）。

Parameters:

model (torch.nn.Module) – 要量化的模型
quantization_mapping (可选[字典[字符串, 类型[torch.nn.Module]]]) – 从原始模块类型到量化模块类型的映射。如果未提供，将使用默认映射：(EmbeddingBagCollection -> QuantEmbeddingBagCollection, EmbeddingCollection -> QuantEmbeddingCollection)。
per_table_weight_dtype (Optional[Dict[str, torch.dtype]]) – 从表名到权重数据类型的映射。如果未提供，将使用默认的量化数据类型（int8）。
fp_weight_dtype (torch.dtype) – 如果使用FeatureProcessedEmbeddingBagCollection，则为特征处理器权重所需的量化数据类型。默认为int8。

Returns:

量化模型

Return type:

torch.nn.Module

示例：

ebc = EmbeddingBagCollection(tables=eb_configs, device=torch.device("meta"))

module = DLRMPredictModule(
    embedding_bag_collection=ebc,
    dense_in_features=self.model_config.dense_in_features,
    dense_arch_layer_sizes=self.model_config.dense_arch_layer_sizes,
    over_arch_layer_sizes=self.model_config.over_arch_layer_sizes,
    id_list_features_keys=self.model_config.id_list_features_keys,
    dense_device=device,
)

quant_model = quantize_inference_model(module)

torchrec.inference.modules.shard_quant_model(model: Module, world_size: int = 1, compute_device: str = 'cuda', sharding_device: str = 'meta', sharders: Optional[List[ModuleSharder[Module]]] = None, device_memory_size: Optional[int] = None, constraints: Optional[Dict[str, ParameterConstraints]] = None, ddr_cap: Optional[int] = None) → Tuple[Module, ShardingPlan]¶

分片一个量化的TorchRec模型，用于生成最适合推理的模型，并且是分布式推理所必需的。

Parameters:

model (torch.nn.Module) – 要分片的量化模型
world_size (int) – 用于分片模型的设备数量，默认为1
compute_device (str) – 运行模型的设备，默认为“cuda”
sharding_device (str) – 运行分片的设备，默认为“meta”
sharders (可选[列表[ModuleSharder[torch.nn.Module]]]) – 用于分片量化模型的分片器，默认为 QuantEmbeddingBagCollectionSharder, QuantEmbeddingCollectionSharder, QuantFeatureProcessedEmbeddingBagCollectionSharder。
device_memory_size (可选[int]) – CUDA设备的内存限制，默认为None
约束 (可选[字典[字符串, 参数约束]]) – 用于分片的约束，默认为 None 这将实现默认约束，QuantEmbeddingBagCollection 将按表分片

Returns:

分片模型和分片计划

Return type:

元组[torch.nn.Module, ShardingPlan]

Example::

ebc = EmbeddingBagCollection(tables=eb_configs, device=torch.device(“meta”))

module = DLRMPredictModule(: embedding_bag_collection=ebc, dense_in_features=self.model_config.dense_in_features, dense_arch_layer_sizes=self.model_config.dense_arch_layer_sizes, over_arch_layer_sizes=self.model_config.over_arch_layer_sizes, id_list_features_keys=self.model_config.id_list_features_keys, dense_device=device,

)

quant_model = quantize_inference_model(module) sharded_model, _ = shard_quant_model(quant_model)