模型配置

此模块定义了model_config格式。

这种格式可以从huggingface、nemo或modelopt-quantized模型转换而来。 我们将使用这种格式保存的上下文来构建tensorrt_llm引擎。

AttentionConfig

注意力层配置。

ConvConfig

卷积层配置。

DecoderLayerConfig

解码器层配置。

EmbeddingConfig

嵌入层配置。

ExpertConfig

专家配置。

LayernormConfig

层归一化层的配置。

LinearActConfig

线性 + 激活层配置。

LinearConfig

线性层配置。

MLPConfig

MLP层配置。

MOEConfig

专家混合层配置。

MedusaHeadConfig

解码器层配置。

ModelConfig

完整的LLM模型配置,包含tensorrt_llm引擎构建所需的全部信息。

QKVConfig

QKV层配置。

RecurrentConfig

来自recurrentgemma的RecurrentBlock。

RelativeAttentionTableConfig

相对注意力表配置。

RgLruConfig

来自recurrentgemma的RG LRU。

class AttentionConfig

基础类:object

注意力层配置。

__init__(qkv=None, dense=None, kv_cache_scaling_factor=None, kv_cache_dtype=None, rotary_dim=-inf, clip_qkv=None, rel_attn_table=None, q_layernorm=None, k_layernorm=None)
Parameters:
Return type:

clip_qkv: float = None
dense: LinearConfig = None
k_layernorm: LayernormConfig = None
kv_cache_dtype: str | None = None
kv_cache_scaling_factor: Tensor = None
q_layernorm: LayernormConfig = None
qkv: QKVConfig | LinearConfig = None
rel_attn_table: RelativeAttentionTableConfig = None
rotary_dim: int = -inf
class ConvConfig

基础类:object

卷积层配置。

__init__(quantization=None, weight=None, bias=None)
Parameters:
  • 量化 (str | None) –

  • weight (Tensor) –

  • bias (Tensor) –

Return type:

bias: Tensor = None
quantization: str | None = None
weight: Tensor = None
class DecoderLayerConfig

基础类:object

解码器层配置。

__init__(quantization=None, decoder_type='', input_layernorm=None, mlp_layernorm=None, attention=None, recurrent=None, post_layernorm=None, pre_feedforward_layernorm=None, post_feedforward_layernorm=None, mlp=None, num_attention_heads=0, attention_head_size=None, num_kv_heads=0, max_position_embeddings=0, rotary_pct=1.0, use_alibi=False, new_decoder_architecture=False, parallel_attention=False, apply_residual_connection_post_layernorm=False, use_cache=True, chatglm_version='', rope_ratio=1.0, seq_length=0, qwen_type='', rotary_base=0, partial_rotary_factor=0, original_max_position_embeddings=0, longrope_scaling_short_factors=None, longrope_scaling_long_factors=None, mup_attn_multiplier=0, mup_embedding_multiplier=0, mup_use_scaling=0, mup_width_multiplier=0, blocksparse_block_size=0, blocksparse_homo_head_pattern=False, blocksparse_num_local_blocks=0, blocksparse_vertical_stride=0, dense_attention_every_n_layers=0, gegelu_limit=0, longrope_short_mscale=0, longrope_long_mscale=0, moe_num_experts=0, moe_top_k=0, moe_tp_mode=0, moe_renorm_mode=0, alibi_bias_max=0, residual_layernorm=None, residual_mlp=None, rnn_hidden_size=0, logits_soft_cap=0, emb_scale_by_sqrt_dim=False, layer_types=<factory>, attn_replacing_linear=None, mlp_replacing_linear=None, block_config=None, final_logit_softcapping=0, attn_logit_softcapping=0, query_pre_attn_scalar=0, clip_qkv=0, cross_attention=None, cross_attention_layernorm=None, self_attention=None, self_attention_layernorm=None, attention_layernorm=None, rel_attn_max_distance=0, rel_attn_num_buckets=0, rope_scaling=None, cross_attention_layers=None, vision_output_dim=0, gate_ffwd=None, gate_attn=None, sparse_mixer_epsilon=0)
Parameters:
  • 量化 (str | None) –

  • decoder_type (str) –

  • input_layernorm (LayernormConfig) –

  • mlp_layernorm (LayernormConfig) –

  • 注意 (AttentionConfig) –

  • recurrent (RecurrentConfig) –

  • post_layernorm (LayernormConfig) –

  • pre_feedforward_layernorm (LayernormConfig) –

  • post_feedforward_layernorm (LayernormConfig) –

  • mlp (MLPConfig | MOEConfig) –

  • num_attention_heads (int) –

  • attention_head_size (int) –

  • num_kv_heads (int) –

  • max_position_embeddings (int) –

  • rotary_pct (float) –

  • use_alibi (bool) –

  • new_decoder_architecture (bool) –

  • parallel_attention (bool) –

  • apply_residual_connection_post_layernorm (bool) –

  • use_cache (bool) –

  • chatglm_version (str) –

  • rope_ratio (float) –

  • seq_length (int) –

  • qwen_type (str) –

  • rotary_base (int) –

  • partial_rotary_factor (float) –

  • original_max_position_embeddings (int) –

  • longrope_scaling_short_factors (List[float]) –

  • longrope_scaling_long_factors (List[float]) –

  • mup_attn_multiplier (float) –

  • mup_embedding_multiplier (float) –

  • mup_use_scaling (float) –

  • mup_width_multiplier (float) –

  • blocksparse_block_size (int) –

  • blocksparse_homo_head_pattern (bool) –

  • blocksparse_num_local_blocks (int) –

  • blocksparse_vertical_stride (int) –

  • dense_attention_every_n_layers (int) –

  • gegelu_limit (float) –

  • longrope_short_mscale (float) –

  • longrope_long_mscale (float) –

  • moe_num_experts (int) –

  • moe_top_k (int) –

  • moe_tp_mode (int) –

  • moe_renorm_mode (int) –

  • alibi_bias_max (int) –

  • residual_layernorm (LayernormConfig) –

  • residual_mlp (MLPConfig) –

  • rnn_hidden_size (int) –

  • logits_soft_cap (float) –

  • emb_scale_by_sqrt_dim (bool) –

  • layer_types (List[str]) –

  • attn_replacing_linear (LinearConfig) –

  • mlp_replacing_linear (LinearConfig) –

  • block_config (dict) –

  • final_logit_softcapping (float) –

  • attn_logit_softcapping (float) –

  • query_pre_attn_scalar (float) –

  • clip_qkv (int) –

  • cross_attention (AttentionConfig) –

  • cross_attention_layernorm (LayernormConfig) –

  • self_attention (AttentionConfig) –

  • self_attention_layernorm (LayernormConfig) –

  • attention_layernorm (LayernormConfig) –

  • rel_attn_max_distance (int) –

  • rel_attn_num_buckets (int) –

  • rope_scaling (dict) –

  • cross_attention_layers (dict) –

  • vision_output_dim (int) –

  • gate_ffwd (Tensor) –

  • gate_attn (Tensor) –

  • sparse_mixer_epsilon (float) –

Return type:

alibi_bias_max: int = 0
apply_residual_connection_post_layernorm: bool = False
attention: AttentionConfig = None
attention_head_size: int = None
attention_layernorm: LayernormConfig = None
attn_logit_softcapping: float = 0
attn_replacing_linear: LinearConfig = None
block_config: dict = None
blocksparse_block_size: int = 0
blocksparse_homo_head_pattern: bool = False
blocksparse_num_local_blocks: int = 0
blocksparse_vertical_stride: int = 0
chatglm_version: str = ''
clip_qkv: int = 0
cross_attention: AttentionConfig = None
cross_attention_layernorm: LayernormConfig = None
cross_attention_layers: dict = None
decoder_type: str = ''
dense_attention_every_n_layers: int = 0
emb_scale_by_sqrt_dim: bool = False
property ffn_hidden_size_local

返回transformer模型的ffn隐藏大小。

final_logit_softcapping: float = 0
gate_attn: Tensor = None
gate_ffwd: Tensor = None
gegelu_limit: float = 0
property hidden_size

返回transformer模型的隐藏大小。

input_layernorm: LayernormConfig = None
layer_types: List[str]
logits_soft_cap: float = 0
longrope_long_mscale: float = 0
longrope_scaling_long_factors: List[float] = None
longrope_scaling_short_factors: List[float] = None
longrope_short_mscale: float = 0
max_position_embeddings: int = 0
mlp: MLPConfig | MOEConfig = None
mlp_layernorm: LayernormConfig = None
mlp_replacing_linear: LinearConfig = None
moe_num_experts: int = 0
moe_renorm_mode: int = 0
moe_top_k: int = 0
moe_tp_mode: int = 0
mup_attn_multiplier: float = 0
mup_embedding_multiplier: float = 0
mup_use_scaling: float = 0
mup_width_multiplier: float = 0
new_decoder_architecture: bool = False
num_attention_heads: int = 0
num_kv_heads: int = 0
original_max_position_embeddings: int = 0
parallel_attention: bool = False
partial_rotary_factor: float = 0
post_feedforward_layernorm: LayernormConfig = None
post_layernorm: LayernormConfig = None
pre_feedforward_layernorm: LayernormConfig = None
quantization: str | None = None
query_pre_attn_scalar: float = 0
qwen_type: str = ''
recurrent: RecurrentConfig = None
rel_attn_max_distance: int = 0
rel_attn_num_buckets: int = 0
residual_layernorm: LayernormConfig = None
residual_mlp: MLPConfig = None
rnn_hidden_size: int = 0
rope_ratio: float = 1.0
rope_scaling: dict = None
rotary_base: int = 0
rotary_pct: float = 1.0
self_attention: AttentionConfig = None
self_attention_layernorm: LayernormConfig = None
seq_length: int = 0
sparse_mixer_epsilon: float = 0
use_alibi: bool = False
use_cache: bool = True
vision_output_dim: int = 0
class EmbeddingConfig

基础类:object

嵌入层配置。

__init__(weight=None, quantization=None)
Parameters:
  • weight (Tensor) –

  • 量化 (str | None) –

Return type:

property hidden_size

从嵌入层权重形状推断出隐藏层大小。

property local_vocab_size

从嵌入层权重形状推断出词汇表大小。

quantization: str | None = None
weight: Tensor = None
class ExpertConfig

基础类:object

专家配置。

__init__(fc=None, proj=None)
Parameters:
Return type:

fc: LinearConfig = None
proj: LinearConfig = None
class LayernormConfig

基础类:object

层归一化层的配置。

__init__(quantization=None, weight=None, bias=None, layernorm_type='', eps=1e-05)
Parameters:
  • 量化 (str | None) –

  • weight (Tensor) –

  • bias (Tensor) –

  • layernorm_type (str) –

  • eps (float) –

Return type:

bias: Tensor = None
eps: float = 1e-05
layernorm_type: str = ''
quantization: str | None = None
weight: Tensor = None
class LinearActConfig

基础类:object

线性 + 激活层配置。

__init__(linear=None, hidden_act='')
Parameters:
Return type:

hidden_act: str = ''
linear: LinearConfig = None
class LinearConfig

基础类:object

线性层配置。

__init__(quantization=None, linear_type='column', weight=None, bias=None, activation_scaling_factor=None, weights_scaling_factor=None, weights_scaling_factor_2=None, prequant_scaling_factor=None, awq_block_size=0)
Parameters:
  • 量化 (str | None) –

  • linear_type (str) –

  • weight (Tensor) –

  • bias (Tensor) –

  • activation_scaling_factor (Tensor) –

  • weights_scaling_factor (Tensor) –

  • weights_scaling_factor_2 (Tensor) –

  • prequant_scaling_factor (张量) –

  • awq_block_size (int) –

Return type:

activation_scaling_factor: Tensor = None
awq_block_size: int = 0
bias: Tensor = None
linear_type: str = 'column'
prequant_scaling_factor: Tensor = None
quantization: str | None = None
weight: Tensor = None
weights_scaling_factor: Tensor = None
weights_scaling_factor_2: Tensor = None
class MLPConfig

基础类:object

MLP层配置。

__init__(fc=None, gate=None, proj=None, hidden_act='', merged_fc1_gate=False)
Parameters:
Return type:

fc: LinearConfig = None
gate: LinearConfig = None
hidden_act: str = ''
merged_fc1_gate: bool = False
proj: LinearConfig = None
property quantization

返回合并的gat和fc1量化。

class MOEConfig

基础类:object

专家混合层配置。

__init__(router=None, experts=None, hidden_act='')
Parameters:
Return type:

experts: ExpertConfig = None
property fc

从专家那里返回fc模块。

hidden_act: str = ''
router: LinearConfig = None
class MedusaHeadConfig

基础类:object

解码器层配置。

__init__(medusa_layers=None, lm_head=None)
Parameters:
Return type:

lm_head: LinearConfig = None
medusa_layers: List[LinearActConfig] = None
class ModelConfig

基础类:object

完整的LLM模型配置,包含tensorrt_llm引擎构建所需的全部信息。

此类包括 tensorrt_llm 支持的所有字段,但并非所有字段都是必需的。 pipeline_parallel > 1 仅支持 TensorRT-LLM 检查点。

__init__(architecture='', quantization=None, dtype='float16', vocab_size=0, rank=0, tensor_parallel=1, pipeline_parallel=1, vocab_embedding=None, position_embedding=None, block_embedding=None, ln_embed=None, layers=<factory>, ln_f=None, lm_head=None, share_embedding_table=False, medusa_heads=None, num_medusa_heads=0, num_medusa_layers=0, enc_dec='', encoder_hidden_size=0, encoder_num_heads=0, encoder_head_size=0)
Parameters:
  • 架构 (str) –

  • 量化 (str) –

  • dtype (str) –

  • vocab_size (int) –

  • rank (int) –

  • tensor_parallel (int) –

  • pipeline_parallel (int) –

  • vocab_embedding (EmbeddingConfig) –

  • position_embedding (EmbeddingConfig) –

  • block_embedding (EmbeddingConfig) –

  • ln_embed (LayernormConfig) –

  • layers (List[DecoderLayerConfig]) –

  • ln_f (LayernormConfig) –

  • lm_head (LinearConfig) –

  • share_embedding_table (bool) –

  • medusa_heads (列表[MedusaHeadConfig]) –

  • num_medusa_heads (int) –

  • num_medusa_layers (int) –

  • enc_dec (str) –

  • encoder_hidden_size (int) –

  • encoder_num_heads (int) –

  • encoder_head_size (int) –

Return type:

architecture: str = ''
block_embedding: EmbeddingConfig = None
dtype: str = 'float16'
enc_dec: str = ''
encoder_head_size: int = 0
encoder_hidden_size: int = 0
encoder_num_heads: int = 0
property hidden_act

返回模型的hidden_act。

property hidden_size

返回模型的隐藏层大小。

layers: List[DecoderLayerConfig]
lm_head: LinearConfig = None
ln_embed: LayernormConfig = None
ln_f: LayernormConfig = None
property max_position_embeddings

返回模型的max_position_embedding。

medusa_heads: List[MedusaHeadConfig] = None
property num_attention_heads

返回模型的num_attention_heads。

property num_kv_heads

返回模型的num_key_value_heads。

num_medusa_heads: int = 0
num_medusa_layers: int = 0
pipeline_parallel: int = 1
position_embedding: EmbeddingConfig = None
quantization: str = None
rank: int = 0
share_embedding_table: bool = False
tensor_parallel: int = 1
vocab_embedding: EmbeddingConfig = None
vocab_size: int = 0
property vocab_size_padded

返回模型填充后的词汇表大小,四舍五入到张量并行。

class QKVConfig

基础类:object

QKV层配置。

__init__(q=None, k=None, v=None)
Parameters:
Return type:

property activation_scaling_factor

返回跨Q、K和V合并的activation_scaling_factor。

返回Q、K、V激活缩放因子的最大值。

property awq_block_size

返回此QKV层的awq_block_size。

property bias

生成的线性层偏差。

Q、K、V 偏差被连接在一起以适应 TensorRT-LLM QKV 线性层。

k: LinearConfig = None
property prequant_scaling_factor

返回跨Q、K和V合并的prequant_scaling_factor。

Q、K、V 的预量化缩放因子应该相同。因此只需返回其中一个。

q: LinearConfig = None
property quantization

返回此QKV层的量化格式。

v: LinearConfig = None
property weight

生成的线性层权重。

Q、K、V 权重被连接在一起以适应 TensorRT-LLM QKV 线性层。

property weights_scaling_factor

返回跨Q、K和V合并的weights_scaling_factor。

如果量化是FP8,则返回Q、K、V权重缩放因子的最大值。 如果量化是INT8_SQ,则返回连接值。

property weights_scaling_factor_2

返回跨Q、K和V合并的weights_scaling_factor_2。

weight_scaling_factor_2 是 W4A8 AWQ 所需的。

class RecurrentConfig

基础类:object

来自recurrentgemma的RecurrentBlock。

__init__(linear_y=None, y_bias=None, linear_x=None, linear_out=None, conv1d=None, rg_lru=None)
Parameters:
Return type:

conv1d: ConvConfig = None
linear_out: LinearConfig = None
linear_x: LinearConfig = None
linear_y: LinearConfig = None
rg_lru: RgLruConfig = None
y_bias: Tensor = None
class RelativeAttentionTableConfig

基础类:object

相对注意力表配置。用于分割目的。

__init__(weight=None)
Parameters:

weight (Tensor) –

Return type:

weight: Tensor = None
class RgLruConfig

基础类:object

来自recurrentgemma的RG LRU。

__init__(recurrent_param=None, input_gate=None, recurrent_gate=None)
Parameters:
Return type:

input_gate: LinearConfig = None
recurrent_gate: LinearConfig = None
recurrent_param: Tensor = None