模型配置
此模块定义了model_config格式。
这种格式可以从huggingface、nemo或modelopt-quantized模型转换而来。 我们将使用这种格式保存的上下文来构建tensorrt_llm引擎。
类
注意力层配置。 |
|
卷积层配置。 |
|
解码器层配置。 |
|
嵌入层配置。 |
|
专家配置。 |
|
层归一化层的配置。 |
|
线性 + 激活层配置。 |
|
线性层配置。 |
|
MLP层配置。 |
|
专家混合层配置。 |
|
解码器层配置。 |
|
完整的LLM模型配置,包含tensorrt_llm引擎构建所需的全部信息。 |
|
QKV层配置。 |
|
来自recurrentgemma的RecurrentBlock。 |
|
相对注意力表配置。 |
|
来自recurrentgemma的RG LRU。 |
- class AttentionConfig
基础类:
object注意力层配置。
- __init__(qkv=None, dense=None, kv_cache_scaling_factor=None, kv_cache_dtype=None, rotary_dim=-inf, clip_qkv=None, rel_attn_table=None, q_layernorm=None, k_layernorm=None)
- Parameters:
qkv (QKVConfig | LinearConfig) –
dense (LinearConfig) –
kv_cache_scaling_factor (Tensor) –
kv_cache_dtype (str | None) –
rotary_dim (int) –
clip_qkv (float) –
rel_attn_table (RelativeAttentionTableConfig) –
q_layernorm (LayernormConfig) –
k_layernorm (LayernormConfig) –
- Return type:
无
- clip_qkv: float = None
- dense: LinearConfig = None
- k_layernorm: LayernormConfig = None
- kv_cache_dtype: str | None = None
- kv_cache_scaling_factor: Tensor = None
- q_layernorm: LayernormConfig = None
- qkv: QKVConfig | LinearConfig = None
- rel_attn_table: RelativeAttentionTableConfig = None
- rotary_dim: int = -inf
- class ConvConfig
基础类:
object卷积层配置。
- __init__(quantization=None, weight=None, bias=None)
- Parameters:
量化 (str | None) –
weight (Tensor) –
bias (Tensor) –
- Return type:
无
- bias: Tensor = None
- quantization: str | None = None
- weight: Tensor = None
- class DecoderLayerConfig
基础类:
object解码器层配置。
- __init__(quantization=None, decoder_type='', input_layernorm=None, mlp_layernorm=None, attention=None, recurrent=None, post_layernorm=None, pre_feedforward_layernorm=None, post_feedforward_layernorm=None, mlp=None, num_attention_heads=0, attention_head_size=None, num_kv_heads=0, max_position_embeddings=0, rotary_pct=1.0, use_alibi=False, new_decoder_architecture=False, parallel_attention=False, apply_residual_connection_post_layernorm=False, use_cache=True, chatglm_version='', rope_ratio=1.0, seq_length=0, qwen_type='', rotary_base=0, partial_rotary_factor=0, original_max_position_embeddings=0, longrope_scaling_short_factors=None, longrope_scaling_long_factors=None, mup_attn_multiplier=0, mup_embedding_multiplier=0, mup_use_scaling=0, mup_width_multiplier=0, blocksparse_block_size=0, blocksparse_homo_head_pattern=False, blocksparse_num_local_blocks=0, blocksparse_vertical_stride=0, dense_attention_every_n_layers=0, gegelu_limit=0, longrope_short_mscale=0, longrope_long_mscale=0, moe_num_experts=0, moe_top_k=0, moe_tp_mode=0, moe_renorm_mode=0, alibi_bias_max=0, residual_layernorm=None, residual_mlp=None, rnn_hidden_size=0, logits_soft_cap=0, emb_scale_by_sqrt_dim=False, layer_types=<factory>, attn_replacing_linear=None, mlp_replacing_linear=None, block_config=None, final_logit_softcapping=0, attn_logit_softcapping=0, query_pre_attn_scalar=0, clip_qkv=0, cross_attention=None, cross_attention_layernorm=None, self_attention=None, self_attention_layernorm=None, attention_layernorm=None, rel_attn_max_distance=0, rel_attn_num_buckets=0, rope_scaling=None, cross_attention_layers=None, vision_output_dim=0, gate_ffwd=None, gate_attn=None, sparse_mixer_epsilon=0)
- Parameters:
量化 (str | None) –
decoder_type (str) –
input_layernorm (LayernormConfig) –
mlp_layernorm (LayernormConfig) –
注意 (AttentionConfig) –
recurrent (RecurrentConfig) –
post_layernorm (LayernormConfig) –
pre_feedforward_layernorm (LayernormConfig) –
post_feedforward_layernorm (LayernormConfig) –
num_attention_heads (int) –
attention_head_size (int) –
num_kv_heads (int) –
max_position_embeddings (int) –
rotary_pct (float) –
use_alibi (bool) –
new_decoder_architecture (bool) –
parallel_attention (bool) –
apply_residual_connection_post_layernorm (bool) –
use_cache (bool) –
chatglm_version (str) –
rope_ratio (float) –
seq_length (int) –
qwen_type (str) –
rotary_base (int) –
partial_rotary_factor (float) –
original_max_position_embeddings (int) –
longrope_scaling_short_factors (List[float]) –
longrope_scaling_long_factors (List[float]) –
mup_attn_multiplier (float) –
mup_embedding_multiplier (float) –
mup_use_scaling (float) –
mup_width_multiplier (float) –
blocksparse_block_size (int) –
blocksparse_homo_head_pattern (bool) –
blocksparse_num_local_blocks (int) –
blocksparse_vertical_stride (int) –
dense_attention_every_n_layers (int) –
gegelu_limit (float) –
longrope_short_mscale (float) –
longrope_long_mscale (float) –
moe_num_experts (int) –
moe_top_k (int) –
moe_tp_mode (int) –
moe_renorm_mode (int) –
alibi_bias_max (int) –
residual_layernorm (LayernormConfig) –
residual_mlp (MLPConfig) –
rnn_hidden_size (int) –
logits_soft_cap (float) –
emb_scale_by_sqrt_dim (bool) –
layer_types (List[str]) –
attn_replacing_linear (LinearConfig) –
mlp_replacing_linear (LinearConfig) –
block_config (dict) –
final_logit_softcapping (float) –
attn_logit_softcapping (float) –
query_pre_attn_scalar (float) –
clip_qkv (int) –
cross_attention (AttentionConfig) –
cross_attention_layernorm (LayernormConfig) –
self_attention (AttentionConfig) –
self_attention_layernorm (LayernormConfig) –
attention_layernorm (LayernormConfig) –
rel_attn_max_distance (int) –
rel_attn_num_buckets (int) –
rope_scaling (dict) –
cross_attention_layers (dict) –
vision_output_dim (int) –
gate_ffwd (Tensor) –
gate_attn (Tensor) –
sparse_mixer_epsilon (float) –
- Return type:
无
- alibi_bias_max: int = 0
- apply_residual_connection_post_layernorm: bool = False
- attention: AttentionConfig = None
- attention_head_size: int = None
- attention_layernorm: LayernormConfig = None
- attn_logit_softcapping: float = 0
- attn_replacing_linear: LinearConfig = None
- block_config: dict = None
- blocksparse_block_size: int = 0
- blocksparse_homo_head_pattern: bool = False
- blocksparse_num_local_blocks: int = 0
- blocksparse_vertical_stride: int = 0
- chatglm_version: str = ''
- clip_qkv: int = 0
- cross_attention: AttentionConfig = None
- cross_attention_layernorm: LayernormConfig = None
- cross_attention_layers: dict = None
- decoder_type: str = ''
- dense_attention_every_n_layers: int = 0
- emb_scale_by_sqrt_dim: bool = False
返回transformer模型的ffn隐藏大小。
- final_logit_softcapping: float = 0
- gate_attn: Tensor = None
- gate_ffwd: Tensor = None
- gegelu_limit: float = 0
返回transformer模型的隐藏大小。
- input_layernorm: LayernormConfig = None
- layer_types: List[str]
- logits_soft_cap: float = 0
- longrope_long_mscale: float = 0
- longrope_scaling_long_factors: List[float] = None
- longrope_scaling_short_factors: List[float] = None
- longrope_short_mscale: float = 0
- max_position_embeddings: int = 0
- mlp_layernorm: LayernormConfig = None
- mlp_replacing_linear: LinearConfig = None
- moe_num_experts: int = 0
- moe_renorm_mode: int = 0
- moe_top_k: int = 0
- moe_tp_mode: int = 0
- mup_attn_multiplier: float = 0
- mup_embedding_multiplier: float = 0
- mup_use_scaling: float = 0
- mup_width_multiplier: float = 0
- new_decoder_architecture: bool = False
- num_attention_heads: int = 0
- num_kv_heads: int = 0
- original_max_position_embeddings: int = 0
- parallel_attention: bool = False
- partial_rotary_factor: float = 0
- post_feedforward_layernorm: LayernormConfig = None
- post_layernorm: LayernormConfig = None
- pre_feedforward_layernorm: LayernormConfig = None
- quantization: str | None = None
- query_pre_attn_scalar: float = 0
- qwen_type: str = ''
- recurrent: RecurrentConfig = None
- rel_attn_max_distance: int = 0
- rel_attn_num_buckets: int = 0
- residual_layernorm: LayernormConfig = None
- rope_ratio: float = 1.0
- rope_scaling: dict = None
- rotary_base: int = 0
- rotary_pct: float = 1.0
- self_attention: AttentionConfig = None
- self_attention_layernorm: LayernormConfig = None
- seq_length: int = 0
- sparse_mixer_epsilon: float = 0
- use_alibi: bool = False
- use_cache: bool = True
- vision_output_dim: int = 0
- class EmbeddingConfig
基础类:
object嵌入层配置。
- __init__(weight=None, quantization=None)
- Parameters:
weight (Tensor) –
量化 (str | None) –
- Return type:
无
从嵌入层权重形状推断出隐藏层大小。
- property local_vocab_size
从嵌入层权重形状推断出词汇表大小。
- quantization: str | None = None
- weight: Tensor = None
- class ExpertConfig
基础类:
object专家配置。
- __init__(fc=None, proj=None)
- Parameters:
fc (LinearConfig) –
proj (LinearConfig) –
- Return type:
无
- fc: LinearConfig = None
- proj: LinearConfig = None
- class LayernormConfig
基础类:
object层归一化层的配置。
- __init__(quantization=None, weight=None, bias=None, layernorm_type='', eps=1e-05)
- Parameters:
量化 (str | None) –
weight (Tensor) –
bias (Tensor) –
layernorm_type (str) –
eps (float) –
- Return type:
无
- bias: Tensor = None
- eps: float = 1e-05
- layernorm_type: str = ''
- quantization: str | None = None
- weight: Tensor = None
- class LinearActConfig
基础类:
object线性 + 激活层配置。
- __init__(linear=None, hidden_act='')
- Parameters:
线性 (LinearConfig) –
hidden_act (str) –
- Return type:
无
- linear: LinearConfig = None
- class LinearConfig
基础类:
object线性层配置。
- __init__(quantization=None, linear_type='column', weight=None, bias=None, activation_scaling_factor=None, weights_scaling_factor=None, weights_scaling_factor_2=None, prequant_scaling_factor=None, awq_block_size=0)
- Parameters:
量化 (str | None) –
linear_type (str) –
weight (Tensor) –
bias (Tensor) –
activation_scaling_factor (Tensor) –
weights_scaling_factor (Tensor) –
weights_scaling_factor_2 (Tensor) –
prequant_scaling_factor (张量) –
awq_block_size (int) –
- Return type:
无
- activation_scaling_factor: Tensor = None
- awq_block_size: int = 0
- bias: Tensor = None
- linear_type: str = 'column'
- prequant_scaling_factor: Tensor = None
- quantization: str | None = None
- weight: Tensor = None
- weights_scaling_factor: Tensor = None
- weights_scaling_factor_2: Tensor = None
- class MLPConfig
基础类:
objectMLP层配置。
- __init__(fc=None, gate=None, proj=None, hidden_act='', merged_fc1_gate=False)
- Parameters:
fc (LinearConfig) –
gate (LinearConfig) –
proj (LinearConfig) –
hidden_act (str) –
merged_fc1_gate (bool) –
- Return type:
无
- fc: LinearConfig = None
- gate: LinearConfig = None
- merged_fc1_gate: bool = False
- proj: LinearConfig = None
- property quantization
返回合并的gat和fc1量化。
- class MOEConfig
基础类:
object专家混合层配置。
- __init__(router=None, experts=None, hidden_act='')
- Parameters:
路由器 (LinearConfig) –
专家 (ExpertConfig) –
hidden_act (str) –
- Return type:
无
- experts: ExpertConfig = None
- property fc
从专家那里返回fc模块。
- router: LinearConfig = None
- class MedusaHeadConfig
基础类:
object解码器层配置。
- __init__(medusa_layers=None, lm_head=None)
- Parameters:
medusa_layers (列表[LinearActConfig]) –
lm_head (LinearConfig) –
- Return type:
无
- lm_head: LinearConfig = None
- medusa_layers: List[LinearActConfig] = None
- class ModelConfig
基础类:
object完整的LLM模型配置,包含tensorrt_llm引擎构建所需的全部信息。
此类包括 tensorrt_llm 支持的所有字段,但并非所有字段都是必需的。 pipeline_parallel > 1 仅支持 TensorRT-LLM 检查点。
- __init__(architecture='', quantization=None, dtype='float16', vocab_size=0, rank=0, tensor_parallel=1, pipeline_parallel=1, vocab_embedding=None, position_embedding=None, block_embedding=None, ln_embed=None, layers=<factory>, ln_f=None, lm_head=None, share_embedding_table=False, medusa_heads=None, num_medusa_heads=0, num_medusa_layers=0, enc_dec='', encoder_hidden_size=0, encoder_num_heads=0, encoder_head_size=0)
- Parameters:
架构 (str) –
量化 (str) –
dtype (str) –
vocab_size (int) –
rank (int) –
tensor_parallel (int) –
pipeline_parallel (int) –
vocab_embedding (EmbeddingConfig) –
position_embedding (EmbeddingConfig) –
block_embedding (EmbeddingConfig) –
ln_embed (LayernormConfig) –
layers (List[DecoderLayerConfig]) –
ln_f (LayernormConfig) –
lm_head (LinearConfig) –
share_embedding_table (bool) –
medusa_heads (列表[MedusaHeadConfig]) –
num_medusa_heads (int) –
num_medusa_layers (int) –
enc_dec (str) –
encoder_hidden_size (int) –
encoder_num_heads (int) –
encoder_head_size (int) –
- Return type:
无
- architecture: str = ''
- block_embedding: EmbeddingConfig = None
- dtype: str = 'float16'
- enc_dec: str = ''
- encoder_head_size: int = 0
- encoder_num_heads: int = 0
返回模型的hidden_act。
返回模型的隐藏层大小。
- layers: List[DecoderLayerConfig]
- lm_head: LinearConfig = None
- ln_embed: LayernormConfig = None
- ln_f: LayernormConfig = None
- property max_position_embeddings
返回模型的max_position_embedding。
- medusa_heads: List[MedusaHeadConfig] = None
- property num_attention_heads
返回模型的num_attention_heads。
- property num_kv_heads
返回模型的num_key_value_heads。
- num_medusa_heads: int = 0
- num_medusa_layers: int = 0
- pipeline_parallel: int = 1
- position_embedding: EmbeddingConfig = None
- quantization: str = None
- rank: int = 0
- tensor_parallel: int = 1
- vocab_embedding: EmbeddingConfig = None
- vocab_size: int = 0
- property vocab_size_padded
返回模型填充后的词汇表大小,四舍五入到张量并行。
- class QKVConfig
基础类:
objectQKV层配置。
- __init__(q=None, k=None, v=None)
- Parameters:
q (LinearConfig) –
k (LinearConfig) –
v (LinearConfig) –
- Return type:
无
- property activation_scaling_factor
返回跨Q、K和V合并的activation_scaling_factor。
返回Q、K、V激活缩放因子的最大值。
- property awq_block_size
返回此QKV层的awq_block_size。
- property bias
生成的线性层偏差。
Q、K、V 偏差被连接在一起以适应 TensorRT-LLM QKV 线性层。
- k: LinearConfig = None
- property prequant_scaling_factor
返回跨Q、K和V合并的prequant_scaling_factor。
Q、K、V 的预量化缩放因子应该相同。因此只需返回其中一个。
- q: LinearConfig = None
- property quantization
返回此QKV层的量化格式。
- v: LinearConfig = None
- property weight
生成的线性层权重。
Q、K、V 权重被连接在一起以适应 TensorRT-LLM QKV 线性层。
- property weights_scaling_factor
返回跨Q、K和V合并的weights_scaling_factor。
如果量化是FP8,则返回Q、K、V权重缩放因子的最大值。 如果量化是INT8_SQ,则返回连接值。
- property weights_scaling_factor_2
返回跨Q、K和V合并的weights_scaling_factor_2。
weight_scaling_factor_2 是 W4A8 AWQ 所需的。
- class RecurrentConfig
基础类:
object来自recurrentgemma的RecurrentBlock。
- __init__(linear_y=None, y_bias=None, linear_x=None, linear_out=None, conv1d=None, rg_lru=None)
- Parameters:
linear_y (LinearConfig) –
y_bias (Tensor) –
linear_x (LinearConfig) –
linear_out (LinearConfig) –
conv1d (ConvConfig) –
rg_lru (RgLruConfig) –
- Return type:
无
- conv1d: ConvConfig = None
- linear_out: LinearConfig = None
- linear_x: LinearConfig = None
- linear_y: LinearConfig = None
- rg_lru: RgLruConfig = None
- y_bias: Tensor = None
- class RelativeAttentionTableConfig
基础类:
object相对注意力表配置。用于分割目的。
- __init__(weight=None)
- Parameters:
weight (Tensor) –
- Return type:
无
- weight: Tensor = None
- class RgLruConfig
基础类:
object来自recurrentgemma的RG LRU。
- __init__(recurrent_param=None, input_gate=None, recurrent_gate=None)
- Parameters:
recurrent_param (Tensor) –
input_gate (LinearConfig) –
recurrent_gate (LinearConfig) –
- Return type:
无
- input_gate: LinearConfig = None
- recurrent_gate: LinearConfig = None
- recurrent_param: Tensor = None