常见自定义

量化

TensorRT-LLM 可以自动量化 Hugging Face 模型。通过在 LLM 实例中设置适当的标志。例如，要执行 Int4 AWQ 量化，以下代码会触发模型量化。请参阅完整的支持的标志和可接受值的列表。

from tensorrt_llm.llmapi import QuantConfig, QuantAlgo

quant_config = QuantConfig(quant_algo=QuantAlgo.W4A16_AWQ)

llm = LLM(<model-dir>, quant_config=quant_config)

采样

SamplingParams 可以自定义采样策略来控制LLM生成的响应，例如束搜索、温度以及其他。

例如，要启用束搜索并设置束大小为4，请按如下方式设置sampling_params：

from tensorrt_llm.llmapi import LLM, SamplingParams, BuildConfig

build_config = BuildConfig()
build_config.max_beam_width = 4

llm = LLM(<llama_model_path>, build_config=build_config)
# Let the LLM object generate text with the default sampling strategy, or
# you can create a SamplingParams object as well with several fields set manually
sampling_params = SamplingParams(beam_width=4) # current limitation: beam_width should be equal to max_beam_width

for output in llm.generate(<prompt>, sampling_params=sampling_params):
    print(output)

SamplingParams 管理并将字段分派到包括以下内容的C++类中：

请参阅类文档以获取更多详细信息。

构建配置

除了上述提到的参数外，您还可以使用build_config类以及从trtllm-build CLI借用的其他参数来自定义构建配置。这些构建配置选项为构建目标硬件和使用场景的引擎提供了灵活性。请参考以下示例：

llm = LLM(<model-path>,
          build_config=BuildConfig(
            max_num_tokens=4096,
            max_batch_size=128,
            max_beam_width=4))

有关更多详细信息，请参阅buildconfig文档。

运行时自定义

类似于build_config，你也可以使用runtime_config、peft_cache_config或从Executor API借用的其他参数来自定义运行时配置。这些运行时配置选项在KV缓存管理、GPU内存分配等方面提供了额外的灵活性。请参考以下示例：

from tensorrt_llm.llmapi import LLM, KvCacheConfig

llm = LLM(<llama_model_path>,
          kv_cache_config=KvCacheConfig(
            free_gpu_memory_fraction=0.8))

分词器自定义

默认情况下，LLM API 使用 transformers 的 AutoTokenizer。您可以在创建 LLM 对象时传递自己的 tokenizer 来覆盖它。请参考以下示例：

llm = LLM(<llama_model_path>, tokenizer=<my_faster_one>)

LLM() 工作流应使用您的分词器。

也可以直接输入token ID而不使用Tokenizers，代码如下。由于未使用tokenizer，代码生成的token ID没有文本。

llm = LLM(<llama_model_path>)

for output in llm.generate([32, 12]):
    ...

禁用分词器

出于性能考虑，您可以在创建LLM时通过传递skip_tokenizer_init=True来禁用分词器。在这种情况下，LLM.generate和LLM.generate_async将期望输入提示的令牌ID。请参考以下示例：

llm = LLM(<llama_model_path>)
for output in llm.generate([[32, 12]], skip_tokenizer_init=True):
    print(output)

你会得到类似这样的结果：

RequestOutput(request_id=1, prompt=None, prompt_token_ids=[1, 15043, 29892, 590, 1024, 338], outputs=[CompletionOutput(index=0, text='', token_ids=[518, 10858, 4408, 29962, 322, 306, 626, 263, 518, 10858, 20627, 29962, 472, 518, 10858, 6938, 1822, 306, 626, 5007, 304, 4653, 590, 4066, 297, 278, 518, 11947, 18527, 29962, 2602, 472], cumulative_logprob=None, logprobs=[])], finished=True)

请注意，由于分词器被停用，CompletionOutput中的text字段为空。

生成

基于Asyncio的生成

使用LLM API，您还可以使用generate_async方法执行异步生成。请参考以下示例：

llm = LLM(model=<llama_model_path>)

async for output in llm.generate_async(<prompt>, streaming=True):
    print(output)

当streaming标志设置为True时，generate_async方法将返回一个生成器，该生成器在令牌可用时立即生成每个令牌。否则，它将返回一个生成器，该生成器等待并仅生成最终结果。

未来风格生成

generate_async 方法的结果是一个 Future-like 对象，它不会阻塞线程，除非调用了 .result()。

# This will not block the main thread
generation = llm.generate_async(<prompt>)
# Do something else here
# call .result() to explicitly block the main thread and wait for the result when needed
output = generation.result()

.result() 方法的工作方式类似于 Python Future 中的 result 方法，您可以指定一个超时时间来等待结果。

output = generation.result(timeout=10)

有一个异步版本，其中使用了.aresult()。

generation = llm.generate_async(<prompt>)
output = await generation.aresult()