Inference ¶

LLaMA-Factory supports multiple inference methods.

You can use llamafactory-cli chat inference_config.yaml or llamafactory-cli webchat inference_config.yaml to perform inference and have conversations with the model. When having a conversation, the configuration file only needs to specify the original model model_name_or_path and template, and specify adapter_name_or_path and finetuning_type according to whether it is a fine-tuned model.

If you want to input a large dataset into the model and save the inference results, you can start the vllm inference engine to perform fast batch inference on the large dataset. You can also perform batch inference through API calls in the form of deploying an api service.

By default, the model inference will use the Huggingface engine. You can also specify infer_backend: vllm to use the vllm inference engine for faster inference speed.

Remarks

When reasoning in any way, the model model_name_or_path needs to exist and correspond to template.

Original Model Inference Configuration¶

For the inference of the original model, in inference_config.yaml, only the original model model_name_or_path and template need to be specified.

### examples/inference/llama3.yaml
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
template: llama3
infer_backend: huggingface #choices： [huggingface, vllm]

Inference Configuration for Fine-tuned Model ¶

For fine-tuned model inference, in addition to the original model and template, you also need to specify the adapter path adapter_name_or_path and the fine-tuning type finetuning_type.

### examples/inference/llama3_lora_sft.yaml
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
adapter_name_or_path: saves/llama3-8b/lora/sft
template: llama3
finetuning_type: lora
infer_backend: huggingface #choices： [huggingface, vllm]

Multimodal Models¶

For multimodal models, you can run the following instructions for inference.

llamafactory-cli webchat examples/inferece/llava1_5.yaml

The configuration example of examples/inference/llava1_5.yaml is as follows:

model_name_or_path: llava-hf/llava-1.5-7b-hf
template: vicuna
infer_backend: huggingface #choices： [huggingface, vllm]

Batch Inference¶

Dataset ¶

You can start the vllm inference engine and perform batch inference using the dataset through the following commands:

python scripts/vllm_infer.py --model_name_or_path path_to_merged_model --dataset alpaca_en_demo

api¶

If you need to use the api for batch inference, you only need to specify information such as the model, adapter (optional), template, and fine-tuning method.

Here is an example of a configuration file:

### examples/inference/llama3_lora_sft.yaml
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
adapter_name_or_path: saves/llama3-8b/lora/sft
template: llama3
finetuning_type: lora

The following is an example of starting and invoking the api service:

You can use API_PORT=8000 CUDA_VISIBLE_DEVICES=0 llamafactory-cli api examples/inference/llama3_lora_sft.yaml to start the api service and run the following example program for invocation:

# api_call_example.py
from openai import OpenAI
client = OpenAI(api_key="0",base_url="http://0.0.0.0:8000/v1")
messages = [{"role": "user", "content": "Who are you?"}]
result = client.chat.completions.create(messages=messages, model="meta-llama/Meta-Llama-3-8B-Instruct")
print(result.choices[0].message)