Evaluation ¶

General Competency Assessment¶

After completing the model training, you can evaluate the model performance through llamafactory-cli eval examples/train_lora/llama3_lora_eval.yaml.

The configuration example file examples/train_lora/llama3_lora_eval.yaml is as follows:

### examples/train_lora/llama3_lora_eval.yaml
### model
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
adapter_name_or_path: saves/llama3-8b/lora/sft # 可选项

### method
finetuning_type: lora

### dataset
task: mmlu_test # mmlu_test, ceval_validation, cmmlu_test
template: fewshot
lang: en
n_shot: 5

### output
save_dir: saves/llama3-8b/lora/eval

### eval
batch_size: 4

NLG Evaluation¶

In addition, you can also obtain the BLEU and ROUGE scores of the model to evaluate the model generation quality through llamafactory-cli train examples/extras/nlg_eval/llama3_lora_predict.yaml.

The configuration example file examples/extras/nlg_eval/llama3_lora_predict.yaml is as follows:

### examples/extras/nlg_eval/llama3_lora_predict.yaml
### model
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
adapter_name_or_path: saves/llama3-8b/lora/sft

### method
stage: sft
do_predict: true
finetuning_type: lora

### dataset
eval_dataset: identity,alpaca_en_demo
template: llama3
cutoff_len: 2048
max_samples: 50
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/llama3-8b/lora/predict
overwrite_output_dir: true

### eval
per_device_eval_batch_size: 1
predict_with_generate: true
ddp_timeout: 180000000

Similarly, you can also specify the model and dataset in the command python scripts/vllm_infer.py --model_name_or_path path_to_merged_model --dataset alpaca_en_demo to use the vllm inference framework for faster inference speed.

Evaluate relevant parameters¶

EvalArguments¶
Parameter Name	Type	Introduction
task	str	The name of the evaluation task. Optional values are mmlu_test, ceval_validation, cmmlu_test
task_dir	str	The folder path containing the evaluation dataset. The default value is `evaluation`.
batch_size	int	The batch size used per GPU, with a default value of `4`.
seed	int	The random seed for the data loader, with a default value of `42`.
lang	str	Evaluate the language used. The available values are `en`, `zh`. The default value is `en`.
n_shot	int	The number of few-shot examples, with a default value of `5`.
save_dir	str	The path to save the evaluation results. The default value is `None`. If the path already exists, an error will be thrown.
download_mode	str	Evaluate the download mode of the dataset. The default value is `DownloadMode.REUSE_DATASET_IF_EXISTS`. Reuse the dataset if it already exists; otherwise, download it.