Evaluation

General Competency Assessment

After completing the model training, you can evaluate the model performance through llamafactory-cli eval examples/train_lora/llama3_lora_eval.yaml.

The configuration example file examples/train_lora/llama3_lora_eval.yaml is as follows:

### examples/train_lora/llama3_lora_eval.yaml
### model
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
adapter_name_or_path: saves/llama3-8b/lora/sft # 可选项

### method
finetuning_type: lora

### dataset
task: mmlu_test # mmlu_test, ceval_validation, cmmlu_test
template: fewshot
lang: en
n_shot: 5

### output
save_dir: saves/llama3-8b/lora/eval

### eval
batch_size: 4

NLG Evaluation

In addition, you can also obtain the BLEU and ROUGE scores of the model to evaluate the model generation quality through llamafactory-cli train examples/extras/nlg_eval/llama3_lora_predict.yaml.

The configuration example file examples/extras/nlg_eval/llama3_lora_predict.yaml is as follows:

### examples/extras/nlg_eval/llama3_lora_predict.yaml
### model
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
adapter_name_or_path: saves/llama3-8b/lora/sft

### method
stage: sft
do_predict: true
finetuning_type: lora

### dataset
eval_dataset: identity,alpaca_en_demo
template: llama3
cutoff_len: 2048
max_samples: 50
overwrite_cache: true
preprocessing_num_workers: 16

### output
output_dir: saves/llama3-8b/lora/predict
overwrite_output_dir: true

### eval
per_device_eval_batch_size: 1
predict_with_generate: true
ddp_timeout: 180000000

Similarly, you can also specify the model and dataset in the command python scripts/vllm_infer.py --model_name_or_path path_to_merged_model --dataset alpaca_en_demo to use the vllm inference framework for faster inference speed.

Evaluate relevant parameters

EvalArguments

Parameter Name

Type

Introduction

task

str

The name of the evaluation task. Optional values are mmlu_test, ceval_validation, cmmlu_test

task_dir

str

The folder path containing the evaluation dataset. The default value is evaluation.

batch_size

int

The batch size used per GPU, with a default value of 4.

seed

int

The random seed for the data loader, with a default value of 42.

lang

str

Evaluate the language used. The available values are en, zh. The default value is en.

n_shot

int

The number of few-shot examples, with a default value of 5.

save_dir

str

The path to save the evaluation results. The default value is None. If the path already exists, an error will be thrown.

download_mode

str

Evaluate the download mode of the dataset. The default value is DownloadMode.REUSE_DATASET_IF_EXISTS. Reuse the dataset if it already exists; otherwise, download it.