Evaluation ¶
General Competency Assessment¶
After completing the model training, you can evaluate the model performance through llamafactory-cli eval examples/train_lora/llama3_lora_eval.yaml.
The configuration example file examples/train_lora/llama3_lora_eval.yaml is as follows:
### examples/train_lora/llama3_lora_eval.yaml
### model
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
adapter_name_or_path: saves/llama3-8b/lora/sft # 可选项
### method
finetuning_type: lora
### dataset
task: mmlu_test # mmlu_test, ceval_validation, cmmlu_test
template: fewshot
lang: en
n_shot: 5
### output
save_dir: saves/llama3-8b/lora/eval
### eval
batch_size: 4
NLG Evaluation¶
In addition, you can also obtain the BLEU and ROUGE scores of the model to evaluate the model generation quality through llamafactory-cli train examples/extras/nlg_eval/llama3_lora_predict.yaml.
The configuration example file examples/extras/nlg_eval/llama3_lora_predict.yaml is as follows:
### examples/extras/nlg_eval/llama3_lora_predict.yaml
### model
model_name_or_path: meta-llama/Meta-Llama-3-8B-Instruct
adapter_name_or_path: saves/llama3-8b/lora/sft
### method
stage: sft
do_predict: true
finetuning_type: lora
### dataset
eval_dataset: identity,alpaca_en_demo
template: llama3
cutoff_len: 2048
max_samples: 50
overwrite_cache: true
preprocessing_num_workers: 16
### output
output_dir: saves/llama3-8b/lora/predict
overwrite_output_dir: true
### eval
per_device_eval_batch_size: 1
predict_with_generate: true
ddp_timeout: 180000000
Similarly, you can also specify the model and dataset in the command python scripts/vllm_infer.py --model_name_or_path path_to_merged_model --dataset alpaca_en_demo to use the vllm inference framework for faster inference speed.
Evaluate relevant parameters¶
Parameter Name |
Type |
Introduction |
|---|---|---|
task |
str |
The name of the evaluation task. Optional values are mmlu_test, ceval_validation, cmmlu_test |
task_dir |
str |
The folder path containing the evaluation dataset. The default value is |
batch_size |
int |
The batch size used per GPU, with a default value of |
seed |
int |
The random seed for the data loader, with a default value of |
lang |
str |
Evaluate the language used. The available values are |
n_shot |
int |
The number of few-shot examples, with a default value of |
save_dir |
str |
The path to save the evaluation results. The default value is |
download_mode |
str |
Evaluate the download mode of the dataset. The default value is |