Tonic Validate
什么是 Tonic Validate
Section titled “What is Tonic Validate”Tonic Validate 是一款面向开发检索增强生成(RAG)系统人员的工具,用于评估其系统性能。您可以使用 Tonic Validate 对 LlamaIndex 设置进行一次性性能抽查,甚至可以在现有 CI/CD 系统(如 Github Actions)中使用它。Tonic Validate 包含两个部分
- 开源软件开发工具包
- 网页用户界面
如果您愿意,可以不使用网页界面而直接使用SDK。该SDK包含了评估RAG系统所需的全部工具。网页界面的作用是在SDK之上提供一个可视化结果的交互层。与仅查看原始数据相比,这能让您更直观地了解系统性能。
如果您想使用网页界面,可以前往此处注册免费账户。
如何使用 Tonic Validate
Section titled “How to use Tonic Validate”设置 Tonic Validate
Section titled “Setting Up Tonic Validate”您可以通过以下命令安装 Tonic Validate
pip install tonic-validate要使用 Tonic Validate,您需要提供一个 OpenAI 密钥,因为评分计算在后端使用了大型语言模型。您可以通过将环境变量 OPENAI_API_KEY 设置为您的 OpenAI API 密钥来设置 OpenAI 密钥。
import os
os.environ["OPENAI_API_KEY"] = "put-your-openai-api-key-here"如果您要将结果上传到用户界面,请确保同时设置您在网页用户界面账户设置期间收到的Tonic Validate API密钥。如果您尚未在网页用户界面上设置账户,可以在此处进行设置。获取API密钥后,您可以通过TONIC_VALIDATE_API_KEY环境变量进行设置。
import os
os.environ["TONIC_VALIDATE_API_KEY"] = "put-your-validate-api-key-here"在这个示例中,我们有一个问题及其参考答案,但参考答案与大语言模型的回答不匹配。这里有两个检索到的上下文片段,其中包含正确答案的片段。
question = "What makes Sam Altman a good founder?"reference_answer = "He is smart and has a great force of will."llm_answer = "He is a good founder because he is smart."retrieved_context_list = [ "Sam Altman is a good founder. He is very smart.", "What makes Sam Altman such a good founder is his great force of will.",]答案相似度评分是一个介于0到5之间的分数,用于评估LLM答案与参考答案的匹配程度。在这种情况下,它们并未完全匹配,因此答案相似度评分并非完美的5分。
answer_similarity_evaluator = AnswerSimilarityEvaluator()score = await answer_similarity_evaluator.aevaluate( question, llm_answer, retrieved_context_list, reference_response=reference_answer,)print(score)# >> EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=4.0, pairwise_source=None, invalid_result=False, invalid_reason=None)答案一致性得分介于0.0到1.0之间,用于衡量答案中是否包含未出现在检索上下文中的信息。在本案例中,答案确实出现在检索上下文中,因此得分为1。
answer_consistency_evaluator = AnswerConsistencyEvaluator()
score = await answer_consistency_evaluator.aevaluate( question, llm_answer, retrieved_context_list)print(score)# >> EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=1.0, pairwise_source=None, invalid_result=False, invalid_reason=None)增强准确率衡量的是答案中包含的检索上下文所占的百分比。在这种情况下,有一个检索到的上下文出现在答案中,因此该得分为0.5。
augmentation_accuracy_evaluator = AugmentationAccuracyEvaluator()
score = await augmentation_accuracy_evaluator.aevaluate( question, llm_answer, retrieved_context_list)print(score)# >> EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=0.5, pairwise_source=None, invalid_result=False, invalid_reason=None)增强精度衡量相关检索到的上下文是否被纳入答案中。两个检索到的上下文都是相关的,但只有一个被纳入答案。因此,该得分为0.5。
augmentation_precision_evaluator = AugmentationPrecisionEvaluator()
score = await augmentation_precision_evaluator.aevaluate( question, llm_answer, retrieved_context_list)print(score)# >> EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=0.5, pairwise_source=None, invalid_result=False, invalid_reason=None)检索精度衡量的是检索到的上下文与问题相关的百分比。在这种情况下,两个检索到的上下文都与回答问题相关,因此得分为1.0。
retrieval_precision_evaluator = RetrievalPrecisionEvaluator()
score = await retrieval_precision_evaluator.aevaluate( question, llm_answer, retrieved_context_list)print(score)# >> EvaluationResult(query='What makes Sam Altman a good founder?', contexts=['Sam Altman is a good founder. He is very smart.', 'What makes Sam Altman such a good founder is his great force of will.'], response='He is a good founder because he is smart.', passing=None, feedback=None, score=1.0, pairwise_source=None, invalid_result=False, invalid_reason=None)TonicValidateEvaluator 可以一次性计算 Tonic Validate 的所有指标。
tonic_validate_evaluator = TonicValidateEvaluator()
scores = await tonic_validate_evaluator.aevaluate( question, llm_answer, retrieved_context_list, reference_response=reference_answer,)print(scores.score_dict)# >> {# 'answer_consistency': 1.0,# 'answer_similarity': 4.0,# 'augmentation_accuracy': 0.5,# 'augmentation_precision': 0.5,# 'retrieval_precision': 1.0# }您也可以使用 TonicValidateEvaluator 一次性评估多个查询和响应,并返回可记录到 Tonic Validate 用户界面 的 tonic_validate 运行对象。
为此,您需要将问题、LLM回答、检索到的上下文列表和参考答案放入列表中,并调用evaluate_run。
questions = ["What is the capital of France?", "What is the capital of Spain?"]reference_answers = ["Paris", "Madrid"]llm_answer = ["Paris", "Madrid"]retrieved_context_lists = [ [ "Paris is the capital and most populous city of France.", "Paris, France's capital, is a major European city and a global center for art, fashion, gastronomy and culture.", ], [ "Madrid is the capital and largest city of Spain.", "Madrid, Spain's central capital, is a city of elegant boulevards and expansive, manicured parks such as the Buen Retiro.", ],]
tonic_validate_evaluator = TonicValidateEvaluator()
scores = await tonic_validate_evaluator.aevaluate_run( [questions], [llm_answers], [retrieved_context_lists], [reference_answers])print(scores.run_data[0].scores)# >> {# 'answer_consistency': 1.0,# 'answer_similarity': 3.0,# 'augmentation_accuracy': 0.5,# 'augmentation_precision': 0.5,# 'retrieval_precision': 1.0# }如果您想将评分上传到用户界面,那么可以使用 Tonic Validate API。在此之前,请确保已按照设置 Tonic Validate部分的说明设置好TONIC_VALIDATE_API_KEY。您还需要确保已在 Tonic Validate 用户界面中创建项目并复制了项目ID。在API密钥和项目设置完成后,即可初始化 Validate API 并上传结果。
validate_api = ValidateApi()project_id = "your-project-id"validate_api.upload_run(project_id, scores)现在您可以在 Tonic Validate 用户界面中查看您的结果!

这里我们将向您展示如何将 Tonic Validate 端到端与 Llama Index 结合使用。首先,让我们使用 Llama Index CLI 下载一个供 Llama Index 运行的数据集。
llamaindex-cli download-llamadataset EvaluatingLlmSurveyPaperDataset --download-dir ./data现在,我们可以创建一个名为 llama.py 的 Python 文件,并将以下代码放入其中。
from llama_index.core import SimpleDirectoryReaderfrom llama_index.core import VectorStoreIndex
documents = SimpleDirectoryReader(input_dir="./data/source_files").load_data()index = VectorStoreIndex.from_documents(documents=documents)query_engine = index.as_query_engine()这段代码本质上只是加载数据集文件,然后初始化Llama Index。
Llama Index 的 CLI 还会下载一份问答列表,您可以在他们的示例数据集上用于测试。如果您想使用这些问题和答案,可以使用以下代码。
from llama_index.core.llama_dataset import LabelledRagDataset
rag_dataset = LabelledRagDataset.from_json("./data/rag_dataset.json")
# We are only going to do 10 questions as running through the full data set takes too longquestions = [item.query for item in rag_dataset.examples][:10]reference_answers = [item.reference_answer for item in rag_dataset.examples][ :10]现在我们可以查询来自Llama Index的响应。
llm_answers = []retrieved_context_lists = []for question in questions: response = query_engine.query(question) context_list = [x.text for x in response.source_nodes] retrieved_context_lists.append(context_list) llm_answers.append(response.response)现在要对其进行评分,我们可以执行以下操作
from tonic_validate.metrics import AnswerSimilarityMetricfrom llama_index.evaluation.tonic_validate import TonicValidateEvaluator
tonic_validate_evaluator = TonicValidateEvaluator( metrics=[AnswerSimilarityMetric()], model_evaluator="gpt-4-1106-preview")
scores = tonic_validate_evaluator.evaluate_run( questions, retrieved_context_lists, reference_answers, llm_answers)print(scores.overall_scores)如果您想将评分上传到用户界面,那么可以使用 Tonic Validate API。在此之前,请确保已按照设置 Tonic Validate部分的说明设置好TONIC_VALIDATE_API_KEY。您还需要确保已在 Tonic Validate 用户界面中创建了一个项目,并且已复制项目ID。在API密钥和项目设置完成后,您可以初始化 Validate API 并上传结果。
validate_api = ValidateApi()project_id = "your-project-id"validate_api.upload_run(project_id, run)除了这里的文档,您还可以访问Tonic Validate的Github页面获取关于如何与我们的API交互以上传结果的更多文档。