使用模式(响应评估)
使用 BaseEvaluator
Section titled “Using BaseEvaluator”LlamaIndex中的所有评估模块都实现了BaseEvaluator类,包含两个主要方法:
evaluate方法接收query、contexts、response以及其他关键字参数。
def evaluate( self, query: Optional[str] = None, contexts: Optional[Sequence[str]] = None, response: Optional[str] = None, **kwargs: Any, ) -> EvaluationResult:evaluate_response方法提供了一个替代接口,它接收一个 llamaindexResponse对象(包含响应字符串和源节点),而不是单独的contexts和response。
def evaluate_response( self, query: Optional[str] = None, response: Optional[Response] = None, **kwargs: Any,) -> EvaluationResult:其功能与 evaluate 完全相同,只是在使用 llamaindex 对象时更加简便。
使用 EvaluationResult
Section titled “Using EvaluationResult”每个评估器在执行时输出一个 EvaluationResult:
eval_result = evaluator.evaluate(query=..., contexts=..., response=...)eval_result.passing # binary pass/faileval_result.score # numerical scoreeval_result.feedback # string feedback不同的评估器可能填充结果字段的子集。
FaithfulnessEvaluator 用于评估答案是否忠实于检索到的上下文(换言之,是否存在幻觉现象)。
from llama_index.core import VectorStoreIndexfrom llama_index.llms.openai import OpenAIfrom llama_index.core.evaluation import FaithfulnessEvaluator
# create llmllm = OpenAI(model="gpt-4", temperature=0.0)
# build index...
# define evaluatorevaluator = FaithfulnessEvaluator(llm=llm)
# query indexquery_engine = vector_index.as_query_engine()response = query_engine.query( "What battles took place in New York City in the American Revolution?")eval_result = evaluator.evaluate_response(response=response)print(str(eval_result.passing))
您也可以选择单独评估每个来源上下文:
from llama_index.core import VectorStoreIndexfrom llama_index.llms.openai import OpenAIfrom llama_index.core.evaluation import FaithfulnessEvaluator
# create llmllm = OpenAI(model="gpt-4", temperature=0.0)
# build index...
# define evaluatorevaluator = FaithfulnessEvaluator(llm=llm)
# query indexquery_engine = vector_index.as_query_engine()response = query_engine.query( "What battles took place in New York City in the American Revolution?")response_str = response.responsefor source_node in response.source_nodes: eval_result = evaluator.evaluate( response=response_str, contexts=[source_node.get_content()] ) print(str(eval_result.passing))您将获得一个结果列表,对应 response.source_nodes 中的每个源节点。
RelevancyEvaluator 评估检索到的上下文和答案对于给定查询是否相关且一致。
请注意,此评估器除了需要传入 Response 对象外,还需要传入 query。
from llama_index.core import VectorStoreIndexfrom llama_index.llms.openai import OpenAIfrom llama_index.core.evaluation import RelevancyEvaluator
# create llmllm = OpenAI(model="gpt-4", temperature=0.0)
# build index...
# define evaluatorevaluator = RelevancyEvaluator(llm=llm)
# query indexquery_engine = vector_index.as_query_engine()query = "What battles took place in New York City in the American Revolution?"response = query_engine.query(query)eval_result = evaluator.evaluate_response(query=query, response=response)print(str(eval_result))
同样地,您也可以针对特定源节点进行评估。
from llama_index.core import VectorStoreIndexfrom llama_index.llms.openai import OpenAIfrom llama_index.core.evaluation import RelevancyEvaluator
# create llmllm = OpenAI(model="gpt-4", temperature=0.0)
# build index...
# define evaluatorevaluator = RelevancyEvaluator(llm=llm)
# query indexquery_engine = vector_index.as_query_engine()query = "What battles took place in New York City in the American Revolution?"response = query_engine.query(query)response_str = response.responsefor source_node in response.source_nodes: eval_result = evaluator.evaluate( query=query, response=response_str, contexts=[source_node.get_content()], ) print(str(eval_result.passing))
LlamaIndex 还可以生成问题来利用您的数据进行回答。结合使用上述评估器,您可以在数据上创建完全自动化的评估流程。
from llama_index.core import SimpleDirectoryReaderfrom llama_index.llms.openai import OpenAIfrom llama_index.core.llama_dataset.generator import RagDatasetGenerator
# create llmllm = OpenAI(model="gpt-4", temperature=0.0)
# build documentsdocuments = SimpleDirectoryReader("./data").load_data()
# define generator, generate questionsdataset_generator = RagDatasetGenerator.from_documents( documents=documents, llm=llm, num_questions_per_chunk=10, # set the number of questions per nodes)
rag_dataset = dataset_generator.generate_questions_from_nodes()questions = [e.query for e in rag_dataset.examples]我们还提供了一个批量评估运行器,用于在多个问题上运行一组评估器。
from llama_index.core.evaluation import BatchEvalRunner
runner = BatchEvalRunner( {"faithfulness": faithfulness_evaluator, "relevancy": relevancy_evaluator}, workers=8,)
eval_results = await runner.aevaluate_queries( vector_index.as_query_engine(), queries=questions)我们还集成了社区评估工具。
DeepEval
Section titled “DeepEval”DeepEval 提供6种评估器(包括3种RAG评估器,用于检索器和生成器评估),这些评估器由其专有的评估指标驱动。首先,安装 deepeval:
pip install -U deepeval然后您可以从 deepeval 导入并使用评估器。完整示例:
from llama_index.core import VectorStoreIndex, SimpleDirectoryReaderfrom deepeval.integrations.llama_index import DeepEvalAnswerRelevancyEvaluator
documents = SimpleDirectoryReader("YOUR_DATA_DIRECTORY").load_data()index = VectorStoreIndex.from_documents(documents)rag_application = index.as_query_engine()
# An example input to your RAG applicationuser_input = "What is LlamaIndex?"
# LlamaIndex returns a response object that contains# both the output string and retrieved nodesresponse_object = rag_application.query(user_input)
evaluator = DeepEvalAnswerRelevancyEvaluator()evaluation_result = evaluator.evaluate_response( query=user_input, response=response_object)print(evaluation_result)以下是如何从 deepeval 导入全部6个评估器:
from deepeval.integrations.llama_index import ( DeepEvalAnswerRelevancyEvaluator, DeepEvalFaithfulnessEvaluator, DeepEvalContextualRelevancyEvaluator, DeepEvalSummarizationEvaluator, DeepEvalBiasEvaluator, DeepEvalToxicityEvaluator,)要了解更多关于如何将 deepeval 的评估指标与 LlamaIndex 结合使用,并充分利用其完整的 LLM 测试套件,请访问文档。