2024年10月23日

通过元提示增强您的提示

欢迎阅读我们的元提示技巧指南!在本指南中,我们将探讨如何优化基础提示词,以提升语言模型的输出质量。我们将以新闻摘要为例来说明这一过程。

元提示技术是一种利用LLM生成或优化提示的方法。通常,这会通过使用更高智能的模型来为相对低智能的模型优化提示。这是一个运用提示来引导、构建和优化其他提示的过程,有助于确保这些提示能更有效地引导LLM产生高质量、相关的输出。我们将利用具备高级推理能力的更智能模型o1-preview,来为gpt-4o改进提示。

我们致力于通过这项技术让您使用LLM的开发之旅更加顺畅和便捷。别忘了在playground中体验我们的Generate Anything功能——这是探索元提示的绝佳起点。

在这个示例中,我们将从一个简单的新闻文章摘要提示开始,然后逐步优化它,观察输出结果如何提升。我们将使用o1-preview来分析并改进提示,在此过程中添加更多细节和清晰度。最后,我们将系统性地评估输出结果,以理解优化带来的影响。

import pandas as pd
import openai 
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
from pydantic import BaseModel
from datasets import load_dataset

client = openai.Client()
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

导入数据

让我们首先从HuggingFace导入bbc_news_alltime数据集。该数据集包含所有BBC新闻文章,记录了从2017年到最近完整月份每月发布的所有内容。为了本次实验,我们将仅关注最近一个月(2024年8月)的样本数据,以确保内容的时效性和可管理性。

ds = load_dataset("RealTimeData/bbc_news_alltime", "2024-08")
df = pd.DataFrame(ds['train']).sample(n=100, random_state=1)
df.head()
标题 发布日期 作者 描述 栏目 内容 链接 顶部图片
2662 劳拉·惠特莫尔:我在提出...后遭遇了煤气灯效应 2024-08-04 https://www.facebook.com/bbcnews 这位前《爱情岛》主持人表示,事情... 文化 电视主持人劳拉·惠特莫尔表示... http://www.bbc.co.uk/news/articles/c9wvwvzm7x7o https://ichef.bbci.co.uk/ace/standard/2560/cps...
1865 埃罗琳·沃伦被任命为国王音乐大师... 2024-08-25 https://www.facebook.com/bbcnews 她最著名的作品是为2012年奥运会创作的音乐... 文化 著名作曲家兼创作歌手埃罗琳·沃伦... http://www.bbc.co.uk/news/articles/c4gl758g7zgo https://ichef.bbci.co.uk/ace/standard/2560/cps...
2554 SDLP:马修·奥图尔支持克莱尔·汉娜... 2024-08-30 https://www.facebook.com/bbcnews 马修·奥图尔曾被一些人视为潜在候选人... 北爱尔兰政治 马修·奥图尔领导其政党的官方反对... http://www.bbc.co.uk/news/articles/cvg41j7xrzdo https://ichef.bbci.co.uk/ace/standard/3840/cps...
1338 罗瑟勒姆骚乱参与者被判入狱 - BBC新闻 2024-08-20 https://www.facebook.com/bbcnews 两名参与袭击霍尔...的暴徒成员... 南约克郡 罗瑟勒姆两人因英国骚乱被判入狱... http://www.bbc.co.uk/news/articles/cwywggd7qw6o https://ichef.bbci.co.uk/ace/standard/2560/cps...
1232 BBC新闻 - BBC iPlayer 2024-08-02 JavaScript似乎已禁用。请启用... http://www.bbc.co.uk/news/10318089

让我们从一个简单的提示开始,然后使用o1-preview来优化它以获得更好的结果。我们想要总结新闻文章,所以这就是我要让模型做的事情。

simple_prompt = "Summarize this news article: {article}"

为了改进提示词,我们需要向o1-preview提供我们想要实现的上下文和目标。然后我们可以要求它生成一个更详细的提示词,以产生更丰富、更全面的新闻摘要。

meta_prompt = """
Improve the following prompt to generate a more detailed summary. 
Adhere to prompt engineering best practices. 
Make sure the structure is clear and intuitive and contains the type of news, tags and sentiment analysis.

{simple_prompt}

Only return the prompt.
"""
def get_model_response(messages, model="o1-preview"):
    response = client.chat.completions.create(
        messages=messages,
        model=model,
    )
    return response.choices[0].message.content


complex_prompt = get_model_response([{"role": "user", "content": meta_prompt.format(simple_prompt=simple_prompt)}])
complex_prompt
'Please read the following news article and provide a comprehensive summary that includes:\n\n1. **Type of News**: Specify the category of the news article (e.g., Politics, Technology, Health, Sports, etc.).\n2. **Summary**: Write a concise and clear summary of the main points, ensuring the structure is logical and intuitive.\n3. **Tags**: List relevant keywords or tags associated with the article.\n4. **Sentiment Analysis**: Analyze the overall sentiment of the article (positive, negative, or neutral) and briefly explain your reasoning.\n\n**Article:**\n\n{article}'

生成摘要

现在我们有了两种提示词,让我们开始生成摘要吧!对于数据集中的每个条目,我们将同时使用简单提示和增强提示来比较它们的效果。通过这种方式,我们可以直观地看到使用o1-preview优化后如何产生更丰富、更详细的摘要。让我们开始探索其中的差异吧!

def generate_response(prompt): 
    messages = [{"role": "user", "content": prompt}]
    response = get_model_response(messages, model="gpt-4o-mini")
    return response

def generate_summaries(row):
    simple_itinerary = generate_response(simple_prompt.format(article=row["content"]))
    complex_itinerary = generate_response(complex_prompt + row["content"])
    return simple_itinerary, complex_itinerary

让我们检查一下是否一切看起来都没问题,以及我们能否为第一篇新闻报道生成摘要。

generate_summaries(df.iloc[0])
('Television presenter Laura Whitmore has shared that the issues she attempted to address during her time on *Strictly Come Dancing* eight years ago are now surfacing, stating that she experienced "gaslighting" that made her concerns seem normalized. In a recent interview, she expressed the difficulties she faced, including being portrayed negatively and feeling "broken" during the competition. Whitmore indicated that she raised concerns about inappropriate behavior and is currently providing evidence for a BBC investigation, although she has not made an official complaint herself. The BBC is facing allegations of mistreatment towards contestants, prompting them to announce new welfare measures, including the presence of a chaperone during rehearsals. Other celebrities participating in the show have also made allegations against professional dancers, leading to growing scrutiny around conditions on the show. The BBC emphasized that it takes complaints very seriously and is committed to updating its support processes.',
 '1. **Type of News**: Entertainment\n\n2. **Summary**: Laura Whitmore, a television presenter, has spoken out about her experiences on Strictly Come Dancing, revealing that issues she attempted to address during her tenure on the show are now coming to light. In an interview with The Irish Times, she described feeling "gaslit" and suggested that her concerns, which she raised eight years ago, were not taken seriously at the time. Whitmore recalled that her participation left her feeling "broken" and criticized how she was portrayed during the show. She mentioned contributing evidence to an ongoing review involving incidents of alleged inappropriate behavior during her time on the show, although she did not make an official complaint. The BBC, which has been navigating its own controversy related to the treatment of contestants, stated it is taking these claims seriously and plans to enhance welfare measures on the show, including the introduction of a chaperone at rehearsals. Recent allegations from other contestants have further intensified the scrutiny of Strictly Come Dancing.\n\n3. **Tags**: Laura Whitmore, Strictly Come Dancing, BBC, allegations, inappropriate behavior, gaslighting, welfare measures, entertainment controversy\n\n4. **Sentiment Analysis**: The overall sentiment of the article is negative. It highlights serious allegations of mistreatment and inappropriate behavior associated with a popular television show, along with personal accounts from Whitmore that reflect emotional distress and professional struggles. The tone conveys a sense of urgency and seriousness regarding the issues raised, indicating a critical atmosphere within the entertainment industry related to contestant treatment.')

通过对比简单提示和增强提示生成的摘要,我们已经可以看到显著的改进。初始摘要提供了文章的大致概述,而增强摘要则更加深入——它不仅提供了详细的摘要,还对新闻类型进行了分类,列出了相关标签,甚至包含了情感分析。

现在让我们在整个数据集上进行测试!

# Add new columns to the dataframe for storing itineraries
df['simple_summary'] = None
df['complex_summary'] = None

# Use ThreadPoolExecutor to generate itineraries concurrently
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(generate_summaries, row): index for index, row in df.iterrows()}
    for future in tqdm(as_completed(futures), total=len(futures), desc="Generating Itineraries"):
        index = futures[future]
        simple_itinerary, complex_itinerary = future.result()
        df.at[index, 'simple_summary'] = simple_itinerary
        df.at[index, 'complex_summary'] = complex_itinerary

df.head()
Generating Itineraries: 100%|██████████| 100/100 [00:50<00:00,  1.98it/s]
标题 发布日期 作者 描述 栏目 内容 链接 顶部图片 简要摘要 详细摘要
2662 劳拉·惠特莫尔:我在提出...后遭遇了煤气灯效应 2024-08-04 https://www.facebook.com/bbcnews 这位前《爱情岛》主持人表示情况... 文化 电视主持人劳拉·惠特莫尔表示... http://www.bbc.co.uk/news/articles/c9wvwvzm7x7o https://ichef.bbci.co.uk/ace/standard/2560/cps... 电视主持人劳拉·惠特莫尔谈及... 1. **新闻类型**: 娱乐/电视\...
1865 埃罗琳·沃伦被任命为国王音乐大师... 2024-08-25 https://www.facebook.com/bbcnews 她最著名的作品是为2012年奥运会... 文化 著名作曲家兼创作歌手埃罗琳·沃伦... http://www.bbc.co.uk/news/articles/c4gl758g7zgo https://ichef.bbci.co.uk/ace/standard/2560/cps... 埃罗琳·沃伦被任命为国王音乐厅大师... 1. **新闻类型**: 艺术/音乐\n\n2. **摘要...
2554 SDLP:马修·奥图尔支持克莱尔·汉娜... 2024-08-30 https://www.facebook.com/bbcnews 马修·奥图尔曾被一些人视为潜在候选人... 北爱尔兰政治 马修·奥图尔领导其政党的官方反对... http://www.bbc.co.uk/news/articles/cvg41j7xrzdo https://ichef.bbci.co.uk/ace/standard/3840/cps... 官方反对党领袖马修·奥图尔... 1. **新闻类型**: 政治\n\n2. **摘要**:...
1338 罗瑟勒姆骚乱参与者被判入狱 - BBC新闻 2024-08-20 https://www.facebook.com/bbcnews 两名参与袭击霍尔...的暴徒成员... 南约克郡 罗瑟勒姆两人因英国骚乱被判入狱... http://www.bbc.co.uk/news/articles/cwywggd7qw6o https://ichef.bbci.co.uk/ace/standard/2560/cps... 两名男子,内森·帕尔默(29岁)和尼文·马修... 1. **新闻类型**: 政治/犯罪与司法...
1232 BBC新闻 - BBC iPlayer 2024-08-02 JavaScript似乎已禁用。请启用... http://www.bbc.co.uk/news/10318089 文章讨论了需要启用JavaS... 由于JavaScript被禁用,我无法提供文章摘要...

评估结果

为了评估两个提示词之间的性能差异,我们将采用结构化评估方法,让大语言模型担任裁判。这意味着我们将利用语言模型本身,根据特定标准来评估和比较输出结果。

“LLM作为评判者”是什么意思?

使用LLM作为评判者涉及让语言模型评估其自身或其他模型的输出。它应用预定义的标准来评估准确性、清晰度和相关性等方面。这种方法帮助我们获得客观且一致的评估,避免人为偏见,从而更容易识别不同提示之间的改进之处。我们的OpenAI Evals入门指南提供了如何开始使用这种方法的概览。

这是我们用于评估的提示词:

evaluation_prompt = """
You are an expert editor tasked with evaluating the quality of a news article summary. Below is the original article and the summary to be evaluated:

**Original Article**:  
{original_article}

**Summary**:  
{summary}

Please evaluate the summary based on the following criteria, using a scale of 1 to 5 (1 being the lowest and 5 being the highest). Be critical in your evaluation and only give high scores for exceptional summaries:

1. **Categorization and Context**: Does the summary clearly identify the type or category of news (e.g., Politics, Technology, Sports) and provide appropriate context?  
2. **Keyword and Tag Extraction**: Does the summary include relevant keywords or tags that accurately capture the main topics and themes of the article?  
3. **Sentiment Analysis**: Does the summary accurately identify the overall sentiment of the article and provide a clear, well-supported explanation for this sentiment?  
4. **Clarity and Structure**: Is the summary clear, well-organized, and structured in a way that makes it easy to understand the main points?  
5. **Detail and Completeness**: Does the summary provide a detailed account that includes all necessary components (type of news, tags, sentiment) comprehensively?  


Provide your scores and justifications for each criterion, ensuring a rigorous and detailed evaluation.
"""

class ScoreCard(BaseModel):
    justification: str
    categorization: int
    keyword_extraction: int
    sentiment_analysis: int
    clarity_structure: int
    detail_completeness: int

这里有一个专业建议——你实际上可以使用元提示来优化你的评估提示!通过同样的迭代增强方法,对指示大语言模型扮演评委角色的提示进行改进,你可以让评估变得更加精确和有洞察力。

让我们使用这个提示来评估我们的摘要!

def evaluate_summaries(row):
    simple_messages = [{"role": "user", "content": evaluation_prompt.format(original_article=row["content"], summary=row['simple_summary'])}]
    complex_messages = [{"role": "user", "content": evaluation_prompt.format(original_article=row["content"], summary=row['complex_summary'])}]
    
    simple_summary = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=simple_messages,
        response_format=ScoreCard)
    simple_summary = simple_summary.choices[0].message.parsed
    
    complex_summary = client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=complex_messages,
        response_format=ScoreCard)
    complex_summary = complex_summary.choices[0].message.parsed
    
    return simple_summary, complex_summary

# Add new columns to the dataframe for storing evaluations
df['simple_evaluation'] = None
df['complex_evaluation'] = None

# Use ThreadPoolExecutor to evaluate itineraries concurrently
with ThreadPoolExecutor() as executor:
    futures = {executor.submit(evaluate_summaries, row): index for index, row in df.iterrows()}
    for future in tqdm(as_completed(futures), total=len(futures), desc="Evaluating Summaries"):
        index = futures[future]
        simple_evaluation, complex_evaluation = future.result()
        df.at[index, 'simple_evaluation'] = simple_evaluation
        df.at[index, 'complex_evaluation'] = complex_evaluation

df.head()
Evaluating Summaries: 100%|██████████| 100/100 [01:42<00:00,  1.02s/it]
标题 发布日期 作者 描述 栏目 内容 链接 顶部图片 简要摘要 详细摘要 简要评价 详细评价
2662 劳拉·惠特摩尔:我在提出...后遭遇了煤气灯效应 2024-08-04 https://www.facebook.com/bbcnews 这位前《爱情岛》主持人表示情况... 文化 电视主持人劳拉·惠特摩尔表示... http://www.bbc.co.uk/news/articles/c9wvwvzm7x7o https://ichef.bbci.co.uk/ace/standard/2560/cps... 电视主持人劳拉·惠特摩尔谈及... 1. **新闻类型**: 娱乐/电视\... categorization=4 keyword_extraction=3 sentimen... categorization=5 keyword_extraction=5 sentimen...
1865 埃罗琳·沃伦被任命为国王音乐大师... 2024-08-25 https://www.facebook.com/bbcnews 她最著名的作品是为2012年奥运会... 文化 著名作曲家兼创作歌手埃罗琳·沃伦... http://www.bbc.co.uk/news/articles/c4gl758g7zgo https://ichef.bbci.co.uk/ace/standard/2560/cps... 埃罗琳·沃伦被任命为国王音乐学院的音乐大师... 1. **新闻类型**: 艺术/音乐\n\n2. **摘要... categorization=4 keyword_extraction=4 sentimen... categorization=5 keyword_extraction=5 sentimen...
2554 SDLP:马修·奥图尔支持克莱尔·汉娜... 2024-08-30 https://www.facebook.com/bbcnews 马修·奥图尔曾被一些人视为潜在候选人... 北爱尔兰政治 马修·奥图尔领导其政党的官方反对派... http://www.bbc.co.uk/news/articles/cvg41j7xrzdo https://ichef.bbci.co.uk/ace/standard/3840/cps... 官方反对派领袖马修·奥图尔... 1. **新闻类型**: 政治\n\n2. **摘要**... categorization=5 keyword_extraction=4 sentimen... categorization=5 keyword_extraction=5 sentimen...
1338 罗瑟勒姆骚乱参与者被判入狱 - BBC新闻 2024-08-20 https://www.facebook.com/bbcnews 两名参与袭击霍尔...的暴徒成员... 南约克郡 罗瑟勒姆两人因英国骚乱被判入狱... http://www.bbc.co.uk/news/articles/cwywggd7qw6o https://ichef.bbci.co.uk/ace/standard/2560/cps... 两名男子Nathan Palmer(29岁)和Niven Matthewm... 1. **新闻类型**: 政治/犯罪与司法... categorization=3 keyword_extraction=3 sentimen... categorization=5 keyword_extraction=4 sentimen...
1232 BBC新闻 - BBC iPlayer 2024-08-02 None None None JavaScript似乎被禁用了。请启用... http://www.bbc.co.uk/news/10318089 文章讨论了需要启用JavaS... 我无法提供文章摘要,因为... categorization=2 keyword_extraction=3 sentimen... categorization=1 keyword_extraction=1 sentimen...
import matplotlib.pyplot as plt

df["simple_scores"] = df["simple_evaluation"].apply(lambda x: [score for key, score in x.model_dump().items() if key != 'justification'])
df["complex_scores"] = df["complex_evaluation"].apply(lambda x: [score for key, score in x.model_dump().items() if key != 'justification'])


# Calculate average scores for each criterion
criteria = [
    'Categorisation',
    'Keywords and Tags',
    'Sentiment Analysis',
    'Clarity and Structure',
    'Detail and Completeness'
]

# Calculate average scores for each criterion by model
simple_avg_scores = df['simple_scores'].apply(pd.Series).mean()
complex_avg_scores = df['complex_scores'].apply(pd.Series).mean()


# Prepare data for plotting
avg_scores_df = pd.DataFrame({
    'Criteria': criteria,
    'Original Prompt': simple_avg_scores,
    'Improved Prompt': complex_avg_scores
})

# Plotting
ax = avg_scores_df.plot(x='Criteria', kind='bar', figsize=(6, 4))
plt.ylabel('Average Score')
plt.title('Comparison of Simple vs Complex Prompt Performance by Model')
plt.xticks(rotation=45, ha='right')
plt.tight_layout()
plt.legend(loc='upper left', bbox_to_anchor=(1, 1))
plt.show()
image generated by notebook

在评估结果后,我们发现虽然基础提示在清晰度和结构上表现良好,但增强提示在其他几个关键标准上显著提升了输出质量:分类、关键词与标签、情感分析以及细节完整性。复杂提示生成的摘要信息更丰富、组织更合理、内容更充实。

这个示例展示了如何优化提示词可以显著提升生成摘要的质量。虽然这是一个简化案例,但在实际生产级应用中,提示优化的优势预计会更加明显,从而产生更符合特定目标和用户需求的输出内容。

结论

元提示是一种强大的技术,可以显著提升语言模型的输出质量。我们的探索表明,从简单提示开始并使用o1-preview进行优化后,生成的摘要信息更丰富、结构更清晰、内容更充实——在分类、关键词和标签、情感分析以及完整性等关键指标上都有所提升。这个案例凸显了提示优化的价值,即使在这个简化示例中,其优势也显而易见。在实际应用中,利用元提示和o1-preview等工具可以将语言模型的性能提升到新高度,从而更好地满足您的特定目标和用户需求。