2. 创建合成问答数据集
我们使用davinci-instruct-beta-v3这个专门用于遵循指令的模型,根据给定上下文生成问题。然后我们同样使用davinci-instruct-beta-v3,在相同上下文中回答这些问题。
这成本高昂且耗时较长,因为我们需要为每个部分调用davinci引擎。您可以直接下载最终数据集。
我们使用的是通过前一个笔记本创建的数据集
2.1 读取数据并创建上下文
通过将标题、章节标题和该部分内容拼接起来创建上下文
import pandas as pd
df = pd.read_csv('olympics-data/olympics_sections.csv')
df['context'] = df.title + "\n" + df.heading + "\n\n" + df.content
df.head()| 标题 | 标题行 | 内容 | 令牌数 | 上下文 | |
|---|---|---|---|---|---|
| 0 | 2020 Summer Olympics | Summary | The 2020 Summer Olympics (Japanese: 2020年夏季オリン... | 713 | 2020 Summer Olympics\nSummary\n\nThe 2020 Summ... |
| 1 | 2020年夏季奥运会 | 主办城市选择 | 国际奥林匹克委员会(IOC)投票... | 126 | 2020年夏季奥运会\n主办城市选择\n\nT... |
| 2 | 2020年夏季奥运会 | COVID-19疫情的影响 | 2020年1月,人们开始担忧... | 369 | 2020 Summer Olympics\nImpact of the COVID-19 p... |
| 3 | 2020夏季奥运会 | 资格赛取消与延期 | 对疫情的担忧开始影响参赛资格... | 298 | 2020夏季奥运会\n资格赛取消与延期... |
| 4 | 2020夏季奥运会 | 对兴奋剂检测的影响 | 强制性兴奋剂检测受到严重影响... | 163 | 2020夏季奥运会\n对兴奋剂检测的影响\n... |
2.2 根据上下文创建问题
使用davinci-instruct生成与维基百科章节内容相关的多个合理问题。
注意:我们使用了temperature=0,但尝试使用更高的temperature值可能有助于获得更多样化的问题。
警告:此步骤将持续较长时间,并消耗大量token,因为它会为每个章节调用davinci-instruct来生成一系列问题。
from openai import OpenAI
client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))
def get_questions(context):
try:
response = client.chat.completions.create(model="davinci-instruct-beta-v3",
prompt=f"Write questions based on the text below\n\nText: {context}\n\nQuestions:\n1.",
temperature=0,
max_tokens=257,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
stop=["\n\n"])
return response.choices[0].text
except:
return ""
df['questions']= df.context.apply(get_questions)
df['questions'] = "1." + df.questions
print(df[['questions']].values[0][0])1. What is the 2020 Summer Olympics? 2. When did the 2020 Summer Olympics take place? 3. Who won the most medals at the 2020 Summer Olympics? 4. Who won the most gold medals at the 2020 Summer Olympics? 5. Who won the most medals at the 2020 Summer Olympics?
该提示词旨在生成一系列问题。上面的示例问题是基于2020年夏季奥运会页面摘要部分生成的。
我们可以观察到上面的问题3和5重复了。有时生成的问题在没有上下文的情况下可能会显得模糊不清。我们将展示即使存在这些限制,我们仍然可以创建一个成功的模型。
print(df.content.values[0])The 2020 Summer Olympics (Japanese: 2020年夏季オリンピック, Hepburn: Nisen Nijū-nen Kaki Orinpikku), officially the Games of the XXXII Olympiad (第三十二回オリンピック競技大会, Dai Sanjūni-kai Orinpikku Kyōgi Taikai) and branded as Tokyo 2020 (東京2020, Tōkyō Nii Zero Nii Zero), was an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan, with some preliminary events that began on 21 July. Tokyo was selected as the host city during the 125th IOC Session in Buenos Aires, Argentina, on 7 September 2013. Originally scheduled to take place from 24 July to 9 August 2020, the event was postponed to 2021 in March 2020 as a result of the COVID-19 pandemic, the first such instance in the history of the Olympic Games (previous games had been cancelled but not rescheduled). However, the event retained the Tokyo 2020 name for marketing and branding purposes. It was largely held behind closed doors with no public spectators permitted due to the declaration of a state of emergency in the Greater Tokyo Area in response to the pandemic. The Summer Paralympics were held between 24 August and 5 September 2021, 16 days after the completion of the Olympics.The 2020 Games were the fourth Olympic Games to be held in Japan, following the Tokyo 1964 (Summer), Sapporo 1972 (Winter) and Nagano 1998 (Winter) games. Tokyo is the first city in Asia to hold the Summer Games twice. The 2020 Games were the second of three consecutive Olympics to be held in East Asia, following the 2018 Winter Olympics in Pyeongchang, South Korea and preceding the 2022 Winter Olympics in Beijing, China. New events were introduced in existing sports for 2020, including 3x3 basketball, freestyle BMX and mixed gender team events in a number of existing sports, as well as the return of madison cycling for men and an introduction of the same event for women. New IOC policies also allowed the host organizing committee to add new sports to the Olympic program for just one Games. The disciplines added by the Japanese Olympic Committee were baseball and softball, karate, sport climbing, surfing and skateboarding, the last four of which made their Olympic debuts, and the last three of which will remain on the Olympic program.The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88). Host nation Japan finished third, setting a record for the most gold medals and total medals ever won by their delegation at an Olympic Games with 27 and 58. Great Britain finished fourth, with a total of 22 gold and 65 medals, becoming the first nation at the Summer Olympics to increase or equal their total medals won in the two Games subsequent to hosting them. The Russian delegation competing as the ROC (not to be confused with the Republic of China (Taiwan) which competed as Chinese Taipei, not ROC) finished fifth with 20 gold medals and third in the overall medal count, with 71 medals. Bermuda, the Philippines and Qatar won their first-ever Olympic gold medals. Burkina Faso, San Marino and Turkmenistan won their first-ever Olympic medals.
2.3 基于上下文生成答案
使用davinci-instruct根据相关维基百科章节内容回答问题
注意:我们使用了temperature=0,但尝试使用更高的temperature值可能有助于获得更多样化的问题。
警告:此步骤将持续较长时间,并消耗大量token,因为它会为每个部分调用davinci-instruct来回答所有问题。
def get_answers(row):
try:
response = client.chat.completions.create(
engine="davinci-instruct-beta-v3",
prompt=f"Write answer based on the text below\n\nText: {row.context}\n\nQuestions:\n{row.questions}\n\nAnswers:\n1.",
temperature=0,
max_tokens=257,
top_p=1,
frequency_penalty=0,
presence_penalty=0
)
return response.choices[0].text
except Exception as e:
print (e)
return ""
df['answers']= df.apply(get_answers, axis=1)
df['answers'] = "1." + df.answers
df = df.dropna().reset_index().drop('index',axis=1)
print(df[['answers']].values[0][0])1. The 2020 Summer Olympics is an international multi-sport event held from 23 July to 8 August 2021 in Tokyo, Japan. 2. The 2020 Summer Olympics took place from 23 July to 8 August 2021. 3. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88). 4. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88). 5. The United States topped the medal count by both total golds (39) and total medals (113), with China finishing second by both respects (38 and 88).
以上是根据主办城市选择背景对问题的回答。
我们可以看到答案3-5包含了正确答案,但并非直接回答问题,而是逐字提取。尽管偶尔会出现这些质量较低的答案,但我们将展示,在提供大量示例的情况下,模型能够较好地学习该任务。
2.4 基于维基百科章节保存奥运会问答数据集
我们将文件保存下来,以便在下一个笔记本中使用
df.to_csv('olympics-data/olympics_qa.csv', index=False)2.5 搜索文件(已弃用)
我们创建一个搜索文件(API参考),用于在提问时检索相关上下文。
已弃用:/search端点已被弃用,建议改用嵌入向量。嵌入向量更经济、更快速,并能提供更好的搜索体验。有关使用嵌入向量实现搜索的详情,请参阅问答指南
df = df[df.tokens<2000]
df[['context', 'tokens']].rename(columns={'context':'text','tokens':'metadata'}).to_json('olympics-data/olympics_search.jsonl', orient='records', lines=True)
search_file = client.files.create(
file=open("olympics-data/olympics_search.jsonl"),
purpose='search'
)
olympics_search_fileid = search_file['id']2.6 根据提供的上下文回答问题
我们将使用一个简单的答案端点实现。该方法通过直接调用/search端点来搜索索引文件,获取可包含在上下文中的相关段落,然后基于指定模型进行问答提示。
from answers_with_ft import create_context, answer_question
print(create_context("Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?", olympics_search_fileid, max_len=400))Athletics at the 2020 Summer Olympics – Women's 4 × 100 metres relay Summary The women's 4 × 100 metres relay event at the 2020 Summer Olympics took place on 5 and 6 August 2021 at the Japan National Stadium. There were 16 competing relay teams, with each team having 5 members from which 4 were selected in each round. ### Athletics at the 2020 Summer Olympics – Men's 4 × 100 metres relay Qualification National Olympic Committees (NOCs) could qualify one relay team in one of three following ways: The top 8 NOCs at the 2019 World Athletics Championships qualified a relay team. The top 8 NOCs at the 2021 World Athletics Relays qualified a relay team. Where an NOC placed in the top 8 at both the 2019 World Championships and the 2021 World Relays, the quota place was allocated to the world top list as of 29 June 2021. In this case, 4 teams did so, so there are 4 places available through the world rankings.A total of five athletes may be entered for a relay team. Should a NOC have also entered individual athletes in the corresponding individual event (100 m), the entered individual athletes must be included in the total of five (5) athletes entered for the relay event. In addition of five, NOCs can nominate a maximum of one alternate athlete for each team. The qualifying period was originally from 1 May 2019 to 29 June 2020. Due to the COVID-19 pandemic, the period was suspended from 6 April 2020 to 30 November 2020, with the end date extended to 29 June 2021. The qualifying time standards could be obtained in various meets during the given period that have the approval of the IAAF. Both indoor and outdoor meets are eligible. The most recent Area Championships may be counted in the ranking, even if not during the qualifying period.
answer_question(olympics_search_fileid, "davinci-instruct-beta-v3",
"Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?")' Japan National Stadium'
在我们对问答模型进行微调后,就能用它替代davinci-instruct-beta-v3,当问题无法基于上下文回答时获得更好的答案。我们发现davinci-instruct-beta-v3存在一个缺点:无论是否存在相关上下文,它总是试图回答问题。(注意第二个问题是关于2024年设定的未来事件。)
answer_question(olympics_search_fileid, "davinci-instruct-beta-v3",
"Where did women's 4 x 100 metres relay event take place during the 2048 Summer Olympics?", max_len=1000)' Japan National Stadium'
我们可以看到,davinci倾向于回答问题,即使根据提供的上下文无法回答该问题。请注意关于2048年夏季奥运会的问题,该赛事尚未发生,而检索到的内容仅返回了2020年的结果。
2.7 (可选) 调查搜索端点返回相关上下文的可能性
def check_context(title, heading, question, max_len=1800, search_model='ada', max_rerank=10):
"""
Evaluate the performance of the search model in retrieving the correct context
Parameters
----------
title: str
The title of the Wikipedia page
heading: str
The heading of the Wikipedia section
qusetion: str
The question
max_len: int
The maximum length of the context
search_model: str
The search model to use - `ada` is most cost effective
max_rerank: int
The maximum number of reranking documents to use the search model on
Returns
-------
rank: int
The rank of the correct context
token_length: int
The number of tokens needed to obtain the correct context
"""
try:
# TODO: openai.Engine(search_model) is deprecated
results = openai.Engine(search_model).search(
search_model=search_model,
query=question,
max_rerank=max_rerank,
file=olympics_search_fileid,
return_metadata=True
)
index=-1
returns = []
cur_len = 0
for result in results['data']:
cur_len += int(result['metadata']) + 4 # we add 4 tokens for the separator `\n\n###\n\n`
if cur_len > max_len:
break
returns.append(result['text'])
res = result['text'].split('\n')
if res[0] == title and res[1] == heading:
index = len(returns) - 1
break
return index, cur_len
except Exception as e:
#print (e)
return []
print(check_context("Athletics at the 2020 Summer Olympics – Women's 4 × 100 metres relay", "Summary", "Where did women's 4 x 100 metres relay event take place during the 2020 Summer Olympics?", max_len=10000))(0, 58)
我们利用基于上下文生成的问题来估算能够检索到原始上下文的频率。这些问题存在噪声,因此这并非一个完美的估算。
我们的问题和答案前面带有编号的项目符号,但由于生成方式的原因,它们缺少第一个数字,因此我们在问题(和答案)列表中添加了“1.”。
我们计算使用ada搜索检索到的章节排名,以及完整检索相关章节所需的上下文中的token数量。
ada_results = df.apply(lambda x: [
check_context( x.title,
x.heading,
q[3:], # remove the number prefix
max_len=1000000, # set a large number to get the full context
search_model='ada',
max_rerank=200,
)
for q in (x.questions).split('\n') # split the questions
if len(q) >10 # remove the empty questions
], axis=1)
ada_results.head()0 [(132, 27104), (-1, 22939), (8, 2151), (2, 121... 1 [(4, 1737), (0, 130), (8, 744), (96, 17208), (... 2 [(0, 373), (0, 373), (-1, 40610), (1, 570)] 3 [(0, 302), (0, 302), (5, 968), (8, 1425)] 4 [(0, 167), (0, 167), (2, 1442)] Name: ada, dtype: object
out = pd.concat([ada_results], axis=1)
out.columns = ['ada']
out.to_csv('olympics-data/search_engine_results.csv')def expand_lists(out):
"""
Expand a pandas series containing lists into a series, where each list element becomes a value on its own
Input is a row per paragraph, which has multiple questions
Output is a row per question
"""
cols = [pd.DataFrame(out[name].tolist()).stack().reset_index(level=1, drop=True).rename(name) for name in out.columns]
return pd.concat(cols, axis=1)
out_expanded = expand_lists(out)
out_expanded['rank'] = out_expanded.ada.apply(lambda x: x[0] if x != [] else -2)
out_expanded['tokens'] = out_expanded.ada.apply(lambda x: x[1] if x != [] else -2)
within_2k = (out_expanded.tokens < 2000).mean()
print(f"{within_2k*100:.1f}% of relevant paragraphs are retrieved within the first 2k tokens")74.3% of relevant paragraphs are retrieved within the first 2k tokens
在该数据集上,相关上下文信息的获取成功率为74%
outside_200 = (out_expanded['rank'] == -1).mean()
print(f"{outside_200*100:.1f}% of relevant paragraphs are not retrieved within the first 200 results")7.4% of relevant paragraphs are not retrieved within the first 200 results
7.4%的情况下,这是由于搜索算法的关键词搜索部分未能在前200个结果中检索到相关上下文。 18.3%的情况下这是由于语义搜索未能将相关上下文放置在前2000个token内。
import matplotlib.pyplot as plt
# plot a histogram, and add axis descriptions and title
out_expanded[(out_expanded['rank'] >=0)&(out_expanded['rank'] <30)]['rank'].hist(bins=29)
plt.xlabel('rank')
plt.ylabel('count')
plt.title('Histogram of ranks of retrieved paragraphs')
plt.show()out_expanded[(out_expanded.tokens>=0)&(out_expanded.tokens < 2000)]['tokens'].hist(bins=29)
plt.xlabel('tokens')
plt.ylabel('count')
plt.title('Histogram of the number of minimum tokens needed')
plt.show()我们可以观察到,上下文最有可能作为前几个结果之一返回,并且极有可能在前200-500个token内返回。
# normalized value_counts
out_expanded['rank'].value_counts(normalize=True).sort_index()[:13]-2 0.000063 -1 0.074428 0 0.453420 1 0.089515 2 0.047146 3 0.032437 4 0.024139 5 0.019676 6 0.015967 7 0.013452 8 0.011189 9 0.009869 10 0.009178 Name: rank, dtype: float64
相关上下文在每一层级被返回的概率。(-2表示处理错误,-1表示层级大于200)