2023年6月16日

使用搜索API和重新排序进行问答

,

在浩瀚信息中寻找相关内容有时就像大海捞针,但别灰心,GPTs其实可以帮我们完成大部分工作。本指南探讨了一种利用多种AI技术增强现有搜索系统的方法,帮助我们过滤噪音信息。

GPT检索信息的两种方式为:

  1. 模拟人类浏览行为: GPT触发搜索,评估结果,并在必要时修改搜索查询。它还可以跟进特定的搜索结果,形成类似人类用户的思维链。
  2. 基于嵌入的检索: 为你的内容和用户查询计算嵌入向量,然后通过余弦相似度度量检索出最相关的内容。这项技术被谷歌等搜索引擎广泛使用

These approaches are both promising, but each has their shortcomings: the first one can be slow due to its iterative nature and the second one requires embedding your entire knowledge base in advance, continuously embedding new content and maintaining a vector database.

通过结合这些方法,并从重新排序技术中汲取灵感,我们找到了一种折中方案。这种方法可以在任何现有搜索系统之上实现,例如Slack搜索API,或是包含私有数据的内部ElasticSearch实例。其工作原理如下:

search_augmented_by_query_generation_and_embeddings_reranking.png

步骤1:搜索

  1. 用户提出一个问题。
  2. GPT生成一系列潜在的查询。
  3. 搜索查询是并行执行的。

步骤2:重新排序

  1. 每个结果的嵌入向量用于计算与用户问题生成的假设理想答案之间的语义相似度。
  2. 结果根据此相似度指标进行排序和筛选。

步骤3:回答

  1. 根据搜索结果,模型会生成用户问题的答案,包括参考文献和链接。

这种混合方法提供了相对较低的延迟,并且可以集成到任何现有的搜索端点中,无需维护向量数据库。让我们深入了解它!我们将以News API为例进行搜索。

设置

除了你的OPENAI_API_KEY外,你还需要在环境中包含一个NEWS_API_KEY。你可以在此获取API密钥。

%%capture
%env NEWS_API_KEY = YOUR_NEWS_API_KEY
# Dependencies
from datetime import date, timedelta  # date handling for fetching recent news
from IPython import display  # for pretty printing
import json  # for parsing the JSON api responses and model outputs
from numpy import dot  # for cosine similarity
from openai import OpenAI
import os  # for loading environment variables
import requests  # for making the API requests
from tqdm.notebook import tqdm  # for printing progress bars

client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

# Load environment variables
news_api_key = os.getenv("NEWS_API_KEY")

GPT_MODEL = "gpt-3.5-turbo"


# Helper functions
def json_gpt(input: str):
    completion = client.chat.completions.create(model=GPT_MODEL,
    messages=[
        {"role": "system", "content": "Output only valid JSON"},
        {"role": "user", "content": input},
    ],
    temperature=0.5)

    text = completion.choices[0].message.content
    parsed = json.loads(text)

    return parsed


def embeddings(input: list[str]) -> list[list[str]]:
    response = client.embeddings.create(model="text-embedding-3-small", input=input)
    return [data.embedding for data in response.data]

一切都始于用户的一个问题。

# User asks a question
USER_QUESTION = "Who won the NBA championship? And who was the MVP? Tell me a bit about the last game."

现在,为了尽可能详尽全面,我们使用该模型基于此问题生成一系列多样化查询。

QUERIES_INPUT = f"""
You have access to a search API that returns recent news articles.
Generate an array of search queries that are relevant to this question.
Use a variation of related keywords for the queries, trying to be as general as possible.
Include as many queries as you can think of, including and excluding terms.
For example, include queries like ['keyword_1 keyword_2', 'keyword_1', 'keyword_2'].
Be creative. The more queries you include, the more likely you are to find relevant results.

User question: {USER_QUESTION}

Format: {{"queries": ["query_1", "query_2", "query_3"]}}
"""

queries = json_gpt(QUERIES_INPUT)["queries"]

# Let's include the original question as well for good measure
queries.append(USER_QUESTION)

queries
['NBA championship winner',
 'MVP of NBA championship',
 'Last game of NBA championship',
 'NBA finals winner',
 'Most valuable player of NBA championship',
 'Finals game of NBA',
 'Who won the NBA finals',
 'NBA championship game summary',
 'NBA finals MVP',
 'Champion of NBA playoffs',
 'NBA finals last game highlights',
 'NBA championship series result',
 'NBA finals game score',
 'NBA finals game recap',
 'NBA champion team and player',
 'NBA finals statistics',
 'NBA championship final score',
 'NBA finals best player',
 'NBA playoffs champion and MVP',
 'NBA finals game analysis',
 'Who won the NBA championship? And who was the MVP? Tell me a bit about the last game.']

查询看起来不错,让我们执行搜索。

def search_news(
    query: str,
    news_api_key: str = news_api_key,
    num_articles: int = 50,
    from_datetime: str = "2023-06-01",  # the 2023 NBA finals were played in June 2023
    to_datetime: str = "2023-06-30",
) -> dict:
    response = requests.get(
        "https://newsapi.org/v2/everything",
        params={
            "q": query,
            "apiKey": news_api_key,
            "pageSize": num_articles,
            "sortBy": "relevancy",
            "from": from_datetime,
            "to": to_datetime,
        },
    )

    return response.json()


articles = []

for query in tqdm(queries):
    result = search_news(query)
    if result["status"] == "ok":
        articles = articles + result["articles"]
    else:
        raise Exception(result["message"])

# remove duplicates
articles = list({article["url"]: article for article in articles}.values())

print("Total number of articles:", len(articles))
print("Top 5 articles of query 1:", "\n")

for article in articles[0:5]:
    print("Title:", article["title"])
    print("Description:", article["description"])
    print("Content:", article["content"][0:100] + "...")
    print()
  0%|          | 0/21 [00:00<?, ?it/s]
Total number of articles: 554
Top 5 articles of query 1: 

Title: Nascar takes on Le Mans as LeBron James gets centenary race under way
Description: <ul><li>Nascar has presence at iconic race for first time since 1976</li><li>NBA superstar LeBron James waves flag as honorary starter</li></ul>The crowd chanted “U-S-A! U-S-A!” as Nascar driver lineup for the 24 Hours of Le Mans passed through the city cente…
Content: The crowd chanted U-S-A! U-S-A! as Nascar driver lineup for the 24 Hours of Le Mans passed through t...

Title: NBA finals predictions: Nuggets or Heat? Our writers share their picks
Description: Denver or Miami? Our contributors pick the winner, key players and dark horses before the NBA’s grand finale tips offA lot has been made of the importance of a balanced roster with continuity, but, somehow, still not enough. The Nuggets are the prime example …
Content: The Nuggets are here because 
A lot has been made of the importance of a balanced roster with conti...

Title: Unboxing: Michelob ULTRA and Artist Futura Enshrine the NBA Championship In Custom Hand-Painted Bottles
Description: As the 2022-2023 NBA Championship nears the end, Michelob ULTRA brings joy to sports fans who will gather to watch the showdown between the Denver Nuggets and Miami Heat. The beermaker teamed up with artist Futura to remix its newly-designed 2023 Champ Bottle…
Content: As the 2022-2023 NBA Championship nears the end, Michelob ULTRA brings joy to sports fans who will g...

Title: Futura and Michelob ULTRA Toast to the NBA Finals With Abstract Artwork Crafted From the Brand’s 2023 Limited-Edition Championship Bottles
Description: The sun is out to play, and so is Michelob ULTRA. With the 2022-2023 NBA Finals underway, the beermaker is back with its celebratory NBA Champ Bottles. This year, the self-proclaimed MVP of joy is dropping a limited-edition bottle made in collaboration with a…
Content: The sun is out to play, and so is Michelob ULTRA. With the 2022-2023 NBA Finals underway, the beerma...

Title: Signed and Delivered, Futura and Michelob ULTRA Will Gift Hand-Painted Bottles to This Year’s NBA Championship Team
Description: Michelob ULTRA, the MVP of joy and official beer sponsor of the NBA is back to celebrate with basketball lovers and sports fans around the globe as the NBA 2022-2023 season comes to a nail-biting close. In collaboration with artist Futura, Michelob ULTRA will…
Content: Michelob ULTRA, the MVP of joy and official beer sponsor of the NBA is back to celebrate with basket...

我们可以看到,搜索查询通常会返回大量结果,其中许多与用户最初提出的问题并不相关。为了提高最终答案的质量,我们使用嵌入技术对结果进行重新排序和过滤。

2. 重新排序

受到HyDE (Gao et al.)的启发,我们首先生成一个假设的理想答案,用于重新排序并将我们的结果与之比较。这有助于优先选择看起来像优质答案的结果,而不是那些与我们的问题相似的结果。以下是我们用于生成假设答案的提示。

HA_INPUT = f"""
Generate a hypothetical answer to the user's question. This answer will be used to rank search results. 
Pretend you have all the information you need to answer, but don't use any actual facts. Instead, use placeholders
like NAME did something, or NAME said something at PLACE. 

User question: {USER_QUESTION}

Format: {{"hypotheticalAnswer": "hypothetical answer text"}}
"""

hypothetical_answer = json_gpt(HA_INPUT)["hypotheticalAnswer"]

hypothetical_answer
'The NBA championship was won by TEAM NAME. The MVP was awarded to PLAYER NAME. The last game was held at STADIUM NAME, where both teams played with great energy and enthusiasm. It was a close game, but in the end, TEAM NAME emerged victorious.'

现在,让我们为搜索结果和假设答案生成嵌入向量。然后计算这些嵌入向量之间的余弦距离,从而获得语义相似度指标。请注意,由于OpenAI嵌入向量在API中返回时已归一化,我们可以直接计算点积来代替完整的余弦相似度计算。

hypothetical_answer_embedding = embeddings(hypothetical_answer)[0]
article_embeddings = embeddings(
    [
        f"{article['title']} {article['description']} {article['content'][0:100]}"
        for article in articles
    ]
)

# Calculate cosine similarity
cosine_similarities = []
for article_embedding in article_embeddings:
    cosine_similarities.append(dot(hypothetical_answer_embedding, article_embedding))

cosine_similarities[0:10]
[0.7854456526852069,
 0.8086023500072106,
 0.8002998147018501,
 0.7961229569526956,
 0.798354506673743,
 0.758216458795653,
 0.7753754083127359,
 0.7494958338411927,
 0.804733946801739,
 0.8405965885235218]

最后,我们使用这些相似度分数对结果进行排序和筛选。

scored_articles = zip(articles, cosine_similarities)

# Sort articles by cosine similarity
sorted_articles = sorted(scored_articles, key=lambda x: x[1], reverse=True)

# Print top 5 articles
print("Top 5 articles:", "\n")

for article, score in sorted_articles[0:5]:
    print("Title:", article["title"])
    print("Description:", article["description"])
    print("Content:", article["content"][0:100] + "...")
    print("Score:", score)
    print()
Top 5 articles: 

Title: NBA Finals: Denver Nuggets beat Miami Hea, lift thier first-ever NBA title
Description: Denver Nuggets won their maiden NBA Championship trophy defeating Miami Heat 94-89 in Game 5 of the NBA Final held on Tuesday at the Ball Arena in Denver
Content: Denver Nuggets won their maiden NBA Championship trophy defeating Miami Heat 94-89 in Game 5 of the ...
Score: 0.8445817523602124

Title: Photos: Denver Nuggets celebrate their first NBA title
Description: The Nuggets capped off an impressive postseason by beating the Miami Heat in the NBA Finals.
Content: Thousands of supporters watched along the streets of Denver, Colorado as the US National Basketball ...
Score: 0.842070667753606

Title: Denver Nuggets win first NBA championship title in Game 5 victory over Miami Heat
Description: The Denver Nuggets won their first NBA championship Monday night, downing the Miami Heat 94-89 at Ball Arena in Denver to take Game 5 of the NBA Finals.
Content: The Denver Nuggets won their first NBA championship Monday night, downing the Miami Heat 94-89 at Ba...
Score: 0.8409346078172385

Title: Denver Nuggets Capture Their First NBA Championship Behind Unbreakable Chemistry
Description: After 47 years of waiting, the Denver Nuggets are NBA champions. Led by Nikola Jokic and Jamal Murray, they reached the mountain top by staying true to themselves.
Content: DENVER, CO - JUNE 12: Jamal Murray (27) of the Denver Nuggets celebrates as he leaves the court ... ...
Score: 0.8405965885235218

Title: NBA Finals: Nikola Jokic, Denver Nuggets survive Miami Heat to secure franchise's first NBA championship
Description: In a rock-fight of a Game 5, the Denver Nuggets reached the NBA mountaintop from the foothills of the Rockies, winning their first-ever championship and setting Nikola Jokic's legacy as an all-timer in stone.
Content: DENVER, COLORADO - JUNE 12: Jamal Murray #27 of the Denver Nuggets reacts during the fourth quarter ...
Score: 0.8389716330890262

太棒了!这些结果看起来与我们最初的查询更加相关。现在,让我们使用前5个结果来生成最终答案。

3. 回答

formatted_top_results = [
    {
        "title": article["title"],
        "description": article["description"],
        "url": article["url"],
    }
    for article, _score in sorted_articles[0:5]
]

ANSWER_INPUT = f"""
Generate an answer to the user's question based on the given search results. 
TOP_RESULTS: {formatted_top_results}
USER_QUESTION: {USER_QUESTION}

Include as much information as possible in the answer. Reference the relevant search result urls as markdown links.
"""

completion = client.chat.completions.create(
    model=GPT_MODEL,
    messages=[{"role": "user", "content": ANSWER_INPUT}],
    temperature=0.5,
    stream=True,
)

text = ""
for chunk in completion:
    text += chunk.choices[0].delta.content
    display.clear_output(wait=True)
    display.display(display.Markdown(text))
<IPython.core.display.Markdown object>