2023年10月20日

命名实体识别以丰富文本

Named Entity Recognition (NER) 是一项Natural Language Processing任务,它能识别命名实体(NE)并将其分类到预定义的语义类别中(如人物、组织、地点、事件、时间表达式和数量)。通过将原始文本转换为结构化信息,NER使数据更具可操作性,便于执行信息提取、数据聚合、分析和社交媒体监控等任务。

本笔记本演示了如何通过聊天补全函数调用进行命名实体识别(NER),并用知识库(如维基百科)的链接来丰富文本内容:

文本:

1440年,德国金匠约翰内斯·古腾堡发明了活字印刷机。他的工作引发了一场信息革命,使文学作品在欧洲范围内以前所未有的规模广泛传播。基于现有螺旋压力机的设计,一台文艺复兴时期的活字印刷机每个工作日可生产多达3600页印刷品。

带有维基百科链接的富文本:

1440年,在德国,金匠约翰内斯·古腾堡发明了活字印刷机。他的工作引发了一场信息革命,使文学作品在欧洲范围内以前所未有的规模广泛传播。基于现有螺旋压力机的设计,一台文艺复兴时期的活字印刷机每个工作日可生产多达3600页印刷品。

推理成本: 该笔记本还展示了如何估算OpenAI API的成本。

%pip install --upgrade openai --quiet
%pip install --upgrade nlpia2-wikipedia --quiet
%pip install --upgrade tenacity --quiet
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.

本笔记本适用于最新的OpenAI模型gpt-3.5-turbo-0613gpt-4-0613

import json
import logging
import os

import openai
import wikipedia

from typing import Optional
from IPython.display import display, Markdown
from tenacity import retry, wait_random_exponential, stop_after_attempt

logging.basicConfig(level=logging.INFO, format=' %(asctime)s - %(levelname)s - %(message)s')

OPENAI_MODEL = 'gpt-3.5-turbo-0613'

client = openai.OpenAI(api_key=os.environ.get("OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"))

我们定义了一套标准的NER标签集来展示广泛的用例。然而,针对我们为文本添加知识库链接这一特定任务,实际上只需要其中的一个子集。

labels = [
    "person",      # people, including fictional characters
    "fac",         # buildings, airports, highways, bridges
    "org",         # organizations, companies, agencies, institutions
    "gpe",         # geopolitical entities like countries, cities, states
    "loc",         # non-gpe locations
    "product",     # vehicles, foods, appareal, appliances, software, toys 
    "event",       # named sports, scientific milestones, historical events
    "work_of_art", # titles of books, songs, movies
    "law",         # named laws, acts, or legislations
    "language",    # any named language
    "date",        # absolute or relative dates or periods
    "time",        # time units smaller than a day
    "percent",     # percentage (e.g., "twenty percent", "18%")
    "money",       # monetary values, including unit
    "quantity",    # measurements, e.g., weight or distance
]

chat completions API 接收消息列表作为输入,并输出模型生成的消息。虽然聊天格式主要设计用于促进多轮对话,但对于没有任何前序对话的单轮任务同样高效。就我们的目的而言,我们将为系统、助手和用户角色分别指定一条消息。

system message(提示词)通过定义期望的角色和任务来设置助手的行为。我们还明确了希望识别的具体实体标签集。

虽然可以指示模型格式化其响应,但需要注意的是,gpt-3.5-turbo-0613gpt-4-0613都经过微调,能够识别何时应该调用函数,并以符合函数签名的JSON格式进行回复。这一功能简化了我们的提示,使我们能够直接从模型获取结构化数据。

def system_message(labels):
    return f"""
You are an expert in Natural Language Processing. Your task is to identify common Named Entities (NER) in a given text.
The possible common Named Entities (NER) types are exclusively: ({", ".join(labels)})."""

Assistant messages通常存储先前的助手回复。但在我们的场景中,它们也可以被精心设计来提供期望行为的示例。虽然OpenAI能够执行zero-shot命名实体识别,但我们发现one-shot方法能产生更精确的结果。

def assisstant_message():
    return f"""
EXAMPLE:
    Text: 'In Germany, in 1440, goldsmith Johannes Gutenberg invented the movable-type printing press. His work led to an information revolution and the unprecedented mass-spread / 
    of literature throughout Europe. Modelled on the design of the existing screw presses, a single Renaissance movable-type printing press could produce up to 3,600 pages per workday.'
    {{
        "gpe": ["Germany", "Europe"],
        "date": ["1440"],
        "person": ["Johannes Gutenberg"],
        "product": ["movable-type printing press"],
        "event": ["Renaissance"],
        "quantity": ["3,600 pages"],
        "time": ["workday"]
    }}
--"""

user message 提供了助手任务的具体文本:

def user_message(text):
    return f"""
TASK:
    Text: {text}
"""

在OpenAI API调用中,我们可以向gpt-3.5-turbo-0613gpt-4-0613描述functions,让模型智能地选择输出包含调用这些functions所需参数的JSON对象。需要注意的是,chat completions API实际上并不执行function,而是提供JSON输出,然后我们可以在代码中使用这些输出来调用function。更多详情请参阅OpenAI Function Calling Guide

我们的函数enrich_entities(text, label_entities)接收一个文本块和一个包含已识别标签和实体的字典作为参数。然后将识别出的实体与它们对应的维基百科文章链接相关联。

@retry(wait=wait_random_exponential(min=1, max=10), stop=stop_after_attempt(5))
def find_link(entity: str) -> Optional[str]:
    """
    Finds a Wikipedia link for a given entity.
    """
    try:
        titles = wikipedia.search(entity)
        if titles:
            # naively consider the first result as the best
            page = wikipedia.page(titles[0])
            return page.url
    except (wikipedia.exceptions.WikipediaException) as ex:
        logging.error(f'Error occurred while searching for Wikipedia link for entity {entity}: {str(ex)}')

    return None
def find_all_links(label_entities:dict) -> dict:
    """ 
    Finds all Wikipedia links for the dictionary entities in the whitelist label list.
    """
    whitelist = ['event', 'gpe', 'org', 'person', 'product', 'work_of_art']
    
    return {e: find_link(e) for label, entities in label_entities.items() 
                            for e in entities
                            if label in whitelist}
def enrich_entities(text: str, label_entities: dict) -> str:
    """
    Enriches text with knowledge base links.
    """
    entity_link_dict = find_all_links(label_entities)
    logging.info(f"entity_link_dict: {entity_link_dict}")
    
    for entity, link in entity_link_dict.items():
        text = text.replace(entity, f"[{entity}]({link})")

    return text

如前所述,gpt-3.5-turbo-0613gpt-4-0613 已经经过微调,能够检测何时应该调用function。此外,它们可以生成符合function签名的JSON响应。以下是我们遵循的流程:

  1. 定义我们的function及其关联的JSON Schema。
  2. 使用messagestoolstool_choice参数调用模型。
  3. 将输出转换为JSON对象,然后使用模型提供的arguments调用function

在实践中,可能需要通过将function响应作为新消息追加来重新调用模型,并让模型将结果总结反馈给用户。不过就我们的目的而言,这一步并非必需。

请注意,在实际应用场景中,强烈建议在执行操作前构建用户确认流程。

由于我们希望模型输出一个标签和识别实体的字典:

{   
    "gpe": ["Germany", "Europe"],   
    "date": ["1440"],   
    "person": ["Johannes Gutenberg"],   
    "product": ["movable-type printing press"],   
    "event": ["Renaissance"],   
    "quantity": ["3,600 pages"],   
    "time": ["workday"]   
}   

我们需要定义对应的JSON模式传递给tools参数:

def generate_functions(labels: dict) -> list:
    return [
        {   
            "type": "function",
            "function": {
                "name": "enrich_entities",
                "description": "Enrich Text with Knowledge Base Links",
                "parameters": {
                    "type": "object",
                        "properties": {
                            "r'^(?:' + '|'.join({labels}) + ')$'": 
                            {
                                "type": "array",
                                "items": {
                                    "type": "string"
                                }
                            }
                        },
                        "additionalProperties": False
                },
            }
        }
    ]

现在,我们调用模型。需要注意的是,我们通过将tool_choice参数设置为{"type": "function", "function" : {"name": "enrich_entities"}}来指示API使用特定函数。

@retry(wait=wait_random_exponential(min=1, max=10), stop=stop_after_attempt(5))
def run_openai_task(labels, text):
    messages = [
          {"role": "system", "content": system_message(labels=labels)},
          {"role": "assistant", "content": assisstant_message()},
          {"role": "user", "content": user_message(text=text)}
      ]

    # TODO: functions and function_call are deprecated, need to be updated
    # See: https://platform.openai.com/docs/api-reference/chat/create#chat-create-tools
    response = openai.chat.completions.create(
        model="gpt-3.5-turbo-0613",
        messages=messages,
        tools=generate_functions(labels),
        tool_choice={"type": "function", "function" : {"name": "enrich_entities"}}, 
        temperature=0,
        frequency_penalty=0,
        presence_penalty=0,
    )

    response_message = response.choices[0].message
    
    available_functions = {"enrich_entities": enrich_entities}  
    function_name = response_message.tool_calls[0].function.name
    
    function_to_call = available_functions[function_name]
    logging.info(f"function_to_call: {function_to_call}")

    function_args = json.loads(response_message.tool_calls[0].function.arguments)
    logging.info(f"function_args: {function_args}")

    function_response = function_to_call(text, function_args)

    return {"model_response": response, 
            "function_response": function_response}
text = """The Beatles were an English rock band formed in Liverpool in 1960, comprising John Lennon, Paul McCartney, George Harrison, and Ringo Starr."""
result = run_openai_task(labels, text)
 2023-10-20 18:05:51,729 - INFO - function_to_call: <function enrich_entities at 0x0000021D30C462A0>
 2023-10-20 18:05:51,730 - INFO - function_args: {'person': ['John Lennon', 'Paul McCartney', 'George Harrison', 'Ringo Starr'], 'org': ['The Beatles'], 'gpe': ['Liverpool'], 'date': ['1960']}
 2023-10-20 18:06:09,858 - INFO - entity_link_dict: {'John Lennon': 'https://en.wikipedia.org/wiki/John_Lennon', 'Paul McCartney': 'https://en.wikipedia.org/wiki/Paul_McCartney', 'George Harrison': 'https://en.wikipedia.org/wiki/George_Harrison', 'Ringo Starr': 'https://en.wikipedia.org/wiki/Ringo_Starr', 'The Beatles': 'https://en.wikipedia.org/wiki/The_Beatles', 'Liverpool': 'https://en.wikipedia.org/wiki/Liverpool'}
display(Markdown(f"""**Text:** {text}   
                     **Enriched_Text:** {result['function_response']}"""))
<IPython.core.display.Markdown object>

要估算推理成本,我们可以解析响应的"usage"字段。各模型详细的令牌成本可在OpenAI Pricing Guide中查看:

# estimate inference cost assuming gpt-3.5-turbo (4K context)
i_tokens  = result["model_response"].usage.prompt_tokens 
o_tokens = result["model_response"].usage.completion_tokens 

i_cost = (i_tokens / 1000) * 0.0015
o_cost = (o_tokens / 1000) * 0.002

print(f"""Token Usage
    Prompt: {i_tokens} tokens
    Completion: {o_tokens} tokens
    Cost estimation: ${round(i_cost + o_cost, 5)}""")
Token Usage
    Prompt: 331 tokens
    Completion: 47 tokens
    Cost estimation: $0.00059