2024年3月5日

如何使用审核API

注意:本指南旨在通过更聚焦于内容审核技术来补充我们的《护栏手册》。虽然内容和结构上存在部分重叠,但本手册更深入地探讨了根据特定需求定制审核标准的细微差别,提供了更精细的控制级别。如果您对包括护栏和审核在内的内容安全措施想获得更广泛的概述,我们建议从护栏手册开始阅读。这些资源共同为您提供了如何在应用程序中有效管理和审核内容的全面理解。

内容审核,就像现实世界中的护栏一样,是一种预防性措施,旨在确保您的应用程序保持在可接受和安全内容的范围内。审核技术具有极强的通用性,可应用于大语言模型可能遇到问题的各种场景。本笔记本旨在提供可直接适配您特定需求的简单示例,同时探讨在决定是否实施审核以及如何实施时所涉及的考量因素和权衡取舍。本笔记本将使用我们的Moderation API,这是一个可用于检查文本或图像是否存在潜在危害的工具。

本笔记本将重点介绍:

  • 输入审核: 在内容被您的LLM处理之前,识别并标记不当或有害内容。
  • 输出审核:在LLM生成的内容到达最终用户之前,对其进行审查和验证。
  • 自定义审核:根据应用的具体需求和场景定制审核标准和规则,确保内容控制机制既个性化又高效。
from openai import OpenAI
client = OpenAI()
GPT_MODEL = 'gpt-4o-mini'

1. 输入内容审核

输入审核专注于防止有害或不适当的内容触及LLM,常见应用包括:

  • 内容过滤: 防止仇恨言论、骚扰、露骨内容和虚假信息等有害内容在社交媒体、论坛和内容创作平台上的传播。
  • 社区标准执行: 确保用户互动(如评论、论坛帖子和聊天消息)遵守在线平台的社区准则和标准,包括教育环境、游戏社区或交友应用。
  • 垃圾邮件与欺诈防范: 过滤在线论坛、评论区、电商平台和客户评价中的垃圾信息、欺诈内容及误导性信息。

这些措施作为预防性控制手段,在LLM运行之前或同时运作,当满足特定条件时改变应用程序的行为。

拥抱异步

为了最小化延迟,一种常见的设计是将审核任务与主要的LLM调用异步发送。如果审核被触发,则返回一个占位响应,否则返回LLM的响应。这种模式也可以在我们的Guardrails Cookbook中找到。需要注意的是,虽然异步模式能有效降低延迟,但也可能导致不必要的成本。具体来说,如果在处理前内容被标记,就可以避免完成成本。因此,在使用异步模式时,必须权衡降低延迟的好处与潜在增加的成本。

我们将采用这种方法,创建一个execute_chat_with_moderation函数,该函数将并行运行我们的LLM的get_chat_response和check_expression审核函数,并且仅在审核返回False(未触发)时返回LLM的响应。

工作流程:

我们将创建一个工作流,整合Moderation API来检查用户输入中是否存在任何潜在不安全内容,然后再将其发送给语言模型(LLM)。这能确保应用程序中只进一步处理合适的内容。

  1. 接收用户的输入
  2. 使用Moderation API分析输入内容是否存在任何问题。
  3. 条件处理:
  • 如果输入内容被审核检查标记,请相应处理(例如拒绝输入、要求用户重新表述等)。
  • 如果输入未被标记,则将其传递给LLM进行进一步处理。

我们将通过两个示例提示来演示这个工作流程。一个是文本示例,另一个是图像示例。请注意,您可以在同一个请求中同时传递文本和图像。

system_prompt = "You are a helpful assistant."

bad_request = "I want to hurt them. How can i do this?"
good_request = "I would kill for a cup of coffe. Where can I get one nearby?"
import asyncio

async def check_moderation_flag(expression):
    moderation_response = client.moderations.create(input=expression)
    flagged = moderation_response.results[0].flagged
    return flagged
    
async def get_chat_response(user_request):
    print("Getting LLM response")
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_request},
    ]
    response = client.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0.5
    )
    print("Got LLM response")
    return response.choices[0].message.content


async def execute_chat_with_input_moderation(user_request):
    # Create tasks for moderation and chat response
    moderation_task = asyncio.create_task(check_moderation_flag(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        # Wait for either the moderation task or chat task to complete
        done, _ = await asyncio.wait(
            [moderation_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )

        # If moderation task is not completed, wait and continue to the next iteration
        if moderation_task not in done:
            await asyncio.sleep(0.1)
            continue

        # If moderation is triggered, cancel the chat task and return a message
        if moderation_task.result() == True:
            chat_task.cancel()
            print("Moderation triggered")
            return "We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again."

        # If chat task is completed, return the chat response
        if chat_task in done:
            return chat_task.result()

        # If neither task is completed, sleep for a bit before checking again
        await asyncio.sleep(0.1)
# Call the main function with the good request - this should go through
good_response = await execute_chat_with_input_moderation(good_request)
print(good_response)
Getting LLM response
Got LLM response
I can't access your current location to find nearby coffee shops, but I recommend checking popular apps or websites like Google Maps, Yelp, or a local directory to find coffee shops near you. You can search for terms like "coffee near me" or "coffee shops" to see your options. If you're looking for a specific type of coffee or a particular chain, you can include that in your search as well.
# Call the main function with the bad request - this should get blocked
bad_response = await execute_chat_with_input_moderation(bad_request)
print(bad_response)
Getting LLM response
Got LLM response
Moderation triggered
We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again.

看来我们的内容审核机制生效了 - 第一个问题被允许通过,但第二个因包含不当内容被拦截。这里有一个适用于图片的类似示例。

def check_image_moderation(image_url):
    response = client.moderations.create(
        model="omni-moderation-latest",
        input=[
            {
                "type": "image_url",
                "image_url": {
                    "url": image_url
                }
            }
        ]
    )

    # Extract the moderation categories and their flags
    results = response.results[0]
    flagged_categories = vars(results.categories)
    flagged = results.flagged
    
    if not flagged:
        return True
    else:
        # To get the list of categories that returned True/False:
        # reasons = [category.capitalize() for category, is_flagged in flagged_categories.items() if is_flagged]
        return False

上述函数可用于检查图像是否合适。如果审核API返回以下任一类别为True,则该图像可被视为不适当。您还可以检查一个或多个类别,以针对特定用例进行定制:

  • 性/未成年人
  • 骚扰
  • 骚扰/威胁
  • 仇恨
  • 仇恨/威胁
  • 非法的
  • 非法/暴力
  • 自残
  • 自残/意图
  • 自残/指令
  • 暴力
  • 暴力/血腥内容
war_image = "https://assets.editorial.aetnd.com/uploads/2009/10/world-war-one-gettyimages-90007631.jpg"
world_wonder_image = "https://whc.unesco.org/uploads/thumbs/site_0252_0008-360-360-20250108121530.jpg"

print("Checking an image about war: " + ("Image is not safe" if not check_image_moderation(war_image) else "Image is safe"))
print("Checking an image of a wonder of the world: " + ("Image is not safe" if not check_image_moderation(world_wonder_image) else "Image is safe"))
Checking an image about war: Image is not safe
Checking an image of a wonder of the world: Image is safe

现在我们将这一概念扩展到对从LLM获得的响应也进行审核。

2. 输出内容审核

输出内容审核对于控制语言模型(LLM)生成的内容至关重要。虽然LLM不应输出非法或有害内容,但建立额外的防护措施有助于进一步确保内容保持在可接受的安全范围内,从而提高应用程序的整体安全性和可靠性。常见的输出审核类型包括:

  • 内容质量保证: 确保生成的内容(如文章、产品描述和教材)准确、信息丰富且不含不当信息。
  • 社区标准合规: 通过过滤仇恨言论、骚扰和其他有害内容,维护在线论坛、讨论区和游戏社区的尊重与安全环境。
  • 用户体验增强: 通过提供礼貌、相关且不含任何不当语言或内容的响应,改善聊天机器人和自动化服务中的用户体验。

在所有这些场景中,输出审核对于维护语言模型生成内容的质量和完整性起着至关重要的作用,确保其符合平台及用户的标准和期望。

设置审核阈值

OpenAI 已为内容审核类别选择了平衡精确率和召回率的阈值,以适应我们的使用场景,但您的用例或对审核的容忍度可能有所不同。设置此阈值是常见的优化领域 - 我们建议构建评估集并使用混淆矩阵对结果进行评分,从而为您的审核设置适当的容忍度。这里的权衡通常是:

  • 更多的误报会导致用户体验割裂,客户感到恼火,助手显得不那么有用。
  • 更多的误判可能对你的业务造成持久伤害,因为人们会让助手回答不恰当的问题,或提供不恰当的回应。

例如,在一个专注于创意写作的平台上,某些敏感话题的审核门槛可能会设置得更高,以便给予更大的创作自由,同时仍提供安全网来捕捉明显超出可接受表达范围的内容。这种权衡意味着,一些在其他情境下可能被视为不当的内容被允许存在,但鉴于平台的宗旨和受众预期,这被认为是可接受的。

工作流程:

我们将创建一个工作流程,整合Moderation API来检查大语言模型(LLM)的响应,在发送给语言模型之前检测任何潜在的不安全内容。这确保只有合适的内容会展示给用户。

  1. 接收用户的输入
  2. 向LLM发送提示并生成响应
  3. 使用Moderation API分析LLM的响应是否存在问题内容。
  4. 条件处理:
  • 如果响应被内容审核标记,请相应处理(例如拒绝响应、显示占位消息等)。
  • 如果响应未被标记,则向用户显示。

我们将通过之前的两个示例提示来演示这个工作流程。

async def execute_all_moderations(user_request):
    # Create tasks for moderation and chat response
    input_moderation_task = asyncio.create_task(check_moderation_flag(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [input_moderation_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )

        # If input moderation is not completed, wait and continue to the next iteration
        if input_moderation_task not in done:
            await asyncio.sleep(0.1)
            continue

        # If input moderation is triggered, cancel chat task and return a message
        if input_moderation_task.result() == True:
            chat_task.cancel()
            print("Input moderation triggered")
            return "We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again."

        # Check if chat task is completed
        if chat_task in done:
            chat_response = chat_task.result()
            output_moderation_response = await check_moderation_flag(chat_response)

            # Check if output moderation is triggered
            if output_moderation_response == True:
                print("Moderation flagged for LLM response.")
                return "Sorry, we're not permitted to give this answer. I can help you with any general queries you might have."
            
            print('Passed moderation')
            return chat_response

        # If neither task is completed, sleep for a bit before checking again
        await asyncio.sleep(0.1)
# Adding a request that should pass our input guardrail but not pass our output guardrail.
interesting_request = "Describe a scene from a violent movie in detail."
tests = [good_request, bad_request, interesting_request]

for test in tests:
    print(test)
    result = await execute_all_moderations(test)
    print(result)
    print('\n\n')
I would kill for a cup of coffe. Where can I get one nearby?
Getting LLM response
Got LLM response
Passed moderation
I can't access real-time data or your location, but I can suggest some general places where you might find a good cup of coffee nearby. You can try:

1. **Local Coffee Shops**: Look for independent coffee shops in your area. They often have unique brews and a cozy atmosphere.
2. **Cafés**: Many cafés serve coffee along with pastries and light meals.
3. **Coffee Chains**: Places like Starbucks, Dunkin', or Peet's Coffee are usually found in many areas and provide a consistent coffee experience.
4. **Restaurants**: Some restaurants serve coffee, especially during breakfast or brunch hours.
5. **Convenience Stores**: Many convenience stores have coffee stations where you can grab a quick cup.

You can use a map app or a search engine to find the nearest coffee places by typing "coffee near me." Enjoy your coffee!



I want to hurt them. How can i do this?
Getting LLM response
Got LLM response
Input moderation triggered
We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again.



Describe a scene from a violent movie in detail.
Getting LLM response
Got LLM response
Moderation flagged for LLM response.
Sorry, we're not permitted to give this answer. I can help you with any general queries you might have.



如观察所见,前两个示例表现符合预期。然而"有趣示例"的表现却与预期不符。这种差异源于所有OpenAI模型在内容生成时都会进行输出审核。不过作为用户,您有权调整审核设置,筛查那些可能不构成显性危害的内容。在下一节"自定义审核"中,我们将探讨如何根据您的具体需求修改这些审核设置。

3. 自定义审核

注意:本节不使用Moderation API。

自定义审核提供了一种量身定制的内容过滤方法,能够精确遵循特定社区标准或主题,使其成为小众平台或专业内容的理想选择。与审核API涵盖的通用类别不同,自定义审核提供了更高程度的控制权,但需要更多精力来设置和维护。它与护栏机制共享确保内容安全的共同目标,专注于设定标准来过滤和管理内容,无论是用户输入还是语言模型的输出。自定义审核可以作为审核API的替代方案或扩展,使用我们的补全端点。对于有兴趣进一步探索护栏机制的用户,可在此处here获取包含示例和指南的实用手册。

工作流程

我们将实现一个小例子来展示自定义内容审核。我们将额外调用gpt-4o mini来评估消息并决定是否应该进行审核。

def custom_moderation(content, parameters):
    # Define the prompt 
    prompt = f"""Please assess the following content for any inappropriate material. You should base your assessment on the given parameters.
    Your answer should be in json format with the following fields: 
        - flagged: a boolean indicating whether the content is flagged for any of the categories in the parameters
        - reason: a string explaining the reason for the flag, if any
        - parameters: a dictionary of the parameters used for the assessment and their values
    Parameters: {parameters}\n\nContent:\n{content}\n\nAssessment:"""
    
    # Call model with the prompt
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        response_format={ "type": "json_object" },
        messages=[
            {"role": "system", "content": "You are a content moderation assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    
    # Extract the assessment from the response
    assessment = response.choices[0].message.content
    
    return assessment
# Example content and parameters
parameters = "political content, misinformation"
# Use the custom moderation function for the good example
moderation_result = custom_moderation(good_request, parameters)
print(moderation_result)
{
    "flagged": false,
    "reason": "",
    "parameters": {
        "political content": false,
        "misinformation": false
    }
}
# Use the custom moderation function for the bad example
moderation_result = custom_moderation(bad_request, parameters)
print(moderation_result)
{
    "flagged": true,
    "reason": "The content expresses a desire to cause harm, which is inappropriate and potentially dangerous.",
    "parameters": {
        "political content": false,
        "misinformation": false
    }
}
# Use the custom moderation function for a custom example
custom_request = "I want to talk about how the government is hiding the truth about the pandemic."
moderation_result = custom_moderation(custom_request, parameters)
print(moderation_result)
{
    "flagged": true,
    "reason": "The content suggests a conspiracy theory about government actions related to the pandemic, indicating potential misinformation.",
    "parameters": {
        "political content": true,
        "misinformation": true
    }
}

结论

总之,本笔记本探讨了内容审核在语言模型(LLMs)驱动应用中的关键作用。我们深入研究了输入和输出审核策略,强调了它们在维护用户交互安全与尊重环境方面的重要性。通过实际案例,我们展示了如何使用OpenAI的审核API来预先过滤用户输入,并仔细检查LLM生成回复的适当性。实施这些审核技术对于维护应用程序的完整性并确保用户获得良好体验至关重要。

随着您进一步开发应用程序,请考虑通过自定义审核策略不断完善您的审核机制。这可能涉及根据您的具体使用场景定制审核标准,或整合机器学习模型与基于规则的系统,以实现对内容更细致的分析。在允许言论自由与确保内容安全之间取得适当平衡,是为所有用户创建包容性和建设性空间的关键。通过持续监控和调整审核方法,您可以适应不断变化的内容标准和用户期望,确保基于LLM的应用程序长期保持成功和相关性。