How to use the moderation API

注意：本指南旨在通过更聚焦于内容审核技术来补充我们的《护栏手册》。虽然内容和结构上存在部分重叠，但本手册更深入地探讨了根据特定需求定制审核标准的细微差别，提供了更精细的控制级别。如果您对包括护栏和审核在内的内容安全措施想获得更广泛的概述，我们建议从护栏手册开始阅读。这些资源共同为您提供了如何在应用程序中有效管理和审核内容的全面理解。

内容审核，就像现实世界中的护栏一样，是一种预防性措施，旨在确保您的应用程序保持在可接受和安全内容的范围内。审核技术具有极强的通用性，可应用于大语言模型可能遇到问题的各种场景。本笔记本旨在提供可直接适配您特定需求的简单示例，同时探讨在决定是否实施审核以及如何实施时所涉及的考量因素和权衡取舍。本笔记本将使用我们的Moderation API，这是一个可用于检查文本或图像是否存在潜在危害的工具。

本笔记本将重点介绍：

输入审核： 在内容被您的LLM处理之前，识别并标记不当或有害内容。
输出审核：在LLM生成的内容到达最终用户之前，对其进行审查和验证。
自定义审核：根据应用的具体需求和场景定制审核标准和规则，确保内容控制机制既个性化又高效。

from openai import OpenAI
client = OpenAI()
GPT_MODEL = 'gpt-4o-mini'

2. 输出内容审核

输出内容审核对于控制语言模型(LLM)生成的内容至关重要。虽然LLM不应输出非法或有害内容，但建立额外的防护措施有助于进一步确保内容保持在可接受的安全范围内，从而提高应用程序的整体安全性和可靠性。常见的输出审核类型包括：

内容质量保证： 确保生成的内容（如文章、产品描述和教材）准确、信息丰富且不含不当信息。
社区标准合规： 通过过滤仇恨言论、骚扰和其他有害内容，维护在线论坛、讨论区和游戏社区的尊重与安全环境。
用户体验增强： 通过提供礼貌、相关且不含任何不当语言或内容的响应，改善聊天机器人和自动化服务中的用户体验。

在所有这些场景中，输出审核对于维护语言模型生成内容的质量和完整性起着至关重要的作用，确保其符合平台及用户的标准和期望。

设置审核阈值

OpenAI 已为内容审核类别选择了平衡精确率和召回率的阈值，以适应我们的使用场景，但您的用例或对审核的容忍度可能有所不同。设置此阈值是常见的优化领域 - 我们建议构建评估集并使用混淆矩阵对结果进行评分，从而为您的审核设置适当的容忍度。这里的权衡通常是：

更多的误报会导致用户体验割裂，客户感到恼火，助手显得不那么有用。
更多的误判可能对你的业务造成持久伤害，因为人们会让助手回答不恰当的问题，或提供不恰当的回应。

例如，在一个专注于创意写作的平台上，某些敏感话题的审核门槛可能会设置得更高，以便给予更大的创作自由，同时仍提供安全网来捕捉明显超出可接受表达范围的内容。这种权衡意味着，一些在其他情境下可能被视为不当的内容被允许存在，但鉴于平台的宗旨和受众预期，这被认为是可接受的。

工作流程:

我们将创建一个工作流程，整合Moderation API来检查大语言模型(LLM)的响应，在发送给语言模型之前检测任何潜在的不安全内容。这确保只有合适的内容会展示给用户。

接收用户的输入
向LLM发送提示并生成响应
使用Moderation API分析LLM的响应是否存在问题内容。
条件处理：

如果响应被内容审核标记，请相应处理（例如拒绝响应、显示占位消息等）。
如果响应未被标记，则向用户显示。

我们将通过之前的两个示例提示来演示这个工作流程。

async def execute_all_moderations(user_request):
    # Create tasks for moderation and chat response
    input_moderation_task = asyncio.create_task(check_moderation_flag(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [input_moderation_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )

        # If input moderation is not completed, wait and continue to the next iteration
        if input_moderation_task not in done:
            await asyncio.sleep(0.1)
            continue

        # If input moderation is triggered, cancel chat task and return a message
        if input_moderation_task.result() == True:
            chat_task.cancel()
            print("Input moderation triggered")
            return "We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again."

        # Check if chat task is completed
        if chat_task in done:
            chat_response = chat_task.result()
            output_moderation_response = await check_moderation_flag(chat_response)

            # Check if output moderation is triggered
            if output_moderation_response == True:
                print("Moderation flagged for LLM response.")
                return "Sorry, we're not permitted to give this answer. I can help you with any general queries you might have."
            
            print('Passed moderation')
            return chat_response

        # If neither task is completed, sleep for a bit before checking again
        await asyncio.sleep(0.1)

# Adding a request that should pass our input guardrail but not pass our output guardrail.
interesting_request = "Describe a scene from a violent movie in detail."

tests = [good_request, bad_request, interesting_request]

for test in tests:
    print(test)
    result = await execute_all_moderations(test)
    print(result)
    print('\n\n')

I would kill for a cup of coffe. Where can I get one nearby?
Getting LLM response
Got LLM response
Passed moderation
I can't access real-time data or your location, but I can suggest some general places where you might find a good cup of coffee nearby. You can try:

1. **Local Coffee Shops**: Look for independent coffee shops in your area. They often have unique brews and a cozy atmosphere.
2. **Cafés**: Many cafés serve coffee along with pastries and light meals.
3. **Coffee Chains**: Places like Starbucks, Dunkin', or Peet's Coffee are usually found in many areas and provide a consistent coffee experience.
4. **Restaurants**: Some restaurants serve coffee, especially during breakfast or brunch hours.
5. **Convenience Stores**: Many convenience stores have coffee stations where you can grab a quick cup.

You can use a map app or a search engine to find the nearest coffee places by typing "coffee near me." Enjoy your coffee!



I want to hurt them. How can i do this?
Getting LLM response
Got LLM response
Input moderation triggered
We're sorry, but your input has been flagged as inappropriate. Please rephrase your input and try again.



Describe a scene from a violent movie in detail.
Getting LLM response
Got LLM response
Moderation flagged for LLM response.
Sorry, we're not permitted to give this answer. I can help you with any general queries you might have.

如观察所见，前两个示例表现符合预期。然而"有趣示例"的表现却与预期不符。这种差异源于所有OpenAI模型在内容生成时都会进行输出审核。不过作为用户，您有权调整审核设置，筛查那些可能不构成显性危害的内容。在下一节"自定义审核"中，我们将探讨如何根据您的具体需求修改这些审核设置。

自定义审核提供了一种量身定制的内容过滤方法，能够精确遵循特定社区标准或主题，使其成为小众平台或专业内容的理想选择。与审核API涵盖的通用类别不同，自定义审核提供了更高程度的控制权，但需要更多精力来设置和维护。它与护栏机制共享确保内容安全的共同目标，专注于设定标准来过滤和管理内容，无论是用户输入还是语言模型的输出。自定义审核可以作为审核API的替代方案或扩展，使用我们的补全端点。对于有兴趣进一步探索护栏机制的用户，可在此处here获取包含示例和指南的实用手册。

工作流程

我们将实现一个小例子来展示自定义内容审核。我们将额外调用gpt-4o mini来评估消息并决定是否应该进行审核。

def custom_moderation(content, parameters):
    # Define the prompt 
    prompt = f"""Please assess the following content for any inappropriate material. You should base your assessment on the given parameters.
    Your answer should be in json format with the following fields: 
        - flagged: a boolean indicating whether the content is flagged for any of the categories in the parameters
        - reason: a string explaining the reason for the flag, if any
        - parameters: a dictionary of the parameters used for the assessment and their values
    Parameters: {parameters}\n\nContent:\n{content}\n\nAssessment:"""
    
    # Call model with the prompt
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        response_format={ "type": "json_object" },
        messages=[
            {"role": "system", "content": "You are a content moderation assistant."},
            {"role": "user", "content": prompt}
        ]
    )
    
    # Extract the assessment from the response
    assessment = response.choices[0].message.content
    
    return assessment

# Example content and parameters
parameters = "political content, misinformation"

# Use the custom moderation function for the good example
moderation_result = custom_moderation(good_request, parameters)
print(moderation_result)

{
    "flagged": false,
    "reason": "",
    "parameters": {
        "political content": false,
        "misinformation": false
    }
}

# Use the custom moderation function for the bad example
moderation_result = custom_moderation(bad_request, parameters)
print(moderation_result)

{
    "flagged": true,
    "reason": "The content expresses a desire to cause harm, which is inappropriate and potentially dangerous.",
    "parameters": {
        "political content": false,
        "misinformation": false
    }
}

# Use the custom moderation function for a custom example
custom_request = "I want to talk about how the government is hiding the truth about the pandemic."
moderation_result = custom_moderation(custom_request, parameters)
print(moderation_result)

{
    "flagged": true,
    "reason": "The content suggests a conspiracy theory about government actions related to the pandemic, indicating potential misinformation.",
    "parameters": {
        "political content": true,
        "misinformation": true
    }
}

结论

总之，本笔记本探讨了内容审核在语言模型(LLMs)驱动应用中的关键作用。我们深入研究了输入和输出审核策略，强调了它们在维护用户交互安全与尊重环境方面的重要性。通过实际案例，我们展示了如何使用OpenAI的审核API来预先过滤用户输入，并仔细检查LLM生成回复的适当性。实施这些审核技术对于维护应用程序的完整性并确保用户获得良好体验至关重要。

随着您进一步开发应用程序，请考虑通过自定义审核策略不断完善您的审核机制。这可能涉及根据您的具体使用场景定制审核标准，或整合机器学习模型与基于规则的系统，以实现对内容更细致的分析。在允许言论自由与确保内容安全之间取得适当平衡，是为所有用户创建包容性和建设性空间的关键。通过持续监控和调整审核方法，您可以适应不断变化的内容标准和用户期望，确保基于LLM的应用程序长期保持成功和相关性。

2024年3月5日

如何使用审核API

1. 输入内容审核

拥抱异步

工作流程:

2. 输出内容审核

设置审核阈值

工作流程:

3. 自定义审核

工作流程

结论