How to implement LLM guardrails

在本笔记本中，我们将分享如何为您的LLM应用程序实现防护栏的示例。防护栏是一个通用术语，指的是旨在引导应用程序的检测性控制措施。鉴于LLM固有的随机性，更强的可操控性是一个普遍需求，因此创建有效的防护栏已成为将LLM从原型推向生产时最常见的性能优化领域之一。

防护栏（Guardrails）的应用场景极其多样化，几乎可以部署到任何您能想到的大语言模型可能出错的场景。本笔记本旨在提供可扩展的简单示例，以满足您的独特用例，同时概述在决定是否实施防护栏以及如何实施时需要考虑的权衡因素。

本笔记本将重点关注：

输入防护栏 在不当内容到达您的LLM之前进行标记
输出防护栏 在LLM生成的内容到达客户之前对其进行验证

注意： 本笔记本将防护栏(guardrails)作为围绕大语言模型(LLM)的检测性控制的通用术语 - 如需获取提供预构建防护栏框架分发的官方库，请查看以下内容：

import openai

GPT_MODEL = 'gpt-4o-mini'

1. 输入防护栏

输入防护机制旨在从一开始就防止不当内容进入LLM - 一些常见用例包括：

主题防护栏： 当用户提出偏离主题的问题时进行识别，并为他们提供建议，说明LLM可以帮助他们处理哪些主题。
越狱检测: 检测用户是否试图劫持LLM并覆盖其提示。
提示注入： 检测用户试图隐藏恶意代码的提示注入实例，这些代码将在LLM执行的任何下游函数中运行。

在所有这些情况下，它们都充当预防性控制措施，在大型语言模型运行之前或同时运行，并在满足任一条件时触发您的应用程序采取不同的行为。

设计防护栏

在设计防护栏时，重要的是要考虑准确性、延迟和成本之间的权衡，目标是尽可能提高准确性，同时将对利润和用户体验的影响降到最低。

我们将从一个简单的主题护栏开始，旨在检测偏离主题的问题，并在触发时阻止LLM回答。这个护栏由一个简单的提示组成，并使用gpt-4o-mini，在保持足够准确性的同时最大限度地降低延迟/成本，但如果要进一步优化，我们可以考虑：

准确性： 您可以考虑微调 gpt-4o-mini 或使用少量示例来提高准确性。如果您拥有可以帮助确定内容是否被允许的信息库，RAG 也可能很有效。
延迟/成本： 您可以尝试微调较小的模型，例如 babbage-002 或开源选项如 Llama，当提供足够的训练样本时，这些模型可以表现得相当出色。使用开源选项时，您还可以调整用于推理的机器配置，以最大限度地降低成本或减少延迟。

这个简单的防护栏旨在确保LLM仅回答预定义的一组主题，并对超出范围的查询使用预设消息进行回复。

拥抱异步

一种常见的降低延迟的设计是，将防护规则与主LLM调用异步发送。如果防护规则被触发，则返回其响应，否则返回LLM的响应。

我们将采用这种方法，创建一个execute_chat_with_guardrails函数，该函数将并行运行我们的LLM的get_chat_response和topical_guardrail防护栏，仅当防护栏返回allowed时才返回LLM的响应。

局限性

在设计开发时，您始终需要考虑防护措施的限制。其中一些关键限制需要注意：

将LLM用作防护栏时需注意，它们与基础LLM调用存在相同的安全漏洞。例如，提示词注入攻击可能同时绕过您的防护栏和实际LLM调用。
随着对话时间延长，LLM更容易受到越狱攻击，因为您的指令会被额外文本稀释。
如果为了弥补上述问题而设置过于严格的防护措施，可能会损害用户体验。这表现为过度拒绝，即您的防护措施会因与提示注入或越狱尝试存在相似性而拒绝无害的用户请求。

缓解措施

如果能将防护措施与基于规则或更传统的机器学习检测模型相结合，可以减轻部分此类风险。我们还发现一些客户采用仅考虑最新消息的防护机制，以降低模型因长对话而产生混淆的风险。

我们还建议逐步推出并主动监控对话，以便及时发现提示注入或越狱的实例，并可以采取以下措施：为这些新型行为添加更多防护措施，或将它们作为训练示例纳入现有的防护机制中。

system_prompt = "You are a helpful assistant."

bad_request = "I want to talk about horses"
good_request = "What are the best breeds of dog for people that like cats?"

import asyncio


async def get_chat_response(user_request):
    print("Getting LLM response")
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0.5
    )
    print("Got LLM response")

    return response.choices[0].message.content


async def topical_guardrail(user_request):
    print("Checking topical guardrail")
    messages = [
        {
            "role": "system",
            "content": "Your role is to assess whether the user question is allowed or not. The allowed topics are cats and dogs. If the topic is allowed, say 'allowed' otherwise say 'not_allowed'",
        },
        {"role": "user", "content": user_request},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=messages, temperature=0
    )

    print("Got guardrail response")
    return response.choices[0].message.content


async def execute_chat_with_guardrail(user_request):
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return "I can only talk about cats and dogs, the best animals that ever lived."
            elif chat_task in done:
                chat_response = chat_task.result()
                return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again

# Call the main function with the good request - this should go through
response = await execute_chat_with_guardrail(good_request)
print(response)

Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
If you like cats and are considering getting a dog, there are several breeds known for their compatibility with feline friends. Here are some of the best dog breeds that tend to get along well with cats:

1. **Golden Retriever**: Friendly and tolerant, Golden Retrievers often get along well with other animals, including cats.

2. **Labrador Retriever**: Similar to Golden Retrievers, Labs are social and friendly, making them good companions for cats.

3. **Cavalier King Charles Spaniel**: This breed is gentle and affectionate, often forming strong bonds with other pets.

4. **Basset Hound**: Basset Hounds are laid-back and generally have a calm demeanor, which can help them coexist peacefully with cats.

5. **Beagle**: Beagles are friendly and sociable, and they often enjoy the company of other animals, including cats.

6. **Pug**: Pugs are known for their playful and friendly nature, which can make them good companions for cats.

7. **Shih Tzu**: Shih Tzus are typically friendly and adaptable, often getting along well with other pets.

8. **Collie**: Collies are known for their gentle and protective nature, which can extend to their relationships with cats.

9. **Newfoundland**: These gentle giants are known for their calm demeanor and often get along well with other animals.

10. **Cocker Spaniel**: Cocker Spaniels are friendly and affectionate dogs that can get along well with cats if introduced properly.

When introducing a dog to a cat, it's important to do so gradually and supervise their interactions to ensure a positive relationship. Each dog's personality can vary, so individual temperament is key in determining compatibility.

# Call the main function with the bad request - this should get blocked
response = await execute_chat_with_guardrail(bad_request)
print(response)

Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Topical guardrail triggered
I can only talk about cats and dogs, the best animals that ever lived.

看来我们的防护机制生效了 - 第一个问题被允许通过，但第二个因偏离主题被拦截。现在我们将扩展这个概念，对从LLM获得的响应也进行内容审核。

2. 输出防护栏

输出护栏控制着LLM返回的内容。这些护栏可以采取多种形式，其中最常见的一些包括：

幻觉/事实核查防护机制：使用真实信息库或一组幻觉响应的训练集来阻止幻觉响应。
内容审核护栏： 应用品牌和企业准则来审核大语言模型的输出结果，若违反准则则阻止或重写其响应。
Syntax checks: Structured outputs from LLMs can be returned corrupt or unable to be parsed - these guardrails detect those and either retry or fail gracefully, preventing failures in downstream applications.
- 这是一个常见的控制手段，用于函数调用，确保当LLM返回function_call时，预期的模式会在arguments中返回。

内容审核护栏

这里我们实现了一个内容审核护栏，它采用G-Eval评估方法的某个版本来对LLM响应中存在的不良内容进行评分。该方法在我们的另一个notebooks中有更详细的演示。

为了实现这一目标，我们将构建一个可扩展的内容审核框架，该框架接收一个domain，并通过一系列steps将criteria应用于某段content：

我们设置了一个域名，用于描述我们将要审核的内容类型。
我们提供明确的标准，详细说明内容应包含和不应包含的内容。
为LLM提供了逐步指导以评估内容。
LLM返回1-5的离散分数。

设置防护栏阈值

我们的输出护栏将评估LLM的响应并阻止任何评分为3或更高的内容。设置此阈值是常见的优化领域 - 我们建议构建一个评估集，并使用混淆矩阵对结果进行评分，以便为您的护栏设置正确的容忍度。这里的权衡通常是：

更多的误报会导致用户体验割裂，客户感到恼火，助手显得不那么有用。
更多的误判可能对你的业务造成持久伤害，因为人们会让助手回答不恰当的问题，或通过提示注入/越狱手段突破限制。

例如，对于越狱行为，您可能希望设置一个非常低的阈值，因为如果您的LLM被劫持并用于生成最终出现在社交媒体上的危险内容，对您的业务风险极高。然而，在我们的用例中，我们愿意接受一些漏报情况，因为最坏的情况不过是有人最终养了一只比熊犬，而实际上拉布拉多犬可能更适合他们——虽然令人遗憾，但可能不会对我们的业务造成持久损害（我们希望如此）。

domain = "animal breed recommendation"

animal_advice_criteria = """
Assess the presence of explicit recommendation of cat or dog breeds in the content.
The content should contain only general advice about cats and dogs, not specific breeds to purchase."""

animal_advice_steps = """
1. Read the content and the criteria carefully.
2. Assess how much explicit recommendation of cat or dog breeds is contained in the content.
3. Assign an animal advice score from 1 to 5, with 1 being no explicit cat or dog breed advice, and 5 being multiple named cat or dog breeds.
"""

moderation_system_prompt = """
You are a moderation assistant. Your role is to detect content about {domain} in the text provided, and mark the severity of that content.

## {domain}

### Criteria

{scoring_criteria}

### Instructions

{scoring_steps}

### Content

{content}

### Evaluation (score only!)
"""

async def moderation_guardrail(chat_response):
    print("Checking moderation guardrail")
    mod_messages = [
        {"role": "user", "content": moderation_system_prompt.format(
            domain=domain,
            scoring_criteria=animal_advice_criteria,
            scoring_steps=animal_advice_steps,
            content=chat_response
        )},
    ]
    response = openai.chat.completions.create(
        model=GPT_MODEL, messages=mod_messages, temperature=0
    )
    print("Got moderation response")
    return response.choices[0].message.content
    
    
async def execute_all_guardrails(user_request):
    topical_guardrail_task = asyncio.create_task(topical_guardrail(user_request))
    chat_task = asyncio.create_task(get_chat_response(user_request))

    while True:
        done, _ = await asyncio.wait(
            [topical_guardrail_task, chat_task], return_when=asyncio.FIRST_COMPLETED
        )
        if topical_guardrail_task in done:
            guardrail_response = topical_guardrail_task.result()
            if guardrail_response == "not_allowed":
                chat_task.cancel()
                print("Topical guardrail triggered")
                return "I can only talk about cats and dogs, the best animals that ever lived."
            elif chat_task in done:
                chat_response = chat_task.result()
                moderation_response = await moderation_guardrail(chat_response)

                if int(moderation_response) >= 3:
                    print(f"Moderation guardrail flagged with a score of {int(moderation_response)}")
                    return "Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have."

                else:
                    print('Passed moderation')
                    return chat_response
        else:
            await asyncio.sleep(0.1)  # sleep for a bit before checking the tasks again

# Adding a request that should pass both our topical guardrail and our moderation guardrail
great_request = 'What is some advice you can give to a new dog owner?'

tests = [good_request,bad_request,great_request]

for test in tests:
    result = await execute_all_guardrails(test)
    print(result)
    print('\n\n')

Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Checking moderation guardrail
Got moderation response
Moderation guardrail flagged with a score of 5
Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have.



Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Topical guardrail triggered
I can only talk about cats and dogs, the best animals that ever lived.



Checking topical guardrail
Got guardrail response
Getting LLM response
Got LLM response
Checking moderation guardrail
Got moderation response
Moderation guardrail flagged with a score of 3
Sorry, we're not permitted to give animal breed advice. I can help you with any general queries you might have.

结论

护栏是LLM中一个充满活力且不断发展的主题，我们希望本笔记本能有效介绍护栏的核心概念。总结如下：

防护栏是一种检测性控制措施，旨在防止有害内容进入您的应用程序和用户，并在生产环境中为您的LLM增加可操控性。
它们可以采取输入护栏的形式，针对内容在到达LLM之前进行控制，以及输出护栏的形式，用于管理LLM的响应。
设计防护栏并设定其阈值需要在准确性、延迟和成本之间进行权衡。您的决策应基于对防护栏性能的明确评估，并了解误报和漏报对业务造成的成本影响。
通过采用异步设计原则，您可以水平扩展防护栏，随着防护栏数量和范围的增加，将对用户的影响降至最低。

我们期待看到您如何推进这一进程，以及随着生态系统的成熟，关于护栏的思考将如何演变。

2023年12月19日

如何实现LLM防护栏