使用BentoML运行Outlines

BentoML 是一个开源的模型服务库，用于构建高性能和可扩展的AI应用，使用Python。它提供了您所需的服务优化、模型打包和生产部署的工具。

在本指南中，我们将向您展示如何使用 BentoML 在本地和在 BentoCloud 上运行使用 Outlines 编写的程序，BentoCloud 是一个面向企业 AI 团队的 AI 推理平台。本指南中的示例源代码也可在 examples/bentoml/ 目录中找到。

导入模型

首先，我们需要下载一个 LLM（本例中为 Mistral-7B-v0.1，您可以使用任何其他 LLM），并将模型导入到 BentoML 的模型存储中。让我们从 PyPi 安装 BentoML 和其他依赖项（最好在虚拟环境中）：

pip install -r requirements.txt

然后将下面的代码片段保存为 import_model.py 并运行 python import_model.py。

注意: 你需要先接受Hugging Face上的相关条件才能获得对Mistral-7B-v0.1的访问权限。

import bentoml

MODEL_ID = "mistralai/Mistral-7B-v0.1"
BENTO_MODEL_TAG = MODEL_ID.lower().replace("/", "--")

def import_model(model_id, bento_model_tag):

    import torch
    from transformers import AutoModelForCausalLM, AutoTokenizer

    tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        torch_dtype=torch.float16,
        low_cpu_mem_usage=True,
    )

    with bentoml.models.create(bento_model_tag) as bento_model_ref:
        tokenizer.save_pretrained(bento_model_ref.path)
        model.save_pretrained(bento_model_ref.path)


if __name__ == "__main__":
    import_model(MODEL_ID, BENTO_MODEL_TAG)

您可以通过运行以下命令来验证下载是否成功：

$ bentoml models list

Tag                                          Module  Size        Creation Time
mistralai--mistral-7b-v0.1:m7lmf5ac2cmubnnz          13.49 GiB   2024-04-25 06:52:39

定义一个 BentoML 服务

当模型准备好后，我们可以定义一个 BentoML Service 来包装模型的功能。

我们将运行在README中的JSON结构生成示例，使用以下模式：

DEFAULT_SCHEMA = """{
    "title": "Character",
    "type": "object",
    "properties": {
        "name": {
            "title": "Name",
            "maxLength": 10,
            "type": "string"
        },
        "age": {
            "title": "Age",
            "type": "integer"
        },
        "armor": {"$ref": "#/definitions/Armor"},
        "weapon": {"$ref": "#/definitions/Weapon"},
        "strength": {
            "title": "Strength",
            "type": "integer"
        }
    },
    "required": ["name", "age", "armor", "weapon", "strength"],
    "definitions": {
        "Armor": {
            "title": "Armor",
            "description": "An enumeration.",
            "enum": ["leather", "chainmail", "plate"],
            "type": "string"
        },
        "Weapon": {
            "title": "Weapon",
            "description": "An enumeration.",
            "enum": ["sword", "axe", "mace", "spear", "bow", "crossbow"],
            "type": "string"
        }
    }
}"""

首先，我们需要通过用 @bentoml.service 修饰符修饰一个普通类 (Outlines 在这里) 来定义一个 BentoML 服务。我们将一些配置和我们希望这个服务在 BentoCloud 上运行的 GPU 传递给这个修饰符（这里是一个具有 24GB 内存的 L4）：

import typing as t
import bentoml

from import_model import BENTO_MODEL_TAG

@bentoml.service(
    traffic={
        "timeout": 300,
    },
    resources={
        "gpu": 1,
        "gpu_type": "nvidia-l4",
    },
)
class Outlines:

    bento_model_ref = bentoml.models.get(BENTO_MODEL_TAG)

    def __init__(self) -> None:

        import outlines
        import torch
        self.model = outlines.models.transformers(
            self.bento_model_ref.path,
            device="cuda",
            model_kwargs={"torch_dtype": torch.float16},
        )

    ...

然后我们需要使用 @bentoml.api 来定义一个 HTTP 端点，以修饰 Outlines 类的 generate 方法：

    ...

    @bentoml.api
    async def generate(
        self,
        prompt: str = "Give me a character description.",
        json_schema: t.Optional[str] = DEFAULT_SCHEMA,
    ) -> t.Dict[str, t.Any]:

        import outlines

        generator = outlines.generate.json(self.model, json_schema)
        character = generator(prompt)

        return character

这里 @bentoml.api 装饰器定义了 generate 作为一个 HTTP 端点，接受一个包含两个字段的 JSON 请求体：prompt 和 json_schema（可选，允许 HTTP 客户端提供他们自己的 JSON schema）。函数签名中的类型提示将用于验证传入的 JSON 请求。您可以通过使用 @bentoml.api 装饰 Outlines 类的其他方法来定义任意数量的 HTTP 端点。

现在您可以将上述代码保存为 service.py （或使用此实现），并使用 BentoML CLI 运行代码。

本地运行以进行测试和调试

然后您可以通过以下方式在本地运行服务器：

bentoml serve .

服务器现在在 http://localhost:3000 处激活。您可以使用Swagger UI或其他不同方式与之互动：

CURL

curl -X 'POST' \
  'http://localhost:3000/generate' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "prompt": "Give me a character description."
}'

Python client

import bentoml

with bentoml.SyncHTTPClient("http://localhost:3000") as client:
    response = client.generate(
        prompt="Give me a character description"
    )
    print(response)

预期输出：

{
  "name": "Aura",
  "age": 15,
  "armor": "plate",
  "weapon": "sword",
  "strength": 20
}

部署到BentoCloud

服务准备好后，您可以将其部署到 BentoCloud 以获得更好的管理和可扩展性。如果您还没有BentoCloud账户，请注册。

确保您已经登录到 BentoCloud，然后运行以下命令进行部署。

bentoml deploy .

一旦应用程序在BentoCloud上启动并运行，您可以通过公开的URL访问它。

注意: 对于在您自己的基础设施中进行自定义部署，请使用 BentoML 生成一个符合OCI标准的镜像。