Docker

我们建议从我们的预构建Docker镜像开始，以避免编译自定义CUDA内核和其他依赖项。

运行基础LLM容器

在这个示例中，我们将使用Mistral-7B-Instruct作为基础模型，但您可以使用HuggingFace上的任何支持模型。

model=mistralai/Mistral-7B-Instruct-v0.1
volume=$PWD/data  # share a volume with the container as a weight cache

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/predibase/lorax:main --model-id $model

注意

main标签将使用从仓库主分支HEAD构建的镜像。如需最新稳定版镜像（基于标记版本构建），请使用latest标签。

注意

预构建的Docker镜像中的LoRAX服务器被配置为监听80端口（而非默认的3000端口）。

注意

要使用GPU，您需要安装NVIDIA Container Toolkit。我们还建议使用CUDA版本11.8或更高的NVIDIA驱动程序。

查看Launcher的参考文档以查看所有可用选项，或从您的容器内运行以下命令：

lorax-launcher --help

提示基础大语言模型

RESTREST (Streaming)PythonPython (Streaming)

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64}}' \
    -H 'Content-Type: application/json'

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64}}' \
    -H 'Content-Type: application/json'

pip install lorax-client

from lorax import Client

client = Client("http://127.0.0.1:8080")
prompt = "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]"

print(client.generate(prompt, max_new_tokens=64).generated_text)

pip install lorax-client

from lorax import Client

client = Client("http://127.0.0.1:8080")
prompt = "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]"

text = ""
for response in client.generate_stream(prompt, max_new_tokens=64):
    if not response.token.special:
        text += response.token.text
print(text)

提示一个LoRA适配器

RESTREST (Streaming)PythonPython (Streaming)

curl 127.0.0.1:8080/generate \
    -X POST \
    -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64, "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"}}' \
    -H 'Content-Type: application/json'

curl 127.0.0.1:8080/generate_stream \
    -X POST \
    -d '{"inputs": "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]", "parameters": {"max_new_tokens": 64, "adapter_id": "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"}}' \
    -H 'Content-Type: application/json'

pip install lorax-client

from lorax import Client

client = Client("http://127.0.0.1:8080")
prompt = "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]"
adapter_id = "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"

print(client.generate(prompt, max_new_tokens=64, adapter_id=adapter_id).generated_text)

pip install lorax-client

from lorax import Client

client = Client("http://127.0.0.1:8080")
prompt = "[INST] Natalia sold clips to 48 of her friends in April, and then she sold half as many clips in May. How many clips did Natalia sell altogether in April and May? [/INST]"
adapter_id = "vineetsharma/qlora-adapter-Mistral-7B-Instruct-v0.1-gsm8k"

text = ""
for response in client.generate_stream(prompt, max_new_tokens=64, adapter_id=adapter_id):
    if not response.token.special:
        text += response.token.text
print(text)