IPEX-LLM 在英特尔CPU上¶

IPEX-LLM 是一个PyTorch库，用于在英特尔CPU和GPU（例如配备集成显卡的本地PC，或Arc、Flex和Max等独立显卡）上以极低延迟运行大语言模型。

本示例演示了如何使用LlamaIndex与ipex-llm交互，在CPU上进行文本生成和聊天。

注意

您可以参考这里查看IpexLLM的完整示例。请注意，若要在Intel CPU上运行，请在运行示例时在命令行参数中指定-d 'cpu'。

安装 llama-index-llms-ipex-llm。这将同时安装 ipex-llm 及其依赖项。

In [ ]:

Copied!

%pip install llama-index-llms-ipex-llm
%pip install llama-index-llms-ipex-llm

在这个示例中，我们将使用HuggingFaceH4/zephyr-7b-alpha模型进行演示。这需要更新transformers和tokenizers包。

In [ ]:

Copied!

%pip install -U transformers==4.37.0 tokenizers==0.15.2
%pip install -U transformers==4.37.0 tokenizers==0.15.2

在加载Zephyr模型之前，您需要定义completion_to_prompt和messages_to_prompt来格式化提示。这对于准备模型能够准确解释的输入至关重要。

In [ ]:

Copied!





# Transform a string into input zephyr-specific input
def completion_to_prompt(completion):
    return f"<|system|>\n</s>\n<|user|>\n{completion}</s>\n<|assistant|>\n"


# Transform a list of chat messages into zephyr-specific input
def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|system|>\n{message.content}</s>\n"
        elif message.role == "user":
            prompt += f"<|user|>\n{message.content}</s>\n"
        elif message.role == "assistant":
            prompt += f"<|assistant|>\n{message.content}</s>\n"

    # ensure we start with a system prompt, insert blank if needed
    if not prompt.startswith("<|system|>\n"):
        prompt = "<|system|>\n</s>\n" + prompt

    # add final assistant prompt
    prompt = prompt + "<|assistant|>\n"

    return prompt
# 将字符串转换为zephyr特定的输入
def completion_to_prompt(completion):
    return f"<|system|>\n\n<|user|>\n{completion}\n<|assistant|>\n"


# 将聊天消息列表转换为zephyr特定的输入
def messages_to_prompt(messages):
    prompt = ""
    for message in messages:
        if message.role == "system":
            prompt += f"<|system|>\n{message.content}\n"
        elif message.role == "user":
            prompt += f"<|user|>\n{message.content}\n"
        elif message.role == "assistant":
            prompt += f"<|assistant|>\n{message.content}\n"

    # 确保以系统提示开始，如果需要则插入空白
    if not prompt.startswith("<|system|>\n"):
        prompt = "<|system|>\n\n" + prompt

    # 添加最终的助手提示
    prompt = prompt + "<|assistant|>\n"

    return prompt

基本用法¶

使用IpexLLM.from_model_id在本地加载Zephyr模型。它将直接以Huggingface格式加载模型，并自动转换为低比特格式进行推理。

In [ ]:

Copied!





import warnings

warnings.filterwarnings(
    "ignore", category=UserWarning, message=".*padding_mask.*"
)

from llama_index.llms.ipex_llm import IpexLLM

llm = IpexLLM.from_model_id(
    model_name="HuggingFaceH4/zephyr-7b-alpha",
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    context_window=512,
    max_new_tokens=128,
    generate_kwargs={"do_sample": False},
    completion_to_prompt=completion_to_prompt,
    messages_to_prompt=messages_to_prompt,
)
import warnings

warnings.filterwarnings(
    "ignore", category=UserWarning, message=".*padding_mask.*"
)

from llama_index.llms.ipex_llm import IpexLLM

llm = IpexLLM.from_model_id(
    model_name="HuggingFaceH4/zephyr-7b-alpha",
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    context_window=512,
    max_new_tokens=128,
    generate_kwargs={"do_sample": False},
    completion_to_prompt=completion_to_prompt,
    messages_to_prompt=messages_to_prompt,
)

Loading checkpoint shards:   0%|          | 0/8 [00:00<?, ?it/s]

2024-04-11 21:36:54,739 - INFO - Converting the current model to sym_int4 format......

现在你可以继续使用加载的模型进行文本补全和交互式聊天。

文本补全¶

In [ ]:

Copied!

completion_response = llm.complete("Once upon a time, ")
print(completion_response.text)
completion_response = llm.complete("从前，")
print(completion_response.text)

in a far-off land, 
there was a young girl named Lily. 
Lily lived in a small village surrounded by lush green forests and rolling hills. She loved nothing more than spending her days exploring the woods and playing with her animal friends. 
One day, while wandering through the forest, Lily stumbled upon a magical tree. The tree was unlike any other she had ever seen. Its trunk was made of shimmering crystal, and its branches were adorned with sparkling jewels. 
Lily was immediately drawn to the tree and sat down to admire its beauty. Suddenly,

流式文本补全¶

In [ ]:

Copied!

response_iter = llm.stream_complete("Once upon a time, there's a little girl")
for response in response_iter:
    print(response.delta, end="", flush=True)
response_iter = llm.stream_complete("从前有个小女孩")
for response in response_iter:
    print(response.delta, end="", flush=True)

who loved to play with her toys. She had a favorite teddy bear named Ted, and a doll named Dolly. She would spend hours playing with them, imagining all sorts of adventures. One day, she decided to take Ted and Dolly on a real adventure. She packed a backpack with some snacks, a blanket, and a map. They set off on a hike in the nearby woods. The little girl was so excited that she could barely contain her joy. Ted and Dolly were happy to be along for the ride. They walked for what seemed like hours, but the little girl didn't mind

聊天¶

In [ ]:

Copied!

from llama_index.core.llms import ChatMessage

message = ChatMessage(role="user", content="Explain Big Bang Theory briefly")
resp = llm.chat([message])
print(resp)
from llama_index.core.llms import ChatMessage

message = ChatMessage(role="user", content="简要解释大爆炸理论")
resp = llm.chat([message])
print(resp)

assistant: The Big Bang Theory is a popular American sitcom that aired from 2007 to 2019. The show follows the lives of two brilliant but socially awkward physicists, Leonard Hofstadter (Johnny Galecki) and Sheldon Cooper (Jim Parsons), and their friends and colleagues, Penny (Kaley Cuoco), Rajesh Koothrappali (Kunal Nayyar), and Howard Wolowitz (Simon Helberg). The show is set in Pasadena, California, and revolves around the characters' work at Caltech and

流式聊天¶

In [ ]:

Copied!





message = ChatMessage(role="user", content="What is AI?")
resp = llm.stream_chat([message], max_tokens=256)
for r in resp:
    print(r.delta, end="")
message = ChatMessage(role="user", content="什么是人工智能？")
resp = llm.stream_chat([message], max_tokens=256)
for r in resp:
    print(r.delta, end="")

AI stands for Artificial Intelligence. It refers to the development of computer systems that can perform tasks that typically require human intelligence, such as learning, reasoning, and problem-solving. AI involves the use of machine learning algorithms, natural language processing, and other advanced techniques to enable computers to understand and respond to human input in a more natural and intuitive way.

保存/加载低比特模型¶

或者，您可以将低比特模型保存到磁盘一次，然后使用from_model_id_low_bit而不是from_model_id来重新加载以供后续使用——甚至可以在不同的机器上使用。这种方式非常节省空间，因为低比特模型所需的磁盘空间比原始模型少得多。而且from_model_id_low_bit在速度和内存使用方面也比from_model_id更高效，因为它跳过了模型转换步骤。

要保存低比特模型，请按如下方式使用save_low_bit。

In [ ]:

Copied!





saved_lowbit_model_path = (
    "./zephyr-7b-alpha-low-bit"  # path to save low-bit model
)

llm._model.save_low_bit(saved_lowbit_model_path)
del llm
saved_lowbit_model_path = (
    "./zephyr-7b-alpha-low-bit"  # 保存低比特模型的路径
)

llm._model.save_low_bit(saved_lowbit_model_path)
del llm

从保存的低比特模型路径加载模型如下。

请注意，低比特率模型的保存路径仅包含模型本身，不包括分词器。如果您希望将所有内容集中在一个地方，您需要手动下载或从原始模型目录中复制分词器文件到低比特率模型的保存位置。

In [ ]:

Copied!





llm_lowbit = IpexLLM.from_model_id_low_bit(
    model_name=saved_lowbit_model_path,
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    # tokenizer_name=saved_lowbit_model_path,  # copy the tokenizers to saved path if you want to use it this way
    context_window=512,
    max_new_tokens=64,
    completion_to_prompt=completion_to_prompt,
    generate_kwargs={"do_sample": False},
)
llm_lowbit = IpexLLM.from_model_id_low_bit(
    model_name=saved_lowbit_model_path,
    tokenizer_name="HuggingFaceH4/zephyr-7b-alpha",
    # tokenizer_name=saved_lowbit_model_path,  # 如果你想这样使用，请将分词器复制到保存路径
    context_window=512,
    max_new_tokens=64,
    completion_to_prompt=completion_to_prompt,
    generate_kwargs={"do_sample": False},
)

2024-04-11 21:38:06,151 - INFO - Converting the current model to sym_int4 format......

尝试使用已加载的低比特模型进行流式完成。

In [ ]:

Copied!

response_iter = llm_lowbit.stream_complete("What is Large Language Model?")
for response in response_iter:
    print(response.delta, end="", flush=True)
response_iter = llm_lowbit.stream_complete("什么是大语言模型？")
for response in response_iter:
    print(response.delta, end="", flush=True)

A large language model (LLM) is a type of artificial intelligence (AI) model that is trained on a massive amount of text data. These models are capable of generating human-like responses to text inputs and can be used for various natural language processing (NLP) tasks, such as text classification, sentiment analysis