上下文长度外推#
长文本外推指的是LLM在推理过程中处理比训练文本更长的数据的能力。TurboMind引擎现在支持LlamaDynamicNTKScalingRotaryEmbedding,并且实现与huggingface一致。
用法#
您可以通过修改TurbomindEngineConfig来启用上下文长度外推能力。将session_len
编辑为预期长度,并将rope_scaling_factor
更改为不小于1.0的数字。
以internlm2_5-7b-chat-1m
为例,它支持高达100万tokens的上下文长度:
from lmdeploy import pipeline, GenerationConfig, TurbomindEngineConfig
backend_config = TurbomindEngineConfig(
rope_scaling_factor=2.5,
session_len=1000000,
max_batch_size=1,
cache_max_entry_count=0.7,
tp=4)
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
prompt = 'Use a long prompt to replace this sentence'
gen_config = GenerationConfig(top_p=0.8,
top_k=40,
temperature=0.8,
max_new_tokens=1024)
response = pipe(prompt, gen_config=gen_config)
print(response)
评估#
我们使用多种方法来评估LMDeploy的长上下文推理能力,包括passkey retrieval、needle in a haystack和计算perplexity
密钥检索#
你可以尝试以下代码来测试LMDeploy可以检索特殊键的次数。
import numpy as np
from lmdeploy import pipeline
from lmdeploy import TurbomindEngineConfig
import time
session_len = 1000000
backend_config = TurbomindEngineConfig(
rope_scaling_factor=2.5,
session_len=session_len,
max_batch_size=1,
cache_max_entry_count=0.7,
tp=4)
pipe = pipeline('internlm/internlm2_5-7b-chat-1m', backend_config=backend_config)
def passkey_retrieval(session_len, n_round=5):
# create long context input
tok = pipe.tokenizer
task_description = 'There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.'
garbage = 'The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again.'
for _ in range(n_round):
start = time.perf_counter()
n_times = (session_len - 1000) // len(tok.encode(garbage))
n_garbage_prefix = np.random.randint(0, n_times)
n_garbage_suffix = n_times - n_garbage_prefix
garbage_prefix = ' '.join([garbage] * n_garbage_prefix)
garbage_suffix = ' '.join([garbage] * n_garbage_suffix)
pass_key = np.random.randint(1, 50000)
information_line = f'The pass key is {pass_key}. Remember it. {pass_key} is the pass key.' # noqa: E501
final_question = 'What is the pass key? The pass key is'
lines = [
task_description,
garbage_prefix,
information_line,
garbage_suffix,
final_question,
]
# inference
prompt = ' '.join(lines)
response = pipe([prompt])
print(pass_key, response)
end = time.perf_counter()
print(f'duration: {end - start} s')
passkey_retrieval(session_len, 5)
此测试在A100-80G GPU上进行时,每轮大约需要364秒
大海捞针#
OpenCompass 提供了非常有用的工具来进行大海捞针评估。具体说明,请参考指南。
困惑度#
以下代码展示了如何使用LMDeploy计算困惑度。
from transformers import AutoTokenizer
from lmdeploy import TurbomindEngineConfig, pipeline
import numpy as np
# load model and tokenizer
model_repoid_or_path = 'internlm/internlm2_5-7b-chat-1m'
backend_config = TurbomindEngineConfig(
rope_scaling_factor=2.5,
session_len=1000000,
max_batch_size=1,
cache_max_entry_count=0.7,
tp=4)
pipe = pipeline(model_repoid_or_path, backend_config=backend_config)
tokenizer = AutoTokenizer.from_pretrained(model_repoid_or_path, trust_remote_code=True)
# get perplexity
text = 'Use a long prompt to replace this sentence'
input_ids = tokenizer.encode(text)
ppl = pipe.get_ppl(input_ids)[0]
print(ppl)