分词器¶

分词器是任何LLM的关键组件。它们将原始文本转换为标记ID，这些ID索引到模型理解的嵌入向量中。

在torchtune中，分词器的作用是将消息对象转换为token ID以及任何必要的模型特定特殊token。

from torchtune.data import Message
from torchtune.models.phi3 import phi3_mini_tokenizer

sample = {
    "input": "user prompt",
    "output": "model response",
}

msgs = [
    Message(role="user", content=sample["input"]),
    Message(role="assistant", content=sample["output"])
]

p_tokenizer = phi3_mini_tokenizer("/tmp/Phi-3-mini-4k-instruct/tokenizer.model")
tokens, mask = p_tokenizer.tokenize_messages(msgs)
print(tokens)
# [1, 32010, 29871, 13, 1792, 9508, 32007, 29871, 13, 32001, 29871, 13, 4299, 2933, 32007, 29871, 13]
print(p_tokenizer.decode(tokens))
# '\nuser prompt \n \nmodel response \n'

模型的分词器通常基于底层的字节对编码算法，例如SentencePiece或TikToken，这两种算法在torchtune中都得到了支持。

从Hugging Face下载分词器¶

托管在 Hugging Face 上的模型也会与它们训练时使用的分词器一起分发。当使用 tune download 时，这些分词器会自动与模型权重一起下载。例如，以下命令会下载 Mistral-7B 模型的权重和分词器：

tune download mistralai/Mistral-7B-v0.1 --output-dir /tmp/Mistral-7B-v0.1 --hf-token <HF_TOKEN>
cd /tmp/Mistral-7B-v0.1/
ls tokenizer.model
# tokenizer.model

从文件加载分词器¶

一旦你下载了分词器文件，你可以通过在你的配置或构造函数中指向分词器模型的文件路径来将其加载到相应的分词器类中。如果你已经将其下载到不同的位置，你也可以传入一个自定义的文件路径。

# In code
from torchtune.models.mistral import mistral_tokenizer

m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model")
type(m_tokenizer)
# <class 'torchtune.models.mistral._tokenizer.MistralTokenizer'>

# In config
tokenizer:
  _component_: torchtune.models.mistral.mistral_tokenizer
  path: /tmp/Mistral-7B-v0.1/tokenizer.model

设置最大序列长度¶

设置最大序列长度可以让你控制内存使用并遵守模型规范。

# In code
from torchtune.models.mistral import mistral_tokenizer

m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model", max_seq_len=8192)

# Set an arbitrarily small seq len for demonstration
from torchtune.data import Message

m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model", max_seq_len=7)
msg = Message(role="user", content="hello world")
tokens, mask = m_tokenizer.tokenize_messages([msg])
print(len(tokens))
# 7
print(tokens)
# [1, 733, 16289, 28793, 6312, 28709, 2]
print(m_tokenizer.decode(tokens))
# '[INST] hello'

# In config
tokenizer:
  _component_: torchtune.models.mistral.mistral_tokenizer
  path: /tmp/Mistral-7B-v0.1/tokenizer.model
  max_seq_len: 8192

提示模板¶

通过将其传递到任何模型分词器中，可以启用提示模板。有关更多详细信息，请参阅提示模板。

特殊标记¶

特殊标记是模型特定的标签，用于提示模型。它们与提示模板不同，因为它们被分配了唯一的标记ID。有关特殊标记和提示模板之间区别的详细讨论，请参见提示模板。

特殊标记由模型的分词器自动添加到您的数据中，不需要您进行任何额外的配置。您还可以通过传递一个JSON文件中的新特殊标记映射文件路径来自定义特殊标记以进行实验。这不会修改底层的tokenizer.model以支持新的特殊标记ID - 您有责任确保分词器文件正确编码它。还要注意，某些模型需要某些特殊标记的存在才能正确使用，例如Llama3 Instruct中的"<|eot_id|>"。

例如，这里我们更改了Llama3 Instruct中的"<|begin_of_text|>"和"<|end_of_text|>"标记ID：

# tokenizer/special_tokens.json
{
    "added_tokens": [
        {
            "id": 128257,
            "content": "<|begin_of_text|>",
        },
        {
            "id": 128258,
            "content": "<|end_of_text|>",
        },
        # Remaining required special tokens
        ...
    ]
}

# In code
from torchtune.models.llama3 import llama3_tokenizer

tokenizer = llama3_tokenizer(
    path="/tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model",
    special_tokens_path="tokenizer/special_tokens.json",
)
print(tokenizer.special_tokens)
# {'<|begin_of_text|>': 128257, '<|end_of_text|>': 128258, ...}

# In config
tokenizer:
  _component_: torchtune.models.llama3.llama3_tokenizer
  path: /tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model
  special_tokens_path: tokenizer/special_tokens.json

基础分词器¶

BaseTokenizer 是底层的字节对编码模块，负责执行实际的原始字符串到令牌ID的转换及其反向操作。在 torchtune 中，它们需要实现 encode 和 decode 方法，这些方法由模型分词器调用，以在原始文本和令牌ID之间进行转换。

class BaseTokenizer(Protocol):

    def encode(self, text: str, **kwargs: Dict[str, Any]) -> List[int]:
        """
        Given a string, return the encoded list of token ids.

        Args:
            text (str): The text to encode.
            **kwargs (Dict[str, Any]): kwargs.

        Returns:
            List[int]: The encoded list of token ids.
        """
        pass

    def decode(self, token_ids: List[int], **kwargs: Dict[str, Any]) -> str:
        """
        Given a list of token ids, return the decoded text, optionally including special tokens.

        Args:
            token_ids (List[int]): The list of token ids to decode.
            **kwargs (Dict[str, Any]): kwargs.

        Returns:
            str: The decoded text.
        """
        pass

如果你加载任何模型分词器，你可以看到它调用其底层的BaseTokenizer来进行实际的编码和解码。

from torchtune.models.mistral import mistral_tokenizer
from torchtune.modules.tokenizers import SentencePieceBaseTokenizer

m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model")
# Mistral uses SentencePiece for its underlying BPE
sp_tokenizer = SentencePieceBaseTokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model")

text = "hello world"

print(m_tokenizer.encode(text))
# [1, 6312, 28709, 1526, 2]

print(sp_tokenizer.encode(text))
# [1, 6312, 28709, 1526, 2]

模型分词器¶

ModelTokenizer 是针对特定模型的。它们需要实现 tokenize_messages 方法，该方法将消息列表转换为令牌ID列表。

class ModelTokenizer(Protocol):

    special_tokens: Dict[str, int]
    max_seq_len: Optional[int]

    def tokenize_messages(
        self, messages: List[Message], **kwargs: Dict[str, Any]
    ) -> Tuple[List[int], List[bool]]:
        """
        Given a list of messages, return a list of tokens and list of masks for
        the concatenated and formatted messages.

        Args:
            messages (List[Message]): The list of messages to tokenize.
            **kwargs (Dict[str, Any]): kwargs.

        Returns:
            Tuple[List[int], List[bool]]: The list of token ids and the list of masks.
        """
        pass

它们之所以是模型特定的，并且与基础分词器不同，是因为它们添加了所有必要的特殊标记或提示模板，以提示模型。

from torchtune.models.mistral import mistral_tokenizer
from torchtune.modules.tokenizers import SentencePieceBaseTokenizer
from torchtune.data import Message

m_tokenizer = mistral_tokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model")
# Mistral uses SentencePiece for its underlying BPE
sp_tokenizer = SentencePieceBaseTokenizer("/tmp/Mistral-7B-v0.1/tokenizer.model")

text = "hello world"
msg = Message(role="user", content=text)

tokens, mask = m_tokenizer.tokenize_messages([msg])
print(tokens)
# [1, 733, 16289, 28793, 6312, 28709, 1526, 28705, 733, 28748, 16289, 28793]
print(sp_tokenizer.encode(text))
# [1, 6312, 28709, 1526, 2]
print(m_tokenizer.decode(tokens))
# [INST] hello world  [/INST]
print(sp_tokenizer.decode(sp_tokenizer.encode(text)))
# hello world