大型语言模型

将LLMs集成到结构化NLP流程中

The spacy-llm package 将大语言模型(LLMs)集成到spaCy中，具有模块化系统，可用于快速原型设计和提示工程，并将非结构化响应转化为适用于各种NLP任务的稳健输出，且无需训练数据。

配置与实现

LLM组件通过LLMWrapper类实现。它可以通过通用的llm 组件工厂以及特定任务的组件工厂访问：llm_ner、llm_spancat、 llm_rel、llm_textcat、llm_sentiment、llm_summarization、 llm_entity_linker、llm_raw和llm_translation。对于这些工厂，默认使用OpenAI的GPT-3.5模型，但可以自定义。

LLMWrapper.init 方法

创建一个新的管道实例。在您的应用程序中，通常会使用快捷方式，通过其字符串名称并使用nlp.add_pipe来实例化该组件。

名称	描述
`name`	String name of the component instance. `llm` by default. str
仅关键字
`vocab`	The shared vocabulary. Vocab
`task`	An LLM Task can generate prompts and parse LLM responses. LLMTask
`model`	The LLM Model queries a specific LLM API.. Callable[[Iterable[Any]], Iterable[Any]]
`cache`	Cache to use for caching prompts and responses per doc. Cache
`save_io`	Whether to save LLM I/O (prompts and responses) in the `Doc._.llm_io` custom attribute. bool

LLMWrapper.call 方法

将管道应用于单个文档。文档会被原地修改并返回。这通常在调用nlp对象处理文本时自动完成，所有管道组件会按顺序应用于Doc对象。

名称	描述
`doc`	The document to process. Doc
返回值	处理后的文档。Doc

LLMWrapper.pipe 方法

将管道应用于文档流。这通常在调用nlp对象处理文本时自动完成，所有流水线组件会按顺序应用于Doc。

名称	描述
`docs`	A stream of documents. Iterable[Doc]
仅关键字
`batch_size`	The number of documents to buffer. Defaults to `128`. int
YIELDS	按顺序处理后的文档。Doc

LLMWrapper.add_label 方法

向管道的任务添加一个新标签。或者，在task定义时提供标签，或通过config的[initialize]块提供。

名称	描述
`label`	The label to add. str
RETURNS	`0` if the label is already present, otherwise `1`. int

LLMWrapper.to_disk 方法

将管道序列化到磁盘。

名称	描述
`path`	A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]

LLMWrapper.from_disk 方法

从磁盘加载管道。就地修改对象并返回它。

名称	描述
`path`	A path to a directory. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The modified `LLMWrapper` object. LLMWrapper

LLMWrapper.to_bytes 方法

将管道序列化为字节串。

名称	描述
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The serialized form of the `LLMWrapper` object. bytes

LLMWrapper.from_bytes 方法

从字节串加载管道。原地修改对象并返回它。

名称	描述
`bytes_data`	The data to load from. bytes
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The `LLMWrapper` object. LLMWrapper

LLMWrapper.labels 属性

当前添加到组件的标签。如果LLM的任务不需要标签，则为空元组。

名称	描述
返回值	添加到组件的标签。Tuple[str, …]

任务

在spacy-llm中，任务定义了使用LLM解决的NLP问题或问题及其解决方案。它通过实现以下职责来实现这一点：

加载提示模板并将文档数据注入提示中。可选地，在提示中包含少量示例。
如果提示过长无法放入模型的上下文且任务支持分片提示，则按照map-reduce范式将提示拆分为多个部分。
将LLM的响应解析回结构化信息并验证解析后的输出。

支持两种不同的任务接口：ShardingLLMTask和NonShardingLLMTask。只有前者支持文档分片功能，即在提示过长时进行拆分。

所有任务都注册在llm_tasks注册表中。

关于分片

“分片”通常指的是将数据集的部分内容分布到多个存储单元中，以便于处理和查询。在spacy-llm中，我们使用这个术语（同义词：“映射”）来描述当提示过长而模型无法处理时的拆分过程，而“融合”（同义词：“归约”）则描述了如何将多个分片的模型响应合并回单个文档中。

提示词会以始终保持模板完整性的方式进行拆分，这意味着给大语言模型的指令将始终保持完整。然而，如果完全渲染后的提示词长度超过模型上下文限制，文档内容将被分割处理。

一个简单的示例：假设一个模型的上下文窗口为25个token，而我们虚构的支持分片任务的提示模板如下所示：

根据如何精确计算标记数（这是一个配置设置），我们可能会得出提示指令中的标记数为n = 12。此外，假设我们的text是"This has been amazing - I can't remember the last time I left the cinema so impressed." - 大约有19个标记。

考虑到我们只有13个token可以添加到提示中，否则就会超出上下文限制，因此我们必须将提示分成两部分。因此spacy-llm（假设使用的任务支持分片）会将提示分成两部分（默认的分割策略是按token分割，但也可以配置其他分割策略，例如按句子分割）：

(提示 1/2)

(提示 2/2)

缩减步骤是任务特定的 - 例如情感分析任务可能会对情感分数进行加权平均。请注意，提示分片会引入潜在的不准确性，因为LLM无法一次性访问整个文档。根据您的使用场景，这可能存在问题也可能无关紧要。

`NonShardingLLMTask`

task.generate_prompts

接收一组文档，返回一组"提示"，这些提示可以是Any类型。通常提示是str类型 - 但为了框架的最大灵活性，并不强制要求。

参数	描述
`docs`	The input documents. Iterable[Doc]
返回值	生成的提示语。Iterable[Any]

task.parse_responses

接收一组LLM响应和原始文档，将响应解析为结构化信息，并在文档上设置注释。parse_responses函数可以自由地以任何方式设置注释，包括使用Doc字段如ents、spans或cats，或使用自定义字段。

responses的类型是Iterable[Any]，尽管它们通常是str对象。这取决于model的返回类型。

参数	描述
`docs`	The input documents. Iterable[Doc]
`responses`	The responses received from the LLM. Iterable[Any]
返回值	经过标注的文档。可迭代对象[Doc]

`ShardingLLMTask`

task.generate_prompts

接收一组文档，必要时将其分割成多个分片以适应模型上下文限制，并返回一个由"提示"集合组成的集合（即每个文档可能包含多个分片，每个分片对应一个提示），这些提示可以是Any类型。通常提示是str类型——但框架并未强制要求，以保持最大灵活性。

参数	描述
`docs`	The input documents. Iterable[Doc]
返回值	生成的提示语。可迭代[可迭代[任意类型]]

task.parse_responses

接收一个包含LLM响应集合的集合（即每个文档可以有多个分片，每个分片恰好包含一个提示/提示响应）以及原始分片，将响应解析为结构化信息，在分片上设置注释，并将文档分片合并回单个文档。parse_responses函数可以自由地以任何方式设置注释，包括Doc字段如ents、spans或cats，或使用自定义字段。

responses的类型是Iterable[Iterable[Any]]，不过它们通常是str对象。这取决于model的返回类型。

参数	描述
`shards`	The input document shards. Iterable[Iterable[Doc]]
`responses`	The responses received from the LLM. Iterable[Iterable[Any]]
返回值	已标注的文档。可迭代对象[Doc]

Translation

The translation task translates texts from a defined or inferred source to a defined target language.

spacy.Translation.v1

spacy.Translation.v1 支持零样本和少样本提示。

参数	描述
`template`	Custom prompt template to send to LLM model. Defaults to translation.v1.jinja. str
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[TranslationTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `TranslationExample`. Optional[Type[FewshotExample]]
`source_lang`	Language to translate from. Doesn’t have to be set. Optional[str]
`target_lang`	Language to translate to. No default value, has to be set. str
`field`	Name of extension attribute to store translation in (i. e. the translation will be available in `doc._.{field}`). Defaults to `translation`. str

要实现小样本学习，你可以将一些示例写在单独的文件中，并把这些示例注入到给大语言模型的提示中。默认的读取器spacy.FewShotReader.v1支持.yml、.yaml、.json和.jsonl格式。

原始提示

与其他所有任务不同，spacy.Raw.vX不会向模型提供特定的提示或包装文档数据。相反，它会指示模型直接对文档内容作出响应。这对于问答场景（每个文档包含一个问题）非常方便，或者如果您想为每个文档包含自定义提示时也很实用。

spacy.Raw.v1

请注意，由于此任务可能请求任意信息，它本身并不进行任何解析——模型响应存储在一个自定义的Doc属性中（即可以通过doc._.{field}访问）。

它支持零样本和小样本提示。

参数	描述
`template`	Custom prompt template to send to LLM model. Defaults to raw.v1.jinja. str
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses`	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[RawTask]]
`prompt_example_type`	Type to use for fewshot examples. Defaults to `RawExample`. Optional[Type[FewshotExample]]
`field`	Name of extension attribute to store model reply in (i. e. the reply will be available in `doc._.{field}`). Defaults to `reply`. str

要实现小样本学习, 你可以将一些示例写在单独的文件中，并提供给LLM注入到提示词中。默认读取器spacy.FewShotReader.v1 支持.yml, .yaml, .json 和 .jsonl格式。

Summarization

摘要任务接收一个文档作为输入，并生成存储在扩展属性中的摘要。

spacy.Summarization.v1

spacy.Summarization.v1 任务支持零样本提示和少样本提示。

参数	描述
`template`	Custom prompt template to send to LLM model. Defaults to summarization.v1.jinja. str
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[SummarizationTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `SummarizationExample`. Optional[Type[FewshotExample]]
`max_n_words`	Maximum number of words to be used in summary. Note that this should not expected to work exactly. Defaults to `None`. Optional[int]
`field`	Name of extension attribute to store summary in (i. e. the summary will be available in `doc._.{field}`). Defaults to `summary`. str

摘要任务提示模型对提供的文本生成简洁摘要。它可选择性地限制响应为特定数量的词元 - 注意此要求会被包含在提示中，但任务不会执行硬性截断。因此您的摘要可能会超出max_n_words。

要实现小样本学习，你可以将一些示例写在单独的文件中，并提供给LLM注入到提示词中。默认读取器spacy.FewShotReader.v1 支持.yml、.yaml、.json和.jsonl格式。

EL (实体链接)

实体链接(EL)将识别的实体(参见NER)与知识库(KB)中的实体进行关联。EL任务提示LLM从知识库中选择最可能的候选实体，该知识库的结构可以是任意的。

请注意，实体链接任务处理的文档预期在其.ents属性中包含已识别的实体。这可以通过运行NER任务、使用训练好的spaCy NER模型或在运行EL任务前手动设置实体来实现。

为了能够从知识库(KB)中提取数据，需要提供一个实现CandidateSelector协议的对象。这需要两个函数： (1) __call__() 用于获取文本中实体提及的候选实体 (假设可通过Doc.ents获取)；(2) get_entity_description() 用于获取任意给定实体ID的描述。描述可以为空，但理想情况下应为知识库中存储的实体提供更多上下文信息。

spacy-llm 提供了一个 CandidateSelector 实现 (spacy.CandidateSelector.v1)，该实现利用spaCy知识库（如在 entity_linking 组件中使用的那样）来选择候选实体。该知识库可以从现有的spaCy管道加载（注意管道的EL组件不需要经过训练）或从单独的.yaml文件加载。

spacy.EntityLinker.v1

支持零样本和少样本提示。依赖于一个可配置组件，在让大语言模型选择最可能的候选实体之前，先建议可行的实体。

参数	描述
`template`	Custom prompt template to send to LLM model. Defaults to entity_linker.v1.jinja. str
`parse_responses`	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[EntityLinkerTask]]
`prompt_example_type`	Type to use for fewshot examples. Defaults to `ELExample`. Optional[Type[FewshotExample]]
`examples`	Optional callable that reads a file containing task examples for few-shot learning. If `None` is passed, zero-shot learning will be used. Defaults to `None`. ExamplesConfigType
`scorer`	Scorer function. Defaults to the metric used by spaCy to evaluate entity linking performance. Optional[Scorer]

spacy.CandidateSelector.v1

spacy.CandidateSelector.v1 是 spacy.EntityLinker.v1 所需的 CandidateSelector 协议的一个实现。内置的候选选择器方法支持通过多种方式加载现有知识库，例如：从包含(不一定经过训练的)实体链接组件的spaCy管道加载，以及从描述知识库的.yaml文件加载。无论哪种方式，加载的数据都将转换为spaCy的InMemoryLookupKB实例。该知识库的选择功能用于为指定的提及选择最可能的实体候选。

参数	描述
`kb_loader`	KB loader object. InMemoryLookupKBLoader
`top_n`	Top-n candidates to include in the prompt. Defaults to 5. int

spacy.KBObjectLoader.v1

遵循spacy.CandidateSelector.v1所需的InMemoryLookupKBLoader接口。从现有的spaCy管道中加载知识库。

参数	描述
`path`	Path to KB file. Union[str,Path]
`nlp_path`	Path to serialized NLP pipeline. If None, path will be guessed. Optional[Union[Path, str]]
`desc_path`	Path to file with descriptions for entities. int
`ent_desc_reader`	Entity description reader. Defaults to an internal method expecting a CSV file without header row, with ”;” as delimiters, and with two columns - one for the entitys’ IDs, one for their descriptions. Optional[EntDescReader]

spacy.KBFileLoader.v1

遵循spacy.CandidateSelector.v1所需的InMemoryLookupKBLoader接口。从知识库文件加载知识库。该KB的.yaml文件必须符合以下格式：

查看这里了解此类知识库文件的简单示例。

参数	描述
`path`	Path to KB file. Union[str,Path]

NER

NER任务识别文本中不重叠的实体。

spacy.NER.v3

版本3与v1和v2有根本性不同，它实现了基于Ashok和Lipton(2023年)PromptNER论文的思维链提示技术。在一个内部用例中，我们发现这种实现能获得显著更高的准确率——F值最高可提升15个百分点。

当没有指定示例时，v3实现会在提示中使用一个虚拟示例。从技术上讲，这意味着该任务在底层始终会执行少量样本提示。

参数	描述
`template`	Custom prompt template to send to LLM model. Defaults to ner.v3.jinja. str
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[NERTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `NERExample`. Optional[Type[FewshotExample]]
`scorer`	Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. Optional[Scorer]
`labels`	List of labels or str of comma-separated list of labels. Union[List[str], str]
`label_definitions`	Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. Optional[Dict[str, str]]
`description` (NEW)	A description of what to recognize or not recognize as entities. str
`normalizer`	Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. Optional[Callable[[str], str]]
`alignment_mode`	Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. str
`case_sensitive_matching`	Whether to search without case sensitivity. Defaults to `False`. bool

请注意，v1和v2版本中使用的single_match参数不再受支持，因为CoT解析算法会自动处理这个问题。

v3版本的新特性是，你可以明确描述实体应该是什么样子。除了label_definitions之外，你还可以使用这个功能。

虽然不是必需的，但提供正例和负例时此任务效果最佳。格式与v1和v2所需的文件不同，因为现在应提供额外字段如is_entity和reason。

要查看完整可运行的示例，请参阅此使用示例。

spacy.NER.v2

此版本支持显式定义带有自定义描述的标签，并像v1一样进一步支持零样本和小样本提示。

参数	描述
`template` (NEW)	Custom prompt template to send to LLM model. Defaults to ner.v2.jinja. str
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[NERTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `NERExample`. Optional[Type[FewshotExample]]
`scorer` (NEW)	Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. Optional[Scorer]
`labels`	List of labels or str of comma-separated list of labels. Union[List[str], str]
`label_definitions` (NEW)	Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. Optional[Dict[str, str]]
`normalizer`	Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. Optional[Callable[[str], str]]
`alignment_mode`	Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. str
`case_sensitive_matching`	Whether to search without case sensitivity. Defaults to `False`. bool
`single_match`	Whether to match an entity in the LLM’s response only once (the first hit) or multiple times. Defaults to `False`. bool

参数 alignment_mode、case_sensitive_matching 和 single_match 与 v1 实现相同。少量示例的格式也保持一致。

v2版本的新特性是你可以为每个标签编写定义，并通过label_definitions参数提供这些定义。这让你能够明确告诉LLM你想要什么，而不必依赖LLM仅根据标签名称来理解其任务。标签描述是自由格式的，因此你可以在这里写任何内容，但一个简短的描述加上一些正例和反例似乎效果相当不错。

要查看完整可运行的示例，请参阅此使用示例。

spacy.NER.v1

内置NER任务的原始版本支持零样本和小样本提示。

参数	描述
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[NERTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `NERExample`. Optional[Type[FewshotExample]]
`scorer` (NEW)	Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. Optional[Scorer]
`labels`	Comma-separated list of labels. str
`normalizer`	Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Optional[Callable[[str], str]]
`alignment_mode`	Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. str
`case_sensitive_matching`	Whether to search without case sensitivity. Defaults to `False`. bool
`single_match`	Whether to match an entity in the LLM’s response only once (the first hit) or multiple times. Defaults to `False`. bool

目前NER任务的实现并未要求LLM提供具体的偏移量，而是仅期望获得一个代表文档中实体的字符串列表。这意味着需要进行某种形式的字符串匹配。可通过以下参数进行配置：

single_match参数通常设置为False以允许多重匹配。例如，来自LLM的响应可能只提到实体"Paris"一次，但你仍然希望标记出它在文档中出现的每一次。
通常将区分大小写的匹配设置为False，以提高对LLM输出中大小写变化的鲁棒性。
alignment_mode参数用于将LLM返回的实体与原始Doc中的token进行匹配 - 具体来说它被用作doc.char_span()调用的参数。"strict"模式将只保留严格遵循给定token边界的span。"contract"模式将只保留完全在给定范围内的token，例如将"New Y"缩减为"New"。最后，"expand"模式会将span扩展到下一个token边界，例如将"New Y"扩展为"New York"。

SpanCat

SpanCat任务可识别文本中可能重叠的实体。

spacy.SpanCat.v3

内置的SpanCat v3任务是对NER v3任务的简单调整，用于支持重叠实体并将其注释存储在doc.spans中。

参数	描述
`template`	Custom prompt template to send to LLM model. Defaults to `spancat.v3.jinja`. str
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[SpanCatTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `SpanCatExample`. Optional[Type[FewshotExample]]
`scorer` (NEW)	Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. Optional[Scorer]
`labels`	List of labels or str of comma-separated list of labels. Union[List[str], str]
`label_definitions`	Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. Optional[Dict[str, str]]
`description` (NEW)	A description of what to recognize or not recognize as entities. str
`spans_key`	Key of the `Doc.spans` dict to save the spans under. Defaults to `"sc"`. str
`normalizer`	Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Optional[Callable[[str], str]]
`alignment_mode`	Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. str
`case_sensitive_matching`	Whether to search without case sensitivity. Defaults to `False`. bool

请注意，v1和v2版本中使用的single_match参数不再受支持，因为CoT解析算法会自动处理这个问题。

spacy.SpanCat.v2

内置的SpanCat v2任务是对NER v2任务的简单调整，用于支持重叠实体并将其注释存储在doc.spans中。

参数	描述
`template` (NEW)	Custom prompt template to send to LLM model. Defaults to `spancat.v2.jinja`. str
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[SpanCatTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `SpanCatExample`. Optional[Type[FewshotExample]]
`scorer` (NEW)	Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. Optional[Scorer]
`labels`	List of labels or str of comma-separated list of labels. Union[List[str], str]
`label_definitions` (NEW)	Optional dict mapping a label to a description of that label. These descriptions are added to the prompt to help instruct the LLM on what to extract. Defaults to `None`. Optional[Dict[str, str]]
`spans_key`	Key of the `Doc.spans` dict to save the spans under. Defaults to `"sc"`. str
`normalizer`	Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Optional[Callable[[str], str]]
`alignment_mode`	Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. str
`case_sensitive_matching`	Whether to search without case sensitivity. Defaults to `False`. bool
`single_match`	Whether to match an entity in the LLM’s response only once (the first hit) or multiple times. Defaults to `False`. bool

除了spans_key参数外，SpanCat v2任务复用了NER v2任务的配置。更多详情请参阅其文档。

spacy.SpanCat.v1

内置SpanCat任务的原始版本是对v1 NER任务的简单适配，以支持重叠实体并将其注释存储在doc.spans中。

参数	描述
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[SpanCatTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `SpanCatExample`. Optional[Type[FewshotExample]]
`scorer` (NEW)	Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. Optional[Scorer]
`labels`	Comma-separated list of labels. str
`spans_key`	Key of the `Doc.spans` dict to save the spans under. Defaults to `"sc"`. str
`normalizer`	Function that normalizes the labels as returned by the LLM. If `None`, defaults to `spacy.LowercaseNormalizer.v1`. Optional[Callable[[str], str]]
`alignment_mode`	Alignment mode in case the LLM returns entities that do not align with token boundaries. Options are `"strict"`, `"contract"` or `"expand"`. Defaults to `"contract"`. str
`case_sensitive_matching`	Whether to search without case sensitivity. Defaults to `False`. bool
`single_match`	Whether to match an entity in the LLM’s response only once (the first hit) or multiple times. Defaults to `False`. bool

除了spans_key参数外，SpanCat v1任务复用了NER v1任务的配置。更多详情请参阅其文档。

TextCat

TextCat任务为文档标注相关类别。

spacy.TextCat.v3

除了v2版本的功能外，内置TextCat任务的第3版还允许设置标签定义。这些定义会被包含在提示中。

参数	描述
`template`	Custom prompt template to send to LLM model. Defaults to `textcat.v3.jinja`. str
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[SpanCatTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `TextCatExample`. Optional[Type[FewshotExample]]
`scorer` (NEW)	Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. Optional[Scorer]
`labels`	List of labels or str of comma-separated list of labels. Union[List[str], str]
`label_definitions` (NEW)	Dictionary of label definitions. Included in the prompt, if set. Defaults to `None`. Optional[Dict[str, str]]
`normalizer`	Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. Optional[Callable[[str], str]]
`exclusive_classes`	If set to `True`, only one label per document should be valid. If set to `False`, one document can have multiple labels. Defaults to `False`. bool
`allow_none`	When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. bool
`verbose`	If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. bool

少量示例的格式与v1实现中的格式相同。

spacy.TextCat.v2

V2版本包含V1的所有功能，并改进了提示模板。

参数	描述
`template` (NEW)	Custom prompt template to send to LLM model. Defaults to `textcat.v2.jinja`. str
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[SpanCatTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `TextCatExample`. Optional[Type[FewshotExample]]
`scorer` (NEW)	Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. Optional[Scorer]
`labels`	List of labels or str of comma-separated list of labels. Union[List[str], str]
`normalizer`	Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. Optional[Callable[[str], str]]
`exclusive_classes`	If set to `True`, only one label per document should be valid. If set to `False`, one document can have multiple labels. Defaults to `False`. bool
`allow_none`	When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. bool
`verbose`	If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. bool

少样本示例的格式与v1实现中的格式相同。

spacy.TextCat.v1

内置TextCat任务的第1版支持零样本和小样本提示。

参数	描述
`examples`	Optional function that generates examples for few-shot learning. Deafults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[SpanCatTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `TextCatExample`. Optional[Type[FewshotExample]]
`scorer` (NEW)	Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. Optional[Scorer]
`labels`	Comma-separated list of labels. str
`normalizer`	Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. Optional[Callable[[str], str]]
`exclusive_classes`	If set to `True`, only one label per document should be valid. If set to `False`, one document can have multiple labels. Defaults to `False`. bool
`allow_none`	When set to `True`, allows the LLM to not return any of the given label. The resulting dict in `doc.cats` will have `0.0` scores for all labels. Defaults to `True`. bool
`verbose`	If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. bool

如果你想对二元分类器进行小样本学习（即判断一段文本是否应被分配到给定类别），可以提供正例和负例，分别标记为"POS"或"NEG"。"POS"表示该示例应被分配到配置中定义的类别标签，"NEG"表示不应分配。例如垃圾邮件分类场景：

REL

REL任务提取命名实体之间的关系。

spacy.REL.v1

内置的REL任务支持零样本和少样本提示。它依赖上游的NER组件进行实体提取。

参数	描述
`template`	Custom prompt template to send to LLM model. Defaults to `rel.v3.jinja`. str
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[RELTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `RELExample`. Optional[Type[FewshotExample]]
`scorer` (NEW)	Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. Optional[Scorer]
`labels`	List of labels or str of comma-separated list of labels. Union[List[str], str]
`label_definitions`	Dictionary providing a description for each relation label. Defaults to `None`. Optional[Dict[str, str]]
`normalizer`	Function that normalizes the labels as returned by the LLM. If `None`, falls back to `spacy.LowercaseNormalizer.v1`. Defaults to `None`. Optional[Callable[[str], str]]
`verbose`	If set to `True`, warnings will be generated when the LLM returns invalid responses. Defaults to `False`. bool

注意：关系抽取(REL)任务依赖于预先提取的实体来进行预测。因此，您需要在spaCy流水线中添加一个组件来用识别到的文本范围填充doc.ents，并将其置于REL组件之前。

要查看完整可运行的示例，请参阅此使用示例。

词元

Lemma任务对提供的文本进行词形还原，并相应地更新文档标记中的lemma_属性。

spacy.Lemma.v1

该任务支持零样本提示和少样本提示。

参数	描述
`template`	Custom prompt template to send to LLM model. Defaults to lemma.v1.jinja. str
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[LemmaTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `LemmaExample`. Optional[Type[FewshotExample]]
`scorer` (NEW)	Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. Optional[Scorer]

该任务提示LLM对传入的文本进行词形还原，并返回还原后的版本作为标记及其对应词元的列表。例如文本I'm buying ice cream for my friends应触发响应

如果对于任何给定的文本/文档实例，由LLM返回的词元数量与流水线分词器的词元数量不匹配，则不会在相应文档的词元中存储任何词元。否则，将使用LLM建议的词元更新词元的.lemma_属性。

情感分析

对提供的文本执行情感分析。分数介于0到1之间，存储在Doc._.sentiment中——数值越高表示越积极。注意在解析出现问题时（例如遇到意外的LLM响应），该值可能为None。

spacy.Sentiment.v1

该任务支持零样本提示和少样本提示。

参数	描述
`template`	Custom prompt template to send to LLM model. Defaults to sentiment.v1.jinja. str
`examples`	Optional function that generates examples for few-shot learning. Defaults to `None`. Optional[Callable[[], Iterable[Any]]]
`parse_responses` (NEW)	Callable for parsing LLM responses for this task. Defaults to the internal parsing method for this task. Optional[TaskResponseParser[SentimentTask]]
`prompt_example_type` (NEW)	Type to use for fewshot examples. Defaults to `SentimentExample`. Optional[Type[FewshotExample]]
`scorer` (NEW)	Scorer function that evaluates the task performance on provided examples. Defaults to the metric used by spaCy. Optional[Scorer]
`field`	Name of extension attribute to store summary in (i. e. the summary will be available in `doc._.{field}`). Defaults to `sentiment`. str

NoOp

此任务仅用于测试 - 它指示LLM不执行任何操作，并且不会在docs上设置任何字段。

spacy.NoOp.v1

此任务无需额外配置。

模型

一个模型定义了要查询哪个LLM模型以及如何查询它。它可以是一个简单的函数，接收一组提示（与task.generate_prompts()的输出类型一致）并返回一组响应（与parse_responses的预期输入一致）。一般来说，它是一个类型为Callable[[Iterable[Iterable[Any]]], Iterable[Iterable[Any]]的函数，但具体实现可能有其他签名，比如Callable[[Iterable[Iterable[str]]], Iterable[Iterable[str]]]。

注意：模型签名期望接收一个嵌套的可迭代对象，以便能够处理分片文档。未分片的文档（即由非分片任务生成的文档）会被重新调整形状以适配预期的数据结构。

通过REST API访问模型

这些模型都采用相同的参数，但请注意config应包含特定于供应商的键和值，因为它将被传递给供应商的API。

参数	描述
`name`	Model name, i. e. any supported variant for this particular model. Default depends on the specific model (cf. below) str
`config`	Further configuration passed on to the model. Default depends on the specific model (cf. below). Dict[Any, Any]
`strict`	If `True`, raises an error if the LLM API returns a malformed response. Otherwise, return the error responses as is. Defaults to `True`. bool
`max_tries`	Max. number of tries for API request. Defaults to `5`. int
`max_request_time`	Max. time (in seconds) to wait for request to terminate before raising an exception. Defaults to `30.0`. float
`interval`	Time interval (in seconds) for API retries in seconds. Defaults to `1.0`. float
`endpoint`	Endpoint URL. Defaults to the provider’s standard URL, if available (which is not the case for providers with exclusively custom deployments, such as Azure) Optional[str]

目前，这些模型作为核心库的一部分提供：

模型	提供商	支持名称	默认名称	默认配置
`spacy.GPT-4.v1`	OpenAI	`["gpt-4", "gpt-4-0314", "gpt-4-32k", "gpt-4-32k-0314"]`	`"gpt-4"`	`{}`
`spacy.GPT-4.v2`	OpenAI	`["gpt-4", "gpt-4-0314", "gpt-4-32k", "gpt-4-32k-0314"]`	`"gpt-4"`	`{temperature=0.0}`
`spacy.GPT-4.v3`	OpenAI	All names of GPT-4 models offered by OpenAI	`"gpt-4"`	`{temperature=0.0}`
`spacy.GPT-3-5.v1`	OpenAI	`["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-0613-16k", "gpt-3.5-turbo-instruct"]`	`"gpt-3.5-turbo"`	`{}`
`spacy.GPT-3-5.v2`	OpenAI	`["gpt-3.5-turbo", "gpt-3.5-turbo-16k", "gpt-3.5-turbo-0613", "gpt-3.5-turbo-0613-16k", "gpt-3.5-turbo-instruct"]`	`"gpt-3.5-turbo"`	`{temperature=0.0}`
`spacy.GPT-3-5.v3`	OpenAI	All names of GPT-3.5 models offered by OpenAI	`"gpt-3.5-turbo"`	`{temperature=0.0}`
`spacy.Davinci.v1`	OpenAI	`["davinci"]`	`"davinci"`	`{}`
`spacy.Davinci.v2`	OpenAI	`["davinci"]`	`"davinci"`	`{temperature=0.0, max_tokens=500}`
`spacy.Text-Davinci.v1`	OpenAI	`["text-davinci-003", "text-davinci-002"]`	`"text-davinci-003"`	`{}`
`spacy.Text-Davinci.v2`	OpenAI	`["text-davinci-003", "text-davinci-002"]`	`"text-davinci-003"`	`{temperature=0.0, max_tokens=1000}`
`spacy.Code-Davinci.v1`	OpenAI	`["code-davinci-002"]`	`"code-davinci-002"`	`{}`
`spacy.Code-Davinci.v2`	OpenAI	`["code-davinci-002"]`	`"code-davinci-002"`	`{temperature=0.0, max_tokens=500}`
`spacy.Curie.v1`	OpenAI	`["curie"]`	`"curie"`	`{}`
`spacy.Curie.v2`	OpenAI	`["curie"]`	`"curie"`	`{temperature=0.0, max_tokens=500}`
`spacy.Text-Curie.v1`	OpenAI	`["text-curie-001"]`	`"text-curie-001"`	`{}`
`spacy.Text-Curie.v2`	OpenAI	`["text-curie-001"]`	`"text-curie-001"`	`{temperature=0.0, max_tokens=500}`
`spacy.Babbage.v1`	OpenAI	`["babbage"]`	`"babbage"`	`{}`
`spacy.Babbage.v2`	OpenAI	`["babbage"]`	`"babbage"`	`{temperature=0.0, max_tokens=500}`
`spacy.Text-Babbage.v1`	OpenAI	`["text-babbage-001"]`	`"text-babbage-001"`	`{}`
`spacy.Text-Babbage.v2`	OpenAI	`["text-babbage-001"]`	`"text-babbage-001"`	`{temperature=0.0, max_tokens=500}`
`spacy.Ada.v1`	OpenAI	`["ada"]`	`"ada"`	`{}`
`spacy.Ada.v2`	OpenAI	`["ada"]`	`"ada"`	`{temperature=0.0, max_tokens=500}`
`spacy.Text-Ada.v1`	OpenAI	`["text-ada-001"]`	`"text-ada-001"`	`{}`
`spacy.Text-Ada.v2`	OpenAI	`["text-ada-001"]`	`"text-ada-001"`	`{temperature=0.0, max_tokens=500}`
`spacy.Azure.v1`	Microsoft, OpenAI	Arbitrary values	No default	`{temperature=0.0}`
`spacy.Command.v1`	Cohere	`["command", "command-light", "command-light-nightly", "command-nightly"]`	`"command"`	`{}`
`spacy.Claude-2-1.v1`	Anthropic	`["claude-2-1"]`	`"claude-2-1"`	`{}`
`spacy.Claude-2.v1`	Anthropic	`["claude-2", "claude-2-100k"]`	`"claude-2"`	`{}`
`spacy.Claude-1.v1`	Anthropic	`["claude-1", "claude-1-100k"]`	`"claude-1"`	`{}`
`spacy.Claude-1-0.v1`	Anthropic	`["claude-1.0"]`	`"claude-1.0"`	`{}`
`spacy.Claude-1-2.v1`	Anthropic	`["claude-1.2"]`	`"claude-1.2"`	`{}`
`spacy.Claude-1-3.v1`	Anthropic	`["claude-1.3", "claude-1.3-100k"]`	`"claude-1.3"`	`{}`
`spacy.Claude-instant-1.v1`	Anthropic	`["claude-instant-1", "claude-instant-1-100k"]`	`"claude-instant-1"`	`{}`
`spacy.Claude-instant-1-1.v1`	Anthropic	`["claude-instant-1.1", "claude-instant-1.1-100k"]`	`"claude-instant-1.1"`	`{}`
`spacy.PaLM.v1`	Google	`["chat-bison-001", "text-bison-001"]`	`"text-bison-001"`	`{temperature=0.0}`

要使用这些模型，请确保您已设置相关API密钥作为环境变量。

⚠️ 关于spacy.Azure.v1的说明。 使用Azure OpenAI与其他提供商的模型略有不同：

在Azure中，必须通过创建特定模型（例如GPT-3.5）的部署来使LLMs可用。此部署可以任意命名。name参数在其他地方表示模型名称（例如claude-1.0、gpt-3.5），在这里指的是部署名称。
部署的Azure OpenAI模型可通过资源特定的基础URL访问，通常格式为https://{resource}.openai.azure.com。因此需要通过base_url参数指定该URL。
Azure 还要求指定 API 版本。通过 api_version 参数设置的默认值当前为 2023-05-15，但未来可能会更新。
最后，由于我们无法从部署名称推断模型的相关信息，spacy-llm要求将model_type设置为"completions"或"chat"，具体取决于部署的模型是补全模型还是对话模型。

API密钥

请注意，在使用托管服务时，必须按照相应提供商的文档说明，将正确的API密钥设置为环境变量。

例如，在使用OpenAI时，您需要从openai.com获取API密钥，并确保这些密钥已设置为环境变量：

对于Cohere:

针对Anthropic:

对于PaLM:

通过HuggingFace获取模型

这些模型都接受相同的参数：

参数	描述
`name`	Model name, i. e. any supported variant for this particular model. str
`config_init`	Further configuration passed on to the construction of the model with `transformers.pipeline()`. Defaults to `{}`. Dict[str, Any]
`config_run`	Further configuration used during model inference. Defaults to `{}`. Dict[str, Any]

目前，这些模型作为核心库的一部分提供：

模型	提供商	支持名称	HF目录
`spacy.Dolly.v1`	Databricks	`["dolly-v2-3b", "dolly-v2-7b", "dolly-v2-12b"]`	https://huggingface.co/databricks
`spacy.Falcon.v1`	TII	`["falcon-rw-1b", "falcon-7b", "falcon-7b-instruct", "falcon-40b-instruct"]`	https://huggingface.co/tiiuae
`spacy.Llama2.v1`	Meta AI	`["Llama-2-7b-hf", "Llama-2-13b-hf", "Llama-2-70b-hf"]`	https://huggingface.co/meta-llama
`spacy.Mistral.v1`	Mistral AI	`["Mistral-7B-v0.1", "Mistral-7B-Instruct-v0.1"]`	https://huggingface.co/mistralai
`spacy.StableLM.v1`	Stability AI	`["stablelm-base-alpha-3b", "stablelm-base-alpha-7b", "stablelm-tuned-alpha-3b", "stablelm-tuned-alpha-7b"]`	https://huggingface.co/stabilityai
`spacy.OpenLLaMA.v1`	OpenLM Research	`["open_llama_3b", "open_llama_7b", "open_llama_7b_v2", "open_llama_13b"]`	https://huggingface.co/openlm-research

请注意，Hugging Face会在首次使用时下载模型 - 您可以通过设置环境变量HF_HOME来定义缓存目录。

通过HuggingFace安装

要使用HuggingFace的模型，理想情况下您需要启用GPU并在虚拟环境中安装transformers、torch和CUDA。这样您就可以在配置中使用device=cuda:0设置，确保模型完全加载到GPU上（否则会失败）。

你可以通过以下方式实现

如果您无法使用GPU，可以安装accelerate并设置device_map=auto，但请注意这可能导致某些层被分配到CPU甚至硬盘上，最终可能导致查询速度极慢。

LangChain模型

要在API检索部分使用LangChain，请确保已先安装它：

请注意，LangChain目前仅支持Python 3.9及以上版本。

LangChain模型在spacy-llm中的工作方式略有不同。langchain的模型会被自动解析，langchain中的每个LLM类在spacy-llm的注册表中都有一个对应条目。由于langchain的设计是为每个API而非每个模型设置一个类，因此会产生像langchain.OpenAI.v1这样的注册条目——也就是说，与基于REST和HuggingFace的条目不同，这里是为每个API而非每个模型(系列)设置一个注册条目。

要使用的模型名称必须通过name属性传入。

参数	描述
`name`	The name of a model supported by LangChain for this API. str
`config`	Configuration passed on to the LangChain model. Defaults to `{}`. Dict[Any, Any]
`query`	Function that executes the prompts. If `None`, defaults to `spacy.CallLangChain.v1`. Optional[Callable[[“langchain.llms.BaseLLM”, Iterable[Any]], Iterable[Any]]]

默认的query (spacy.CallLangChain.v1)通过为每个给定的文本提示运行model(text)来执行提示。

缓存

与大型语言模型（LLMs）交互，无论是通过外部API还是本地实例，成本都很高。由于开发自然语言处理（NLP）流水线通常涉及大量探索和原型设计，spacy-llm实现了内置缓存功能，以避免在每次运行时重新处理相同的文档，该功能会将文档批次存储在磁盘上。

参数	描述
`path`	Cache directory. If `None`, no caching is performed, and this component will act as a NoOp. Defaults to `None`. Optional[Union[str,Path]]
`batch_size`	Number of docs in one batch (file). Once a batch is full, it will be peristed to disk. Defaults to 64. int
`max_batches_in_mem`	Max. number of batches to hold in memory. Allows you to limit the effect on your memory if you’re handling a lot of docs. Defaults to 4. int

检索文档时，BatchCache会先确定文档所属的批次。如果该批次不在内存中，它将尝试从磁盘加载该批次，然后将其移入内存。

请注意，由于缓存是由注册函数生成的，您也可以提供自己的注册函数来返回自定义的缓存实现。如需这样做，请确保您的缓存对象遵循spacy_llm.ty.Cache中定义的Protocol协议。

各种功能

spacy.FewShotReader.v1

该函数已在spaCy的misc注册表中注册，可从.yml、.yaml、.json或.jsonl文件中读取示例。它使用srsly来读取这些文件，并根据文件扩展名进行解析。

参数	描述
`path`	Path to an examples file with suffix `.yml`, `.yaml`, `.json` or `.jsonl`. Union[str,Path]

spacy.FileReader.v1

该函数注册在spaCy的misc注册表中，通过读取提供给path的文件，返回其内容的str表示形式。此函数通常用于读取包含提示模板的Jinja文件。

参数	描述
`path`	Path to the file to be read. Union[str,Path]

标准化函数

这些函数为字符串比较提供了简单的标准化处理，例如在LLM响应原始文本中给定的标签与指定标签列表之间进行比较。它们已在spaCy的misc注册表中注册，并具有Callable[[str], str]的函数签名。

spacy.StripNormalizer.v1: 仅应用 text.strip()
spacy.LowercaseNormalizer.v1: 应用text.strip().lower()以不区分大小写的方式比较字符串。

建议编辑

流水线

LLMWrapper.__init__ 方法

LLMWrapper.__call__ 方法

LLMWrapper.pipe 方法

LLMWrapper.add_label 方法

LLMWrapper.to_disk 方法

LLMWrapper.from_disk 方法

LLMWrapper.to_bytes 方法

LLMWrapper.from_bytes 方法

LLMWrapper.labels 属性

LLMWrapper.init 方法

LLMWrapper.call 方法