Transformer

classv3

String name:transformer

用于多任务学习的transformer模型流水线组件

该流水线组件允许您在流水线中使用transformer模型。它支持通过HuggingFace transformers库提供的所有模型。通常您会使用TransformerListener层将后续组件连接到共享的transformer。这与spaCy的Tok2Vec组件和Tok2VecListener子层的工作方式类似。

该组件将transformer的输出分配给Doc的扩展属性。我们还会计算词片段标记与spaCy分词之间的对齐关系，以便使用最后的隐藏状态来设置Doc.tensor属性。当多个词片段标记对齐到同一个spaCy标记时，该spaCy标记将接收这些值的总和。要访问这些值，可以使用自定义属性Doc._.trf_data。该包还添加了函数注册表@span_getters和@annotation_setters，其中包含多个内置注册函数。更多详情请参阅使用文档。

Assigned Attributes

该组件设置了以下自定义扩展属性：

位置	值
`Doc._.trf_data`	Transformer tokens and outputs for the `Doc` object. TransformerData

配置与实现

默认配置由管道组件工厂定义，描述了组件应如何配置。您可以通过nlp.add_pipe上的config参数或在训练用的config.cfg中覆盖其设置。有关transformer架构及其参数和超参数的详细信息，请参阅模型架构文档。

设置	描述
`max_batch_items`	Maximum size of a padded batch. Defaults to `4096`. int
`set_extra_annotations`	Function that takes a batch of `Doc` objects and transformer outputs to set additional annotations on the `Doc`. The `Doc._.trf_data` attribute is set prior to calling the callback. Defaults to `null_annotation_setter` (no additional annotations). Callable[[List[Doc],FullTransformerBatch], None]
`model`	The Thinc `Model` wrapping the transformer. Defaults to TransformerModel. Model[List[Doc],FullTransformerBatch]

explosion/spacy-transformers/master/spacy_transformers/pipeline_component.py

Transformer.init 方法

构建一个Transformer组件。一个或多个后续的spaCy组件可以在其模型中使用该transformer的输出作为特征，并通过梯度反向传播到共享权重中。transformer的激活值会被保存在 Doc._.trf_data扩展属性中。您还可以提供一个回调函数来设置额外的注释。在您的应用程序中，通常会使用快捷方式并通过其字符串名称和 nlp.add_pipe来实例化该组件。

名称	描述
`vocab`	The shared vocabulary. Vocab
`model`	The Thinc `Model` wrapping the transformer. Usually you will want to use the TransformerModel layer for this. Model[List[Doc],FullTransformerBatch]
`set_extra_annotations`	Function that takes a batch of `Doc` objects and transformer outputs and stores the annotations on the `Doc`. The `Doc._.trf_data` attribute is set prior to calling the callback. By default, no additional annotations are set. Callable[[List[Doc],FullTransformerBatch], None]
仅关键字
`name`	String name of the component instance. Used to add entries to the `losses` during training. str
`max_batch_items`	Maximum size of a padded batch. Defaults to `128*32`. int

Transformer.call 方法

将管道应用于单个文档。文档会被原地修改并返回。这通常在调用nlp对象处理文本时自动完成，所有管道组件会按顺序应用于Doc对象。 __call__和pipe方法都会委托给 predict和set_annotations方法执行。

名称	描述
`doc`	The document to process. Doc
返回值	处理后的文档。Doc

Transformer.pipe 方法

将管道应用于文档流。这通常在调用nlp对象处理文本时自动完成，所有管道组件会按顺序应用于Doc对象。无论是__call__还是pipe方法，最终都会委托给predict和set_annotations方法执行。

名称	描述
`stream`	A stream of documents. Iterable[Doc]
仅关键字
`batch_size`	The number of documents to buffer. Defaults to `128`. int
YIELDS	按顺序处理后的文档。Doc

Transformer.initialize 方法

初始化组件用于训练并返回一个Optimizer。get_examples应为一个返回可迭代Example对象的函数。至少需要提供一个示例。这些数据示例用于初始化组件模型，可以是完整训练数据或代表性样本。初始化过程包括验证网络、推断缺失形状以及根据数据设置标签方案。该方法通常由Language.initialize调用。

名称	描述
`get_examples`	Function that returns gold-standard annotations in the form of `Example` objects. Must contain at least one `Example`. Callable[[], Iterable[Example]]
仅关键字
`nlp`	The current `nlp` object. Defaults to `None`. Optional[Language]

Transformer.predict 方法

在不修改的情况下，将组件的模型应用于一批Doc对象。

名称	描述
`docs`	The documents to predict. Iterable[Doc]
返回值	模型对每个文档的预测结果。

Transformer.set_annotations 方法

将提取的特征分配给Doc对象。默认情况下，TransformerData对象会被写入Doc._.trf_data属性。如果提供了set_extra_annotations回调函数，则随后会调用它。

名称	描述
`docs`	The documents to modify. Iterable[Doc]
`scores`	The scores to set, produced by `Transformer.predict`.

Transformer.update 方法

准备更新transformer。与Tok2Vec组件类似，Transformer组件的特殊之处在于它不接收"黄金标准"标注来计算权重更新。transformer数据的最优输出是未知的——它是网络中的一个隐藏层，通过从输出层反向传播来更新。

因此，Transformer组件在其自身的update方法中不会执行权重更新。相反，它会运行其transformer模型，并将输出和反向传播回调传递给通过TransformerListener子层连接到它的任何下游组件。如果有多个监听器，最后一层实际上会反向传播到transformer并调用优化器，而其他层只是简单地增加梯度。

名称	描述
`examples`	A batch of `Example` objects. Only the `Example.predicted` `Doc` object is used, the reference `Doc` is ignored. Iterable[Example]
仅关键字
`drop`	The dropout rate. float
`sgd`	An optimizer. Will be created via `create_optimizer` if not set. Optional[Optimizer]
`losses`	Optional record of the loss during training. Updated using the component name as the key. Optional[Dict[str, float]]
RETURNS	The updated `losses` dictionary. Dict[str, float]

Transformer.create_optimizer 方法

为管道组件创建一个优化器。

名称	描述
返回值	优化器。Optimizer

Transformer.use_params 方法上下文管理器

修改管道的模型以使用给定的参数值。在上下文结束时，原始参数将被恢复。

名称	描述
`params`	The parameter values to use in the model. dict

Transformer.to_disk 方法

将管道序列化到磁盘。

名称	描述
`path`	A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]

Transformer.from_disk 方法

从磁盘加载管道。就地修改对象并返回它。

名称	描述
`path`	A path to a directory. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The modified `Transformer` object. Transformer

Transformer.to_bytes 方法

将管道序列化为字节串。

名称	描述
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The serialized form of the `Transformer` object. bytes

Transformer.from_bytes 方法

从字节串加载管道。原地修改对象并返回它。

名称	描述
`bytes_data`	The data to load from. bytes
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The `Transformer` object. Transformer

序列化字段

在序列化过程中，spaCy会导出多个用于恢复对象不同方面的数据字段。如果需要，您可以通过exclude参数传入字符串名称来将它们排除在序列化之外。

名称	描述
`vocab`	The shared `Vocab`.
`cfg`	The config file. You usually don’t want to exclude this.
`model`	The binary model data. You usually don’t want to exclude this.

TransformerData 数据类

一个Doc对象的Transformer词元和输出。Transformer模型返回的张量涉及整个填充后的文档批次。这些张量被封装到FullTransformerBatch对象中。FullTransformerBatch随后拆分出每份文档的数据，由本类处理。该类的实例通常被分配给Doc._.trf_data扩展属性。

名称	描述
`tokens`	A slice of the tokens data produced by the tokenizer. This may have several fields, including the token IDs, the texts and the attention mask. See the `transformers.BatchEncoding` object for details. dict
`model_output`	The model output from the transformer model, determined by the model and transformer config. New in `spacy-transformers` v1.1.0. transformers.file_utils.ModelOutput
`tensors`	The `model_output` in the earlier `transformers` tuple format converted using `ModelOutput.to_tuple()`. Returns `Tuple` instead of `List` as of `spacy-transformers` v1.1.0. Tuple[Union[FloatsXd, List[FloatsXd]]]
`align`	Alignment from the `Doc`’s tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. Ragged
`width`	The width of the last hidden layer. int

TransformerData.empty 类方法

创建一个空的TransformerData容器。

名称	描述
返回值	容器。TransformerData

在spacy-transformers v1.0版本中，模型输出存储在TransformerData.tensors中，类型为List[Union[FloatsXd]]，仅包含来自transformer的Doc激活值。通常最后一个三维张量最为重要，因为它提供最终的隐藏状态。二维的激活值通常是注意力权重。该变量的具体细节取决于底层transformer模型。

FullTransformerBatch 数据类

保存用于transformer模型的一批输入和输出对象。这些数据随后可拆分为TransformerData对象列表，以便将输出与批次中的每个Doc关联起来。

名称	描述
`spans`	The batch of input spans. The outer list refers to the Doc objects in the batch, and the inner list are the spans for that `Doc`. Note that spans are allowed to overlap or exclude tokens, but each `Span` can only refer to one `Doc` (by definition). This means that within a `Doc`, the regions of the output tensors that correspond to each `Span` may overlap or have gaps, but for each `Doc`, there is a non-overlapping contiguous slice of the outputs. List[List[Span]]
`tokens`	The output of the tokenizer. transformers.BatchEncoding
`model_output`	The model output from the transformer model, determined by the model and transformer config. New in `spacy-transformers` v1.1.0. transformers.file_utils.ModelOutput
`tensors`	The `model_output` in the earlier `transformers` tuple format converted using `ModelOutput.to_tuple()`. Returns `Tuple` instead of `List` as of `spacy-transformers` v1.1.0. Tuple[Union[torch.Tensor, Tuple[torch.Tensor]]]
`align`	Alignment from the spaCy tokenization to the wordpieces. This is a ragged array, where `align.lengths[i]` indicates the number of wordpiece tokens that token `i` aligns against. The actual indices are provided at `align[i].dataXd`. Ragged
`doc_data`	The outputs, split per `Doc` object. List[TransformerData]

FullTransformerBatch.unsplit_by_doc 方法

根据当前对象的跨度、标记和对齐信息，从分割的激活批次中返回一个新的FullTransformerBatch。这在反向传播过程中使用，目的是构建要传回transformer模型的梯度。

名称	描述
`arrays`	The split batch of activations. List[List[Floats3d]]
返回值	transformer批次。FullTransformerBatch

FullTransformerBatch.split_by_doc 方法

将一个表示批处理的TransformerData对象拆分为一个列表，其中每个Doc对应一个TransformerData。

名称	描述
返回值	分割后的批次。列表[TransformerData]

在spacy-transformers v1.0中，模型输出以List[torch.Tensor]的形式存储在FullTransformerBatch.tensors中。

Span获取器
Source

Span获取器是接收一批Doc对象并返回每个文档中需要由transformer处理的Span对象列表的函数。这用于通过将长文档切割成较小序列来管理长文档，然后再运行transformer。这些span允许重叠，如果某些Doc部分不相关，也可以省略它们。

跨度获取器可以在配置的[components.transformer.model.get_spans]块中被引用，以自定义由transformer处理的序列。您还可以使用@spacy.registry.span_getters装饰器注册自定义跨度获取器。

名称	描述
`docs`	A batch of `Doc` objects. Iterable[Doc]
返回值	需要由transformer处理的文本片段。List[List[Span]]

doc_spans.v1 注册函数

创建一个使用整个文档作为其跨度的span获取器。如果您的Doc对象已经引用了相对较短的文本，这是最佳方法。

sent_spans.v1 注册函数

创建一个使用句子边界标记来提取文本片段的span获取器。这需要预先设置句子边界（例如通过Sentencizer组件），可能会因句子长度不同而产生不太均匀的批次。但这种方法确实能为transformer模型提供更有语义意义的注意力窗口。

要在训练期间使用sentencizer设置句子边界，请将sentencizer添加到流程的开头，并将其包含在[training.annotating_components]中，以便在transformer组件运行之前设置句子边界。

strided_spans.v1 注册函数

为跨步跨度创建跨度获取器。如果将window和stride设置为相同值，这些跨度将覆盖每个标记一次。将stride设置为小于window将允许重叠，这样某些标记会被计算两次。这可能是可取的，因为它允许所有标记同时拥有左右上下文。

名称	描述
`window`	The window size. int
`stride`	The stride size. int

标注设置器注册函数
Source

标注设置器是接收一批Doc对象和FullTransformerBatch的函数，能够在Doc上设置额外标注，例如设置自定义或内置属性。您可以使用@registry.annotation_setters装饰器注册自定义标注设置器。

名称	描述
`docs`	A batch of `Doc` objects. List[Doc]
`trf_data`	The transformers data for the batch. FullTransformerBatch

以下内置函数可用：

名称	描述
`spacy-transformers.null_annotation_setter.v1`	Don’t set any additional annotations.

建议编辑