SpanCategorizer

class,experimentalv3.1

String name:spancatTrainable:

用于标注可能重叠文本跨度的流水线组件

一个span分类器由两部分组成：一个建议函数用于提出可能重叠的候选span，以及一个标注模型用于为每个候选span预测零个或多个标签。

该组件有两种形式：spancat和spancat_singlelabel（在spaCy v3.5.1版本中新增）。当需要对文本片段进行多标签分类时，请使用spancat。spancat组件采用Logistic层，其中每个类别的输出概率相互独立。然而，如果需要预测一个文本片段最多只有一个真实类别，则使用spancat_singlelabel。它使用Softmax层并将任务视为多分类问题。

预测的文本片段将保存在文档的SpanGroup中，位置为doc.spans[spans_key]，其中spans_key是组件配置设置。单个片段分数存储在doc.spans[spans_key].attrs["scores"]中。

Assigned Attributes

预测结果将保存到Doc.spans[spans_key]作为SpanGroup。在SpanGroup中的span分数将被保存在SpanGroup.attrs["scores"]中。

spans_key 默认为 "sc"，但可以作为参数传递。spancat 组件将覆盖 spans key doc.spans[spans_key] 下的任何现有 spans。

位置	值
`Doc.spans[spans_key]`	The annotated spans. SpanGroup
`Doc.spans[spans_key].attrs["scores"]`	The score for each span in the `SpanGroup`. Floats1d

配置与实现

默认配置由管道组件工厂定义，描述了组件应如何配置。您可以通过nlp.add_pipe中的config参数或在训练用的config.cfg中覆盖其设置。有关架构及其参数和超参数的详细信息，请参阅模型架构文档。

设置	描述
`suggester`	A function that suggests spans. Spans are returned as a ragged array with two integer columns, for the start and end positions. Defaults to `ngram_suggester`. Callable[[Iterable[Doc], Optional[Ops]],Ragged]
`model`	A model instance that is given a a list of documents and `(start, end)` indices representing candidate span offsets. The model predicts a probability for each category for each span. Defaults to SpanCategorizer. Model[Tuple[List[Doc],Ragged],Floats2d]
`spans_key`	Key of the `Doc.spans` dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to `"sc"`. str
`threshold`	Minimum probability to consider a prediction positive. Spans with a positive prediction will be saved on the Doc. Meant to be used in combination with the multi-class `spancat` component with a `Logistic` scoring layer. Defaults to `0.5`. float
`max_positive`	Maximum number of labels to consider positive per span. Defaults to `None`, indicating no limit. Meant to be used together with the `spancat` component and defaults to 0 with `spancat_singlelabel`. Optional[int]
`scorer`	The scoring method. Defaults to `Scorer.score_spans` for `Doc.spans[spans_key]` with overlapping spans allowed. Optional[Callable]
`add_negative_label` v3.5.1	Whether to learn to predict a special negative label for each unannotated `Span` . This should be `True` when using a `Softmax` classifier layer and so its `True` by default for `spancat_singlelabel`. Spans with negative labels and their scores are not stored as annotations. bool
`negative_weight` v3.5.1	Multiplier for the loss terms. It can be used to downweight the negative samples if there are too many. It is only used when `add_negative_label` is `True`. Defaults to `1.0`. float
`allow_overlap` v3.5.1	If `True`, the data is assumed to contain overlapping spans. It is only available when `max_positive` is exactly 1. Defaults to `True`. bool

如果为spans_key设置了非默认值，您还需要更新[training.score_weights]以确保权重计算正确。例如，对于spans_key == "myspankey"，请在配置中包含以下内容：

explosion/spaCy/master/spacy/pipeline/spancat.py

SpanCategorizer.init 方法

创建一个新的管道实例。在您的应用程序中，通常会使用快捷方式，通过其字符串名称并使用nlp.add_pipe来实例化该组件。

名称	描述
`vocab`	The shared vocabulary. Vocab
`model`	A model instance that is given a a list of documents and `(start, end)` indices representing candidate span offsets. The model predicts a probability for each category for each span. Model[Tuple[List[Doc],Ragged],Floats2d]
`suggester`	A function that suggests spans. Spans are returned as a ragged array with two integer columns, for the start and end positions. Callable[[Iterable[Doc], Optional[Ops]],Ragged]
`name`	String name of the component instance. Used to add entries to the `losses` during training. str
仅关键字
`spans_key`	Key of the `Doc.spans` dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to `"sc"`. str
`threshold`	Minimum probability to consider a prediction positive. Spans with a positive prediction will be saved on the Doc. Defaults to `0.5`. float
`max_positive`	Maximum number of labels to consider positive per span. Defaults to `None`, indicating no limit. Optional[int]
`allow_overlap` v3.5.1	If `True`, the data is assumed to contain overlapping spans. It is only available when `max_positive` is exactly 1. Defaults to `True`. bool
`add_negative_label` v3.5.1	Whether to learn to predict a special negative label for each unannotated `Span`. This should be `True` when using a `Softmax` classifier layer and so its `True` by default for `spancat_singlelabel` . Spans with negative labels and their scores are not stored as annotations. bool
`negative_weight` v3.5.1	Multiplier for the loss terms. It can be used to downweight the negative samples if there are too many . It is only used when `add_negative_label` is `True`. Defaults to `1.0`. float

SpanCategorizer.call 方法

将管道应用于单个文档。文档会被原地修改并返回。这通常在调用nlp对象处理文本时自动完成，所有管道组件会按顺序应用于Doc对象。 __call__和pipe 都会委托给predict和 set_annotations方法。

名称	描述
`doc`	The document to process. Doc
返回值	处理后的文档。Doc

SpanCategorizer.pipe 方法

将管道应用于文档流。这通常在调用nlp对象处理文本时自动完成，所有管道组件会按顺序应用于Doc对象。无论是__call__还是pipe方法，最终都会委托给predict和set_annotations方法执行。

名称	描述
`stream`	A stream of documents. Iterable[Doc]
仅关键字
`batch_size`	The number of documents to buffer. Defaults to `128`. int
YIELDS	按顺序处理后的文档。Doc

SpanCategorizer.initialize 方法

初始化组件以进行训练。get_examples应为一个返回可迭代Example对象的函数。至少需要提供一个示例。这些数据示例用于初始化组件模型，可以是完整的训练数据或代表性样本。初始化过程包括验证网络、推断缺失形状以及根据数据设置标签方案。该方法通常由Language.initialize调用，并允许您通过配置中的[initialize.components]块来自定义接收的参数。

名称	描述
`get_examples`	Function that returns gold-standard annotations in the form of `Example` objects. Must contain at least one `Example`. Callable[[], Iterable[Example]]
仅关键字
`nlp`	The current `nlp` object. Defaults to `None`. Optional[Language]
`labels`	The label information to add to the component, as provided by the `label_data` property after initialization. To generate a reusable JSON file from your data, you should run the `init labels` command. If no labels are provided, the `get_examples` callback is used to extract the labels from the data, which may be a lot slower. Optional[Iterable[str]]

SpanCategorizer.predict 方法

在不修改的情况下，将组件的模型应用于一批Doc对象。

名称	描述
`docs`	The documents to predict. Iterable[Doc]
返回值	模型对每个文档的预测结果。

SpanCategorizer.set_annotations 方法

使用预先计算的分数批量修改Doc对象。

名称	描述
`docs`	The documents to modify. Iterable[Doc]
`scores`	The scores to set, produced by `SpanCategorizer.predict`.

SpanCategorizer.update 方法

从一批包含预测和黄金标准标注的Example对象中学习，并更新组件的模型。委托给predict和get_loss。

名称	描述
`examples`	A batch of `Example` objects to learn from. Iterable[Example]
仅关键字
`drop`	The dropout rate. float
`sgd`	An optimizer. Will be created via `create_optimizer` if not set. Optional[Optimizer]
`losses`	Optional record of the loss during training. Updated using the component name as the key. Optional[Dict[str, float]]
RETURNS	The updated `losses` dictionary. Dict[str, float]

SpanCategorizer.set_candidates 方法v3.3

使用建议器将一系列Span候选添加到Doc对象列表中。此方法专为调试目的而设计。

名称	描述
`docs`	The documents to modify. Iterable[Doc]
`candidates_key`	Key of the Doc.spans dict to save the candidate spans under. str

SpanCategorizer.get_loss 方法

计算这批文档及其预测分数的损失和损失梯度。

名称	描述
`examples`	The batch of examples. Iterable[Example]
`spans_scores`	Scores representing the model’s predictions. Tuple[Ragged,Floats2d]
RETURNS	The loss and the gradient, i.e. `(loss, gradient)`. Tuple[float, float]

SpanCategorizer.create_optimizer 方法

为管道组件创建一个优化器。

名称	描述
返回值	优化器。 Optimizer

SpanCategorizer.use_params 方法上下文管理器

修改管道的模型以使用给定的参数值。

名称	描述
`params`	The parameter values to use in the model. dict

SpanCategorizer.add_label 方法

向管道添加一个新标签。如果输出维度已设置，或模型已完全初始化，则会引发错误。请注意，如果您向initialize方法提供了代表性数据样本，则无需调用此方法。在这种情况下，样本中发现的所有标签将自动添加到模型中，输出维度将自动推断。

名称	描述
`label`	The label to add. str
RETURNS	`0` if the label is already present, otherwise `1`. int

SpanCategorizer.to_disk 方法

将管道序列化到磁盘。

名称	描述
`path`	A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]

SpanCategorizer.from_disk 方法

从磁盘加载管道。就地修改对象并返回它。

名称	描述
`path`	A path to a directory. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The modified `SpanCategorizer` object. SpanCategorizer

SpanCategorizer.to_bytes 方法

将管道序列化为字节串。

名称	描述
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The serialized form of the `SpanCategorizer` object. bytes

SpanCategorizer.from_bytes 方法

从字节串加载管道。原地修改对象并返回它。

名称	描述
`bytes_data`	The data to load from. bytes
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The `SpanCategorizer` object. SpanCategorizer

SpanCategorizer.labels 属性

当前添加到组件中的标签。

名称	描述
返回值	添加到组件的标签。Tuple[str, …]

SpanCategorizer.label_data 属性

当前添加到组件的标签及其内部元信息。这是由init labels生成的数据，并被 SpanCategorizer.initialize用于使用预定义的标签集初始化模型。

名称	描述
返回值	添加到组件的标签数据。Tuple[str, …]

序列化字段

在序列化过程中，spaCy会导出多个用于恢复对象不同方面的数据字段。如果需要，您可以通过exclude参数传入字符串名称来将它们排除在序列化之外。

名称	描述
`vocab`	The shared `Vocab`.
`cfg`	The config file. You usually don’t want to exclude this.
`model`	The binary model data. You usually don’t want to exclude this.

建议器已注册函数
Source

spacy.ngram_suggester.v1

建议给定长度的所有跨度。跨度以不规则的整数数组形式返回。该数组有两列，分别表示起始和结束位置。

名称	描述
`sizes`	The phrase lengths to suggest. For example, `[1, 2]` will suggest phrases consisting of 1 or 2 tokens. List[int]
CREATES	建议器函数。可调用[[可迭代[Doc], 可选[Ops]],Ragged]

spacy.ngram_range_suggester.v1

建议所有长度至少为min_size且至多为max_size的跨度（均包含边界值）。跨度以不规则整数数组形式返回。该数组包含两列，分别表示起始和结束位置。

名称	描述
`min_size`	The minimal phrase lengths to suggest (inclusive). [int]
`max_size`	The maximal phrase lengths to suggest (inclusive). [int]
CREATES	建议器函数。可调用[[可迭代[Doc], 可选[Ops]],Ragged]

spacy.preset_spans_suggester.v1

建议所有已经存储在doc.spans[spans_key]中的span。当使用上游组件（如SpanRuler或SpanFinder）在Doc上设置span时，这非常有用。

名称	描述
`spans_key`	Key of `Doc.spans` that provides spans to suggest. str
CREATES	建议器函数。可调用[[可迭代[Doc], 可选[Ops]],Ragged]

建议编辑

流水线

Assigned Attributes

配置与实现

SpanCategorizer.__init__ 方法

SpanCategorizer.__call__ 方法

SpanCategorizer.pipe 方法

SpanCategorizer.initialize 方法

SpanCategorizer.predict 方法

SpanCategorizer.set_annotations 方法

SpanCategorizer.update 方法

SpanCategorizer.set_candidates 方法v3.3

SpanCategorizer.get_loss 方法

SpanCategorizer.create_optimizer 方法

SpanCategorizer.use_params 方法上下文管理器

SpanCategorizer.add_label 方法

SpanCategorizer.to_disk 方法

SpanCategorizer.from_disk 方法

SpanCategorizer.to_bytes 方法

SpanCategorizer.from_bytes 方法

SpanCategorizer.labels 属性

SpanCategorizer.label_data 属性

序列化字段

建议器 已注册函数Source

spacy.ngram_suggester.v1

spacy.ngram_range_suggester.v1

spacy.preset_spans_suggester.v1

SpanCategorizer.init 方法

SpanCategorizer.call 方法

建议器已注册函数
Source