SpanCategorizer
一个span分类器由两部分组成:一个建议函数用于提出可能重叠的候选span,以及一个标注模型用于为每个候选span预测零个或多个标签。
该组件有两种形式:spancat和spancat_singlelabel(在spaCy v3.5.1版本中新增)。当需要对文本片段进行多标签分类时,请使用spancat。spancat组件采用Logistic层,其中每个类别的输出概率相互独立。然而,如果需要预测一个文本片段最多只有一个真实类别,则使用spancat_singlelabel。它使用Softmax层并将任务视为多分类问题。
预测的文本片段将保存在文档的SpanGroup中,位置为doc.spans[spans_key],其中spans_key是组件配置设置。单个片段分数存储在doc.spans[spans_key].attrs["scores"]中。
Assigned Attributes
预测结果将保存到Doc.spans[spans_key]作为SpanGroup。在SpanGroup中的span分数将被保存在SpanGroup.attrs["scores"]中。
spans_key 默认为 "sc",但可以作为参数传递。spancat
组件将覆盖 spans key doc.spans[spans_key] 下的任何现有 spans。
| 位置 | 值 |
|---|---|
Doc.spans[spans_key] | The annotated spans. SpanGroup |
Doc.spans[spans_key].attrs["scores"] | The score for each span in the SpanGroup. Floats1d |
配置与实现
默认配置由管道组件工厂定义,描述了组件应如何配置。您可以通过nlp.add_pipe中的config参数或在训练用的config.cfg中覆盖其设置。有关架构及其参数和超参数的详细信息,请参阅模型架构文档。
| 设置 | 描述 |
|---|---|
suggester | A function that suggests spans. Spans are returned as a ragged array with two integer columns, for the start and end positions. Defaults to ngram_suggester. Callable[[Iterable[Doc], Optional[Ops]],Ragged] |
model | A model instance that is given a a list of documents and (start, end) indices representing candidate span offsets. The model predicts a probability for each category for each span. Defaults to SpanCategorizer. Model[Tuple[List[Doc],Ragged],Floats2d] |
spans_key | Key of the Doc.spans dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to "sc". str |
threshold | Minimum probability to consider a prediction positive. Spans with a positive prediction will be saved on the Doc. Meant to be used in combination with the multi-class spancat component with a Logistic scoring layer. Defaults to 0.5. float |
max_positive | Maximum number of labels to consider positive per span. Defaults to None, indicating no limit. Meant to be used together with the spancat component and defaults to 0 with spancat_singlelabel. Optional[int] |
scorer | The scoring method. Defaults to Scorer.score_spans for Doc.spans[spans_key] with overlapping spans allowed. Optional[Callable] |
add_negative_label v3.5.1 | Whether to learn to predict a special negative label for each unannotated Span . This should be True when using a Softmax classifier layer and so its True by default for spancat_singlelabel. Spans with negative labels and their scores are not stored as annotations. bool |
negative_weight v3.5.1 | Multiplier for the loss terms. It can be used to downweight the negative samples if there are too many. It is only used when add_negative_label is True. Defaults to 1.0. float |
allow_overlap v3.5.1 | If True, the data is assumed to contain overlapping spans. It is only available when max_positive is exactly 1. Defaults to True. bool |
explosion/spaCy/master/spacy/pipeline/spancat.py
SpanCategorizer.__init__ 方法
创建一个新的管道实例。在您的应用程序中,通常会使用快捷方式,通过其字符串名称并使用nlp.add_pipe来实例化该组件。
| 名称 | 描述 |
|---|---|
vocab | The shared vocabulary. Vocab |
model | A model instance that is given a a list of documents and (start, end) indices representing candidate span offsets. The model predicts a probability for each category for each span. Model[Tuple[List[Doc],Ragged],Floats2d] |
suggester | A function that suggests spans. Spans are returned as a ragged array with two integer columns, for the start and end positions. Callable[[Iterable[Doc], Optional[Ops]],Ragged] |
name | String name of the component instance. Used to add entries to the losses during training. str |
| 仅关键字 | |
spans_key | Key of the Doc.spans dict to save the spans under. During initialization and training, the component will look for spans on the reference document under the same key. Defaults to "sc". str |
threshold | Minimum probability to consider a prediction positive. Spans with a positive prediction will be saved on the Doc. Defaults to 0.5. float |
max_positive | Maximum number of labels to consider positive per span. Defaults to None, indicating no limit. Optional[int] |
allow_overlap v3.5.1 | If True, the data is assumed to contain overlapping spans. It is only available when max_positive is exactly 1. Defaults to True. bool |
add_negative_label v3.5.1 | Whether to learn to predict a special negative label for each unannotated Span. This should be True when using a Softmax classifier layer and so its True by default for spancat_singlelabel . Spans with negative labels and their scores are not stored as annotations. bool |
negative_weight v3.5.1 | Multiplier for the loss terms. It can be used to downweight the negative samples if there are too many . It is only used when add_negative_label is True. Defaults to 1.0. float |
SpanCategorizer.__call__ 方法
将管道应用于单个文档。文档会被原地修改并返回。
这通常在调用nlp对象处理文本时自动完成,
所有管道组件会按顺序应用于Doc对象。
__call__和pipe
都会委托给predict和
set_annotations方法。
| 名称 | 描述 |
|---|---|
doc | The document to process. Doc |
| 返回值 | 处理后的文档。Doc |
SpanCategorizer.pipe 方法
将管道应用于文档流。这通常在调用nlp对象处理文本时自动完成,所有管道组件会按顺序应用于Doc对象。无论是__call__还是pipe方法,最终都会委托给predict和set_annotations方法执行。
| 名称 | 描述 |
|---|---|
stream | A stream of documents. Iterable[Doc] |
| 仅关键字 | |
batch_size | The number of documents to buffer. Defaults to 128. int |
| YIELDS | 按顺序处理后的文档。Doc |
SpanCategorizer.initialize 方法
初始化组件以进行训练。get_examples应为一个返回可迭代Example对象的函数。至少需要提供一个示例。这些数据示例用于初始化组件模型,可以是完整的训练数据或代表性样本。初始化过程包括验证网络、推断缺失形状以及根据数据设置标签方案。该方法通常由Language.initialize调用,并允许您通过配置中的[initialize.components]块来自定义接收的参数。
| 名称 | 描述 |
|---|---|
get_examples | Function that returns gold-standard annotations in the form of Example objects. Must contain at least one Example. Callable[[], Iterable[Example]] |
| 仅关键字 | |
nlp | The current nlp object. Defaults to None. Optional[Language] |
labels | The label information to add to the component, as provided by the label_data property after initialization. To generate a reusable JSON file from your data, you should run the init labels command. If no labels are provided, the get_examples callback is used to extract the labels from the data, which may be a lot slower. Optional[Iterable[str]] |
SpanCategorizer.predict 方法
在不修改的情况下,将组件的模型应用于一批Doc对象。
| 名称 | 描述 |
|---|---|
docs | The documents to predict. Iterable[Doc] |
| 返回值 | 模型对每个文档的预测结果。 |
SpanCategorizer.set_annotations 方法
使用预先计算的分数批量修改Doc对象。
| 名称 | 描述 |
|---|---|
docs | The documents to modify. Iterable[Doc] |
scores | The scores to set, produced by SpanCategorizer.predict. |
SpanCategorizer.update 方法
从一批包含预测和黄金标准标注的Example对象中学习,并更新组件的模型。委托给predict和get_loss。
| 名称 | 描述 |
|---|---|
examples | A batch of Example objects to learn from. Iterable[Example] |
| 仅关键字 | |
drop | The dropout rate. float |
sgd | An optimizer. Will be created via create_optimizer if not set. Optional[Optimizer] |
losses | Optional record of the loss during training. Updated using the component name as the key. Optional[Dict[str, float]] |
| RETURNS | The updated losses dictionary. Dict[str, float] |
SpanCategorizer.set_candidates 方法v3.3
使用建议器将一系列Span候选添加到Doc对象列表中。此方法专为调试目的而设计。
| 名称 | 描述 |
|---|---|
docs | The documents to modify. Iterable[Doc] |
candidates_key | Key of the Doc.spans dict to save the candidate spans under. str |
SpanCategorizer.get_loss 方法
计算这批文档及其预测分数的损失和损失梯度。
| 名称 | 描述 |
|---|---|
examples | The batch of examples. Iterable[Example] |
spans_scores | Scores representing the model’s predictions. Tuple[Ragged,Floats2d] |
| RETURNS | The loss and the gradient, i.e. (loss, gradient). Tuple[float, float] |
SpanCategorizer.create_optimizer 方法
为管道组件创建一个优化器。
| 名称 | 描述 |
|---|---|
| 返回值 | 优化器。 Optimizer |
SpanCategorizer.use_params 方法上下文管理器
修改管道的模型以使用给定的参数值。
| 名称 | 描述 |
|---|---|
params | The parameter values to use in the model. dict |
SpanCategorizer.add_label 方法
向管道添加一个新标签。如果输出维度已设置,或模型已完全初始化,则会引发错误。请注意,如果您向initialize方法提供了代表性数据样本,则无需调用此方法。在这种情况下,样本中发现的所有标签将自动添加到模型中,输出维度将自动推断。
| 名称 | 描述 |
|---|---|
label | The label to add. str |
| RETURNS | 0 if the label is already present, otherwise 1. int |
SpanCategorizer.to_disk 方法
将管道序列化到磁盘。
| 名称 | 描述 |
|---|---|
path | A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or Path-like objects. Union[str,Path] |
| 仅关键字 | |
exclude | String names of serialization fields to exclude. Iterable[str] |
SpanCategorizer.from_disk 方法
从磁盘加载管道。就地修改对象并返回它。
| 名称 | 描述 |
|---|---|
path | A path to a directory. Paths may be either strings or Path-like objects. Union[str,Path] |
| 仅关键字 | |
exclude | String names of serialization fields to exclude. Iterable[str] |
| RETURNS | The modified SpanCategorizer object. SpanCategorizer |
SpanCategorizer.to_bytes 方法
将管道序列化为字节串。
| 名称 | 描述 |
|---|---|
| 仅关键字 | |
exclude | String names of serialization fields to exclude. Iterable[str] |
| RETURNS | The serialized form of the SpanCategorizer object. bytes |
SpanCategorizer.from_bytes 方法
从字节串加载管道。原地修改对象并返回它。
| 名称 | 描述 |
|---|---|
bytes_data | The data to load from. bytes |
| 仅关键字 | |
exclude | String names of serialization fields to exclude. Iterable[str] |
| RETURNS | The SpanCategorizer object. SpanCategorizer |
SpanCategorizer.labels 属性
当前添加到组件中的标签。
| 名称 | 描述 |
|---|---|
| 返回值 | 添加到组件的标签。Tuple[str, …] |
SpanCategorizer.label_data 属性
当前添加到组件的标签及其内部元信息。
这是由init labels生成的数据,并被
SpanCategorizer.initialize用于
使用预定义的标签集初始化模型。
| 名称 | 描述 |
|---|---|
| 返回值 | 添加到组件的标签数据。Tuple[str, …] |
序列化字段
在序列化过程中,spaCy会导出多个用于恢复对象不同方面的数据字段。如果需要,您可以通过exclude参数传入字符串名称来将它们排除在序列化之外。
| 名称 | 描述 |
|---|---|
vocab | The shared Vocab. |
cfg | The config file. You usually don’t want to exclude this. |
model | The binary model data. You usually don’t want to exclude this. |
建议器 已注册函数
spacy.ngram_suggester.v1
建议给定长度的所有跨度。跨度以不规则的整数数组形式返回。该数组有两列,分别表示起始和结束位置。
| 名称 | 描述 |
|---|---|
sizes | The phrase lengths to suggest. For example, [1, 2] will suggest phrases consisting of 1 or 2 tokens. List[int] |
| CREATES | 建议器函数。可调用[[可迭代[Doc], 可选[Ops]],Ragged] |
spacy.ngram_range_suggester.v1
建议所有长度至少为min_size且至多为max_size的跨度(均包含边界值)。跨度以不规则整数数组形式返回。该数组包含两列,分别表示起始和结束位置。
| 名称 | 描述 |
|---|---|
min_size | The minimal phrase lengths to suggest (inclusive). [int] |
max_size | The maximal phrase lengths to suggest (inclusive). [int] |
| CREATES | 建议器函数。可调用[[可迭代[Doc], 可选[Ops]],Ragged] |
spacy.preset_spans_suggester.v1
建议所有已经存储在doc.spans[spans_key]中的span。当使用上游组件(如SpanRuler或SpanFinder)在Doc上设置span时,这非常有用。
| 名称 | 描述 |
|---|---|
spans_key | Key of Doc.spans that provides spans to suggest. str |
| CREATES | 建议器函数。可调用[[可迭代[Doc], 可选[Ops]],Ragged] |