模型架构
模型架构是一个用于构建Model实例的函数,您可以在管道组件或作为更大网络的一层中使用该实例。本页记录了spaCy内置的用于不同NLP任务的架构。所有可训练的内置组件都期望在配置中定义一个model参数,并记录其默认架构。自定义架构可以使用@spacy.registry.architectures装饰器注册,并作为训练配置的一部分使用。另请参阅关于层和模型架构的使用文档。
Tok2Vec 架构
spacy.Tok2Vec.v2
构建一个由两个子网络组成的tok2vec模型:一个用于嵌入,另一个用于编码。背景知识请参阅“Embed, Encode, Attend, Predict”博客文章。
| 名称 | 描述 |
|---|---|
embed | Embed tokens into context-independent word vector representations. For example, CharacterEmbed or MultiHashEmbed. Model[List[Doc], List[Floats2d]] |
encode | Encode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, MaxoutWindowEncoder. Model[List[Floats2d], List[Floats2d]] |
| CREATES | 使用该架构创建的模型。Model[List[Doc], List[Floats2d]] |
spacy.HashEmbedCNN.v2
构建spaCy的“标准”tok2vec层。该层由一个使用子词特征的MultiHashEmbed嵌入层,以及由CNN和层归一化maxout激活函数组成的MaxoutWindowEncoder编码层定义。
| 名称 | 描述 |
|---|---|
width | The width of the input and output. These are required to be the same, so that residual connections can be used. Recommended values are 96, 128 or 300. int |
depth | The number of convolutional layers to use. Recommended values are between 2 and 8. int |
embed_size | The number of rows in the hash embedding tables. This can be surprisingly small, due to the use of the hash embeddings. Recommended values are between 2000 and 10000. int |
window_size | The number of tokens on either side to concatenate during the convolutions. The receptive field of the CNN will be depth * window_size * 2 + 1, so a 4-layer network with a window size of 2 will be sensitive to 17 words at a time. Recommended value is 1. int |
maxout_pieces | The number of pieces to use in the maxout non-linearity. If 1, the Mish non-linearity is used instead. Recommended values are 1-3. int |
subword_features | Whether to also embed subword features, specifically the prefix, suffix and word shape. This is recommended for alphabetic languages like English, but not if single-character tokens are used for a language such as Chinese. bool |
pretrained_vectors | Whether to also use static vectors. bool |
| CREATES | 使用该架构创建的模型。Model[List[Doc], List[Floats2d]] |
spacy.Tok2VecListener.v1
监听器作为子层用于组件内部,例如
DependencyParser、
EntityRecognizer或
TextCategorizer。通常会有多个监听器连接到管道中更早的单个上游Tok2Vec组件。这些监听层充当代理角色,将Tok2Vec组件的预测结果传递至下游组件,并将梯度信息传回上游。
模型架构如Tagger无需定义自己的Tok2Vec实例,可以通过其tok2vec参数定义一个监听器,连接到管道中的共享tok2vec组件。
监听器的工作原理是缓存给定批次Doc的Tok2Vec输出。这意味着组件要与监听器协同工作,传递给监听器的Doc批次必须与传递给Tok2Vec的Doc批次相同。因此,任何会影响Tok2Vec输出的Doc操作(例如创建特殊上下文或移除无法进行预测的Doc)都必须在模型内部、调用Tok2Vec组件之后进行。
| 名称 | 描述 |
|---|---|
width | The width of the vectors produced by the “upstream” Tok2Vec component. int |
upstream | A string to identify the “upstream” Tok2Vec component to communicate with. By default, the upstream name is the wildcard string "*", but you could also specify the name of the Tok2Vec component. You’ll almost never have multiple upstream Tok2Vec components, so the wildcard string will almost always be fine. str |
| CREATES | 使用该架构创建的模型。Model[List[Doc], List[Floats2d]] |
spacy.MultiHashEmbed.v2
构建一个嵌入层,该层使用哈希嵌入分别编码多个词汇属性,将结果拼接后通过前馈子网络生成混合表示。可通过attrs参数配置所使用的特征,建议属性包括NORM、PREFIX、SUFFIX和SHAPE。这种方法让模型能够考虑部分子词信息,而无需构建完全基于字符的表示。如果存在预训练向量,也可将其纳入表示中(向量表保持静态,即不进行更新)。
| 名称 | 描述 |
|---|---|
width | The output width. Also used as the width of the embedding tables. Recommended values are between 64 and 300. If static vectors are included, a learned linear layer is used to map the vectors to the specified width before concatenating it with the other embedding outputs. A single maxout layer is then used to reduce the concatenated vectors to the final width. int |
attrs | The token attributes to embed. A separate embedding table will be constructed for each attribute. List[Union[int, str]] |
rows | The number of rows for each embedding tables. Can be low, due to the hashing trick. Recommended values are between 1000 and 10000. The layer needs surprisingly few rows, due to its use of the hashing trick. Generally between 2000 and 10000 rows is sufficient, even for very large vocabularies. A number of rows must be specified for each table, so the rows list must be of the same length as the attrs parameter. List[int] |
include_static_vectors | Whether to also use static word vectors. Requires a vectors table to be loaded in the Doc objects’ vocab. bool |
| CREATES | 使用该架构创建的模型。Model[List[Doc], List[Floats2d]] |
spacy.CharacterEmbed.v2
基于字符嵌入构建嵌入式表示,使用前馈网络。每个单词使用固定数量的UTF-8字节字符,从单词的开头和结尾均匀选取。对于过短的单词,会在中间使用填充。
例如,假设nC=4,单词是"jumping"。使用的字符将是"jung"(开头两个,结尾两个)。如果我们设置nC=8,字符将是"jumpping":开头四个,结尾四个。这样可以确保最后一个字符始终位于末尾位置,而不是根据单词长度随机出现在任意位置。
字符被嵌入到一个具有给定行数的嵌入表中,并将向量连接起来。单词NORM的哈希嵌入向量也会被连接起来,然后将结果通过前馈网络传递,以构建一个单独的向量来表示信息。
| 名称 | 描述 |
|---|---|
width | The width of the output vector and the NORM hash embedding. int |
rows | The number of rows in the NORM hash embedding table. int |
nM | The dimensionality of the character embeddings. Recommended values are between 16 and 64. int |
nC | The number of UTF-8 bytes to embed per word. Recommended values are between 3 and 8, although it may depend on the length of words in the language. int |
| CREATES | 使用该架构创建的模型。Model[List[Doc], List[Floats2d]] |
spacy.MaxoutWindowEncoder.v2
使用带有maxout激活、层归一化和残差连接的卷积来编码上下文。
| 名称 | 描述 |
|---|---|
width | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between 64 and 300. int |
window_size | The number of words to concatenate around each token to construct the convolution. Recommended value is 1. int |
maxout_pieces | The number of maxout pieces to use. Recommended values are 2 or 3. int |
depth | The number of convolutional layers. Recommended value is 4. int |
| CREATES | 使用该架构的模型。Model[List[Floats2d], List[Floats2d]] |
spacy.MishWindowEncoder.v2
使用带有Mish激活函数的卷积、层归一化和残差连接来编码上下文。
| 名称 | 描述 |
|---|---|
width | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between 64 and 300. int |
window_size | The number of words to concatenate around each token to construct the convolution. Recommended value is 1. int |
depth | The number of convolutional layers. Recommended value is 4. int |
| CREATES | 使用该架构创建的模型。Model[List[Floats2d], List[Floats2d]] |
spacy.TorchBiLSTMEncoder.v1
使用双向LSTM层编码上下文。需要 PyTorch。
| 名称 | 描述 |
|---|---|
width | The input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between 64 and 300. int |
depth | The number of recurrent layers, for instance depth=2 results in stacking two LSTMs together. int |
dropout | Creates a Dropout layer on the outputs of each LSTM layer except the last layer. Set to 0.0 to disable this functionality. float |
| CREATES | 使用该架构创建的模型。Model[List[Floats2d], List[Floats2d]] |
spacy.StaticVectors.v2
将Doc对象与其词汇表的向量表嵌入,应用学习到的线性投影来控制维度。未知标记会被映射为零向量。详情请参阅静态向量文档。
| 名称 | 描述 |
|---|---|
nO | The output width of the layer, after the linear projection. Optional[int] |
nM | The width of the static vectors. Optional[int] |
dropout | Optional dropout rate. If set, it’s applied per dimension over the whole batch. Defaults to None. Optional[float] |
init_W | The initialization function. Defaults to glorot_uniform_init. Callable[[Ops, Tuple[int, …]]],FloatsXd] |
key_attr | This setting is ignored in spaCy v3.6+. To set a custom key attribute for vectors, configure it through Vectors or spacy init vectors. Defaults to "ORTH". str |
| CREATES | 使用该架构创建的模型。Model[List[Doc],Ragged] |
spacy.FeatureExtractor.v1
从Doc对象中提取输入特征数组。需要提供一个待提取的特征名称列表,这些名称应指向词符属性。
| 名称 | 描述 |
|---|---|
columns | The token attributes to extract. List[Union[int, str]] |
| CREATES | 创建的特征提取层。Model[List[Doc], List[Ints2d]] |
Transformer架构
以下架构由spacy-transformers包提供。请参阅使用文档了解如何将这些架构集成到您的训练配置中。
spacy-transformers.TransformerModel.v3
从HuggingFace transformers库加载并封装一个transformer模型。您可以使用任何具有预训练权重和PyTorch实现的transformer。name变量会传递给底层库,因此它可以是字符串或路径。如果是字符串,当预训练权重在本地不可用时,将通过transformers库下载。
为了支持更长的文档,
TransformerModel 层允许您传入一个 get_spans 函数,该函数会在将 Doc 对象传入transformer之前对其进行分割。您定义的跨度可以重叠或排除某些词元。该层通常由 Transformer 组件直接使用,使您能够在整个流程中共享transformer权重。如需配置用于其他组件的层,请参阅 Tok2VecTransformer。
| 名称 | 描述 |
|---|---|
name | Any model name that can be loaded by transformers.AutoModel. str |
get_spans | Function that takes a batch of Doc object and returns lists of Span objects to process by the transformer. See here for built-in options and examples. Callable[[List[Doc]], List[Span]] |
tokenizer_config | Tokenizer settings passed to transformers.AutoTokenizer. Dict[str, Any] |
transformer_config | Transformer settings passed to transformers.AutoConfig Dict[str, Any] |
mixed_precision | Replace whitelisted ops by half-precision counterparts. Speeds up training and prediction on GPUs with Tensor Cores and reduces GPU memory use. bool |
grad_scaler_config | Configuration to pass to thinc.api.PyTorchGradScaler during training when mixed_precision is enabled. Dict[str, Any] |
| CREATES | 使用该架构创建的模型。Model[List[Doc],FullTransformerBatch] |
transformer_config参数在spacy-transformers.TransformerModel.v2中被添加。mixed_precision和grad_scaler_config参数是在spacy-transformers.TransformerModel.v3版本中添加的。
其他参数在所有版本之间是共享的。
spacy-transformers.TransformerListener.v1
创建一个TransformerListener层,该层将连接到流水线中较早的Transformer组件。该层接收Doc对象列表作为输入,并生成二维数组列表作为输出,每个数组为每个标记(token)包含一行。大多数spaCy模型期望具有此签名的子层,从而可以轻松通过该子层将它们连接到transformer模型。Transformer模型通常处理wordpieces(词片段),这些片段通常与spaCy标记不是一一对应的。因此,该层需要一个归约操作来计算给定零个或多个wordpiece向量时的单个标记向量。
| 名称 | 描述 |
|---|---|
pooling | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see reduce_mean) is usually a good choice. Model[Ragged,Floats2d] |
grad_factor | Reweight gradients from the component before passing them upstream. You can set this to 0 to “freeze” the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at 1.0 is usually fine. float |
upstream | A string to identify the “upstream” Transformer component to communicate with. By default, the upstream name is the wildcard string "*", but you could also specify the name of the Transformer component. You’ll almost never have multiple upstream Transformer components, so the wildcard string will almost always be fine. str |
| CREATES | 使用该架构创建的模型。Model[List[Doc], List[Floats2d]] |
spacy-transformers.Tok2VecTransformer.v3
直接将transformer用作Tok2Vec层。这种方式不允许多个组件共享transformer权重,也不允许transformer在Doc对象中设置注释,但如果您只需要在一个组件内使用transformer,这是一个更简单的解决方案。
| 名称 | 描述 |
|---|---|
get_spans | Function that takes a batch of Doc object and returns lists of Span objects to process by the transformer. See here for built-in options and examples. Callable[[List[Doc]], List[Span]] |
tokenizer_config | Tokenizer settings passed to transformers.AutoTokenizer. Dict[str, Any] |
transformer_config | Settings to pass to the transformers forward pass. Dict[str, Any] |
pooling | A reduction layer used to calculate the token vectors based on zero or more wordpiece vectors. If in doubt, mean pooling (see reduce_mean) is usually a good choice. Model[Ragged,Floats2d] |
grad_factor | Reweight gradients from the component before passing them upstream. You can set this to 0 to “freeze” the transformer weights with respect to the component, or use it to make some components more significant than others. Leaving it at 1.0 is usually fine. float |
mixed_precision | Replace whitelisted ops by half-precision counterparts. Speeds up training and prediction on GPUs with Tensor Cores and reduces GPU memory use. bool |
grad_scaler_config | Configuration to pass to thinc.api.PyTorchGradScaler during training when mixed_precision is enabled. Dict[str, Any] |
| CREATES | 使用该架构创建的模型。Model[List[Doc], List[Floats2d]] |
transformer_config参数在spacy-transformers.Tok2VecTransformer.v2版本中被添加。mixed_precision和grad_scaler_config参数是在spacy-transformers.Tok2VecTransformer.v3版本中添加的。
其他参数在所有版本之间是共享的。
精选Transformer架构
以下架构由spacy-curated-transformers包提供。
请参阅使用文档了解如何将这些架构集成到您的训练配置中。
从Hugging Face Hub加载模型时,模型配置参数必须与预训练模型使用的超参数相同。可以使用init fill-curated-transformer CLI命令自动填充这些值。
spacy-curated-transformers.AlbertTransformer.v1
构建一个ALBERT transformer模型。
| 名称 | 描述 |
|---|---|
vocab_size | Vocabulary size. int |
with_spans | Callback that constructs a span generator model. Callable |
piece_encoder | The piece encoder to segment input tokens. Model |
attention_probs_dropout_prob | Dropout probability of the self-attention layers. float |
embedding_width | Width of the embedding representations. int |
hidden_act | Activation used by the point-wise feed-forward layers. str |
hidden_dropout_prob | Dropout probability of the point-wise feed-forward and embedding layers. float |
hidden_width | Width of the final representations. int |
intermediate_width | Width of the intermediate projection layer in the point-wise feed-forward layer. int |
layer_norm_eps | Epsilon for layer normalization. float |
max_position_embeddings | Maximum length of position embeddings. int |
model_max_length | Maximum length of model inputs. int |
num_attention_heads | Number of self-attention heads. int |
num_hidden_groups | Number of layer groups whose constituents share parameters. int |
num_hidden_layers | Number of hidden layers. int |
padding_idx | Index of the padding meta-token. int |
type_vocab_size | Type vocabulary size. int |
mixed_precision | Use mixed-precision training. bool |
grad_scaler_config | Configuration passed to the PyTorch gradient scaler. dict |
| CREATES | 使用架构Model创建模型 |
spacy-curated-transformers.BertTransformer.v1
构建一个BERT转换器模型。
| 名称 | 描述 |
|---|---|
vocab_size | Vocabulary size. int |
with_spans | Callback that constructs a span generator model. Callable |
piece_encoder | The piece encoder to segment input tokens. Model |
attention_probs_dropout_prob | Dropout probability of the self-attention layers. float |
hidden_act | Activation used by the point-wise feed-forward layers. str |
hidden_dropout_prob | Dropout probability of the point-wise feed-forward and embedding layers. float |
hidden_width | Width of the final representations. int |
intermediate_width | Width of the intermediate projection layer in the point-wise feed-forward layer. int |
layer_norm_eps | Epsilon for layer normalization. float |
max_position_embeddings | Maximum length of position embeddings. int |
model_max_length | Maximum length of model inputs. int |
num_attention_heads | Number of self-attention heads. int |
num_hidden_layers | Number of hidden layers. int |
padding_idx | Index of the padding meta-token. int |
type_vocab_size | Type vocabulary size. int |
mixed_precision | Use mixed-precision training. bool |
grad_scaler_config | Configuration passed to the PyTorch gradient scaler. dict |
| CREATES | 使用Model架构创建模型 |
spacy-curated-transformers.CamembertTransformer.v1
构建一个CamemBERT转换器模型。
| 名称 | 描述 |
|---|---|
vocab_size | Vocabulary size. int |
with_spans | Callback that constructs a span generator model. Callable |
piece_encoder | The piece encoder to segment input tokens. Model |
attention_probs_dropout_prob | Dropout probability of the self-attention layers. float |
hidden_act | Activation used by the point-wise feed-forward layers. str |
hidden_dropout_prob | Dropout probability of the point-wise feed-forward and embedding layers. float |
hidden_width | Width of the final representations. int |
intermediate_width | Width of the intermediate projection layer in the point-wise feed-forward layer. int |
layer_norm_eps | Epsilon for layer normalization. float |
max_position_embeddings | Maximum length of position embeddings. int |
model_max_length | Maximum length of model inputs. int |
num_attention_heads | Number of self-attention heads. int |
num_hidden_layers | Number of hidden layers. int |
padding_idx | Index of the padding meta-token. int |
type_vocab_size | Type vocabulary size. int |
mixed_precision | Use mixed-precision training. bool |
grad_scaler_config | Configuration passed to the PyTorch gradient scaler. dict |
| CREATES | 使用Model架构的模型 |
spacy-curated-transformers.RobertaTransformer.v1
构建一个RoBERTa转换器模型。
| 名称 | 描述 |
|---|---|
vocab_size | Vocabulary size. int |
with_spans | Callback that constructs a span generator model. Callable |
piece_encoder | The piece encoder to segment input tokens. Model |
attention_probs_dropout_prob | Dropout probability of the self-attention layers. float |
hidden_act | Activation used by the point-wise feed-forward layers. str |
hidden_dropout_prob | Dropout probability of the point-wise feed-forward and embedding layers. float |
hidden_width | Width of the final representations. int |
intermediate_width | Width of the intermediate projection layer in the point-wise feed-forward layer. int |
layer_norm_eps | Epsilon for layer normalization. float |
max_position_embeddings | Maximum length of position embeddings. int |
model_max_length | Maximum length of model inputs. int |
num_attention_heads | Number of self-attention heads. int |
num_hidden_layers | Number of hidden layers. int |
padding_idx | Index of the padding meta-token. int |
type_vocab_size | Type vocabulary size. int |
mixed_precision | Use mixed-precision training. bool |
grad_scaler_config | Configuration passed to the PyTorch gradient scaler. dict |
| CREATES | 使用架构Model创建模型 |
spacy-curated-transformers.XlmrTransformer.v1
构建一个XLM-RoBERTa转换器模型。
| 名称 | 描述 |
|---|---|
vocab_size | Vocabulary size. int |
with_spans | Callback that constructs a span generator model. Callable |
piece_encoder | The piece encoder to segment input tokens. Model |
attention_probs_dropout_prob | Dropout probability of the self-attention layers. float |
hidden_act | Activation used by the point-wise feed-forward layers. str |
hidden_dropout_prob | Dropout probability of the point-wise feed-forward and embedding layers. float |
hidden_width | Width of the final representations. int |
intermediate_width | Width of the intermediate projection layer in the point-wise feed-forward layer. int |
layer_norm_eps | Epsilon for layer normalization. float |
max_position_embeddings | Maximum length of position embeddings. int |
model_max_length | Maximum length of model inputs. int |
num_attention_heads | Number of self-attention heads. int |
num_hidden_layers | Number of hidden layers. int |
padding_idx | Index of the padding meta-token. int |
type_vocab_size | Type vocabulary size. int |
mixed_precision | Use mixed-precision training. bool |
grad_scaler_config | Configuration passed to the PyTorch gradient scaler. dict |
| CREATES | 使用架构Model创建模型 |
spacy-curated-transformers.ScalarWeight.v1
构建一个模型,该模型接收一系列transformer层的输出,并返回它们的加权表示。
| 名称 | 描述 |
|---|---|
num_layers | Number of transformer hidden layers. int |
dropout_prob | Dropout probability. float |
mixed_precision | Use mixed-precision training. bool |
grad_scaler_config | Configuration passed to the PyTorch gradient scaler. dict |
| 创建 | 使用架构Model[ScalarWeightInT, ScalarWeightOutT]的模型 |
spacy-curated-transformers.TransformerLayersListener.v1
构建一个监听层,与一个或多个上游Transformer组件进行通信。该层提取最后一个transformer层的输出,并对每个Doc标记的各个部分进行池化操作,返回它们对应的表示。上游名称可以是通配符字符串'*',也可以是Transformer组件的名称。
在大多数情况下,通配符字符串就足够了,因为通常只会有一个上游Transformer组件。但在某些情况下,例如:您为特定任务准备了不连续的数据集,或者您想使用预训练管道但下游任务需要自己的标记表示,最终可能会在管道中出现多个Transformer组件。
| 名称 | 描述 |
|---|---|
layers | The number of layers produced by the upstream transformer component, excluding the embedding layer. int |
width | The width of the vectors produced by the upstream transformer component. int |
pooling | Model that is used to perform pooling over the piece representations. Model |
upstream_name | A string to identify the ‘upstream’ Transformer component to communicate with. str |
grad_factor | Factor to multiply gradients with. float |
| CREATES | 创建一个模型,该模型从上游transformer组件返回相关向量。Model[List[Doc], List[Floats2d]] |
spacy-curated-transformers.LastTransformerLayerListener.v1
构建一个监听层,与一个或多个上游Transformer组件进行通信。该层提取最后一个transformer层的输出,并对每个Doc令牌的各个部分进行池化处理,返回它们对应的表示。上游名称应为通配符字符串'*',或是Transformer组件的名称。
在大多数情况下,通配符字符串就足够了,因为通常只会有一个上游Transformer组件。但在某些情况下,例如:您为特定任务准备了不连续的数据集,或者您想使用预训练管道但下游任务需要自己的标记表示,最终可能会在管道中出现多个Transformer组件。
| 名称 | 描述 |
|---|---|
width | The width of the vectors produced by the upstream transformer component. int |
pooling | Model that is used to perform pooling over the piece representations. Model |
upstream_name | A string to identify the ‘upstream’ Transformer component to communicate with. str |
grad_factor | Factor to multiply gradients with. float |
| CREATES | 创建一个模型,该模型从上游transformer组件返回相关向量。Model[List[Doc], List[Floats2d]] |
spacy-curated-transformers.ScalarWeightingListener.v1
构建一个监听层,与一个或多个上游Transformer组件进行通信。该层计算所有transformer层输出的加权表示,并对每个Doc token的各个部分执行池化操作,返回它们对应的表示。
要求其上游Transformer组件返回模型的所有层输出。上游名称应为通配符字符串'*',或Transformer组件的名称。
在大多数情况下,通配符字符串就足够了,因为通常只会有一个上游Transformer组件。但在某些情况下,例如:您为特定任务准备了不连续的数据集,或者您想使用预训练管道但下游任务需要自己的标记表示,最终可能会在管道中出现多个Transformer组件。
| 名称 | 描述 |
|---|---|
width | The width of the vectors produced by the upstream transformer component. int |
weighting | Model that is used to perform the weighting of the different layer outputs. Model |
pooling | Model that is used to perform pooling over the piece representations. Model |
upstream_name | A string to identify the ‘upstream’ Transformer component to communicate with. str |
grad_factor | Factor to multiply gradients with. float |
| CREATES | 创建一个模型,该模型可从上游transformer组件返回相关向量。Model[List[Doc], List[Floats2d]] |
spacy-curated-transformers.BertWordpieceEncoder.v1
构建一个WordPiece分词编码器模型,该模型接收一系列标记序列或文档,并返回对应的分词标识符列表。此编码器还会按照大多数BERT模型的预期,在每个标点符号处对标记进行分割。
该模型必须使用适当的加载器单独初始化。
spacy-curated-transformers.ByteBpeEncoder.v1
构建一个Byte-BPE分词编码器模型,该模型接受一系列token序列或文档作为输入,并返回对应的分词标识符列表。
该模型必须使用适当的加载器单独初始化。
spacy-curated-transformers.CamembertSentencepieceEncoder.v1
构建一个SentencePiece分词编码器模型,该模型接受一系列标记序列或文档,并返回应用了CamemBERT后处理的对应分词标识符列表。
该模型必须使用适当的加载器单独初始化。
spacy-curated-transformers.CharEncoder.v1
构建一个字符片段编码器模型,该模型接受一系列标记序列或文档,并返回对应的片段标识符列表。
该模型必须使用适当的加载器单独初始化。
spacy-curated-transformers.SentencepieceEncoder.v1
构建一个SentencePiece分词编码器模型,该模型接受一系列标记序列或文档,并返回对应的分词标识符列表。
该模型必须使用适当的加载器单独初始化。
spacy-curated-transformers.WordpieceEncoder.v1
构建一个WordPiece分词编码器模型,该模型接收一系列标记序列或文档,并返回对应的分词标识符列表。该编码器还会按照大多数BERT模型的预期,在每个标记的标点符号处进行分割。
该模型必须使用适当的加载器单独初始化。
spacy-curated-transformers.XlmrSentencepieceEncoder.v1
构建一个SentencePiece分词编码器模型,该模型接受一系列标记序列或文档作为输入,并返回经过XLM-RoBERTa后处理的对应分词标识符列表。
该模型必须使用适当的加载器单独初始化。
预训练架构
spacy的pretrain命令允许您使用原始文本中的信息来初始化管道中的Tok2Vec层。为此,会添加额外的层来构建一个临时任务的网络,迫使Tok2Vec层学习有关句子结构和单词共现统计的信息。提供了两种预训练目标,它们都是BERTDevlin et al. (2018)引入的完形填空任务的变体。
更多信息,请参阅预训练部分。
spacy.PretrainVectors.v1
从静态嵌入表中预测单词向量,作为Tok2Vec层的预训练目标。要使用此目标,请确保配置中的initialize.vectors部分引用了具有静态向量的模型。
| 名称 | 描述 |
|---|---|
maxout_pieces | The number of maxout pieces to use. Recommended values are 2 or 3. int |
hidden_size | Size of the hidden layer of the model. int |
loss | The loss function can be either “cosine” or “L2”. We typically recommend to use “cosine”. ~~~str~~ |
| CREATES | A callable function that can create the Model, given the vocab of the pipeline and the tok2vec layer to pretrain. Callable[[Vocab,Model],Model] |
spacy.PretrainCharacters.v1
预测一定数量的前导和尾随UTF-8字节作为Tok2Vec层的预训练目标。
| 名称 | 描述 |
|---|---|
maxout_pieces | The number of maxout pieces to use. Recommended values are 2 or 3. int |
hidden_size | Size of the hidden layer of the model. int |
n_characters | The window of characters - e.g. if n_characters = 2, the model will try to predict the first two and last two characters of the word. int |
| CREATES | A callable function that can create the Model, given the vocab of the pipeline and the tok2vec layer to pretrain. Callable[[Vocab,Model],Model] |
Parser & NER 架构
spacy.TransitionBasedParser.v2
构建一个基于转移的解析器模型。可应用于命名实体识别(NER)或依存句法分析。 基于转移的解析是一种结构化预测方法,它将预测结构的任务 映射为一系列状态转移。您可能会发现本教程 对背景知识有所帮助。该神经网络状态预测模型 由两个或三个子网络组成:
- tok2vec: 将每个标记映射为向量表示。该子网络在每批次数据上运行一次。
- lower: 为每个
(token, feature)对构建特定特征的向量。该操作也会对每个批次执行一次。然后通过求和组件特征并应用非线性函数来构建状态表示。 - upper (可选): 一个前馈网络,用于根据状态表示预测分数。如果未提供,则直接使用下层模型的输出作为动作分数。
| 名称 | 描述 |
|---|---|
tok2vec | Subnetwork to map tokens into vector representations. Model[List[Doc], List[Floats2d]] |
state_type | Which task to extract features for. Possible values are “ner” and “parser”. str |
extra_state_tokens | Whether to use an expanded feature set when extracting the state tokens. Slightly slower, but sometimes improves accuracy slightly. Defaults to False. bool |
hidden_width | The width of the hidden layer. int |
maxout_pieces | How many pieces to use in the state prediction layer. Recommended values are 1, 2 or 3. If 1, the maxout non-linearity is replaced with a Relu non-linearity if use_upper is True, and no non-linearity if False. int |
use_upper | Whether to use an additional hidden layer after the state vector in order to predict the action scores. It is recommended to set this to False for large pretrained models such as transformers, and True for smaller networks. The upper layer is computed on CPU, which becomes a bottleneck on larger GPU-based models, where it’s also less necessary. bool |
nO | The number of actions the model will predict between. Usually inferred from data at the beginning of training, or loaded from disk. int |
| CREATES | 使用该架构创建的模型。Model[List[Docs], List[List[Floats2d]]] |
TransitionBasedParser.v1 具有完全相同的签名,但默认情况下 use_upper 参数为 True。
标记架构
spacy.Tagger.v2
构建一个标注器模型,使用提供的词符到向量组件。该标注器模型添加了一个带softmax激活的线性层,用于根据词符向量预测得分。
| 名称 | 描述 |
|---|---|
tok2vec | Subnetwork to map tokens into vector representations. Model[List[Doc], List[Floats2d]] |
nO | The number of tags to output. Inferred from the data if None. Optional[int] |
normalize | Normalize probabilities during inference. Defaults to False. bool |
| CREATES | 使用该架构创建的模型。Model[List[Doc], List[Floats2d]] |
normalize参数在spacy.Tagger.v2版本中添加。spacy.Tagger.v1在推理过程中总是会标准化概率。
其他参数在所有版本之间是共享的。
文本分类架构
文本分类架构需要接收一个Doc作为输入,并为每个可能的标签类别生成分数。文本分类任务可以是二元的(例如情感分析),也可以涉及多个可能的标签。多标签任务中的标签可以是互斥的(每个样本只有一个标签),也可以同时适用多个标签。
由于文本分类问题的特性差异很大,我们提供了多种内置架构。建议尝试不同的架构和设置,以确定哪种方案最适合您的具体数据和挑战。
spacy.TextCatEnsemble.v2
由线性词袋模型和神经网络模型组成的堆叠集成。该神经网络基于Tok2Vec层构建并使用注意力机制。关于该模型是否应适应多标签分类的设置取自线性模型,该设置存储在model.attrs["multi_label"]中。
| 名称 | 描述 |
|---|---|
linear_model | The linear bag-of-words model. Model[List[Doc],Floats2d] |
tok2vec | The tok2vec layer to build the neural network upon. Model[List[Doc], List[Floats2d]] |
nO | Output dimension, determined by the number of different labels. If not set, the TextCategorizer component will set it when initialize is called. Optional[int] |
| CREATES | 使用该架构创建的模型。Model[List[Doc],Floats2d] |
TextCatEnsemble.v1 在功能上类似,
但内部使用了 tok2vec 而不是将其作为参数接收:
| 名称 | 描述 |
|---|---|
exclusive_classes | Whether or not categories are mutually exclusive. bool |
pretrained_vectors | Whether or not pretrained vectors will be used in addition to the feature vectors. bool |
width | Output dimension of the feature encoding step. int |
embed_size | Input dimension of the feature encoding step. int |
conv_depth | Depth of the tok2vec layer. int |
window_size | The number of contextual vectors to concatenate from the left and from the right. int |
ngram_size | Determines the maximum length of the n-grams in the BOW model. For instance, ngram_size=3would give unigram, trigram and bigram features. int |
dropout | The dropout rate. float |
nO | Output dimension, determined by the number of different labels. If not set, the TextCategorizer component will set it when initialize is called. Optional[int] |
| CREATES | 使用该架构创建的模型。Model[List[Doc],Floats2d] |
spacy.TextCatBOW.v3
一个n-gram"词袋"模型。该架构的运行速度比其他模型快得多,但可能不够准确,尤其是对于短文本。
| 名称 | 描述 |
|---|---|
exclusive_classes | Whether or not categories are mutually exclusive. bool |
ngram_size | Determines the maximum length of the n-grams in the BOW model. For instance, ngram_size=3 would give unigram, trigram and bigram features. int |
no_output_layer | Whether or not to add an output layer to the model (Softmax activation if exclusive_classes is True, else Logistic). bool |
length | The size of the weights vector. The length will be rounded up to the next power of two if it is not a power of two. Defaults to 262144. int |
nO | Output dimension, determined by the number of different labels. If not set, the TextCategorizer component will set it when initialize is called. Optional[int] |
| CREATES | 使用该架构创建的模型。Model[List[Doc],Floats2d] |
- TextCatBOW.v1 之前不支持调整大小。从v2版本开始,该组件可以在训练后添加新标签。
- TextCatBOW.v1 和 TextCatBOW.v2 使用了一个错误的稀疏线性层,该层仅使用了少量已分配的参数。
- TextCatBOW.v1 和
TextCatBOW.v2 没有
length参数。
spacy.TextCatParametricAttention.v1
一种基于Tok2Vec的神经网络模型,利用参数化注意力机制关注与文本分类相关的标记。
| 名称 | 描述 |
|---|---|
tok2vec | The tok2vec layer to build the neural network upon. Model[List[Doc], List[Floats2d]] |
exclusive_classes | Whether or not categories are mutually exclusive. bool |
nO | Output dimension, determined by the number of different labels. If not set, the TextCategorizer component will set it when initialize is called. Optional[int] |
| CREATES | 使用该架构创建的模型。Model[List[Doc],Floats2d] |
spacy.TextCatReduce.v1
一个分类器,通过使用first、max或mean缩减来汇集每个Doc的token隐藏表示,然后应用分类层。当使用多种缩减方式时,这些缩减结果会被拼接起来。
| 名称 | 描述 |
|---|---|
exclusive_classes | Whether or not categories are mutually exclusive. bool |
tok2vec | The tok2vec layer of the model. Model |
use_reduce_first | Pool by using the hidden representation of the first token of a Doc. bool |
use_reduce_last | Pool by using the hidden representation of the last token of a Doc. bool |
use_reduce_max | Pool by taking the maximum values of the hidden representations of a Doc. bool |
use_reduce_mean | Pool by taking the mean of all hidden representations of a Doc. bool |
nO | Output dimension, determined by the number of different labels. If not set, the TextCategorizer component will set it when initialize is called. Optional[int] |
| CREATES | 使用该架构创建的模型。Model[List[Doc],Floats2d] |
文本片段分类架构
spacy.SpanCategorizer.v1
构建一个span分类器模型,用于驱动SpanCategorizer组件。该模型需要:一个将token转换为向量的模型、一个将每个span的向量序列映射为单个向量的归约模型,以及一个将向量映射为概率的打分模型。
| 名称 | 描述 |
|---|---|
tok2vec | The token-to-vector model. Model[List[Doc], List[Floats2d]] |
reducer | The reducer model. Model[Ragged,Floats2d] |
scorer | The scorer model. Model[Floats2d,Floats2d] |
| CREATES | 使用该架构创建的模型。Model[Tuple[List[Doc],Ragged],Floats2d] |
spacy.mean_max_reducer.v1
通过连接序列的平均池化和最大池化向量来缩减序列,然后将连接后的向量与隐藏层结合。
| 名称 | 描述 |
|---|---|
hidden_size | The size of the hidden layer. int |
实体链接架构
一个EntityLinker组件用于消除文本提及(标记为命名实体)的歧义,将其映射到唯一标识符,从而将命名实体锚定到"现实世界"。这需要3个主要组件:
- 一个
KnowledgeBase(知识库),用于存储唯一标识符、潜在同义词和先验概率。 - 候选生成步骤,根据给定的文本提及内容生成一组可能的标识符。
- 一个机器学习
Model,用于从候选集合中选择最合理的ID。
spacy.EntityLinker.v2
EntityLinker模型架构是一个带有Linear输出层的Thinc Model。
| 名称 | 描述 |
|---|---|
tok2vec | The tok2vec layer of the model. Model |
nO | Output dimension, determined by the length of the vectors encoding each entity in the KB. If the nO dimension is not set, the entity linking component will set it when initialize is called. Optional[int] |
| CREATES | 使用该架构创建的模型。Model[List[Doc],Floats2d] |
spacy.EmptyKB.v1
一个函数,用于从Vocab实例创建一个空的KnowledgeBase。
| 名称 | 描述 |
|---|---|
entity_vector_length | The length of the vectors encoding each entity in the KB. Defaults to 64. int |
spacy.EmptyKB.v2
一个函数,用于从Vocab实例创建一个空的KnowledgeBase。这是创建新实体链接器组件时的默认行为。它返回一个Callable[[Vocab, int], InMemoryLookupKB]。
spacy.KBFromFile.v1
一个从文件中读取现有KnowledgeBase的函数。
| 名称 | 描述 |
|---|---|
kb_path | The location of the KB that was stored to file. Path |
spacy.CandidateGenerator.v1
一个函数,接收KnowledgeBase和表示命名实体的Span对象作为输入,返回一个可能的Candidate对象列表。默认的CandidateGenerator使用提及文本在KnowledgeBase中查找其潜在别名。请注意此函数区分大小写。
spacy.CandidateBatchGenerator.v1
一个函数,接收一个KnowledgeBase和表示命名实体的Span对象Iterable作为输入,并返回每个指定Span对应的可能Candidate对象列表。默认的CandidateBatchGenerator使用提及文本在KnowledgeBase中查找其潜在别名。请注意此函数区分大小写。
共指解析 实验性
CoreferenceResolver组件用于识别指向同一实体的词元。SpanResolver组件则从单个词元推断出文本片段。这两个组件结合使用可以复现传统的共指消解模型。如果仅使用词元级别的聚类结果可接受,也可以省略SpanResolver组件。
spacy-experimental.Coref.v1 实验性
Coref 模型架构是一个 Thinc Model。
| 名称 | 描述 |
|---|---|
tok2vec | The tok2vec layer of the model. Model |
distance_embedding_size | A representation of the distance between candidates. int |
dropout | The dropout to use internally. Unlike some Thinc models, this has separate dropout for the internal PyTorch layers. float |
hidden_size | Size of the main internal layers. int |
depth | Depth of the internal network. int |
antecedent_limit | How many candidate antecedents to keep after rough scoring. This has a significant effect on memory usage. Typical values would be 50 to 200, or higher for very long documents. int |
antecedent_batch_size | Internal batch size. int |
| CREATES | 使用该架构创建的模型。Model[List[Doc],Floats2d] |
spacy-experimental.SpanResolver.v1 实验性
SpanResolver模型架构是一个Thinc Model。请注意MentionClusters的类型是List[List[Tuple[int, int]]]。
| 名称 | 描述 |
|---|---|
tok2vec | The tok2vec layer of the model. Model |
hidden_size | Size of the main internal layers. int |
distance_embedding_size | A representation of the distance between two candidates. int |
conv_channels | The number of channels in the internal CNN. int |
window_size | The number of neighboring tokens to consider in the internal CNN. 1 means consider one token on each side. int |
max_distance | The longest possible length of a predicted span. int |
prefix | The prefix that indicates spans to use for input data. string |
| CREATES | 使用该架构创建的模型。Model[List[Doc], List[MentionClusters]] |