遗留

遗留函数与架构

通过spacy-legacy提供的存档实现

spacy-legacy 软件包包含 已过时的注册函数和架构。它会作为spaCy的依赖项自动安装, 为项目中可能仍在使用的归档函数提供向后兼容性。

您可以在本页面找到每个此类遗留函数的详细文档。

架构

这些函数可从@spacy.registry.architectures获取。

spacy.Tok2Vec.v1

spacy.Tok2Vec.v1架构需要一个类型为Model[Floats2D, Floats2D]encode模型,例如spacy.MaxoutWindowEncoder.v1spacy.MishWindowEncoder.v1

构建一个由两个子网络组成的tok2vec模型:一个用于嵌入,另一个用于编码。背景知识请参阅“Embed, Encode, Attend, Predict”博客文章。

名称描述
embedEmbed tokens into context-independent word vector representations. For example, CharacterEmbed or MultiHashEmbed. Model[List[Doc], List[Floats2d]]
encodeEncode context into the embeddings, using an architecture such as a CNN, BiLSTM or transformer. For example, MaxoutWindowEncoder.v1. Model[Floats2d,Floats2d]

spacy.MaxoutWindowEncoder.v1

spacy.MaxoutWindowEncoder.v1 架构生成的模型类型为 Model[Floats2D, Floats2D]。从 spacy.MaxoutWindowEncoder.v2 开始,输出类型已更改为 Model[List[Floats2d], List[Floats2d]]

使用带有maxout激活、层归一化和残差连接的卷积来编码上下文。

名称描述
widthThe input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between 64 and 300. int
window_sizeThe number of words to concatenate around each token to construct the convolution. Recommended value is 1. int
maxout_piecesThe number of maxout pieces to use. Recommended values are 2 or 3. int
depthThe number of convolutional layers. Recommended value is 4. int

spacy.MishWindowEncoder.v1

spacy.MishWindowEncoder.v1架构生成的模型类型为Model[Floats2D, Floats2D]。从spacy.MishWindowEncoder.v2开始,输出类型已更改为Model[List[Floats2d], List[Floats2d]]

使用带有Mish激活函数的卷积、层归一化和残差连接来编码上下文。

名称描述
widthThe input and output width. These are required to be the same, to allow residual connections. This value will be determined by the width of the inputs. Recommended values are between 64 and 300. int
window_sizeThe number of words to concatenate around each token to construct the convolution. Recommended value is 1. int
depthThe number of convolutional layers. Recommended value is 4. int

spacy.HashEmbedCNN.v1

spacy.HashEmbedCNN.v2完全相同,不同之处在于如果包含向量则使用spacy.StaticVectors.v1

spacy.MultiHashEmbed.v1

spacy.MultiHashEmbed.v2完全相同,只是在包含向量时使用spacy.StaticVectors.v1

spacy.CharacterEmbed.v1

spacy.CharacterEmbed.v2完全相同,不同之处在于如果包含向量则使用spacy.StaticVectors.v1

spacy.TextCatEnsemble.v1

spacy.TextCatEnsemble.v1架构内部构建了tok2veclinear_model。从spacy.TextCatEnsemble.v2开始,这一设计已被重构,使得TextCatEnsemble将这两个子层作为输入参数。

词袋模型和神经网络模型的堆叠集成。该神经网络具有内部的CNN Tok2Vec层并使用注意力机制。

名称描述
exclusive_classesWhether or not categories are mutually exclusive. bool
pretrained_vectorsWhether or not pretrained vectors will be used in addition to the feature vectors. bool
widthOutput dimension of the feature encoding step. int
embed_sizeInput dimension of the feature encoding step. int
conv_depthDepth of the tok2vec layer. int
window_sizeThe number of contextual vectors to concatenate from the left and from the right. int
ngram_sizeDetermines the maximum length of the n-grams in the BOW model. For instance, ngram_size=3would give unigram, trigram and bigram features. int
dropoutThe dropout rate. float
nOOutput dimension, determined by the number of different labels. If not set, the TextCategorizer component will set it when initialize is called. Optional[int]

spacy.TextCatCNN.v1

spacy.TextCatCNN.v2版本起,该架构已支持调整大小,这意味着您可以为先前训练好的文本分类器添加标签。TextCatCNN v1版本尚不支持此功能。TextCatCNN现已被更通用的TextCatReduce层取代。TextCatCNN与设置参数use_reduce_mean=trueuse_reduce_first=falsereduce_last=falseuse_reduce_max=falseTextCatReduce完全等同。

一种神经网络模型,其中使用CNN计算词符向量。这些向量经过平均池化后作为前馈网络的特征。该架构通常比集成模型的准确性低,但运行速度更快。

名称描述
exclusive_classesWhether or not categories are mutually exclusive. bool
tok2vecThe tok2vec layer of the model. Model
nOOutput dimension, determined by the number of different labels. If not set, the TextCategorizer component will set it when initialize is called. Optional[int]

spacy.TextCatCNN.v2

一种神经网络模型,其中使用CNN计算词符向量。这些向量经过平均池化后作为前馈网络的特征。该架构通常比集成模型准确度稍低,但运行速度更快。

TextCatCNN 已被更通用的 TextCatReduce 层取代。TextCatCNN 与设置参数为 use_reduce_mean=trueuse_reduce_first=falsereduce_last=falseuse_reduce_max=falseTextCatReduce 完全相同。

名称描述
exclusive_classesWhether or not categories are mutually exclusive. bool
tok2vecThe tok2vec layer of the model. Model
nOOutput dimension, determined by the number of different labels. If not set, the TextCategorizer component will set it when initialize is called. Optional[int]

TextCatCNN.v1 具有完全相同的签名,但之前无法调整大小。从v2版本开始,即使训练后也可以向该组件添加新标签。

spacy.TextCatBOW.v1

spacy.TextCatBOW.v2版本起,该架构已支持调整大小,这意味着您可以为已训练好的文本分类器添加新标签。而TextCatBOW v1版本尚不支持此功能。在spacy.TextCatBOW.v3之前的模型版本中,使用了一个存在缺陷的稀疏线性层,该层仅利用了已分配参数中的一小部分。

一个n-gram"词袋"模型。该架构的运行速度比其他模型快得多,但可能不够准确,尤其是当文本较短时。

名称描述
exclusive_classesWhether or not categories are mutually exclusive. bool
ngram_sizeDetermines the maximum length of the n-grams in the BOW model. For instance, ngram_size=3 would give unigram, trigram and bigram features. int
no_output_layerWhether or not to add an output layer to the model (Softmax activation if exclusive_classes is True, else Logistic). bool
nOOutput dimension, determined by the number of different labels. If not set, the TextCategorizer component will set it when initialize is called. Optional[int]

spacy.TextCatBOW.v2

spacy.TextCatBOW.v3之前的版本中,该模型使用了一个错误的稀疏线性层,该层仅使用了少量已分配的参数。

一个n-gram"词袋"模型。该架构的运行速度比其他模型快得多,但可能不够准确,尤其是对于短文本。

名称描述
exclusive_classesWhether or not categories are mutually exclusive. bool
ngram_sizeDetermines the maximum length of the n-grams in the BOW model. For instance, ngram_size=3 would give unigram, trigram and bigram features. int
no_output_layerWhether or not to add an output layer to the model (Softmax activation if exclusive_classes is True, else Logistic). bool
nOOutput dimension, determined by the number of different labels. If not set, the TextCategorizer component will set it when initialize is called. Optional[int]

spacy.TransitionBasedParser.v1

spacy.TransitionBasedParser.v2 完全相同,只是默认将 use_upper 参数设置为 True

Layers

这些函数可从 @spacy.registry.layers 获取。

spacy.StaticVectors.v1

spacy.StaticVectors.v2完全相同,除了处理没有向量的标记的方式不同。

Loggers

这些函数可从@spacy.registry.loggers获取。

spacy.ConsoleLogger.v1

以表格格式将训练步骤的结果写入控制台。

请注意,累积损失在一个周期内会持续增加,但应该会随着周期数的增加而开始下降。

名称描述
progress_barWhether the logger should print the progress bar bool

spaCy的日志记录工具实现在spacy-loggers代码库中,这些函数通常可通过@spacy.registry.loggers调用。

更多文档可以在该仓库的readme文件中找到。