概述

库架构

spaCy中的核心数据结构是Language类、VocabDoc对象。Language类用于处理文本并将其转换为Doc对象,通常存储为名为nlp的变量。Doc对象拥有词符序列及其所有标注。通过将字符串、词向量和词汇属性集中存储在Vocab中,我们避免了数据的多重存储。这节省了内存,并确保存在单一数据源

文本标注的设计也遵循单一数据源原则:Doc对象拥有数据,而SpanToken指向该数据的视图Doc对象由Tokenizer构建,然后被流水线组件原地修改Language对象负责协调这些组件,它接收原始文本并通过流水线处理,最终返回一个标注文档,同时还负责管理训练和序列化流程。

Library architecture {w:1080, h:1254}

容器对象

名称描述
DocA container for accessing linguistic annotations.
DocBinA collection of Doc objects for efficient binary serialization. Also used for training data.
ExampleA collection of training annotations, containing two Doc objects: the reference data and the predictions.
LanguageProcessing class that turns text into Doc objects. Different languages implement their own subclasses of it. The variable is typically called nlp.
LexemeAn entry in the vocabulary. It’s a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc.
SpanA slice from a Doc object.
SpanGroupA named collection of spans belonging to a Doc.
TokenAn individual token — i.e. a word, punctuation symbol, whitespace, etc.

处理流程

处理流程由一个或多个流程组件组成,这些组件按顺序在Doc上调用。分词器在组件之前运行。可以使用Language.add_pipe添加流程组件。它们可以包含统计模型和训练权重,或者仅对Doc进行基于规则的修改。spaCy为不同的语言处理任务提供了一系列内置组件,并允许添加自定义组件

The processing pipeline
名称描述
AttributeRulerSet token attributes using matcher rules.
DependencyParserPredict syntactic dependencies.
EditTreeLemmatizerPredict base forms of words.
EntityLinkerDisambiguate named entities to nodes in a knowledge base.
EntityRecognizerPredict named entities, e.g. persons or products.
EntityRulerAdd entity spans to the Doc using token-based rules or exact phrase matches.
LemmatizerDetermine the base forms of words using rules and lookups.
MorphologizerPredict morphological features and coarse-grained part-of-speech tags.
SentenceRecognizerPredict sentence boundaries.
SentencizerImplement rule-based sentence boundary detection that doesn’t require the dependency parse.
TaggerPredict part-of-speech tags.
TextCategorizerPredict categories or labels over the whole document.
Tok2VecApply a “token-to-vector” model and set its outputs.
TokenizerSegment raw text and create Doc objects from the words.
TrainablePipeClass that all trainable pipeline components inherit from.
TransformerUse a transformer model and set its outputs.
Other functionsAutomatically apply something to the Doc, e.g. to merge spans of tokens.

Matchers

匹配器(Matchers)帮助您根据描述目标序列的匹配模式,从Doc对象中查找和提取信息。匹配器在Doc上运行,并让您能在上下文中访问匹配到的词元。

名称描述
DependencyMatcherMatch sequences of tokens based on dependency trees using Semgrex operators.
MatcherMatch sequences of tokens, based on pattern rules, similar to regular expressions.
PhraseMatcherMatch sequences of tokens based on phrases.

其他类

名称描述
CorpusClass for managing annotated corpora for training and evaluation data.
KnowledgeBaseAbstract base class for storage and retrieval of data for entity linking.
InMemoryLookupKBImplementation of KnowledgeBase storing all data in memory.
CandidateObject associating a textual mention with a specific entity contained in a KnowledgeBase.
LookupsContainer for convenient access to large lookup tables and dictionaries.
MorphAnalysisA morphological analysis.
MorphologyStore morphological analyses and map them to and from hash values.
ScorerCompute evaluation scores.
StringStoreMap strings to and from hash values.
VectorsContainer class for vector data keyed by string.
VocabThe shared vocabulary that stores strings and gives you access to Lexeme objects.