库架构
spaCy中的核心数据结构是Language类、Vocab和Doc对象。Language类用于处理文本并将其转换为Doc对象,通常存储为名为nlp的变量。Doc对象拥有词符序列及其所有标注。通过将字符串、词向量和词汇属性集中存储在Vocab中,我们避免了数据的多重存储。这节省了内存,并确保存在单一数据源。
文本标注的设计也遵循单一数据源原则:Doc对象拥有数据,而Span和Token是指向该数据的视图。Doc对象由Tokenizer构建,然后被流水线组件原地修改。Language对象负责协调这些组件,它接收原始文本并通过流水线处理,最终返回一个标注文档,同时还负责管理训练和序列化流程。
容器对象
| 名称 | 描述 |
|---|---|
Doc | A container for accessing linguistic annotations. |
DocBin | A collection of Doc objects for efficient binary serialization. Also used for training data. |
Example | A collection of training annotations, containing two Doc objects: the reference data and the predictions. |
Language | Processing class that turns text into Doc objects. Different languages implement their own subclasses of it. The variable is typically called nlp. |
Lexeme | An entry in the vocabulary. It’s a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc. |
Span | A slice from a Doc object. |
SpanGroup | A named collection of spans belonging to a Doc. |
Token | An individual token — i.e. a word, punctuation symbol, whitespace, etc. |
处理流程
处理流程由一个或多个流程组件组成,这些组件按顺序在Doc上调用。分词器在组件之前运行。可以使用Language.add_pipe添加流程组件。它们可以包含统计模型和训练权重,或者仅对Doc进行基于规则的修改。spaCy为不同的语言处理任务提供了一系列内置组件,并允许添加自定义组件。
| 名称 | 描述 |
|---|---|
AttributeRuler | Set token attributes using matcher rules. |
DependencyParser | Predict syntactic dependencies. |
EditTreeLemmatizer | Predict base forms of words. |
EntityLinker | Disambiguate named entities to nodes in a knowledge base. |
EntityRecognizer | Predict named entities, e.g. persons or products. |
EntityRuler | Add entity spans to the Doc using token-based rules or exact phrase matches. |
Lemmatizer | Determine the base forms of words using rules and lookups. |
Morphologizer | Predict morphological features and coarse-grained part-of-speech tags. |
SentenceRecognizer | Predict sentence boundaries. |
Sentencizer | Implement rule-based sentence boundary detection that doesn’t require the dependency parse. |
Tagger | Predict part-of-speech tags. |
TextCategorizer | Predict categories or labels over the whole document. |
Tok2Vec | Apply a “token-to-vector” model and set its outputs. |
Tokenizer | Segment raw text and create Doc objects from the words. |
TrainablePipe | Class that all trainable pipeline components inherit from. |
Transformer | Use a transformer model and set its outputs. |
| Other functions | Automatically apply something to the Doc, e.g. to merge spans of tokens. |
Matchers
匹配器(Matchers)帮助您根据描述目标序列的匹配模式,从Doc对象中查找和提取信息。匹配器在Doc上运行,并让您能在上下文中访问匹配到的词元。
| 名称 | 描述 |
|---|---|
DependencyMatcher | Match sequences of tokens based on dependency trees using Semgrex operators. |
Matcher | Match sequences of tokens, based on pattern rules, similar to regular expressions. |
PhraseMatcher | Match sequences of tokens based on phrases. |
其他类
| 名称 | 描述 |
|---|---|
Corpus | Class for managing annotated corpora for training and evaluation data. |
KnowledgeBase | Abstract base class for storage and retrieval of data for entity linking. |
InMemoryLookupKB | Implementation of KnowledgeBase storing all data in memory. |
Candidate | Object associating a textual mention with a specific entity contained in a KnowledgeBase. |
Lookups | Container for convenient access to large lookup tables and dictionaries. |
MorphAnalysis | A morphological analysis. |
Morphology | Store morphological analyses and map them to and from hash values. |
Scorer | Compute evaluation scores. |
StringStore | Map strings to and from hash values. |
Vectors | Container class for vector data keyed by string. |
Vocab | The shared vocabulary that stores strings and gives you access to Lexeme objects. |