库架构

spaCy中的核心数据结构是Language类、Vocab和Doc对象。Language类用于处理文本并将其转换为Doc对象，通常存储为名为nlp的变量。Doc对象拥有词符序列及其所有标注。通过将字符串、词向量和词汇属性集中存储在Vocab中，我们避免了数据的多重存储。这节省了内存，并确保存在单一数据源。

文本标注的设计也遵循单一数据源原则：Doc对象拥有数据，而Span和Token是指向该数据的视图。Doc对象由Tokenizer构建，然后被流水线组件原地修改。Language对象负责协调这些组件，它接收原始文本并通过流水线处理，最终返回一个标注文档，同时还负责管理训练和序列化流程。

容器对象

名称	描述
`Doc`	A container for accessing linguistic annotations.
`DocBin`	A collection of `Doc` objects for efficient binary serialization. Also used for training data.
`Example`	A collection of training annotations, containing two `Doc` objects: the reference data and the predictions.
`Language`	Processing class that turns text into `Doc` objects. Different languages implement their own subclasses of it. The variable is typically called `nlp`.
`Lexeme`	An entry in the vocabulary. It’s a word type with no context, as opposed to a word token. It therefore has no part-of-speech tag, dependency parse etc.
`Span`	A slice from a `Doc` object.
`SpanGroup`	A named collection of spans belonging to a `Doc`.
`Token`	An individual token — i.e. a word, punctuation symbol, whitespace, etc.

处理流程

处理流程由一个或多个流程组件组成，这些组件按顺序在Doc上调用。分词器在组件之前运行。可以使用Language.add_pipe添加流程组件。它们可以包含统计模型和训练权重，或者仅对Doc进行基于规则的修改。spaCy为不同的语言处理任务提供了一系列内置组件，并允许添加自定义组件。

名称	描述
`AttributeRuler`	Set token attributes using matcher rules.
`DependencyParser`	Predict syntactic dependencies.
`EditTreeLemmatizer`	Predict base forms of words.
`EntityLinker`	Disambiguate named entities to nodes in a knowledge base.
`EntityRecognizer`	Predict named entities, e.g. persons or products.
`EntityRuler`	Add entity spans to the `Doc` using token-based rules or exact phrase matches.
`Lemmatizer`	Determine the base forms of words using rules and lookups.
`Morphologizer`	Predict morphological features and coarse-grained part-of-speech tags.
`SentenceRecognizer`	Predict sentence boundaries.
`Sentencizer`	Implement rule-based sentence boundary detection that doesn’t require the dependency parse.
`Tagger`	Predict part-of-speech tags.
`TextCategorizer`	Predict categories or labels over the whole document.
`Tok2Vec`	Apply a “token-to-vector” model and set its outputs.
`Tokenizer`	Segment raw text and create `Doc` objects from the words.
`TrainablePipe`	Class that all trainable pipeline components inherit from.
`Transformer`	Use a transformer model and set its outputs.
Other functions	Automatically apply something to the `Doc`, e.g. to merge spans of tokens.

Matchers

匹配器(Matchers)帮助您根据描述目标序列的匹配模式，从Doc对象中查找和提取信息。匹配器在Doc上运行，并让您能在上下文中访问匹配到的词元。

名称	描述
`DependencyMatcher`	Match sequences of tokens based on dependency trees using Semgrex operators.
`Matcher`	Match sequences of tokens, based on pattern rules, similar to regular expressions.
`PhraseMatcher`	Match sequences of tokens based on phrases.

其他类

名称	描述
`Corpus`	Class for managing annotated corpora for training and evaluation data.
`KnowledgeBase`	Abstract base class for storage and retrieval of data for entity linking.
`InMemoryLookupKB`	Implementation of `KnowledgeBase` storing all data in memory.
`Candidate`	Object associating a textual mention with a specific entity contained in a `KnowledgeBase`.
`Lookups`	Container for convenient access to large lookup tables and dictionaries.
`MorphAnalysis`	A morphological analysis.
`Morphology`	Store morphological analyses and map them to and from hash values.
`Scorer`	Compute evaluation scores.
`StringStore`	Map strings to and from hash values.
`Vectors`	Container class for vector data keyed by string.
`Vocab`	The shared vocabulary that stores strings and gives you access to `Lexeme` objects.

建议编辑

概述

容器对象

处理流程

Matchers

其他类