文档

class

一个用于访问语言注释的容器。

Doc是由一系列Token对象组成的序列。可以访问句子和命名实体，将标注导出为numpy数组，无损序列化为压缩二进制字符串。Doc对象包含一个TokenC结构体数组。Python层的Token和Span对象是该数组的视图，即它们本身并不拥有数据。

Doc.init 方法

构建一个Doc对象。获取Doc对象最常见的方式是通过nlp对象。

名称	描述
`vocab`	A storage container for lexical types. Vocab
`words`	A list of strings or integer hash values to add to the document as words. Optional[List[Union[str,int]]]
`spaces`	A list of boolean values indicating whether each word has a subsequent space. Must have the same length as `words`, if specified. Defaults to a sequence of `True`. Optional[List[bool]]
仅关键字
`user_data`	Optional extra data to attach to the Doc. Dict
`tags` v3.0	A list of strings, of the same length as `words`, to assign as `token.tag` for each word. Defaults to `None`. Optional[List[str]]
`pos` v3.0	A list of strings, of the same length as `words`, to assign as `token.pos` for each word. Defaults to `None`. Optional[List[str]]
`morphs` v3.0	A list of strings, of the same length as `words`, to assign as `token.morph` for each word. Defaults to `None`. Optional[List[str]]
`lemmas` v3.0	A list of strings, of the same length as `words`, to assign as `token.lemma` for each word. Defaults to `None`. Optional[List[str]]
`heads` v3.0	A list of values, of the same length as `words`, to assign as the head for each word. Head indices are the absolute position of the head in the `Doc`. Defaults to `None`. Optional[List[int]]
`deps` v3.0	A list of strings, of the same length as `words`, to assign as `token.dep` for each word. Defaults to `None`. Optional[List[str]]
`sent_starts` v3.0	A list of values, of the same length as `words`, to assign as `token.is_sent_start`. Will be overridden by heads if `heads` is provided. Defaults to `None`. Optional[List[Union[bool, int, None]]]
`ents` v3.0	A list of strings, of the same length of `words`, to assign the token-based IOB tag. Defaults to `None`. Optional[List[str]]

Doc.getitem 方法

获取位置i处的Token对象，其中i为整数。支持负索引，并遵循常规Python语义，即 doc[-2]等同于doc[len(doc) - 2]。

名称	描述
`i`	The index of the token. int
RETURNS	The token at `doc[i]`. Token

获取一个Span对象，起始位置为start（词符索引），结束位置为end（词符索引）。例如，doc[2:5]会生成包含第2、3和4个词符的片段。不支持步进切片（如doc[start : end : step]），因为Span对象必须是连续的（不能有间隔）。您可以使用负索引和开放式范围，这些遵循常规的Python语义。

名称	描述
`start_end`	The slice of the document to get. Tuple[int, int]
RETURNS	The span at `doc[start:end]`. Span

Doc.iter 方法

遍历Token对象，可以轻松访问其中的标注信息。

这是访问Token对象的主要方式，这些对象是从Python访问注释的主要途径。如果需要比Python更快的速度，可以改用numpy数组访问注释，或直接从Cython访问底层C数据。

名称	描述
YIELDS	A `Token` object. Token

Doc.len 方法

获取文档中的词元数量。

名称	描述
返回值	文档中的词元数量。int

Doc.set_extension 类方法

在Doc上定义一个自定义属性，该属性可通过Doc._访问。有关详细信息，请参阅自定义属性的文档。

名称	描述
`name`	Name of the attribute to set by the extension. For example, `"my_attr"` will be available as `doc._.my_attr`. str
`default`	Optional default value of the attribute if no getter or method is defined. Optional[Any]
`method`	Set a custom method on the object, for example `doc._.compare(other_doc)`. Optional[Callable[[Doc, …], Any]]
`getter`	Getter function that takes the object and returns an attribute value. Is called when the user accesses the `._` attribute. Optional[Callable[[Doc], Any]]
`setter`	Setter function that takes the `Doc` and a value, and modifies the object. Is called when the user writes to the `Doc._` attribute. Optional[Callable[[Doc, Any], None]]
`force`	Force overwriting existing attribute. bool

Doc.get_extension 类方法

通过名称查找先前注册的扩展。如果扩展已注册，则返回一个4元组(default, method, getter, setter)。否则抛出KeyError。

名称	描述
`name`	Name of the extension. str
RETURNS	A `(default, method, getter, setter)` tuple of the extension. Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]

Doc.has_extension 类方法

检查扩展是否已在Doc类上注册。

名称	描述
`name`	Name of the extension to check. str
返回值	扩展是否已注册。bool

Doc.remove_extension 类方法

移除之前注册的扩展。

名称	描述
`name`	Name of the extension. str
RETURNS	A `(default, method, getter, setter)` tuple of the removed extension. Tuple[Optional[Any], Optional[Callable], Optional[Callable], Optional[Callable]]

Doc.char_span 方法

从切片doc.text[start_idx:end_idx]创建一个Span对象。如果字符索引在默认对齐模式`"strict"`下无法映射到有效范围，则返回None。

名称	描述
`start`	The index of the first character of the span. int
`end`	The index of the last character after the span. int
`label`	A label to attach to the span, e.g. for named entities. Union[int, str]
`kb_id`	An ID from a knowledge base to capture the meaning of a named entity. Union[int, str]
`vector`	A meaning representation of the span. numpy.ndarray[ndim=1, dtype=float32]
`alignment_mode`	How character indices snap to token boundaries. Options: `"strict"` (no snapping), `"contract"` (span of all tokens completely within the character span), `"expand"` (span of all tokens at least partially covered by the character span). Defaults to `"strict"`. str
`span_id` v3.3.1	An identifier to associate with the span. Union[int, str]
RETURNS	The newly constructed object or `None`. Optional[Span]

Doc.set_ents 方法v3.0

在文档中设置命名实体。

名称	描述
`entities`	Spans with labels to set as entities. List[Span]
仅关键字
`blocked`	Spans to set as “blocked” (never an entity) for spacy’s built-in NER component. Other components may ignore this setting. Optional[List[Span]]
`missing`	Spans with missing/unknown entity information. Optional[List[Span]]
`outside`	Spans outside of entities (O in IOB). Optional[List[Span]]
`default`	How to set entity annotation for tokens outside of any provided spans. Options: `"blocked"`, `"missing"`, `"outside"` and `"unmodified"` (preserve current state). Defaults to `"outside"`. str

Doc.similarity 方法需要模型

进行语义相似度估算。默认估算方法是使用词向量平均值的余弦相似度。

名称	描述
`other`	The object to compare with. By default, accepts `Doc`, `Span`, `Token` and `Lexeme` objects. Union[Doc,Span,Token,Lexeme]
返回值	一个标量相似度分数。数值越高表示越相似。float

Doc.count_by 方法

统计给定属性的出现频率。生成一个字典{attr (int): count (ints)}，其中键为给定属性ID的值，值为对应频率。

名称	描述
`attr_id`	The attribute ID. int
返回值	一个将属性映射到整数计数的字典。Dict[int, int]

Doc.get_lca_matrix 方法

计算给定Doc的最低公共祖先矩阵。返回包含祖先整数索引的LCA矩阵，如果未找到公共祖先（例如当跨度排除了必要祖先时）则返回-1。

名称	描述
RETURNS	The lowest common ancestor matrix of the `Doc`. numpy.ndarray[ndim=2, dtype=int32]

Doc.has_annotation 方法

检查文档是否包含关于Token属性的注释。

v3.0版本变更

该方法替代了之前的布尔属性，如Doc.is_tagged、 Doc.is_parsed或Doc.is_sentenced。

名称	描述
`attr`	The attribute string name or int ID. Union[int, str]
仅关键字
`require_complete`	Whether to check that the attribute is set on every token in the doc. Defaults to `False`. bool
返回值	判断文档中是否存在指定的标注。bool

Doc.to_array 方法

将给定的词符属性导出为numpy的ndarray。如果attr_ids是包含M个属性的序列，输出数组的形状将为(N, M)，其中N是Doc的长度（以词符计）。如果attr_ids是单个属性，输出形状将为(N,)。您可以通过整数ID（例如spacy.attrs.LEMMA）或字符串名称（例如"LEMMA"或"lemma"）指定属性。这些值将是64位整数。

返回一个二维数组，每行代表一个标记(token)，每列代表一个属性(当attr_ids为列表时)；或返回一个一维numpy数组，每个元素对应一个属性(当attr_ids为单个值时)。

名称	描述
`attr_ids`	A list of attributes (int IDs or string names) or a single attribute (int ID or string name). Union[int, str, List[Union[int, str]]]
返回值	导出的属性作为numpy数组。Union[numpy.ndarray[ndim=2, dtype=uint64],numpy.ndarray[ndim=1, dtype=uint64]]

Doc.from_array 方法

从numpy数组中加载属性。将属性从(M, N)数组写入到Doc对象。

名称	描述
`attrs`	A list of attribute ID ints. List[int]
`array`	The attribute values to load. numpy.ndarray[ndim=2, dtype=int32]
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The `Doc` itself. Doc

Doc.from_docs staticmethodv3.0

将多个Doc对象连接成一个新的对象。如果这些Doc对象不共享相同的Vocab，则会引发错误。

名称	描述
`docs`	A list of `Doc` objects. List[Doc]
`ensure_whitespace`	Insert a space between two adjacent docs whenever the first doc does not end in whitespace. bool
`attrs`	Optional list of attribute ID ints or attribute name strings. Optional[List[Union[str, int]]]
仅关键字
`exclude` v3.3	String names of Doc attributes to exclude. Supported: `spans`, `tensor`, `user_data`. Iterable[str]
RETURNS	The new `Doc` object that is containing the other docs or `None`, if `docs` is empty or `None`. Optional[Doc]

Doc.to_disk 方法

将当前状态保存到目录中。

名称	描述
`path`	A path to a directory, which will be created if it doesn’t exist. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]

Doc.from_disk 方法

从目录加载状态。就地修改对象并返回它。

名称	描述
`path`	A path to a directory. Paths may be either strings or `Path`-like objects. Union[str,Path]
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The modified `Doc` object. Doc

Doc.to_bytes 方法

序列化，即将文档内容导出为二进制字符串。

名称	描述
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	A losslessly serialized copy of the `Doc`, including all annotations. bytes

Doc.from_bytes 方法

反序列化，即从二进制字符串导入文档内容。

名称	描述
`data`	The string to load from. bytes
仅关键字
`exclude`	String names of serialization fields to exclude. Iterable[str]
RETURNS	The `Doc` object. Doc

Doc.to_json 方法

将文档序列化为JSON格式。请注意，此格式与已弃用的JSON training format不同。

名称	描述
`underscore`	Optional list of string names of custom `Doc` attributes. Attribute values need to be JSON-serializable. Values will be added to an `"_"` key in the data, e.g. `"_": {"foo": "bar"}`. Optional[List[str]]
返回值	JSON格式的数据。Dict[str, Any]

Doc.from_json 方法v3.3.1

从JSON反序列化文档，即根据由Doc.to_json()生成的JSON数据创建文档。

名称	描述
`doc_json`	The Doc data in JSON format from `Doc.to_json`. Dict[str, Any]
仅关键字
`validate`	Whether to validate the JSON input against the expected schema for detailed debugging. Defaults to `False`. bool
RETURNS	A `Doc` corresponding to the provided JSON. Doc

Doc.retokenize 上下文管理器

用于处理Doc重标记化的上下文管理器。对Doc标记化的修改会被暂存，然后在上下文管理器退出时一次性执行。这种方式效率更高且更不易出错。在重标记化之前创建的所有Doc视图(Span和Token)都会失效，尽管它们可能偶然还能继续工作。

名称	描述
返回值	重标记器。Retokenizer

Retokenizer.merge 方法

标记一个待合并的文本片段。attrs属性将被应用到结果标记上（如果是上下文相关的标记属性如LEMMA或DEP），或者应用到基础词素上（如果是上下文无关的词法属性如LOWER或IS_STOP）。可写的自定义扩展属性可以通过"_"键提供，并指定一个将属性名映射到值的字典。

名称	描述
`span`	The span to merge. Span
`attrs`	Attributes to set on the merged token. Dict[Union[str, int], Any]

Retokenizer.split 方法

标记一个待分割的token到指定的orths中。heads参数用于指定新子token应如何整合到依存树中。每个token的头部列表可以是原始文档中的一个token（例如doc[2]），也可以是一个由原始文档中的token及其子token索引组成的元组。例如，(doc[3], 1)会将子token附加到doc[3]的第二个子token上。

该机制允许将子标记附加到其他新创建的子标记上，而无需跟踪变化的标记索引。如果在重标记器块中指定的头部标记将被分割且未指定子标记索引，则默认为0。可以在子标记上设置的属性可作为值列表提供。它们将应用于结果标记（如果是上下文相关的标记属性如LEMMA或DEP）或应用于基础词位（如果是上下文无关的词法属性如LOWER或IS_STOP）。

名称	描述
`token`	The token to split. Token
`orths`	The verbatim text of the split tokens. Needs to match the text of the original token. List[str]
`heads`	List of `token` or `(token, subtoken)` tuples specifying the tokens to attach the newly split subtokens to. List[Union[Token, Tuple[Token, int]]]
`attrs`	Attributes to set on all split tokens. Attribute names mapped to list of per-token attribute values. Dict[Union[str, int], List[Any]]

Doc.ents 属性需要模型

文档中的命名实体。如果已应用实体识别器，则返回命名实体Span对象的元组。

名称	描述
RETURNS	Entities in the document, one `Span` per entity. Tuple[Span]

Doc.spans 属性

一个命名跨度组的字典，用于存储和访问额外的跨度注释。您可以通过将Span对象列表或SpanGroup分配给给定键来写入它。

名称	描述
返回值	分配给文档的span组。Dict[str,SpanGroup]

Doc.cats 属性需要模型

将标签映射到应用于文档类别的分数。通常由TextCategorizer设置。

名称	描述
返回值	文本类别映射到分数。Dict[str, float]

Doc.noun_chunks 属性需要模型

遍历文档中的基础名词短语。如果文档已经过句法分析，则生成基础名词短语Span对象。基础名词短语（或称"NP块"）是指不允许其他名词短语嵌套其中的名词短语——因此不包含NP级别的并列结构、介词短语和关系从句。

要自定义已加载管道中的名词块迭代器，请修改 nlp.vocab.get_noun_chunks。如果给定语言的noun_chunk 语法迭代器尚未实现，则会引发NotImplementedError错误。

名称	描述
YIELDS	文档中的名词短语块。Span

Doc.sents 属性需要模型

遍历文档中的句子。句子跨度没有标签。

该属性仅在文档通过parser、senter、sentencizer或某些自定义函数设置了句子边界时才可用。否则会引发错误。

名称	描述
YIELDS	文档中的句子。Span

Doc.has_vector 属性需要模型

一个布尔值，表示该对象是否关联了词向量。

名称	描述
返回值	判断文档是否附加了向量数据。bool

Doc.vector 属性需要模型

一个实值意义表示。默认为词符向量的平均值。

名称	描述
返回值	一个一维数组，表示文档的向量。numpy.ndarray[ndim=1, dtype=float32]

Doc.vector_norm 属性需要模型

文档向量表示的L2范数。

名称	描述
返回值	向量表示的L2范数。float

属性

名称	描述
`text`	A string representation of the document text. str
`text_with_ws`	An alias of `Doc.text`, provided for duck-type compatibility with `Span` and `Token`. str
`mem`	The document’s local memory heap, for all C data it owns. cymem.Pool
`vocab`	The store of lexical types. Vocab
`tensor`	Container for dense vector representations. numpy.ndarray
`user_data`	A generic storage area, for user custom data. Dict[str, Any]
`lang`	Language of the document’s vocabulary. int
`lang_`	Language of the document’s vocabulary. str
`sentiment`	The document’s positivity/negativity score, if available. float
`user_hooks`	A dictionary that allows customization of the `Doc`’s properties. Dict[str, Callable]
`user_token_hooks`	A dictionary that allows customization of properties of `Token` children. Dict[str, Callable]
`user_span_hooks`	A dictionary that allows customization of properties of `Span` children. Dict[str, Callable]
`has_unknown_spaces`	Whether the document was constructed without known spacing between tokens (typically when created from gold tokenization). bool
`_`	User space for adding custom attribute extensions. Underscore

序列化字段

在序列化过程中，spaCy会导出多个用于恢复对象不同方面的数据字段。如果需要，您可以通过exclude参数传入字符串名称来将它们排除在序列化之外。

名称	描述
`text`	The value of the `Doc.text` attribute.
`sentiment`	The value of the `Doc.sentiment` attribute.
`tensor`	The value of the `Doc.tensor` attribute.
`user_data`	The value of the `Doc.user_data` dictionary.
`user_data_keys`	The keys of the `Doc.user_data` dictionary.
`user_data_values`	The values of the `Doc.user_data` dictionary.

建议编辑