语料库

classv3

一个带标注的语料库

该类用于管理标注语料库，可用于处理DocBin格式（.spacy）的训练和开发数据集。要自定义训练期间的数据加载过程，您可以注册自己的数据读取器和批处理器。更多详细信息和示例请参阅数据工具使用指南。

配置与实现

spacy.Corpus.v1 是一个已注册的函数，用于创建训练或评估数据的 Corpus。它接受与 Corpus 类相同的参数，并返回一个可调用对象，该对象会生成 Example 对象。您可以在 @readers 注册表中用自己的注册函数替换它，以自定义数据加载和流式处理。

名称	描述
`path`	The directory or filename to read from. Expects data in spaCy’s binary `.spacy` format. Path
`gold_preproc`	Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See `Corpus` for details. bool
`max_length`	Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. int
`limit`	Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. int
`augmenter`	Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don’t have smart-quotes, or only have smart quotes, etc. Defaults to `None`. Optional[Callable]

explosion/spaCy/master/spacy/training/corpus.py

Corpus.init 方法

创建一个Corpus用于从文件或.spacy数据文件目录中迭代Example对象。gold_preproc设置允许您指定是否使用黄金标准句子和标记来设置Example对象以进行预测。黄金预处理有助于注释与标记化对齐，并可能产生长度更一致的序列。然而，由于训练/测试偏差，它可能会降低运行时准确性。

名称	描述
`path`	The directory or filename to read from. Union[str,Path]
仅关键字
`gold_preproc`	Whether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to `False`. bool
`max_length`	Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to `0` for no limit. int
`limit`	Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. int
`augmenter`	Optional data augmentation callback. Callable[[Language,Example], Iterable[Example]]
`shuffle`	Whether to shuffle the examples. Defaults to `False`. bool

Corpus.call 方法

从数据中生成示例。

名称	描述
`nlp`	The current `nlp` object. Language
YIELDS	示例结果。 Example

JsonlCorpus 类

从JSONL（换行分隔的JSON）格式的原始文本文件或目录中迭代Doc对象。可用于从JSONL文件中读取语言模型预训练的原始文本语料。

示例

JsonlCorpus.init 方法

初始化读取器。

名称	描述
`path`	The directory or filename to read from. Expects newline-delimited JSON with a key `"text"` for each record. Union[str,Path]
仅关键字
`min_length`	Minimum document length (in tokens). Shorter documents will be skipped. Defaults to `0`, which indicates no limit. int
`max_length`	Maximum document length (in tokens). Longer documents will be skipped. Defaults to `0`, which indicates no limit. int
`limit`	Limit corpus to a subset of examples, e.g. for debugging. Defaults to `0` for no limit. int

JsonlCorpus.call 方法

从数据中生成示例。

名称	描述
`nlp`	The current `nlp` object. Language
YIELDS	示例。 Example

PlainTextCorpus 类v3.5.1

遍历纯文本文件中的文档。可用于读取语言模型的原始文本语料库进行预训练。预期文件格式为：

UTF-8编码
每行一个文档
空行会被忽略。

示例

PlainTextCorpus.init 方法

初始化读取器。

名称	描述
`path`	The directory or filename to read from. Expects newline-delimited documents in UTF8 format. Union[str,Path]
仅关键字
`min_length`	Minimum document length (in tokens). Shorter documents will be skipped. Defaults to `0`, which indicates no limit. int
`max_length`	Maximum document length (in tokens). Longer documents will be skipped. Defaults to `0`, which indicates no limit. int

PlainTextCorpus.call 方法

从数据中生成示例。

名称	描述
`nlp`	The current `nlp` object. Language
YIELDS	示例结果。 Example

建议编辑

其他

配置与实现

Corpus.__init__ 方法

Corpus.__call__ 方法

JsonlCorpus 类

示例

JsonlCorpus.__init__ 方法

JsonlCorpus.__call__ 方法

PlainTextCorpus 类v3.5.1

示例

PlainTextCorpus.__init__ 方法

PlainTextCorpus.__call__ 方法

Corpus.init 方法

Corpus.call 方法

JsonlCorpus.init 方法

JsonlCorpus.call 方法

PlainTextCorpus.init 方法

PlainTextCorpus.call 方法