其他

语料库

classv3
一个带标注的语料库

该类用于管理标注语料库,可用于处理DocBin格式(.spacy)的训练和开发数据集。要自定义训练期间的数据加载过程,您可以注册自己的数据读取器和批处理器。更多详细信息和示例请参阅数据工具使用指南。

配置与实现

spacy.Corpus.v1 是一个已注册的函数,用于创建训练或评估数据的 Corpus。它接受与 Corpus 类相同的参数,并返回一个可调用对象,该对象会生成 Example 对象。您可以在 @readers 注册表 中用自己的注册函数替换它,以自定义数据加载和流式处理。

名称描述
pathThe directory or filename to read from. Expects data in spaCy’s binary .spacy format. Path
gold_preprocWhether to set up the Example object with gold-standard sentences and tokens for the predictions. See Corpus for details. bool
max_lengthMaximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. int
limitLimit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int
augmenterApply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don’t have smart-quotes, or only have smart quotes, etc. Defaults to None. Optional[Callable]
explosion/spaCy/master/spacy/training/corpus.py

Corpus.__init__ 方法

创建一个Corpus用于从文件或.spacy数据文件目录中迭代Example对象。gold_preproc设置允许您指定是否使用黄金标准句子和标记来设置Example对象以进行预测。黄金预处理有助于注释与标记化对齐,并可能产生长度更一致的序列。然而,由于训练/测试偏差,它可能会降低运行时准确性。

名称描述
pathThe directory or filename to read from. Union[str,Path]
仅关键字
gold_preprocWhether to set up the Example object with gold-standard sentences and tokens for the predictions. Defaults to False. bool
max_lengthMaximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. int
limitLimit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int
augmenterOptional data augmentation callback. Callable[[Language,Example], Iterable[Example]]
shuffleWhether to shuffle the examples. Defaults to False. bool

Corpus.__call__ 方法

从数据中生成示例。

名称描述
nlpThe current nlp object. Language

JsonlCorpus

从JSONL(换行分隔的JSON)格式的原始文本文件或目录中迭代Doc对象。可用于从JSONL文件中读取语言模型预训练的原始文本语料。

示例

JsonlCorpus.__init__ 方法

初始化读取器。

名称描述
pathThe directory or filename to read from. Expects newline-delimited JSON with a key "text" for each record. Union[str,Path]
仅关键字
min_lengthMinimum document length (in tokens). Shorter documents will be skipped. Defaults to 0, which indicates no limit. int
max_lengthMaximum document length (in tokens). Longer documents will be skipped. Defaults to 0, which indicates no limit. int
limitLimit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int

JsonlCorpus.__call__ 方法

从数据中生成示例。

名称描述
nlpThe current nlp object. Language

PlainTextCorpus v3.5.1

遍历纯文本文件中的文档。可用于读取语言模型的原始文本语料库进行预训练。预期文件格式为:

  • UTF-8编码
  • 每行一个文档
  • 空行会被忽略。

示例

PlainTextCorpus.__init__ 方法

初始化读取器。

名称描述
pathThe directory or filename to read from. Expects newline-delimited documents in UTF8 format. Union[str,Path]
仅关键字
min_lengthMinimum document length (in tokens). Shorter documents will be skipped. Defaults to 0, which indicates no limit. int
max_lengthMaximum document length (in tokens). Longer documents will be skipped. Defaults to 0, which indicates no limit. int

PlainTextCorpus.__call__ 方法

从数据中生成示例。

名称描述
nlpThe current nlp object. Language