顶层函数
spacy.load 函数
使用已安装包的名称、字符串路径或类似Path的对象加载流程管线。
spaCy将按此顺序尝试解析加载参数。如果通过字符串名称加载管线,spaCy会假定其为Python包并导入它,然后调用该包自身的load()方法。如果通过路径加载管线,spaCy会假定其为数据目录,加载其config.cfg并使用语言和管线信息来构建Language类。数据将通过Language.from_disk加载。从包加载管线还会导入任何自定义代码(如果存在),而从目录加载则不会。对于这些情况,您需要手动导入自定义代码。
| 名称 | 描述 |
|---|---|
name | Pipeline to load, i.e. package name or path. Union[str,Path] |
| 仅关键字 | |
vocab | Optional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. Union[Vocab, bool] |
disable | Name(s) of pipeline component(s) to disable. Disabled pipes will be loaded but they won’t be run unless you explicitly enable them by calling nlp.enable_pipe. Is merged with the config entry nlp.disabled. Union[str, Iterable[str]] |
enable v3.4 | Name(s) of pipeline component(s) to enable. All other pipes will be disabled. Union[str, Iterable[str]] |
exclude v3.0 | Name(s) of pipeline component(s) to exclude. Excluded components won’t be loaded. Union[str, Iterable[str]] |
config v3.0 | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. "components.name.value". Union[Dict[str, Any],Config] |
| RETURNS | A Language object with the loaded pipeline. Language |
本质上,spacy.load()是一个便捷封装器,它会读取管道的config.cfg,使用语言和管道信息来构建一个Language对象,加载模型数据和权重,最后将其返回。
抽象示例
spacy.blank 函数
创建一个指定语言类别的空白处理流程。此函数是spacy.load()的对应方法。
| 名称 | 描述 |
|---|---|
name | IETF language tag, such as ‘en’, of the language class to load. str |
| 仅关键字 | |
vocab | Optional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. Union[Vocab, bool] |
config v3.0 | Optional config overrides, either as nested dict or dict keyed by section value in dot notation, e.g. "components.name.value". Union[Dict[str, Any],Config] |
meta | Optional meta overrides for nlp.meta. Dict[str, Any] |
| RETURNS | An empty Language object of the appropriate subclass. Language |
spacy.info 函数
与info命令相同。从spaCy内部漂亮地打印关于您的安装、已安装的管道和本地设置的信息。
| 名称 | 描述 |
|---|---|
model | Optional pipeline, i.e. a package name or path (optional). Optional[str] |
| 仅关键字 | |
markdown | Print information as Markdown. bool |
silent | Don’t print anything, just return. bool |
spacy.explain 函数
获取给定词性标签、依存关系标签或实体类型的描述。有关可用术语列表,请参阅glossary.py。
| 名称 | 描述 |
|---|---|
term | Term to explain. str |
| RETURNS | The explanation, or None if not found in the glossary. Optional[str] |
spacy.prefer_gpu 函数
如果可用,分配数据并在GPU上执行操作。如果数据已分配在CPU上,则不会移动。理想情况下,此函数应在导入spaCy后和加载任何管道之前立即调用。
| 名称 | 描述 |
|---|---|
gpu_id | Device index to select. Defaults to 0. int |
| 返回值 | GPU是否被激活。bool |
spacy.require_gpu 函数
在GPU上分配数据并执行操作。如果没有可用的GPU将报错。如果数据已分配在CPU上,则不会被移动。理想情况下,应在导入spaCy后立即调用此函数,且在任何管道加载之前。
| 名称 | 描述 |
|---|---|
gpu_id | Device index to select. Defaults to 0. int |
| RETURNS | True bool |
spacy.require_cpu 函数v3.0.0
在CPU上分配数据并执行操作。如果数据已在GPU上分配,则不会移动。理想情况下,应在导入spaCy后立即调用此函数,且在加载任何管道之前。
| 名称 | 描述 |
|---|---|
| RETURNS | True bool |
displaCy
从v2.0版本开始,spaCy内置了可视化套件。如需更多信息和示例,请参阅visualizing spaCy使用指南。
displacy.serve 方法
提供依赖解析树或命名实体可视化,以便在浏览器中查看。将运行一个简单的网络服务器。
| 名称 | 描述 |
|---|---|
docs | Document(s) or span(s) to visualize. Union[Iterable[Union[Doc,Span]],Doc,Span] |
style v3.3 | Visualization style, "dep", "ent" or "span". Defaults to "dep". str |
page | Render markup as full HTML page. Defaults to True. bool |
minify | Minify HTML markup. Defaults to False. bool |
options | Visualizer-specific options, e.g. colors. Dict[str, Any] |
manual | Don’t parse Doc and instead expect a dict or list of dicts. See here for formats and examples. Defaults to False. bool |
port | Port to serve visualization. Defaults to 5000. int |
host | Host to serve visualization. Defaults to "0.0.0.0". str |
auto_select_port v3.5 | If True, automatically switch to a different port if the specified port is already in use. Defaults to False. bool |
displacy.render 方法
渲染依赖关系解析树或命名实体可视化。
| 名称 | 描述 |
|---|---|
docs | Document(s) or span(s) to visualize. Union[Iterable[Union[Doc,Span, dict]],Doc,Span, dict] |
style | Visualization style, "dep", "ent" or "span" v3.3. Defaults to "dep". str |
page | Render markup as full HTML page. Defaults to False. bool |
minify | Minify HTML markup. Defaults to False. bool |
options | Visualizer-specific options, e.g. colors. Dict[str, Any] |
manual | Don’t parse Doc and instead expect a dict or list of dicts. See here for formats and examples. Defaults to False. bool |
jupyter | Explicitly enable or disable ”Jupyter mode” to return markup ready to be rendered in a notebook. Detected automatically if None (default). Optional[bool] |
| RETURNS | The rendered HTML markup. str |
displacy.parse_deps 方法
生成依赖解析,格式为{'words': [], 'arcs': []}。用于配合displacy.render中的manual=True参数使用。
| 名称 | 描述 |
|---|---|
orig_doc | Doc or span to parse dependencies. Union[Doc,Span] |
options | Dependency parse specific visualisation options. Dict[str, Any] |
| 返回值 | 生成的依存解析,以单词和弧为键。dict |
displacy.parse_ents 方法
生成命名实体,格式为[{start: i, end: i, label: 'label'}]。用于配合displacy.render中的manual=True参数使用。
| 名称 | 描述 |
|---|---|
doc | Doc to parse entities. Doc |
options | NER-specific visualisation options. Dict[str, Any] |
| 返回值 | 生成的实体,以文本(原始文本)和实体为键。dict |
displacy.parse_spans 方法
生成格式为[{start_token: i, end_token: i, label: 'label'}]的文本片段。用于配合displacy.render中的manual=True参数使用。
| 名称 | 描述 |
|---|---|
doc | Doc to parse entities. Doc |
options | Span-specific visualisation options. Dict[str, Any] |
| 返回值 | 生成的实体,以文本(原始文本)和实体为键。dict |
可视化工具数据结构
您可以使用displaCy的数据格式手动渲染数据。如果您想可视化其他库的输出,这会非常有用。您可以在下方找到displaCy不同数据格式的示例。
依存关系可视化工具数据结构
| 字典键 | 描述 |
|---|---|
words | List of dictionaries describing a word token (see structure below). List[Dict[str, Any]] |
arcs | List of dictionaries describing the relations between words (see structure below). List[Dict[str, Any]] |
| 可选 | |
title | Title of the visualization. Optional[str] |
settings | Dependency Visualizer options (see here). Dict[str, Any] |
| 字典键 | 描述 |
|---|---|
text | Text content of the word. str |
tag | Fine-grained part-of-speech. str |
lemma | Base form of the word. Optional[str] |
| 字典键 | 描述 |
|---|---|
start | The index of the starting token. int |
end | The index of the ending token. int |
label | The type of dependency relation. str |
dir | Direction of the relation (left, right). str |
命名实体识别数据结构
| 字典键 | 描述 |
|---|---|
text | String representation of the document text. str |
ents | List of dictionaries describing entities (see structure below). List[Dict[str, Any]] |
| 可选 | |
title | Title of the visualization. Optional[str] |
settings | Entity Visualizer options (see here). Dict[str, Any] |
| 字典键 | 描述 |
|---|---|
start | The index of the first character of the entity. int |
end | The index of the last character of the entity. (not inclusive) int |
label | Label attached to the entity. str |
| 可选 | |
kb_id | KnowledgeBase ID. str |
kb_url | KnowledgeBase URL. str |
Span分类数据结构
| 字典键 | 描述 |
|---|---|
text | String representation of the document text. str |
spans | List of dictionaries describing spans (see structure below). List[Dict[str, Any]] |
tokens | List of word tokens. List[str] |
| 可选 | |
title | Title of the visualization. Optional[str] |
settings | Span Visualizer options (see here). Dict[str, Any] |
| 字典键 | 描述 |
|---|---|
start_token | The index of the first token of the span in tokens. int |
end_token | The index of the last token of the span in tokens. int |
label | Label attached to the span. str |
| 可选 | |
kb_id | KnowledgeBase ID. str |
kb_url | KnowledgeBase URL. str |
可视化工具选项
options参数允许您为每个可视化工具指定额外的设置。如果选项中不存在某个设置,则将使用默认值。
Dependency Visualizer选项
| 名称 | 描述 |
|---|---|
fine_grained | Use fine-grained part-of-speech tags (Token.tag_) instead of coarse-grained tags (Token.pos_). Defaults to False. bool |
add_lemma | Print the lemmas in a separate row below the token texts. Defaults to False. bool |
collapse_punct | Attach punctuation to tokens. Can make the parse more readable, as it prevents long arcs to attach punctuation. Defaults to True. bool |
collapse_phrases | Merge noun phrases into one token. Defaults to False. bool |
compact | “Compact mode” with square arrows that takes up less space. Defaults to False. bool |
color | Text color. Can be provided in any CSS legal format as a string e.g.: "#00ff00", "rgb(0, 255, 0)", "hsl(120, 100%, 50%)" and "green" all correspond to the color green (without transparency). Defaults to "#000000". str |
bg | Background color. Can be provided in any CSS legal format as a string e.g.: "#00ff00", "rgb(0, 255, 0)", "hsl(120, 100%, 50%)" and "green" all correspond to the color green (without transparency). Defaults to "#ffffff". str |
font | Font name or font family for all text. Defaults to "Arial". str |
offset_x | Spacing on left side of the SVG in px. Defaults to 50. int |
arrow_stroke | Width of arrow path in px. Defaults to 2. int |
arrow_width | Width of arrow head in px. Defaults to 10 in regular mode and 8 in compact mode. int |
arrow_spacing | Spacing between arrows in px to avoid overlaps. Defaults to 20 in regular mode and 12 in compact mode. int |
word_spacing | Vertical spacing between words and arcs in px. Defaults to 45. int |
distance | Distance between words in px. Defaults to 175 in regular mode and 150 in compact mode. int |
命名实体可视化工具选项
| 名称 | 描述 |
|---|---|
ents | Entity types to highlight or None for all types (default). Optional[List[str]] |
colors | Color overrides. Entity types should be mapped to color names or values. Dict[str, str] |
template | Optional template to overwrite the HTML used to render entity spans. Should be a format string and can use {bg}, {text} and {label}. See templates.py for examples. Optional[str] |
kb_url_template v3.2.1 | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in. Optional[str] |
Span Visualizer 选项
| 名称 | 描述 |
|---|---|
spans_key | Which spans key to render spans from. Default is "sc". str |
templates | Dictionary containing the keys "span", "slice", and "start". These dictate how the overall span, a span slice, and the starting token will be rendered. Optional[Dict[str, str] |
kb_url_template | Optional template to construct the KB url for the entity to link to. Expects a python f-string format with single field to fill in Optional[str] |
colors | Color overrides. Entity types should be mapped to color names or values. Dict[str, str] |
默认情况下,displaCy为spaCy训练管道中实体和范围可视化工具使用的所有实体类型提供了颜色配置。如果您使用自定义实体类型,可以通过colors设置为其添加自定义颜色。您的应用程序或管道包还可以通过spacy_displacy_colors入口点自动添加自定义标签及其对应颜色。
By default, displaCy links to # for entities without a kb_id set on their
span. If you wish to link an entity to their URL then consider using the
kb_url_template option from above. For example if the kb_id on a span is
Q95 and this is a Wikidata identifier then this option can be set to
https://www.wikidata.org/wiki/{}. Clicking on your entity in the rendered HTML
should redirect you to their Wikidata page, in this case
https://www.wikidata.org/wiki/Q95.
注册表 v3.0
spaCy的函数注册表扩展了
Thinc的registry,允许您
将字符串映射到函数。您可以注册函数来创建架构、
优化器、调度器等,然后在您的配置文件中
引用它们并设置参数。
Python类型提示用于验证输入。有关
registry方法的详细信息,请参阅
Thinc文档,以及我们的辅助库
catalogue了解
函数注册表概念的一些背景。spaCy还使用函数注册表来管理
语言子类、模型架构、查找表和管道组件
工厂。
| 注册表名称 | 描述 |
|---|---|
architectures | Registry for functions that create model architectures. Can be used to register custom model architectures and reference them in the config.cfg. |
augmenters | Registry for functions that create data augmentation callbacks for corpora and other training data iterators. |
batchers | Registry for training and evaluation data batchers. |
callbacks | Registry for custom callbacks to modify the nlp object before training. |
displacy_colors | Registry for custom color scheme for the displacy NER visualizer. Automatically reads from entry points. |
factories | Registry for functions that create pipeline components. Added automatically when you use the @spacy.component decorator and also reads from entry points. |
initializers | Registry for functions that create initializers. |
languages | Registry for language-specific Language subclasses. Automatically reads from entry points. |
layers | Registry for functions that create layers. |
loggers | Registry for functions that log training results. |
lookups | Registry for large lookup tables available via vocab.lookups. |
losses | Registry for functions that create losses. |
misc | Registry for miscellaneous functions that return data assets, knowledge bases or anything else you may need. |
optimizers | Registry for functions that create optimizers. |
readers | Registry for file and data readers, including training and evaluation data readers like Corpus. |
schedules | Registry for functions that create schedules. |
scorers | Registry for functions that create scoring methods for user with the Scorer. Scoring methods are called with Iterable[Example] and arbitrary **kwargs and return scores as Dict[str, Any]. |
tokenizers | Registry for tokenizer factories. Registered functions should return a callback that receives the nlp object and returns a Tokenizer or a custom callable. |
spacy-transformers 注册表
以下注册表由
spacy-transformers 包添加。
详情请参阅 Transformer API 参考和
使用文档。
| 注册表名称 | 描述 |
|---|---|
span_getters | Registry for functions that take a batch of Doc objects and return a list of Span objects to process by the transformer, e.g. sentences. |
annotation_setters | Registry for functions that create annotation setters. Annotation setters are functions that take a batch of Doc objects and a FullTransformerBatch and can set additional annotations on the Doc. |
日志记录器 v3.0
日志记录器用于记录训练结果。创建日志记录器时,会返回两个函数:一个用于记录每个训练步骤的信息,另一个函数在训练结束时被调用来完成日志记录。要记录每个训练步骤,会从spacy train传递一个字典,其中包含训练损失和开发集准确率等信息。
内置的默认日志记录器是ConsoleLogger,它以表格格式将结果打印到控制台,并将它们保存到jsonl文件中。作为spaCy依赖项包含的spacy-loggers包支持其他日志记录器,例如将结果发送到Weights & Biases仪表板的记录器。
除了使用内置的日志记录器,你也可以实现自己的日志记录器。
spacy.ConsoleLogger.v2 已注册函数
将训练步骤的结果以表格形式写入控制台,并保存到jsonl文件中。
请注意,累积损失在一个周期内会持续增加,但应该会随着周期数的增加而开始下降。
| 名称 | 描述 |
|---|---|
progress_bar | Whether the logger should print a progress bar tracking the steps till the next evaluation pass (default: False). bool |
console_output | Whether the logger should print the logs in the console (default: True). bool |
output_file | The file to save the training logs to (default: None). Optional[Union[str,Path]] |
spacy.ConsoleLogger.v3 注册函数
将训练步骤的结果以表格形式写入控制台,并可选择性地保存到jsonl文件中。
| 名称 | 描述 |
|---|---|
progress_bar | Type of progress bar to show in the console: "train", "eval" or None. |
The bar tracks the number of steps until training.max_steps and training.eval_frequency are reached respectively (default: None). Optional[str] | |
console_output | Whether the logger should print the logs in the console (default: True). bool |
output_file | The file to save the training logs to (default: None). Optional[Union[str,Path]] |
Readers
文件读取器 v3.0
以下文件读取器由我们的序列化库srsly提供。所有注册函数都接受一个参数path,指向要加载的文件路径。
| 名称 | 描述 |
|---|---|
srsly.read_json.v1 | Read data from a JSON file. |
srsly.read_jsonl.v1 | Read data from a JSONL (newline-delimited JSON) file. |
srsly.read_yaml.v1 | Read data from a YAML file. |
srsly.read_msgpack.v1 | Read data from a binary MessagePack file. |
spacy.read_labels.v1 注册函数
读取由init labels生成的JSON格式标签文件。通常用于训练配置中的[initialize]块,以加速模型初始化过程并提供预生成的标签集。
| 名称 | 描述 |
|---|---|
path | The path to the labels file generated with init labels. Path |
require | Whether to require the file to exist. If set to False and the labels file doesn’t exist, the loader will return None and the initialize method will extract the labels from the data. Defaults to False. bool |
| CREATES | 标签列表。List[str] |
语料库读取器 v3.0
语料库读取器是已注册的函数,它们加载数据并返回一个函数,该函数接收当前的nlp对象并生成可用于训练和预训练的Example对象。您可以在@readers注册表中用自定义的注册函数来替换它,以实现数据加载和流式传输的定制化。
spacy.Corpus.v1 注册函数
Corpus阅读器用于管理带标注的语料库,可用于处理DocBin格式(.spacy)的训练和开发数据集。另请参阅Corpus类。
| 名称 | 描述 |
|---|---|
path | The directory or filename to read from. Expects data in spaCy’s binary .spacy format. Union[str,Path] |
gold_preproc | Whether to set up the Example object with gold-standard sentences and tokens for the predictions. See Corpus for details. bool |
max_length | Maximum document length. Longer documents will be split into sentences, if sentence boundaries are available. Defaults to 0 for no limit. int |
limit | Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int |
augmenter | Apply some simply data augmentation, where we replace tokens with variations. This is especially useful for punctuation and case replacement, to help generalize beyond corpora that don’t have smart-quotes, or only have smart quotes, etc. Defaults to None. Optional[Callable] |
| CREATES | 语料库阅读器。Corpus |
spacy.JsonlCorpus.v1 注册函数
从JSONL(换行符分隔的JSON)文件中创建Example对象,该文件以"text"为键存储文本。可用于从JSONL文件中读取语言模型预训练的原始文本语料库。另请参阅JsonlCorpus类。
| 名称 | 描述 |
|---|---|
path | The directory or filename to read from. Expects newline-delimited JSON with a key "text" for each record. Union[str,Path] |
min_length | Minimum document length (in tokens). Shorter documents will be skipped. Defaults to 0, which indicates no limit. int |
max_length | Maximum document length (in tokens). Longer documents will be skipped. Defaults to 0, which indicates no limit. int |
limit | Limit corpus to a subset of examples, e.g. for debugging. Defaults to 0 for no limit. int |
| CREATES | 语料库阅读器。JsonlCorpus |
Batchers v3.0
数据批处理器实现了一种批处理策略,本质上将项目流转换为批次流,每个批次包含一个项目或项目列表。在训练过程中,模型每次处理完一个批次后会更新权重。典型的批处理策略包括:将训练数据呈现为具有相似大小的批次流,或使用逐渐增大的批次大小。有关标准示例,请参阅Thinc文档中的schedules。
除了使用这里列出的内置批处理器之外,你也可以实现自己的批处理器,它可以使用也可以不使用自定义调度方案。
spacy.batch_by_words.v1 注册函数
创建大致包含指定单词数量的小批量数据。如果有样本长度超过指定的批次长度,这些样本将单独组成一个批次;若discard_oversize参数设为True,则会被丢弃。参数docs可以是字符串列表、Doc对象或Example对象。
| 名称 | 描述 |
|---|---|
seqs | The sequences to minibatch. Iterable[Any] |
size | The target number of words per batch. Can also be a block referencing a schedule, e.g. compounding. Union[int, Sequence[int]] |
tolerance | What percentage of the size to allow batches to exceed. float |
discard_oversize | Whether to discard sequences that by themselves exceed the tolerated size. bool |
get_length | Optional function that receives a sequence item and returns its length. Defaults to the built-in len() if not set. Optional[Callable[[Any], int]] |
| CREATES | 批处理器,接收可迭代项并返回批次的可调用对象。Callable[[Iterable[Any]], Iterable[List[Any]]] |
spacy.batch_by_sequence.v1 已注册函数
创建一个批处理器,用于生成指定大小的批次。
| 名称 | 描述 |
|---|---|
size | The target number of items per batch. Can also be a block referencing a schedule, e.g. compounding. Union[int, Sequence[int]] |
get_length | Optional function that receives a sequence item and returns its length. Defaults to the built-in len() if not set. Optional[Callable[[Any], int]] |
| CREATES | 批处理器,接收可迭代项并返回批次的可调用对象。Callable[[Iterable[Any]], Iterable[List[Any]]] |
spacy.batch_by_padded.v1 注册函数
通过填充批次的大小对序列进行小批量处理,序列在窗口内按长度分箱。填充大小定义为批次内序列的最大长度乘以批次中的序列数量。
| 名称 | 描述 |
|---|---|
size | The largest padded size to batch sequences into. Can also be a block referencing a schedule, e.g. compounding. Union[int, Sequence[int]] |
buffer | The number of sequences to accumulate before sorting by length. A larger buffer will result in more even sizing, but if the buffer is very large, the iteration order will be less random, which can result in suboptimal training. int |
discard_oversize | Whether to discard sequences that are by themselves longer than the largest padded batch size. bool |
get_length | Optional function that receives a sequence item and returns its length. Defaults to the built-in len() if not set. Optional[Callable[[Any], int]] |
| CREATES | 批处理器,接收可迭代项并返回批次。Callable[[Iterable[Any]], Iterable[List[Any]]] |
增强器 v3.0
数据增强是指对训练数据进行小幅修改的过程。它对于标点符号和大小写替换特别有用——例如,如果您的语料库仅使用智能引号,而您希望通过包含常规引号的变体来增加多样性,或者通过混合大小写示例使模型对大小写不那么敏感。详情和示例请参阅使用指南。
spacy.orth_variants.v1 注册函数
创建一个使用正交变体替换的数据增强回调函数。该回调可以在训练期间添加到语料库或其他数据迭代器中。它特别适用于标点符号和大小写替换,有助于泛化那些不包含智能引号或仅包含智能引号等情况的语料库。
| 名称 | 描述 |
|---|---|
level | The percentage of texts that will be augmented. float |
lower | The percentage of texts that will be lowercased. float |
orth_variants | A dictionary containing the single and paired orth variants. Typically loaded from a JSON file. See en_orth_variants.json for an example. Dict[str, Dict[List[Union[str, List[str]]]]] |
| CREATES | A function that takes the current nlp object and an Example and yields augmented Example objects. Callable[[Language,Example], Iterator[Example]] |
spacy.lower_case.v1 注册函数
创建一个数据增强回调函数,用于将文档转换为小写。该回调可以在训练期间添加到语料库或其他数据迭代器中。它特别有助于降低模型对大小写的敏感性。
| 名称 | 描述 |
|---|---|
level | The percentage of texts that will be augmented. float |
| CREATES | A function that takes the current nlp object and an Example and yields augmented Example objects. Callable[[Language,Example], Iterator[Example]] |
回调函数 v3.0
配置支持在生命周期的多个节点使用回调函数来修改nlp对象。
spacy.copy_from_base_model.v1 注册函数
从指定模型复制分词器和/或词汇表。这类似于v2版本的基础模型选项,在微调现有流水线时与来源组件结合使用非常有用。词汇表包含来自指定模型的查找表和向量。设计用于[initialize.before_init]阶段。
| 名称 | 描述 |
|---|---|
tokenizer | The pipeline to copy the tokenizer from. Defaults to None. Optional[str] |
vocab | The pipeline to copy the vocab from. The vocab includes the lookups and vectors. Defaults to None. Optional[str] |
| CREATES | A function that takes the current nlp object and modifies its tokenizer and vocab. Callable[[Language], None] |
spacy.models_with_nvtx_range.v1 注册函数
递归地使用NVTX范围标记器包装每个管道中的模型。这些标记通过将特定操作归属于Model的前向或反向传播过程,有助于GPU性能分析。
| 名称 | 描述 |
|---|---|
forward_color | Color identifier for forward passes. Defaults to -1. int |
backprop_color | Color identifier for backpropagation passes. Defaults to -1. int |
| CREATES | A function that takes the current nlp and wraps forward/backprop passes in NVTX ranges. Callable[[Language],Language] |
spacy.models_and_pipes_with_nvtx_range.v1 注册函数v3.4
使用NVTX范围标记递归地包装每个管道(pipeline)的模型和方法。默认情况下,会包装以下方法:pipe, predict, set_annotations, update, rehearse, get_loss, initialize, begin_update, finish_update, update。
| 名称 | 描述 |
|---|---|
forward_color | Color identifier for model forward passes. Defaults to -1. int |
backprop_color | Color identifier for model backpropagation passes. Defaults to -1. int |
additional_pipe_functions | Additional pipeline methods to wrap. Keys are pipeline names and values are lists of method identifiers. Defaults to None. Optional[Dict[str, List[str]]] |
| CREATES | A function that takes the current nlp and wraps pipe models and methods in NVTX ranges. Callable[[Language],Language] |
训练数据与对齐
training.offsets_to_biluo_tags 函数
使用BILUO方案(Begin, In, Last, Unit, Out)将标记的文本片段编码为每个标记的标签。返回一个描述标签的字符串列表。每个标签字符串的格式为""、"O"或"{action}-{label}",其中action是"B"、"I"、"L"、"U"之一。当实体偏移量与Doc对象中的分词不对齐时,使用字符串"-"。训练算法会将这些视为缺失值。O表示非实体标记。B表示多标记实体的开始,I表示三个或更多标记实体的内部,L表示两个或更多标记实体的结束。U表示单标记实体。
| 名称 | 描述 |
|---|---|
doc | The document that the entity offsets refer to. The output tags will refer to the token boundaries within the document. Doc |
entities | A sequence of (start, end, label) triples. start and end should be character-offset integers denoting the slice into the original string. List[Tuple[int, int, Union[str, int]]] |
missing | The label used for missing values, e.g. if tokenization doesn’t align with the entity offsets. Defaults to "O". str |
| 返回值 | 一个字符串列表,描述BILUO标签。List[str] |
training.biluo_tags_to_offsets 函数
将每个标记的标签按照BILUO方案编码为实体偏移量。
| 名称 | 描述 |
|---|---|
doc | The document that the BILUO tags refer to. Doc |
tags | A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". List[str] |
| RETURNS | A sequence of (start, end, label) triples. start and end will be character-offset integers denoting the slice into the original string. List[Tuple[int, int, str]] |
training.biluo_tags_to_spans 函数
按照BILUO方案将每个标记的标签编码为Span对象。这可用于从基于标记的标签创建实体范围,例如覆盖doc.ents。
| 名称 | 描述 |
|---|---|
doc | The document that the BILUO tags refer to. Doc |
tags | A sequence of BILUO tags with each tag describing one token. Each tag string will be of the form of either "", "O" or "{action}-{label}", where action is one of "B", "I", "L", "U". List[str] |
| RETURNS | A sequence of Span objects with added entity labels. List[Span] |
training.biluo_to_iob 函数
将BILUO标签序列转换为IOB标签。当您想在使用仅支持IOB标签的模型中使用BILUO标签时,这非常有用。
| 名称 | 描述 |
|---|---|
tags | A sequence of BILUO tags. Iterable[str] |
| 返回值 | 一个IOB标签列表。 列表[字符串] |
training.iob_to_biluo 函数
将IOB标签序列转换为BILUO标签。这在您想将IOB标签用于仅支持BILUO标签的模型时非常有用。
| 名称 | 描述 |
|---|---|
tags | A sequence of IOB tags. Iterable[str] |
| 返回值 | 一个BILUO标签列表。 列表[字符串] |
training.biluo_to_iob 函数
将BILUO标签序列转换为IOB标签。这在您想使用仅支持IOB标签的模型来处理BILUO标签时非常有用。
| 名称 | 描述 |
|---|---|
tags | A sequence of BILUO tags. Iterable[str] |
| 返回值 | 一个IOB标签列表。 列表[字符串] |
training.iob_to_biluo 函数
将IOB标签序列转换为BILUO标签。当您需要使用仅支持BILUO标签的模型处理IOB标签时,这个功能非常有用。
| 名称 | 描述 |
|---|---|
tags | A sequence of IOB tags. Iterable[str] |
| 返回值 | 一个BILUO标签列表。 列表[字符串] |
实用函数
spaCy附带了一小组实用函数,位于spacy/util.py中。由于这些实用函数主要用于spaCy内部使用,它们的行为可能会在未来的版本中发生变化。本页文档记录的这些函数应该是可以安全使用的,我们会尽量确保向后兼容性。不过,如果您的应用程序依赖于spaCy的任何实用功能,我们建议您设置额外的测试。
util.get_lang_class 函数
导入并加载一个Language类。支持延迟加载
语言数据并通过双字母语言代码导入
语言。要为自定义语言类添加语言代码,
可以使用@registry.languages装饰器进行注册。
| 名称 | 描述 |
|---|---|
lang | Two-letter language code, e.g. "en". str |
| 返回值 | 对应的子类。Language |
util.lang_class_is_loaded 函数
检查Language子类是否已加载。Language子类采用延迟加载机制,以避免执行与语言数据相关的高成本初始化代码。
| 名称 | 描述 |
|---|---|
name | Two-letter language code, e.g. "en". str |
| 返回值 | 类是否已被加载。bool |
util.load_model 函数
从包或数据路径加载一个处理流程。如果使用字符串名称调用,spaCy会假设该流程是一个Python包,并导入并调用其load()方法。如果使用路径调用,spaCy会假设它是一个数据目录,从config.cfg中读取语言和处理流程设置,并创建一个Language对象。然后模型数据将通过Language.from_disk加载。
| 名称 | 描述 |
|---|---|
name | Package name or path. str |
| 仅关键字 | |
vocab | Optional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. Union[Vocab, bool] |
disable | Name(s) of pipeline component(s) to disable. Disabled pipes will be loaded but they won’t be run unless you explicitly enable them by calling nlp.enable_pipe. Union[str, Iterable[str]] |
enable v3.4 | Name(s) of pipeline component(s) to enable. All other pipes will be disabled, but can be enabled again using nlp.enable_pipe. Union[str, Iterable[str]] |
exclude | Name(s) of pipeline component(s) to exclude. Excluded components won’t be loaded. Union[str, Iterable[str]] |
config v3.0 | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. "nlp.pipeline". Union[Dict[str, Any],Config] |
| RETURNS | Language class with the loaded pipeline. Language |
util.load_model_from_init_py 函数
一个辅助函数,用于管道包的__init__.py中的load()方法。
| 名称 | 描述 |
|---|---|
init_file | Path to package’s __init__.py, i.e. __file__. Union[str,Path] |
| 仅关键字 | |
vocab v3.0 | Optional shared vocab to pass in on initialization. If True (default), a new Vocab object will be created. Union[Vocab, bool] |
disable | Name(s) of pipeline component(s) to disable. Disabled pipes will be loaded but they won’t be run unless you explicitly enable them by calling nlp.enable_pipe. Union[str, Iterable[str]] |
enable v3.4 | Name(s) of pipeline component(s) to enable. All other pipes will be disabled, but can be enabled again using nlp.enable_pipe. Union[str, Iterable[str]] |
exclude v3.0 | Name(s) of pipeline component(s) to exclude. Excluded components won’t be loaded. Union[str, Iterable[str]] |
config v3.0 | Config overrides as nested dict or flat dict keyed by section values in dot notation, e.g. "nlp.pipeline". Union[Dict[str, Any],Config] |
| RETURNS | Language class with the loaded pipeline. Language |
util.load_config 函数v3.0
从文件路径加载管道的config.cfg。该配置文件通常包含组件及其创建方式的详细信息,以及所有训练设置和超参数。
| 名称 | 描述 |
|---|---|
path | Path to the pipeline’s config.cfg. Union[str,Path] |
overrides | Optional config overrides to replace in loaded config. Can be provided as nested dict, or as flat dict with keys in dot notation, e.g. "nlp.pipeline". Dict[str, Any] |
interpolate | Whether to interpolate the config and replace variables like ${paths.train} with their values. Defaults to False. bool |
| 返回值 | 管道的配置。Config |
util.load_meta 函数v3.0
从文件路径获取管道的meta.json并验证其内容。元数据通常包含作者、许可协议、数据源和版本等详细信息。
| 名称 | 描述 |
|---|---|
path | Path to the pipeline’s meta.json. Union[str,Path] |
| 返回值 | 管道的元数据。Dict[str, Any] |
util.get_installed_models 函数v3.0
列出当前环境中安装的所有流程包。这将包括任何通过spacy package打包的spaCy流程。在底层实现中,流程包会暴露一个Python入口点供spaCy检查,而无需加载nlp对象。
| 名称 | 描述 |
|---|---|
| 返回值 | 当前环境中已安装的流水线名称列表。List[str] |
util.is_package 函数
检查字符串是否映射到通过pip安装的包。主要用于验证pipeline packages。
| 名称 | 描述 |
|---|---|
name | Name of package. str |
| RETURNS | True if installed package, False if not. bool |
util.get_package_path 函数
获取已安装包的路径。主要用于解析流程包的位置。目前通过导入包来查找其路径。
| 名称 | 描述 |
|---|---|
package_name | Name of installed package. str |
| 返回值 | 管道包目录的路径。Path |
util.is_in_jupyter 函数
通过检测IPython内核来检查用户是否从Jupyter笔记本中运行spaCy。主要用于displacy可视化工具。
| 名称 | 描述 |
|---|---|
| RETURNS | True if in Jupyter, False if not. bool |
util.compile_prefix_regex 函数
将一系列前缀规则编译为正则表达式对象。
| 名称 | 描述 |
|---|---|
entries | The prefix rules, e.g. lang.punctuation.TOKENIZER_PREFIXES. Iterable[Union[str,Pattern]] |
| RETURNS | The regex object to be used for Tokenizer.prefix_search. Pattern |
util.compile_suffix_regex 函数
将一系列后缀规则编译为正则表达式对象。
| 名称 | 描述 |
|---|---|
entries | The suffix rules, e.g. lang.punctuation.TOKENIZER_SUFFIXES. Iterable[Union[str,Pattern]] |
| RETURNS | The regex object to be used for Tokenizer.suffix_search. Pattern |
util.compile_infix_regex 函数
将一系列中缀规则编译为正则表达式对象。
| 名称 | 描述 |
|---|---|
entries | The infix rules, e.g. lang.punctuation.TOKENIZER_INFIXES. Iterable[Union[str,Pattern]] |
| RETURNS | The regex object to be used for Tokenizer.infix_finditer. Pattern |
util.minibatch 函数
遍历批量的项目。size可以是一个迭代器,因此每步的批量大小可以变化。
| 名称 | 描述 |
|---|---|
items | The items to batch up. Iterable[Any] |
size | The batch size(s). Union[int, Sequence[int]] |
| YIELDS | 批次数据。 |
util.filter_spans 函数
过滤一个Span对象序列并移除重复或重叠部分。这在创建命名实体(一个词符只能属于一个实体)或使用Retokenizer.merge合并跨度时非常有用。当跨度重叠时,优先选择(第一个)最长的跨度而非较短跨度。
| 名称 | 描述 |
|---|---|
spans | The spans to filter. Iterable[Span] |
| 返回值 | 过滤后的文本片段。List[Span] |
util.get_words_and_spaces 函数v3.0
给定一个单词列表和一段文本,重构原始标记并返回可用于创建Doc的单词和空格列表。这有助于恢复未保留任何空白信息的破坏性标记化。
| 名称 | 描述 |
|---|---|
words | The list of words. Iterable[str] |
text | The original text. str |
| 返回值 | 返回一个单词列表和一个布尔值列表,布尔值表示该位置的单词后面是否跟有空格。Tuple[List[str], List[bool]] |