camel.datagen.source2synth 包

本页内容

camel.datagen.source2synth 包#

子模块#

camel.datagen.source2synth.data_processor 模块#

class camel.datagen.source2synth.data_processor.DataCurator(config: ProcessorConfig, rng: Random)[来源]#

基类：object

管理和策划多跳问答对的数据集。

该类负责处理数据集管理任务，包括质量过滤、复杂度过滤、去重以及数据集采样。

config#

数据整理参数的配置。

Type:: ProcessorConfig

rng#

用于可重复采样的随机数生成器。

Type:: random.Random

curate_dataset(examples: List[Dict[str, Any]]) → List[Dict[str, Any]][来源]#

通过多个筛选阶段管理和整理数据集。

Parameters:: 示例 (列表[字典[字符串, 任意类型]]) - 待整理的示例列表。
Returns:: 经过筛选符合质量标准的数据集。
Return type:: List[Dict[str, Any]]

class camel.datagen.source2synth.data_processor.ExampleConstructor(config: ProcessorConfig, multi_hop_agent: MultiHopGeneratorAgent | None = None)[来源]#

基类：object

从原始文本数据构建训练示例。

该类负责通过预处理文本、提取信息对和生成问答对来构建训练样本。

config#

示例构建的配置。

Type:: ProcessorConfig

multi_hop_agent#

用于生成问答的Agent。

Type:: 可选[MultiHopGeneratorAgent]

construct_examples(raw_data: List[Dict[str, Any]]) → List[Dict[str, Any]][来源]#

从原始数据构建训练示例。

Parameters:

raw_data (List[Dict[str, Any]]) - 包含文本和元数据的原始数据字典列表

Returns:

构建的示例列表（包含问答对）: 以及元数据。

Return type:

List[Dict[str, Any]]

class camel.datagen.source2synth.data_processor.UserDataProcessor(config: ProcessorConfig | None = None)[来源]#

基类：object

一个用于从用户数据生成多跳问答对的处理器。

该类负责处理文本数据，使用AI模型或基于规则的方法生成多跳问答对。它管理从文本预处理到数据集整理的整个流程。

config#

数据处理参数的配置。

Type:: ProcessorConfig

rng#

用于可重复性的随机数生成器。

Type:: random.Random

multi_hop_agent#

用于生成问答对的Agent。

Type:: 可选[MultiHopGeneratorAgent]

process_batch(texts: List[str], sources: List[str] | None = None) → List[Dict[str, Any]][来源]#

批量处理多个文本以生成多跳问答对。

Parameters:

texts (List[str]) – 需要处理的输入文本列表。
sources (可选[List[str]], 可选) – 源标识符列表。(默认: None)

Returns:

已处理的示例列表，包含问答对和: 元数据。

Return type:

List[Dict[str, Any]]

Raises:

ValueError - 如果源数据长度与文本长度不匹配。

process_text(text: str, source: str = 'user_input') → List[Dict[str, Any]][来源]#

处理单个文本以生成多跳问答对。

Parameters:

text (str) – 需要处理的输入文本。
source (str, optional) – 文本的源标识符。 (默认值: "user_input")

Returns:

已处理的示例列表，包含问答对和: 元数据。

Return type:

List[Dict[str, Any]]

camel.datagen.source2synth.models 模块#

class camel.datagen.source2synth.models.ContextPrompt(*, main_context: str, related_contexts: List[str] | None = None)[来源]#

基类: BaseModel

用于生成多跳问答对的上下文提示。

main_context#

生成问答对的主要上下文。

Type:: 字符串

related_contexts#

其他相关上下文。

Type:: 可选[列表[str]]

main_context: str#

model_config: ClassVar[ConfigDict] = {}#: 模型的配置，应该是一个符合[ConfigDict][pydantic.config.ConfigDict]规范的字典。

related_contexts: List[str] | None#

class camel.datagen.source2synth.models.MultiHopQA(*, question: str, reasoning_steps: List[推理步骤], answer: str, supporting_facts: List[str], type: str)[来源]#

基类: BaseModel

一个包含推理步骤和支持事实的多跳问答对。

question#

需要进行多跳推理的问题。

Type:: 字符串

reasoning_steps#

回答所需的推理步骤列表。

Type:: 列表[ReasoningStep]

answer#

问题的最终答案。

Type:: 字符串

supporting_facts#

支持推理的事实列表。

Type:: 字符串列表

type#

问答对的类型。

Type:: 字符串

class Config[来源]#

基类：object

json_schema_extra: ClassVar[Dict[str, Any]] = {'example': {'answer': 'Paris', 'question': 'What is the capital of France?', 'reasoning_steps': [{'step': 'Identify the country France.'}, {'step': 'Find the capital city of France.'}], 'supporting_facts': ['France is a country in Europe.', 'Paris is the capital city of France.'], 'type': 'multi_hop_qa'}}#

answer: str#

model_config: ClassVar[ConfigDict] = {'json_schema_extra': {'example': {'answer': 'Paris', 'question': 'What is the capital of France?', 'reasoning_steps': [{'step': 'Identify the country France.'}, {'step': 'Find the capital city of France.'}], 'supporting_facts': ['France is a country in Europe.', 'Paris is the capital city of France.'], 'type': 'multi_hop_qa'}}}#: 模型的配置，应该是一个符合[ConfigDict][pydantic.config.ConfigDict]规范的字典。

question: str#

reasoning_steps: List[推理步骤]#

supporting_facts: List[str]#

type: str#

class camel.datagen.source2synth.models.ReasoningStep(*, step: str)[来源]#

基类: BaseModel

多步推理过程中的一个步骤。

step#

推理步骤的文本描述。

Type:: 字符串

model_config: ClassVar[ConfigDict] = {}#: 模型的配置，应该是一个符合[ConfigDict][pydantic.config.ConfigDict]规范的字典。

step: str#

camel.datagen.source2synth.user_data_processor_config 模块#

class camel.datagen.source2synth.user_data_processor_config.ProcessorConfig(*, seed: int = <factory>, min_length: int = 50, max_length: int = 512, complexity_threshold: float = 0.5, dataset_size: int = 1000, use_ai_model: bool = True, hop_generating_agent: ~camel.agents.multi_hop_generator_agent.MultiHopGeneratorAgent = <factory>)[来源]#

基类: BaseModel

数据处理配置类

complexity_threshold: float#

dataset_size: int#

hop_generating_agent: MultiHopGeneratorAgent#

max_length: int#

min_length: int#

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'frozen': False, 'protected_namespaces': (), 'validate_assignment': True}#: 模型的配置，应该是一个符合[ConfigDict][pydantic.config.ConfigDict]规范的字典。

seed: int#

use_ai_model: bool#

模块内容#

class camel.datagen.source2synth.DataCurator(config: ProcessorConfig, rng: Random)[来源]#

基类：object

管理和策划多跳问答对的数据集。

该类负责处理数据集管理任务，包括质量过滤、复杂度过滤、去重以及数据集采样。

config#

数据整理参数的配置。

Type:: ProcessorConfig

rng#

用于可重复采样的随机数生成器。

Type:: random.Random

curate_dataset(examples: List[Dict[str, Any]]) → List[Dict[str, Any]][来源]#

通过多个筛选阶段管理和整理数据集。

Parameters:: 示例 (列表[字典[字符串, 任意类型]]) - 待整理的示例列表。
Returns:: 经过筛选符合质量标准的数据集。
Return type:: List[Dict[str, Any]]

class camel.datagen.source2synth.ExampleConstructor(config: ProcessorConfig, multi_hop_agent: MultiHopGeneratorAgent | None = None)[来源]#

基类：object

从原始文本数据构建训练示例。

该类负责通过预处理文本、提取信息对和生成问答对来构建训练样本。

config#

示例构建的配置。

Type:: ProcessorConfig

multi_hop_agent#

用于生成问答的Agent。

Type:: 可选[MultiHopGeneratorAgent]

construct_examples(raw_data: List[Dict[str, Any]]) → List[Dict[str, Any]][来源]#

从原始数据构建训练示例。

Parameters:

raw_data (List[Dict[str, Any]]) - 包含文本和元数据的原始数据字典列表

Returns:

构建的示例列表（包含问答对）: 以及元数据。

Return type:

List[Dict[str, Any]]

class camel.datagen.source2synth.MultiHopQA(*, question: str, reasoning_steps: List[推理步骤], answer: str, supporting_facts: List[str], type: str)[来源]#

基类: BaseModel

一个包含推理步骤和支持事实的多跳问答对。

question#

需要进行多跳推理的问题。

Type:: 字符串

reasoning_steps#

回答所需的推理步骤列表。

Type:: 列表[推理步骤]

answer#

问题的最终答案。

Type:: 字符串

supporting_facts#

支持推理的事实列表。

Type:: 字符串列表

type#

问答对的类型。

Type:: 字符串

class Config[来源]#

基类：object

json_schema_extra: ClassVar[Dict[str, Any]] = {'example': {'answer': 'Paris', 'question': 'What is the capital of France?', 'reasoning_steps': [{'step': 'Identify the country France.'}, {'step': 'Find the capital city of France.'}], 'supporting_facts': ['France is a country in Europe.', 'Paris is the capital city of France.'], 'type': 'multi_hop_qa'}}#

answer: str#

model_config: ClassVar[ConfigDict] = {'json_schema_extra': {'example': {'answer': 'Paris', 'question': 'What is the capital of France?', 'reasoning_steps': [{'step': 'Identify the country France.'}, {'step': 'Find the capital city of France.'}], 'supporting_facts': ['France is a country in Europe.', 'Paris is the capital city of France.'], 'type': 'multi_hop_qa'}}}#: 模型的配置，应该是一个符合[ConfigDict][pydantic.config.ConfigDict]规范的字典。

question: str#

reasoning_steps: List[推理步骤]#

supporting_facts: List[str]#

type: str#

class camel.datagen.source2synth.ProcessorConfig(*, seed: int = <factory>, min_length: int = 50, max_length: int = 512, complexity_threshold: float = 0.5, dataset_size: int = 1000, use_ai_model: bool = True, hop_generating_agent: ~camel.agents.multi_hop_generator_agent.MultiHopGeneratorAgent = <factory>)[来源]#

基类: BaseModel

数据处理配置类

complexity_threshold: float#

dataset_size: int#

hop_generating_agent: MultiHopGeneratorAgent#

max_length: int#

min_length: int#

model_config: ClassVar[ConfigDict] = {'arbitrary_types_allowed': True, 'frozen': False, 'protected_namespaces': (), 'validate_assignment': True}#: 模型的配置，应该是一个符合[ConfigDict][pydantic.config.ConfigDict]规范的字典。

seed: int#

use_ai_model: bool#

class camel.datagen.source2synth.ReasoningStep(*, step: str)[来源]#

基类: BaseModel

多跳推理过程中的一个步骤。

step#

推理步骤的文本描述。

Type:: 字符串

model_config: ClassVar[ConfigDict] = {}#: 模型的配置，应该是一个符合[ConfigDict][pydantic.config.ConfigDict]规范的字典。

step: str#

class camel.datagen.source2synth.UserDataProcessor(config: ProcessorConfig | None = None)[来源]#

基类：object

一个用于从用户数据生成多跳问答对的处理器。

该类负责处理文本数据，使用AI模型或基于规则的方法生成多跳问答对。它管理从文本预处理到数据集整理的整个流程。

config#

数据处理参数的配置。

Type:: ProcessorConfig

rng#

用于可重复性的随机数生成器。

Type:: random.Random

multi_hop_agent#

用于生成问答对的Agent。

Type:: 可选[MultiHopGeneratorAgent]

process_batch(texts: List[str], sources: List[str] | None = None) → List[Dict[str, Any]][来源]#

批量处理多个文本以生成多跳问答对。

Parameters:

texts (List[str]) - 要处理的输入文本列表。
sources (可选[List[str]], 可选) - 源标识符列表。(默认值: None)

Returns:

已处理的示例列表，包含问答对和: 元数据。

Return type:

List[Dict[str, Any]]

Raises:

ValueError - 如果源数据长度与文本长度不匹配。

process_text(text: str, source: str = 'user_input') → List[Dict[str, Any]][来源]#

处理单个文本以生成多跳问答对。

Parameters:

text (str) – 要处理的输入文本。
source (str, optional) – 文本的源标识符。 (默认值: "user_input")

Returns:

已处理的示例列表，包含问答对和: 元数据。

Return type:

List[Dict[str, Any]]