torch_geometric.data.OnDiskDataset

class OnDiskDataset(root: str, transform: ~typing.Optional[~typing.Callable] = None, pre_filter: ~typing.Optional[~typing.Callable] = None, backend: str = 'sqlite', schema: ~typing.Union[~typing.Any, ~typing.Dict[str, ~typing.Any], ~typing.Tuple[~typing.Any], ~typing.List[~typing.Any]] = <class 'object'>, log: bool = True)[source]

Bases: Dataset

用于创建大型图数据集的基类,这些数据集不容易一次性放入CPU内存中,通过利用Database后端进行磁盘存储和数据对象的访问。

Parameters:
  • root (str) – Root directory where the dataset should be saved.

  • transform (callable, optional) – A function/transform that takes in a Data or HeteroData object and returns a transformed version. The data object will be transformed before every access. (default: None)

  • pre_filter (callable, optional) – A function that takes in a Data or HeteroData object and returns a boolean value, indicating whether the data object should be included in the final dataset. (default: None)

  • backend (str) – 要使用的Database后端 (可以是 "sqlite""rocksdb")。 (默认: "sqlite")

  • schema (AnyTuple[Any] 或 Dict[str, Any], optional) – 输入数据的模式。 可以接受 int, float, str, object, 或者一个带有 dtypesize 键的字典(用于指定张量数据)作为输入,并且可以嵌套为元组或字典。 指定模式将提高效率,因为默认情况下数据库将使用 python 的 pickle 进行序列化和反序列化。如果指定为不同于 object 的内容,OnDiskDataset 的实现需要覆盖 serialize()deserialize() 方法。 (默认: object)

  • log (bool, optional) – Whether to print any console output while downloading and processing the dataset. (default: True)

property processed_file_names: str

The name of the files in the self.processed_dir folder that must be present in order to skip processing.

Return type:

str

property db: Database

返回底层的 Database

Return type:

Database

close() None[source]

关闭与底层数据库的连接。

Return type:

None

serialize(data: BaseData) Any[source]

DataHeteroData对象序列化为预期的数据库模式。

Return type:

Any

deserialize(data: Any) BaseData[source]

将数据库条目反序列化为 DataHeteroData 对象。

Return type:

BaseData

append(data: BaseData) None[source]

将数据对象附加到数据集中。

Return type:

None

extend(data_list: Sequence[BaseData], batch_size: Optional[int] = None) None[source]

通过一系列数据对象扩展数据集。

Return type:

None

get(idx: int) BaseData[source]

Gets the data object at index idx.

Return type:

BaseData

multi_get(indices: Union[Iterable[int], Tensor, slice, range], batch_size: Optional[int] = None) List[BaseData][source]

从指定的索引中获取数据对象列表。

Return type:

List[BaseData]

len() int[source]

返回数据集中存储的数据对象的数量。

Return type:

int

download() None

Downloads the dataset to the self.raw_dir folder.

Return type:

None

get_summary() Any

收集数据集的汇总统计信息。

Return type:

Any

property has_download: bool

Checks whether the dataset defines a download() method.

Return type:

bool

property has_process: bool

Checks whether the dataset defines a process() method.

Return type:

bool

index_select(idx: Union[slice, Tensor, ndarray, Sequence]) 数据集

从指定的索引 idx 创建数据集的子集。 索引 idx 可以是一个切片对象,例如[2:5], 一个列表,一个元组,或者一个 torch.Tensornp.ndarray 类型的 长整型或布尔型。

Return type:

数据集

property num_classes: int

返回数据集中类的数量。

Return type:

int

property num_edge_features: int

返回数据集中每条边的特征数量。

Return type:

int

property num_features: int

Returns the number of features per node in the dataset. Alias for num_node_features.

Return type:

int

property num_node_features: int

返回数据集中每个节点的特征数量。

Return type:

int

print_summary(fmt: str = 'psql') None

将数据集的摘要统计信息打印到控制台。

Parameters:

fmt (str, optional) – Summary tables format. Available table formats can be found here. (default: "psql")

Return type:

None

process() None

Processes the dataset to the self.processed_dir folder.

Return type:

None

property processed_paths: List[str]

必须存在的绝对文件路径,以便跳过处理。

Return type:

List[str]

property raw_file_names: Union[str, List[str], Tuple[str, ...]]

The name of the files in the self.raw_dir folder that must be present in order to skip downloading.

Return type:

Union[str, List[str], Tuple[str, ...]]

property raw_paths: List[str]

必须存在的绝对文件路径,以便跳过下载。

Return type:

List[str]

shuffle(return_perm: bool = False) Union[数据集, Tuple[数据集, Tensor]]

随机打乱数据集中的示例。

Parameters:

return_perm (bool, optional) – If set to True, will also return the random permutation used to shuffle the dataset. (default: False)

Return type:

Union[Dataset, Tuple[Dataset, Tensor]]

to_datapipe() Any

Converts the dataset into a torch.utils.data.DataPipe.

The returned instance can then be used with built-in DataPipes for batching graphs as follows:

from torch_geometric.datasets import QM9

dp = QM9(root='./data/QM9/').to_datapipe()
dp = dp.batch_graphs(batch_size=2, drop_last=True)

for batch in dp:
    pass

See the PyTorch tutorial for further background on DataPipes.

Return type:

Any