用于同质图的OnDiskDataset

本教程展示了如何为可以在GraphBolt框架中使用的同构图创建OnDiskDataset。

在本教程结束时，您将能够

组织图结构数据。
组织特征数据。
为特定任务组织训练/验证/测试集。

要创建一个OnDiskDataset对象，你需要将所有数据包括图结构、特征数据和任务组织到一个目录中。该目录应包含一个metadata.yaml文件，用于描述数据集的元数据。

现在让我们逐步生成各种数据并将它们组织在一起，最终实例化OnDiskDataset。

安装DGL包

[1]:

# Install required packages.
import os
import torch
import numpy as np
os.environ['TORCH'] = torch.__version__
os.environ['DGLBACKEND'] = "pytorch"

# Install the CPU version.
device = torch.device("cpu")
!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html

try:
    import dgl
    import dgl.graphbolt as gb
    installed = True
except ImportError as error:
    installed = False
    print(error)
print("DGL installed!" if installed else "DGL not found!")

Looking in links: https://data.dgl.ai/wheels-test/repo.html
Requirement already satisfied: dgl in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages/dgl-2.3-py3.8-linux-x86_64.egg (2.3)
Requirement already satisfied: numpy>=1.14.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (1.24.4)
Requirement already satisfied: scipy>=1.1.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (1.10.1)
Requirement already satisfied: networkx>=2.1 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (3.1)
Requirement already satisfied: requests>=2.19.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (2.31.0)
Requirement already satisfied: tqdm in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (4.66.4)
Requirement already satisfied: psutil>=5.8.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (5.9.8)
Requirement already satisfied: torchdata>=0.5.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (0.7.1)
Requirement already satisfied: pandas in /home/ubuntu/.pyenv/versions/miniconda3-latest/lib/python3.8/site-packages (from dgl) (2.0.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from requests>=2.19.0->dgl) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from requests>=2.19.0->dgl) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from requests>=2.19.0->dgl) (2.2.1)
Requirement already satisfied: certifi>=2017.4.17 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from requests>=2.19.0->dgl) (2024.2.2)
Requirement already satisfied: torch>=2 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torchdata>=0.5.0->dgl) (2.0.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from pandas->dgl) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from pandas->dgl) (2024.1)
Requirement already satisfied: tzdata>=2022.1 in /home/ubuntu/.pyenv/versions/miniconda3-latest/lib/python3.8/site-packages (from pandas->dgl) (2024.1)
Requirement already satisfied: six>=1.5 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas->dgl) (1.16.0)
Requirement already satisfied: filelock in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (3.14.0)
Requirement already satisfied: typing-extensions in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (4.11.0)
Requirement already satisfied: sympy in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (1.12)
Requirement already satisfied: jinja2 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (3.1.4)
Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.7.99)
Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.7.99)
Requirement already satisfied: nvidia-cuda-cupti-cu11==11.7.101 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.7.101)
Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (8.5.0.96)
Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.10.3.66)
Requirement already satisfied: nvidia-cufft-cu11==10.9.0.58 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (10.9.0.58)
Requirement already satisfied: nvidia-curand-cu11==10.2.10.91 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (10.2.10.91)
Requirement already satisfied: nvidia-cusolver-cu11==11.4.0.1 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.4.0.1)
Requirement already satisfied: nvidia-cusparse-cu11==11.7.4.91 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.7.4.91)
Requirement already satisfied: nvidia-nccl-cu11==2.14.3 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (2.14.3)
Requirement already satisfied: nvidia-nvtx-cu11==11.7.91 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.7.91)
Requirement already satisfied: triton==2.0.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (2.0.0)
Requirement already satisfied: setuptools in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=2->torchdata>=0.5.0->dgl) (69.5.1)
Requirement already satisfied: wheel in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=2->torchdata>=0.5.0->dgl) (0.43.0)
Requirement already satisfied: cmake in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from triton==2.0.0->torch>=2->torchdata>=0.5.0->dgl) (3.29.3)
Requirement already satisfied: lit in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from triton==2.0.0->torch>=2->torchdata>=0.5.0->dgl) (18.1.4)
Requirement already satisfied: MarkupSafe>=2.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from jinja2->torch>=2->torchdata>=0.5.0->dgl) (2.1.5)
Requirement already satisfied: mpmath>=0.19 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from sympy->torch>=2->torchdata>=0.5.0->dgl) (1.3.0)
DGL installed!

数据准备

为了演示如何组织各种数据，我们首先创建一个基础目录。

[2]:

base_dir = './ondisk_dataset_homograph'
os.makedirs(base_dir, exist_ok=True)
print(f"Created base directory: {base_dir}")

Created base directory: ./ondisk_dataset_homograph

生成图结构数据

对于同构图，我们只需要将边（即种子）保存到Numpy或CSV文件中。

注意：- 当保存到Numpy时，数组需要是(2, N)的形状。推荐使用这种格式，因为从中构建图比CSV文件快得多。- 当保存到CSV文件时，不要保存索引和标题。

[3]:

import numpy as np
import pandas as pd
num_nodes = 1000
num_edges = 10 * num_nodes
edges_path = os.path.join(base_dir, "edges.csv")
edges = np.random.randint(0, num_nodes, size=(num_edges, 2))

print(f"Part of edges: {edges[:5, :]}")

df = pd.DataFrame(edges)
df.to_csv(edges_path, index=False, header=False)

print(f"Edges are saved into {edges_path}")

Part of edges: [[855 775]
 [850 798]
 [336 200]
 [261  46]
 [443 806]]
Edges are saved into ./ondisk_dataset_homograph/edges.csv

生成图的特征数据

对于特征数据，目前支持numpy数组和torch张量。

[4]:

# Generate node feature in numpy array.
node_feat_0_path = os.path.join(base_dir, "node-feat-0.npy")
node_feat_0 = np.random.rand(num_nodes, 5)
print(f"Part of node feature [feat_0]: {node_feat_0[:3, :]}")
np.save(node_feat_0_path, node_feat_0)
print(f"Node feature [feat_0] is saved to {node_feat_0_path}\n")

# Generate another node feature in torch tensor
node_feat_1_path = os.path.join(base_dir, "node-feat-1.pt")
node_feat_1 = torch.rand(num_nodes, 5)
print(f"Part of node feature [feat_1]: {node_feat_1[:3, :]}")
torch.save(node_feat_1, node_feat_1_path)
print(f"Node feature [feat_1] is saved to {node_feat_1_path}\n")

# Generate edge feature in numpy array.
edge_feat_0_path = os.path.join(base_dir, "edge-feat-0.npy")
edge_feat_0 = np.random.rand(num_edges, 5)
print(f"Part of edge feature [feat_0]: {edge_feat_0[:3, :]}")
np.save(edge_feat_0_path, edge_feat_0)
print(f"Edge feature [feat_0] is saved to {edge_feat_0_path}\n")

# Generate another edge feature in torch tensor
edge_feat_1_path = os.path.join(base_dir, "edge-feat-1.pt")
edge_feat_1 = torch.rand(num_edges, 5)
print(f"Part of edge feature [feat_1]: {edge_feat_1[:3, :]}")
torch.save(edge_feat_1, edge_feat_1_path)
print(f"Edge feature [feat_1] is saved to {edge_feat_1_path}\n")

Part of node feature [feat_0]: [[0.66418935 0.48835371 0.48818793 0.39986083 0.06194864]
 [0.70411792 0.23833721 0.40806056 0.44118287 0.6980818 ]
 [0.2674444  0.52477223 0.14542247 0.82756121 0.59281003]]
Node feature [feat_0] is saved to ./ondisk_dataset_homograph/node-feat-0.npy

Part of node feature [feat_1]: tensor([[0.0832, 0.0468, 0.6916, 0.8953, 0.2208],
        [0.6560, 0.8490, 0.4157, 0.3341, 0.3320],
        [0.8025, 0.3192, 0.2389, 0.8955, 0.4619]])
Node feature [feat_1] is saved to ./ondisk_dataset_homograph/node-feat-1.pt

Part of edge feature [feat_0]: [[0.85202101 0.37442228 0.43294893 0.11479708 0.16078714]
 [0.09799214 0.48974035 0.48913703 0.96732298 0.38535437]
 [0.40573755 0.04555319 0.57415337 0.39332099 0.10864744]]
Edge feature [feat_0] is saved to ./ondisk_dataset_homograph/edge-feat-0.npy

Part of edge feature [feat_1]: tensor([[0.9977, 0.6291, 0.8746, 0.1186, 0.6706],
        [0.4452, 0.6584, 0.8161, 0.8932, 0.5067],
        [0.0245, 0.8730, 0.6280, 0.7643, 0.1814]])
Edge feature [feat_1] is saved to ./ondisk_dataset_homograph/edge-feat-1.pt

生成任务

OnDiskDataset 支持多个任务。对于每个任务，我们需要分别准备训练/验证/测试集。这些集通常在不同的任务之间有所不同。在本教程中，让我们创建一个节点分类任务和一个链接预测任务。

节点分类任务

对于节点分类任务，我们需要每个训练/验证/测试集的节点ID和相应的标签。与特征数据一样，这些集支持numpy数组和torch张量。

[5]:

num_trains = int(num_nodes * 0.6)
num_vals = int(num_nodes * 0.2)
num_tests = num_nodes - num_trains - num_vals

ids = np.arange(num_nodes)
np.random.shuffle(ids)

nc_train_ids_path = os.path.join(base_dir, "nc-train-ids.npy")
nc_train_ids = ids[:num_trains]
print(f"Part of train ids for node classification: {nc_train_ids[:3]}")
np.save(nc_train_ids_path, nc_train_ids)
print(f"NC train ids are saved to {nc_train_ids_path}\n")

nc_train_labels_path = os.path.join(base_dir, "nc-train-labels.pt")
nc_train_labels = torch.randint(0, 10, (num_trains,))
print(f"Part of train labels for node classification: {nc_train_labels[:3]}")
torch.save(nc_train_labels, nc_train_labels_path)
print(f"NC train labels are saved to {nc_train_labels_path}\n")

nc_val_ids_path = os.path.join(base_dir, "nc-val-ids.npy")
nc_val_ids = ids[num_trains:num_trains+num_vals]
print(f"Part of val ids for node classification: {nc_val_ids[:3]}")
np.save(nc_val_ids_path, nc_val_ids)
print(f"NC val ids are saved to {nc_val_ids_path}\n")

nc_val_labels_path = os.path.join(base_dir, "nc-val-labels.pt")
nc_val_labels = torch.randint(0, 10, (num_vals,))
print(f"Part of val labels for node classification: {nc_val_labels[:3]}")
torch.save(nc_val_labels, nc_val_labels_path)
print(f"NC val labels are saved to {nc_val_labels_path}\n")

nc_test_ids_path = os.path.join(base_dir, "nc-test-ids.npy")
nc_test_ids = ids[-num_tests:]
print(f"Part of test ids for node classification: {nc_test_ids[:3]}")
np.save(nc_test_ids_path, nc_test_ids)
print(f"NC test ids are saved to {nc_test_ids_path}\n")

nc_test_labels_path = os.path.join(base_dir, "nc-test-labels.pt")
nc_test_labels = torch.randint(0, 10, (num_tests,))
print(f"Part of test labels for node classification: {nc_test_labels[:3]}")
torch.save(nc_test_labels, nc_test_labels_path)
print(f"NC test labels are saved to {nc_test_labels_path}\n")

Part of train ids for node classification: [284 651 844]
NC train ids are saved to ./ondisk_dataset_homograph/nc-train-ids.npy

Part of train labels for node classification: tensor([3, 7, 3])
NC train labels are saved to ./ondisk_dataset_homograph/nc-train-labels.pt

Part of val ids for node classification: [923 240 934]
NC val ids are saved to ./ondisk_dataset_homograph/nc-val-ids.npy

Part of val labels for node classification: tensor([7, 3, 1])
NC val labels are saved to ./ondisk_dataset_homograph/nc-val-labels.pt

Part of test ids for node classification: [107 664 870]
NC test ids are saved to ./ondisk_dataset_homograph/nc-test-ids.npy

Part of test labels for node classification: tensor([6, 9, 4])
NC test labels are saved to ./ondisk_dataset_homograph/nc-test-labels.pt

链接预测任务

对于链接预测任务，我们需要种子或相应的标签和索引，这些标签和索引表示每个训练/验证/测试集的种子组的正/负属性。与特征数据一样，这些集支持numpy数组和torch张量。

[6]:

num_trains = int(num_edges * 0.6)
num_vals = int(num_edges * 0.2)
num_tests = num_edges - num_trains - num_vals

lp_train_seeds_path = os.path.join(base_dir, "lp-train-seeds.npy")
lp_train_seeds = edges[:num_trains, :]
print(f"Part of train seeds for link prediction: {lp_train_seeds[:3]}")
np.save(lp_train_seeds_path, lp_train_seeds)
print(f"LP train seeds are saved to {lp_train_seeds_path}\n")

lp_val_seeds_path = os.path.join(base_dir, "lp-val-seeds.npy")
lp_val_seeds = edges[num_trains:num_trains+num_vals, :]
lp_val_neg_dsts = np.random.randint(0, num_nodes, (num_vals, 10)).reshape(-1)
lp_val_neg_srcs = np.repeat(lp_val_seeds[:,0], 10)
lp_val_neg_seeds = np.concatenate((lp_val_neg_srcs, lp_val_neg_dsts)).reshape(2,-1).T
lp_val_seeds = np.concatenate((lp_val_seeds, lp_val_neg_seeds))
print(f"Part of val seeds for link prediction: {lp_val_seeds[:3]}")
np.save(lp_val_seeds_path, lp_val_seeds)
print(f"LP val seeds are saved to {lp_val_seeds_path}\n")

lp_val_labels_path = os.path.join(base_dir, "lp-val-labels.npy")
lp_val_labels = np.empty(num_vals * (10 + 1))
lp_val_labels[:num_vals] = 1
lp_val_labels[num_vals:] = 0
print(f"Part of val labels for link prediction: {lp_val_labels[:3]}")
np.save(lp_val_labels_path, lp_val_labels)
print(f"LP val labels are saved to {lp_val_labels_path}\n")

lp_val_indexes_path = os.path.join(base_dir, "lp-val-indexes.npy")
lp_val_indexes = np.arange(0, num_vals)
lp_val_neg_indexes = np.repeat(lp_val_indexes, 10)
lp_val_indexes = np.concatenate([lp_val_indexes, lp_val_neg_indexes])
print(f"Part of val indexes for link prediction: {lp_val_indexes[:3]}")
np.save(lp_val_indexes_path, lp_val_indexes)
print(f"LP val indexes are saved to {lp_val_indexes_path}\n")

lp_test_seeds_path = os.path.join(base_dir, "lp-test-seeds.npy")
lp_test_seeds = edges[-num_tests:, :]
lp_test_neg_dsts = np.random.randint(0, num_nodes, (num_tests, 10)).reshape(-1)
lp_test_neg_srcs = np.repeat(lp_test_seeds[:,0], 10)
lp_test_neg_seeds = np.concatenate((lp_test_neg_srcs, lp_test_neg_dsts)).reshape(2,-1).T
lp_test_seeds = np.concatenate((lp_test_seeds, lp_test_neg_seeds))
print(f"Part of test seeds for link prediction: {lp_test_seeds[:3]}")
np.save(lp_test_seeds_path, lp_test_seeds)
print(f"LP test seeds are saved to {lp_test_seeds_path}\n")

lp_test_labels_path = os.path.join(base_dir, "lp-test-labels.npy")
lp_test_labels = np.empty(num_tests * (10 + 1))
lp_test_labels[:num_tests] = 1
lp_test_labels[num_tests:] = 0
print(f"Part of val labels for link prediction: {lp_test_labels[:3]}")
np.save(lp_test_labels_path, lp_test_labels)
print(f"LP test labels are saved to {lp_test_labels_path}\n")

lp_test_indexes_path = os.path.join(base_dir, "lp-test-indexes.npy")
lp_test_indexes = np.arange(0, num_tests)
lp_test_neg_indexes = np.repeat(lp_test_indexes, 10)
lp_test_indexes = np.concatenate([lp_test_indexes, lp_test_neg_indexes])
print(f"Part of test indexes for link prediction: {lp_test_indexes[:3]}")
np.save(lp_test_indexes_path, lp_test_indexes)
print(f"LP test indexes are saved to {lp_test_indexes_path}\n")

Part of train seeds for link prediction: [[855 775]
 [850 798]
 [336 200]]
LP train seeds are saved to ./ondisk_dataset_homograph/lp-train-seeds.npy

Part of val seeds for link prediction: [[956 201]
 [899 528]
 [209 380]]
LP val seeds are saved to ./ondisk_dataset_homograph/lp-val-seeds.npy

Part of val labels for link prediction: [1. 1. 1.]
LP val labels are saved to ./ondisk_dataset_homograph/lp-val-labels.npy

Part of val indexes for link prediction: [0 1 2]
LP val indexes are saved to ./ondisk_dataset_homograph/lp-val-indexes.npy

Part of test seeds for link prediction: [[ 40 984]
 [622 853]
 [589 241]]
LP test seeds are saved to ./ondisk_dataset_homograph/lp-test-seeds.npy

Part of val labels for link prediction: [1. 1. 1.]
LP test labels are saved to ./ondisk_dataset_homograph/lp-test-labels.npy

Part of test indexes for link prediction: [0 1 2]
LP test indexes are saved to ./ondisk_dataset_homograph/lp-test-indexes.npy

将数据组织成YAML文件

现在我们需要创建一个metadata.yaml文件，其中包含路径、图结构的数据类型、特征数据、训练/验证/测试集。

注意：- 所有路径应相对于metadata.yaml。- 以下字段是可选的，并未在下面的示例中指定。- in_memory：指示是否将数据加载到内存中或使用mmap。默认值为True。

请参考YAML规范了解更多详情。

[7]:

yaml_content = f"""
    dataset_name: homogeneous_graph_nc_lp
    graph:
      nodes:
        - num: {num_nodes}
      edges:
        - format: csv
          path: {os.path.basename(edges_path)}
    feature_data:
      - domain: node
        name: feat_0
        format: numpy
        path: {os.path.basename(node_feat_0_path)}
      - domain: node
        name: feat_1
        format: torch
        path: {os.path.basename(node_feat_1_path)}
      - domain: edge
        name: feat_0
        format: numpy
        path: {os.path.basename(edge_feat_0_path)}
      - domain: edge
        name: feat_1
        format: torch
        path: {os.path.basename(edge_feat_1_path)}
    tasks:
      - name: node_classification
        num_classes: 10
        train_set:
          - data:
              - name: seeds
                format: numpy
                path: {os.path.basename(nc_train_ids_path)}
              - name: labels
                format: torch
                path: {os.path.basename(nc_train_labels_path)}
        validation_set:
          - data:
              - name: seeds
                format: numpy
                path: {os.path.basename(nc_val_ids_path)}
              - name: labels
                format: torch
                path: {os.path.basename(nc_val_labels_path)}
        test_set:
          - data:
              - name: seeds
                format: numpy
                path: {os.path.basename(nc_test_ids_path)}
              - name: labels
                format: torch
                path: {os.path.basename(nc_test_labels_path)}
      - name: link_prediction
        num_classes: 10
        train_set:
          - data:
              - name: seeds
                format: numpy
                path: {os.path.basename(lp_train_seeds_path)}
        validation_set:
          - data:
              - name: seeds
                format: numpy
                path: {os.path.basename(lp_val_seeds_path)}
              - name: labels
                format: numpy
                path: {os.path.basename(lp_val_labels_path)}
              - name: indexes
                format: numpy
                path: {os.path.basename(lp_val_indexes_path)}
        test_set:
          - data:
              - name: seeds
                format: numpy
                path: {os.path.basename(lp_test_seeds_path)}
              - name: labels
                format: numpy
                path: {os.path.basename(lp_test_labels_path)}
              - name: indexes
                format: numpy
                path: {os.path.basename(lp_test_indexes_path)}
"""
metadata_path = os.path.join(base_dir, "metadata.yaml")
with open(metadata_path, "w") as f:
  f.write(yaml_content)

实例化 `OnDiskDataset`

现在我们准备通过dgl.graphbolt.OnDiskDataset加载数据集。在实例化时，我们只需传入metadata.yaml文件所在的基础目录。

在首次实例化期间，GraphBolt 会预处理原始数据，例如从边构建 FusedCSCSamplingGraph。所有数据，包括图、特征数据、训练/验证/测试集，在预处理后都会放入 preprocessed 目录中。任何后续的数据集加载都将跳过预处理阶段。

预处理后，需要显式调用load()以加载图、特征数据和任务。

[8]:

dataset = gb.OnDiskDataset(base_dir).load()
graph = dataset.graph
print(f"Loaded graph: {graph}\n")

feature = dataset.feature
print(f"Loaded feature store: {feature}\n")

tasks = dataset.tasks
nc_task = tasks[0]
print(f"Loaded node classification task: {nc_task}\n")
lp_task = tasks[1]
print(f"Loaded link prediction task: {lp_task}\n")

Start to preprocess the on-disk dataset.
Finish preprocessing the on-disk dataset.
Loaded graph: FusedCSCSamplingGraph(csc_indptr=tensor([    0,    10,    21,  ...,  9980,  9995, 10000], dtype=torch.int32),
                      indices=tensor([181, 410, 596,  ..., 745, 997, 122], dtype=torch.int32),
                      total_num_nodes=1000, num_edges=10000,)

Loaded feature store: TorchBasedFeatureStore(
    {(<OnDiskFeatureDataDomain.NODE: 'node'>, None, 'feat_0'): TorchBasedFeature(
        feature=tensor([[0.6642, 0.4884, 0.4882, 0.3999, 0.0619],
                        [0.7041, 0.2383, 0.4081, 0.4412, 0.6981],
                        [0.2674, 0.5248, 0.1454, 0.8276, 0.5928],
                        ...,
                        [0.6519, 0.2808, 0.8133, 0.0882, 0.6630],
                        [0.2974, 0.7740, 0.1138, 0.4471, 0.2879],
                        [0.9448, 0.7801, 0.3655, 0.6760, 0.9172]], dtype=torch.float64),
        metadata={},
    ), (<OnDiskFeatureDataDomain.NODE: 'node'>, None, 'feat_1'): TorchBasedFeature(
        feature=tensor([[0.0832, 0.0468, 0.6916, 0.8953, 0.2208],
                        [0.6560, 0.8490, 0.4157, 0.3341, 0.3320],
                        [0.8025, 0.3192, 0.2389, 0.8955, 0.4619],
                        ...,
                        [0.6520, 0.4619, 0.5146, 0.6097, 0.2386],
                        [0.7205, 0.9640, 0.6694, 0.1268, 0.5149],
                        [0.5294, 0.7750, 0.7792, 0.0633, 0.3548]]),
        metadata={},
    ), (<OnDiskFeatureDataDomain.EDGE: 'edge'>, None, 'feat_0'): TorchBasedFeature(
        feature=tensor([[0.8520, 0.3744, 0.4329, 0.1148, 0.1608],
                        [0.0980, 0.4897, 0.4891, 0.9673, 0.3854],
                        [0.4057, 0.0456, 0.5742, 0.3933, 0.1086],
                        ...,
                        [0.0465, 0.2654, 0.7799, 0.4601, 0.6528],
                        [0.0966, 0.4897, 0.2502, 0.4873, 0.8816],
                        [0.6565, 0.7743, 0.6417, 0.2032, 0.5695]], dtype=torch.float64),
        metadata={},
    ), (<OnDiskFeatureDataDomain.EDGE: 'edge'>, None, 'feat_1'): TorchBasedFeature(
        feature=tensor([[0.9977, 0.6291, 0.8746, 0.1186, 0.6706],
                        [0.4452, 0.6584, 0.8161, 0.8932, 0.5067],
                        [0.0245, 0.8730, 0.6280, 0.7643, 0.1814],
                        ...,
                        [0.4550, 0.8023, 0.3347, 0.6111, 0.3855],
                        [0.9893, 0.4175, 0.5815, 0.0469, 0.0468],
                        [0.9241, 0.3524, 0.5294, 0.5368, 0.7772]]),
        metadata={},
    )}
)

Loaded node classification task: OnDiskTask(validation_set=ItemSet(
               items=(tensor([923, 240, 934, 846,  20, 775, 142, 617, 109, 974, 383, 228,  69, 178,
                   403, 189, 973, 312, 276, 829, 794, 447, 900,  88, 724, 281, 715, 814,
                   953, 696,  34, 633, 637, 275, 315, 164, 888, 412, 405, 567, 855, 595,
                   434, 333, 640, 384, 477, 140, 739, 778, 282, 769, 624, 262, 491, 662,
                   484, 459, 969, 221, 935, 569, 575, 892, 758, 757,  55, 138, 130, 344,
                   322, 767, 653, 279, 903, 191, 820, 655, 476, 267,  26, 157, 742, 802,
                   465, 623, 737, 962, 450, 199, 631, 510, 527, 719, 848, 348, 347, 321,
                   796, 291, 795, 915, 682, 202, 101, 196, 183, 878, 206, 147, 584,  52,
                   270, 673, 123, 621, 690, 730, 319,  83, 692, 716, 817, 390, 679, 732,
                    66, 306, 821, 580, 978, 840, 528, 260, 327, 740, 210, 931, 539, 151,
                   288, 113, 955, 610, 374, 389, 168, 363, 718, 889, 295, 534,  46, 530,
                   224, 593, 810, 201, 169, 585, 899, 521, 776, 675, 421, 801, 245, 886,
                   124, 606, 331, 950, 856, 371, 646, 880, 782, 746, 627, 929, 490, 907,
                   893,  70, 436, 761, 104,  96, 117, 590, 512, 803, 377, 250, 970, 203,
                   438, 813, 238, 455], dtype=torch.int32), tensor([7, 3, 1, 1, 8, 3, 5, 0, 3, 4, 0, 8, 6, 6, 1, 5, 8, 6, 9, 5, 8, 9, 6, 0,
                   0, 3, 3, 4, 4, 5, 9, 6, 2, 0, 2, 7, 3, 3, 8, 9, 8, 8, 5, 5, 0, 6, 1, 0,
                   5, 5, 9, 4, 8, 3, 8, 2, 0, 1, 0, 1, 3, 4, 5, 9, 7, 1, 1, 3, 2, 6, 1, 6,
                   5, 4, 5, 6, 7, 0, 8, 0, 6, 9, 9, 2, 1, 0, 7, 3, 9, 9, 1, 6, 5, 2, 5, 3,
                   0, 0, 3, 2, 3, 9, 7, 7, 9, 9, 7, 3, 4, 7, 7, 4, 3, 5, 6, 0, 7, 0, 5, 5,
                   4, 8, 2, 2, 9, 3, 0, 4, 9, 0, 7, 2, 6, 1, 2, 3, 1, 8, 6, 9, 2, 1, 1, 6,
                   8, 3, 4, 3, 8, 7, 1, 8, 8, 2, 3, 2, 1, 3, 9, 7, 1, 2, 5, 6, 7, 1, 0, 3,
                   0, 1, 2, 0, 4, 2, 8, 8, 9, 0, 4, 3, 3, 8, 8, 9, 2, 5, 2, 3, 4, 1, 0, 4,
                   4, 7, 0, 0, 2, 3, 1, 1])),
               names=('seeds', 'labels'),
           ),
           train_set=ItemSet(
               items=(tensor([284, 651, 844,  33, 924, 367, 582, 222, 964, 972, 223, 263, 522, 639,
                   150, 400, 213, 255,  97, 159, 722, 661, 917, 704, 218, 672, 330, 976,
                   948, 843, 543, 466, 997, 960, 193, 328, 143, 446, 342, 309, 533, 965,
                   707, 127, 688,   2, 957, 986, 693, 904, 932, 765, 431,   6, 667, 712,
                   362, 586, 398,  72, 643, 591, 299, 141, 161, 204, 609, 485, 244, 239,
                   302, 755, 884, 551, 670, 708, 536, 336, 703, 800, 149,  86, 462, 784,
                     5, 537, 133, 861, 920, 112, 186, 603, 200,   1, 493, 869, 676, 642,
                   717, 956, 608, 445, 148, 501, 334, 979, 681, 561, 175, 668, 947, 304,
                   453, 187, 967, 293, 944,  12, 918, 355,  94,  77, 167, 387, 454, 991,
                    36, 762, 842, 984, 341, 660, 474, 118, 269, 316, 674, 126, 583, 589,
                   747, 296, 397, 789,  17, 949, 264, 432, 205, 435, 894,  79, 122, 975,
                   871,  81, 496,  40, 274, 473, 994, 546, 864, 981, 227, 392, 544, 427,
                    75, 963,  11, 709, 738,  56, 556, 559, 461, 728, 857, 500, 626, 839,
                   910, 233, 684, 190,  62, 895, 602, 669, 868, 804,  39,  43, 375,  90,
                   700,  58, 860, 926, 982, 139, 613,  80, 809, 729, 781, 579, 560, 422,
                   826, 832, 475, 547,   9, 404, 253, 604,  54, 366,  32, 562,  63, 495,
                   990,  10, 246, 798, 230, 507,  47, 361, 697, 320,  21, 600, 155, 339,
                   508, 261,  25, 908, 463, 735, 694, 152, 691, 989, 851, 231, 714, 845,
                   616,  87,  78, 764, 710, 834, 825, 108, 369, 983, 326, 305, 942,  91,
                   396, 873,   0, 605, 777, 875, 259, 928,  61, 685, 351, 129, 242, 859,
                   828, 176, 797, 581, 511, 850, 952, 343,  51, 919, 531, 555, 184, 225,
                     3, 665, 283, 346, 324, 451, 612,  28, 656, 587, 268, 699, 265, 357,
                   553, 812, 611, 503, 441, 273, 905,  13, 570, 173, 557, 966, 439, 376,
                   552, 874, 858, 550, 488, 658, 632, 992, 913, 513, 216, 487, 174, 921,
                   212, 830, 517, 413, 911, 256, 564, 819, 425, 902, 549, 849, 160, 415,
                   594, 993,  44, 486, 440, 538, 388, 943,  35, 793, 313, 483, 131, 408,
                   323, 247, 763, 468, 482, 574, 303, 128,  82, 566, 563, 618, 266,  67,
                   548,  38, 448, 518, 678,  16,  68,  23, 614, 254, 497, 677, 601, 572,
                   219, 181, 414, 945,  30, 815, 237, 783, 540, 958, 525, 998, 588, 774,
                   308, 879, 232, 515, 598, 896,  31, 136, 657, 723, 479, 318, 654, 298,
                   381, 936,  89, 663, 689, 170, 125, 165, 406, 426, 901, 423, 290, 695,
                    85, 368,  95, 419, 792,  49, 287,  74, 380, 541, 744, 838, 243, 208,
                   759, 285,  27, 464, 307, 941, 770, 504, 771, 329, 876, 578, 433, 754,
                   753, 177, 345, 145, 401, 429,  41, 930, 999,  98, 499, 705, 885, 277,
                   883, 215, 650, 198, 332, 300, 565, 114, 987,  45,  18, 768, 297, 881,
                   980, 671, 573, 615, 365, 721, 382, 647, 469, 897, 619, 399, 402, 898,
                   866, 946, 542, 545, 683, 395, 725, 909, 172, 217, 317, 354, 135, 492,
                   760, 635, 120, 442, 641,  71, 648, 119, 750, 831, 890,  73, 121,   4,
                   701, 937, 498, 529, 634, 818, 478, 748, 596, 780, 506, 115, 711, 659,
                   458, 420, 234, 280, 686, 311, 116, 629, 514, 638, 520, 607, 411, 743,
                   480, 571, 137, 833, 636, 195, 337, 241, 372, 847, 153, 257, 325, 824,
                   378, 197, 470, 211, 863,  14, 741, 460, 791, 535,  15, 182, 862, 912,
                   229, 734, 977, 424, 853,  64, 887, 822, 959, 444, 457, 106],
                  dtype=torch.int32), tensor([3, 7, 3, 0, 8, 5, 1, 4, 1, 4, 7, 3, 9, 2, 0, 5, 4, 4, 1, 4, 7, 4, 6, 2,
                   0, 9, 3, 7, 3, 7, 4, 8, 8, 8, 0, 5, 0, 0, 9, 4, 9, 1, 5, 9, 5, 6, 1, 0,
                   3, 7, 5, 1, 1, 6, 0, 7, 4, 4, 9, 1, 3, 9, 5, 7, 1, 4, 3, 2, 6, 9, 5, 8,
                   0, 7, 4, 9, 9, 6, 8, 5, 6, 3, 4, 4, 9, 1, 7, 7, 8, 7, 6, 4, 0, 3, 7, 5,
                   9, 8, 9, 9, 1, 5, 6, 8, 2, 0, 4, 0, 0, 3, 4, 4, 6, 6, 4, 9, 6, 4, 5, 1,
                   1, 2, 4, 6, 4, 4, 8, 5, 8, 4, 0, 4, 3, 1, 6, 2, 3, 0, 6, 0, 4, 0, 8, 4,
                   6, 0, 1, 1, 0, 7, 6, 1, 1, 7, 3, 0, 6, 6, 8, 3, 0, 8, 3, 6, 4, 5, 2, 6,
                   9, 4, 9, 3, 9, 2, 0, 3, 9, 2, 1, 4, 4, 9, 9, 1, 6, 0, 1, 4, 7, 5, 9, 4,
                   4, 4, 6, 4, 1, 8, 7, 8, 0, 9, 2, 5, 7, 5, 8, 1, 3, 6, 2, 8, 6, 2, 7, 7,
                   4, 0, 3, 9, 7, 0, 7, 5, 8, 0, 2, 7, 3, 5, 2, 9, 5, 2, 2, 8, 6, 8, 4, 4,
                   7, 5, 4, 9, 0, 3, 7, 7, 7, 4, 4, 3, 4, 2, 5, 7, 7, 5, 5, 7, 0, 2, 4, 7,
                   1, 1, 1, 9, 0, 6, 9, 6, 2, 4, 3, 5, 0, 7, 2, 6, 8, 8, 2, 6, 3, 1, 1, 7,
                   4, 6, 8, 3, 8, 0, 9, 7, 6, 4, 4, 2, 6, 0, 9, 4, 1, 7, 9, 6, 0, 0, 0, 8,
                   9, 9, 5, 4, 1, 1, 7, 8, 3, 1, 4, 8, 9, 5, 5, 3, 1, 6, 9, 3, 9, 6, 6, 4,
                   8, 1, 9, 5, 8, 3, 5, 4, 9, 8, 6, 5, 3, 5, 9, 6, 3, 4, 1, 2, 4, 8, 2, 1,
                   3, 0, 0, 3, 0, 1, 7, 6, 7, 7, 7, 2, 6, 9, 2, 5, 9, 5, 2, 0, 4, 1, 0, 6,
                   1, 8, 6, 4, 7, 5, 5, 9, 1, 2, 6, 0, 5, 9, 2, 2, 9, 3, 2, 3, 6, 3, 3, 7,
                   2, 3, 3, 8, 5, 6, 7, 0, 9, 2, 0, 5, 2, 4, 2, 8, 4, 9, 5, 3, 2, 4, 0, 5,
                   0, 4, 7, 7, 6, 0, 8, 5, 5, 9, 1, 9, 6, 4, 6, 5, 2, 2, 0, 3, 6, 4, 7, 3,
                   8, 5, 9, 3, 8, 9, 2, 4, 4, 8, 8, 2, 2, 7, 7, 6, 5, 6, 4, 7, 1, 1, 0, 2,
                   8, 3, 6, 5, 2, 7, 0, 2, 5, 1, 5, 0, 9, 5, 1, 9, 8, 6, 1, 1, 4, 6, 8, 9,
                   5, 9, 8, 5, 5, 0, 0, 4, 0, 0, 3, 2, 9, 3, 2, 7, 2, 3, 9, 2, 6, 8, 9, 1,
                   3, 3, 3, 1, 4, 3, 0, 0, 0, 4, 0, 8, 6, 6, 8, 4, 2, 9, 5, 0, 5, 3, 1, 9,
                   5, 6, 8, 7, 3, 4, 1, 7, 0, 4, 7, 2, 8, 5, 6, 1, 1, 0, 2, 9, 7, 5, 0, 0,
                   8, 3, 7, 5, 8, 7, 9, 8, 6, 1, 1, 3, 3, 4, 2, 5, 9, 5, 6, 7, 4, 5, 0, 9])),
               names=('seeds', 'labels'),
           ),
           test_set=ItemSet(
               items=(tensor([107, 664, 870, 939, 207, 162, 509, 353, 756, 103, 745, 100,   8, 294,
                   698, 171, 494, 971, 988, 301, 766, 338, 630, 467, 272,  19,  22, 790,
                   940, 995,  37, 865, 996, 358, 925, 352, 726, 379, 576, 417, 481, 644,
                    65, 502, 471, 214, 916, 524,  99, 951, 180, 156, 568, 166, 472,  57,
                   356, 985, 961, 179, 852, 271, 622, 854,  42, 836, 773, 340, 437, 418,
                    50, 394, 841, 687, 105, 416, 409, 807, 827, 713, 872, 144, 554, 526,
                   628, 489,  53, 720,  59, 505, 706, 779, 625, 386, 235, 731, 523, 236,
                   649, 373, 158, 922, 226,  93, 787, 452,  84, 811, 391, 927,  24, 891,
                   350, 592, 314,  76, 194, 385, 134, 733, 335, 251, 816, 577, 154, 933,
                   188, 102, 752, 430, 597, 248, 808, 620, 258, 449, 785, 289, 938, 558,
                   110, 751,  29, 443, 799, 349, 364, 914, 727, 772, 192,  60, 805, 360,
                   877, 702, 393, 516, 645, 837, 407, 310, 163,  92, 428, 666, 519, 835,
                   532,   7, 968, 249, 209, 806, 359, 788, 132, 252, 906, 185, 749, 410,
                    48, 736, 680, 599, 456, 286, 220, 954, 370, 111, 786, 652, 292, 867,
                   882, 146, 823, 278], dtype=torch.int32), tensor([6, 9, 4, 0, 4, 5, 3, 3, 3, 6, 6, 1, 0, 7, 9, 3, 1, 4, 2, 0, 9, 4, 0, 8,
                   6, 9, 8, 1, 1, 9, 6, 3, 8, 2, 9, 6, 3, 3, 0, 1, 7, 7, 5, 1, 3, 7, 2, 9,
                   5, 3, 2, 9, 4, 7, 5, 0, 6, 6, 3, 7, 3, 1, 8, 0, 1, 0, 2, 8, 0, 4, 9, 0,
                   1, 3, 8, 8, 1, 5, 1, 7, 0, 3, 8, 3, 0, 5, 2, 0, 3, 7, 5, 0, 6, 1, 6, 6,
                   9, 9, 0, 7, 2, 5, 8, 9, 9, 0, 0, 4, 7, 8, 4, 9, 2, 1, 1, 0, 3, 1, 9, 2,
                   1, 8, 6, 4, 0, 0, 4, 6, 4, 5, 7, 1, 7, 6, 0, 5, 8, 8, 0, 5, 6, 4, 0, 0,
                   8, 5, 8, 9, 2, 0, 2, 5, 1, 2, 9, 8, 2, 4, 7, 8, 7, 1, 0, 4, 5, 8, 1, 5,
                   3, 3, 6, 0, 1, 6, 8, 1, 5, 5, 9, 5, 1, 2, 7, 7, 9, 1, 3, 5, 6, 6, 5, 4,
                   8, 5, 8, 7, 1, 3, 6, 1])),
               names=('seeds', 'labels'),
           ),
           metadata={'name': 'node_classification', 'num_classes': 10},)

Loaded link prediction task: OnDiskTask(validation_set=ItemSet(
               items=(tensor([[956, 201],
                   [899, 528],
                   [209, 380],
                   ...,
                   [693, 506],
                   [693, 911],
                   [693, 979]], dtype=torch.int32), tensor([1., 1., 1.,  ..., 0., 0., 0.], dtype=torch.float64), tensor([   0,    1,    2,  ..., 1999, 1999, 1999])),
               names=('seeds', 'labels', 'indexes'),
           ),
           train_set=ItemSet(
               items=(tensor([[855, 775],
                   [850, 798],
                   [336, 200],
                   ...,
                   [819, 807],
                   [758, 628],
                   [324, 175]], dtype=torch.int32),),
               names=('seeds',),
           ),
           test_set=ItemSet(
               items=(tensor([[ 40, 984],
                   [622, 853],
                   [589, 241],
                   ...,
                   [339, 387],
                   [339, 570],
                   [339, 443]], dtype=torch.int32), tensor([1., 1., 1.,  ..., 0., 0., 0.], dtype=torch.float64), tensor([   0,    1,    2,  ..., 1999, 1999, 1999])),
               names=('seeds', 'labels', 'indexes'),
           ),
           metadata={'name': 'link_prediction', 'num_classes': 10},)

/home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages/dgl-2.3-py3.8-linux-x86_64.egg/dgl/graphbolt/impl/ondisk_dataset.py:460: DGLWarning: Edge feature is stored, but edge IDs are not saved.
  dgl_warning("Edge feature is stored, but edge IDs are not saved.")