用于同质图的OnDiskDataset
本教程展示了如何为可以在GraphBolt框架中使用的同构图创建OnDiskDataset。
在本教程结束时,您将能够
组织图结构数据。
组织特征数据。
为特定任务组织训练/验证/测试集。
要创建一个OnDiskDataset对象,你需要将所有数据包括图结构、特征数据和任务组织到一个目录中。该目录应包含一个metadata.yaml文件,用于描述数据集的元数据。
现在让我们逐步生成各种数据并将它们组织在一起,最终实例化OnDiskDataset。
安装DGL包
[1]:
# Install required packages.
import os
import torch
import numpy as np
os.environ['TORCH'] = torch.__version__
os.environ['DGLBACKEND'] = "pytorch"
# Install the CPU version.
device = torch.device("cpu")
!pip install --pre dgl -f https://data.dgl.ai/wheels-test/repo.html
try:
import dgl
import dgl.graphbolt as gb
installed = True
except ImportError as error:
installed = False
print(error)
print("DGL installed!" if installed else "DGL not found!")
Looking in links: https://data.dgl.ai/wheels-test/repo.html
Requirement already satisfied: dgl in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages/dgl-2.3-py3.8-linux-x86_64.egg (2.3)
Requirement already satisfied: numpy>=1.14.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (1.24.4)
Requirement already satisfied: scipy>=1.1.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (1.10.1)
Requirement already satisfied: networkx>=2.1 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (3.1)
Requirement already satisfied: requests>=2.19.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (2.31.0)
Requirement already satisfied: tqdm in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (4.66.4)
Requirement already satisfied: psutil>=5.8.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (5.9.8)
Requirement already satisfied: torchdata>=0.5.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from dgl) (0.7.1)
Requirement already satisfied: pandas in /home/ubuntu/.pyenv/versions/miniconda3-latest/lib/python3.8/site-packages (from dgl) (2.0.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from requests>=2.19.0->dgl) (3.3.2)
Requirement already satisfied: idna<4,>=2.5 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from requests>=2.19.0->dgl) (3.7)
Requirement already satisfied: urllib3<3,>=1.21.1 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from requests>=2.19.0->dgl) (2.2.1)
Requirement already satisfied: certifi>=2017.4.17 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from requests>=2.19.0->dgl) (2024.2.2)
Requirement already satisfied: torch>=2 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torchdata>=0.5.0->dgl) (2.0.0)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from pandas->dgl) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from pandas->dgl) (2024.1)
Requirement already satisfied: tzdata>=2022.1 in /home/ubuntu/.pyenv/versions/miniconda3-latest/lib/python3.8/site-packages (from pandas->dgl) (2024.1)
Requirement already satisfied: six>=1.5 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from python-dateutil>=2.8.2->pandas->dgl) (1.16.0)
Requirement already satisfied: filelock in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (3.14.0)
Requirement already satisfied: typing-extensions in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (4.11.0)
Requirement already satisfied: sympy in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (1.12)
Requirement already satisfied: jinja2 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (3.1.4)
Requirement already satisfied: nvidia-cuda-nvrtc-cu11==11.7.99 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.7.99)
Requirement already satisfied: nvidia-cuda-runtime-cu11==11.7.99 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.7.99)
Requirement already satisfied: nvidia-cuda-cupti-cu11==11.7.101 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.7.101)
Requirement already satisfied: nvidia-cudnn-cu11==8.5.0.96 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (8.5.0.96)
Requirement already satisfied: nvidia-cublas-cu11==11.10.3.66 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.10.3.66)
Requirement already satisfied: nvidia-cufft-cu11==10.9.0.58 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (10.9.0.58)
Requirement already satisfied: nvidia-curand-cu11==10.2.10.91 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (10.2.10.91)
Requirement already satisfied: nvidia-cusolver-cu11==11.4.0.1 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.4.0.1)
Requirement already satisfied: nvidia-cusparse-cu11==11.7.4.91 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.7.4.91)
Requirement already satisfied: nvidia-nccl-cu11==2.14.3 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (2.14.3)
Requirement already satisfied: nvidia-nvtx-cu11==11.7.91 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (11.7.91)
Requirement already satisfied: triton==2.0.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from torch>=2->torchdata>=0.5.0->dgl) (2.0.0)
Requirement already satisfied: setuptools in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=2->torchdata>=0.5.0->dgl) (69.5.1)
Requirement already satisfied: wheel in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from nvidia-cublas-cu11==11.10.3.66->torch>=2->torchdata>=0.5.0->dgl) (0.43.0)
Requirement already satisfied: cmake in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from triton==2.0.0->torch>=2->torchdata>=0.5.0->dgl) (3.29.3)
Requirement already satisfied: lit in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from triton==2.0.0->torch>=2->torchdata>=0.5.0->dgl) (18.1.4)
Requirement already satisfied: MarkupSafe>=2.0 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from jinja2->torch>=2->torchdata>=0.5.0->dgl) (2.1.5)
Requirement already satisfied: mpmath>=0.19 in /home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages (from sympy->torch>=2->torchdata>=0.5.0->dgl) (1.3.0)
DGL installed!
数据准备
为了演示如何组织各种数据,我们首先创建一个基础目录。
[2]:
base_dir = './ondisk_dataset_homograph'
os.makedirs(base_dir, exist_ok=True)
print(f"Created base directory: {base_dir}")
Created base directory: ./ondisk_dataset_homograph
生成图结构数据
对于同构图,我们只需要将边(即种子)保存到Numpy或CSV文件中。
注意:- 当保存到Numpy时,数组需要是(2, N)的形状。推荐使用这种格式,因为从中构建图比CSV文件快得多。- 当保存到CSV文件时,不要保存索引和标题。
[3]:
import numpy as np
import pandas as pd
num_nodes = 1000
num_edges = 10 * num_nodes
edges_path = os.path.join(base_dir, "edges.csv")
edges = np.random.randint(0, num_nodes, size=(num_edges, 2))
print(f"Part of edges: {edges[:5, :]}")
df = pd.DataFrame(edges)
df.to_csv(edges_path, index=False, header=False)
print(f"Edges are saved into {edges_path}")
Part of edges: [[855 775]
[850 798]
[336 200]
[261 46]
[443 806]]
Edges are saved into ./ondisk_dataset_homograph/edges.csv
生成图的特征数据
对于特征数据,目前支持numpy数组和torch张量。
[4]:
# Generate node feature in numpy array.
node_feat_0_path = os.path.join(base_dir, "node-feat-0.npy")
node_feat_0 = np.random.rand(num_nodes, 5)
print(f"Part of node feature [feat_0]: {node_feat_0[:3, :]}")
np.save(node_feat_0_path, node_feat_0)
print(f"Node feature [feat_0] is saved to {node_feat_0_path}\n")
# Generate another node feature in torch tensor
node_feat_1_path = os.path.join(base_dir, "node-feat-1.pt")
node_feat_1 = torch.rand(num_nodes, 5)
print(f"Part of node feature [feat_1]: {node_feat_1[:3, :]}")
torch.save(node_feat_1, node_feat_1_path)
print(f"Node feature [feat_1] is saved to {node_feat_1_path}\n")
# Generate edge feature in numpy array.
edge_feat_0_path = os.path.join(base_dir, "edge-feat-0.npy")
edge_feat_0 = np.random.rand(num_edges, 5)
print(f"Part of edge feature [feat_0]: {edge_feat_0[:3, :]}")
np.save(edge_feat_0_path, edge_feat_0)
print(f"Edge feature [feat_0] is saved to {edge_feat_0_path}\n")
# Generate another edge feature in torch tensor
edge_feat_1_path = os.path.join(base_dir, "edge-feat-1.pt")
edge_feat_1 = torch.rand(num_edges, 5)
print(f"Part of edge feature [feat_1]: {edge_feat_1[:3, :]}")
torch.save(edge_feat_1, edge_feat_1_path)
print(f"Edge feature [feat_1] is saved to {edge_feat_1_path}\n")
Part of node feature [feat_0]: [[0.66418935 0.48835371 0.48818793 0.39986083 0.06194864]
[0.70411792 0.23833721 0.40806056 0.44118287 0.6980818 ]
[0.2674444 0.52477223 0.14542247 0.82756121 0.59281003]]
Node feature [feat_0] is saved to ./ondisk_dataset_homograph/node-feat-0.npy
Part of node feature [feat_1]: tensor([[0.0832, 0.0468, 0.6916, 0.8953, 0.2208],
[0.6560, 0.8490, 0.4157, 0.3341, 0.3320],
[0.8025, 0.3192, 0.2389, 0.8955, 0.4619]])
Node feature [feat_1] is saved to ./ondisk_dataset_homograph/node-feat-1.pt
Part of edge feature [feat_0]: [[0.85202101 0.37442228 0.43294893 0.11479708 0.16078714]
[0.09799214 0.48974035 0.48913703 0.96732298 0.38535437]
[0.40573755 0.04555319 0.57415337 0.39332099 0.10864744]]
Edge feature [feat_0] is saved to ./ondisk_dataset_homograph/edge-feat-0.npy
Part of edge feature [feat_1]: tensor([[0.9977, 0.6291, 0.8746, 0.1186, 0.6706],
[0.4452, 0.6584, 0.8161, 0.8932, 0.5067],
[0.0245, 0.8730, 0.6280, 0.7643, 0.1814]])
Edge feature [feat_1] is saved to ./ondisk_dataset_homograph/edge-feat-1.pt
生成任务
OnDiskDataset 支持多个任务。对于每个任务,我们需要分别准备训练/验证/测试集。这些集通常在不同的任务之间有所不同。在本教程中,让我们创建一个节点分类任务和一个链接预测任务。
节点分类任务
对于节点分类任务,我们需要每个训练/验证/测试集的节点ID和相应的标签。与特征数据一样,这些集支持numpy数组和torch张量。
[5]:
num_trains = int(num_nodes * 0.6)
num_vals = int(num_nodes * 0.2)
num_tests = num_nodes - num_trains - num_vals
ids = np.arange(num_nodes)
np.random.shuffle(ids)
nc_train_ids_path = os.path.join(base_dir, "nc-train-ids.npy")
nc_train_ids = ids[:num_trains]
print(f"Part of train ids for node classification: {nc_train_ids[:3]}")
np.save(nc_train_ids_path, nc_train_ids)
print(f"NC train ids are saved to {nc_train_ids_path}\n")
nc_train_labels_path = os.path.join(base_dir, "nc-train-labels.pt")
nc_train_labels = torch.randint(0, 10, (num_trains,))
print(f"Part of train labels for node classification: {nc_train_labels[:3]}")
torch.save(nc_train_labels, nc_train_labels_path)
print(f"NC train labels are saved to {nc_train_labels_path}\n")
nc_val_ids_path = os.path.join(base_dir, "nc-val-ids.npy")
nc_val_ids = ids[num_trains:num_trains+num_vals]
print(f"Part of val ids for node classification: {nc_val_ids[:3]}")
np.save(nc_val_ids_path, nc_val_ids)
print(f"NC val ids are saved to {nc_val_ids_path}\n")
nc_val_labels_path = os.path.join(base_dir, "nc-val-labels.pt")
nc_val_labels = torch.randint(0, 10, (num_vals,))
print(f"Part of val labels for node classification: {nc_val_labels[:3]}")
torch.save(nc_val_labels, nc_val_labels_path)
print(f"NC val labels are saved to {nc_val_labels_path}\n")
nc_test_ids_path = os.path.join(base_dir, "nc-test-ids.npy")
nc_test_ids = ids[-num_tests:]
print(f"Part of test ids for node classification: {nc_test_ids[:3]}")
np.save(nc_test_ids_path, nc_test_ids)
print(f"NC test ids are saved to {nc_test_ids_path}\n")
nc_test_labels_path = os.path.join(base_dir, "nc-test-labels.pt")
nc_test_labels = torch.randint(0, 10, (num_tests,))
print(f"Part of test labels for node classification: {nc_test_labels[:3]}")
torch.save(nc_test_labels, nc_test_labels_path)
print(f"NC test labels are saved to {nc_test_labels_path}\n")
Part of train ids for node classification: [284 651 844]
NC train ids are saved to ./ondisk_dataset_homograph/nc-train-ids.npy
Part of train labels for node classification: tensor([3, 7, 3])
NC train labels are saved to ./ondisk_dataset_homograph/nc-train-labels.pt
Part of val ids for node classification: [923 240 934]
NC val ids are saved to ./ondisk_dataset_homograph/nc-val-ids.npy
Part of val labels for node classification: tensor([7, 3, 1])
NC val labels are saved to ./ondisk_dataset_homograph/nc-val-labels.pt
Part of test ids for node classification: [107 664 870]
NC test ids are saved to ./ondisk_dataset_homograph/nc-test-ids.npy
Part of test labels for node classification: tensor([6, 9, 4])
NC test labels are saved to ./ondisk_dataset_homograph/nc-test-labels.pt
链接预测任务
对于链接预测任务,我们需要种子或相应的标签和索引,这些标签和索引表示每个训练/验证/测试集的种子组的正/负属性。与特征数据一样,这些集支持numpy数组和torch张量。
[6]:
num_trains = int(num_edges * 0.6)
num_vals = int(num_edges * 0.2)
num_tests = num_edges - num_trains - num_vals
lp_train_seeds_path = os.path.join(base_dir, "lp-train-seeds.npy")
lp_train_seeds = edges[:num_trains, :]
print(f"Part of train seeds for link prediction: {lp_train_seeds[:3]}")
np.save(lp_train_seeds_path, lp_train_seeds)
print(f"LP train seeds are saved to {lp_train_seeds_path}\n")
lp_val_seeds_path = os.path.join(base_dir, "lp-val-seeds.npy")
lp_val_seeds = edges[num_trains:num_trains+num_vals, :]
lp_val_neg_dsts = np.random.randint(0, num_nodes, (num_vals, 10)).reshape(-1)
lp_val_neg_srcs = np.repeat(lp_val_seeds[:,0], 10)
lp_val_neg_seeds = np.concatenate((lp_val_neg_srcs, lp_val_neg_dsts)).reshape(2,-1).T
lp_val_seeds = np.concatenate((lp_val_seeds, lp_val_neg_seeds))
print(f"Part of val seeds for link prediction: {lp_val_seeds[:3]}")
np.save(lp_val_seeds_path, lp_val_seeds)
print(f"LP val seeds are saved to {lp_val_seeds_path}\n")
lp_val_labels_path = os.path.join(base_dir, "lp-val-labels.npy")
lp_val_labels = np.empty(num_vals * (10 + 1))
lp_val_labels[:num_vals] = 1
lp_val_labels[num_vals:] = 0
print(f"Part of val labels for link prediction: {lp_val_labels[:3]}")
np.save(lp_val_labels_path, lp_val_labels)
print(f"LP val labels are saved to {lp_val_labels_path}\n")
lp_val_indexes_path = os.path.join(base_dir, "lp-val-indexes.npy")
lp_val_indexes = np.arange(0, num_vals)
lp_val_neg_indexes = np.repeat(lp_val_indexes, 10)
lp_val_indexes = np.concatenate([lp_val_indexes, lp_val_neg_indexes])
print(f"Part of val indexes for link prediction: {lp_val_indexes[:3]}")
np.save(lp_val_indexes_path, lp_val_indexes)
print(f"LP val indexes are saved to {lp_val_indexes_path}\n")
lp_test_seeds_path = os.path.join(base_dir, "lp-test-seeds.npy")
lp_test_seeds = edges[-num_tests:, :]
lp_test_neg_dsts = np.random.randint(0, num_nodes, (num_tests, 10)).reshape(-1)
lp_test_neg_srcs = np.repeat(lp_test_seeds[:,0], 10)
lp_test_neg_seeds = np.concatenate((lp_test_neg_srcs, lp_test_neg_dsts)).reshape(2,-1).T
lp_test_seeds = np.concatenate((lp_test_seeds, lp_test_neg_seeds))
print(f"Part of test seeds for link prediction: {lp_test_seeds[:3]}")
np.save(lp_test_seeds_path, lp_test_seeds)
print(f"LP test seeds are saved to {lp_test_seeds_path}\n")
lp_test_labels_path = os.path.join(base_dir, "lp-test-labels.npy")
lp_test_labels = np.empty(num_tests * (10 + 1))
lp_test_labels[:num_tests] = 1
lp_test_labels[num_tests:] = 0
print(f"Part of val labels for link prediction: {lp_test_labels[:3]}")
np.save(lp_test_labels_path, lp_test_labels)
print(f"LP test labels are saved to {lp_test_labels_path}\n")
lp_test_indexes_path = os.path.join(base_dir, "lp-test-indexes.npy")
lp_test_indexes = np.arange(0, num_tests)
lp_test_neg_indexes = np.repeat(lp_test_indexes, 10)
lp_test_indexes = np.concatenate([lp_test_indexes, lp_test_neg_indexes])
print(f"Part of test indexes for link prediction: {lp_test_indexes[:3]}")
np.save(lp_test_indexes_path, lp_test_indexes)
print(f"LP test indexes are saved to {lp_test_indexes_path}\n")
Part of train seeds for link prediction: [[855 775]
[850 798]
[336 200]]
LP train seeds are saved to ./ondisk_dataset_homograph/lp-train-seeds.npy
Part of val seeds for link prediction: [[956 201]
[899 528]
[209 380]]
LP val seeds are saved to ./ondisk_dataset_homograph/lp-val-seeds.npy
Part of val labels for link prediction: [1. 1. 1.]
LP val labels are saved to ./ondisk_dataset_homograph/lp-val-labels.npy
Part of val indexes for link prediction: [0 1 2]
LP val indexes are saved to ./ondisk_dataset_homograph/lp-val-indexes.npy
Part of test seeds for link prediction: [[ 40 984]
[622 853]
[589 241]]
LP test seeds are saved to ./ondisk_dataset_homograph/lp-test-seeds.npy
Part of val labels for link prediction: [1. 1. 1.]
LP test labels are saved to ./ondisk_dataset_homograph/lp-test-labels.npy
Part of test indexes for link prediction: [0 1 2]
LP test indexes are saved to ./ondisk_dataset_homograph/lp-test-indexes.npy
将数据组织成YAML文件
现在我们需要创建一个metadata.yaml文件,其中包含路径、图结构的数据类型、特征数据、训练/验证/测试集。
注意:- 所有路径应相对于metadata.yaml。- 以下字段是可选的,并未在下面的示例中指定。- in_memory:指示是否将数据加载到内存中或使用mmap。默认值为True。
请参考YAML规范了解更多详情。
[7]:
yaml_content = f"""
dataset_name: homogeneous_graph_nc_lp
graph:
nodes:
- num: {num_nodes}
edges:
- format: csv
path: {os.path.basename(edges_path)}
feature_data:
- domain: node
name: feat_0
format: numpy
path: {os.path.basename(node_feat_0_path)}
- domain: node
name: feat_1
format: torch
path: {os.path.basename(node_feat_1_path)}
- domain: edge
name: feat_0
format: numpy
path: {os.path.basename(edge_feat_0_path)}
- domain: edge
name: feat_1
format: torch
path: {os.path.basename(edge_feat_1_path)}
tasks:
- name: node_classification
num_classes: 10
train_set:
- data:
- name: seeds
format: numpy
path: {os.path.basename(nc_train_ids_path)}
- name: labels
format: torch
path: {os.path.basename(nc_train_labels_path)}
validation_set:
- data:
- name: seeds
format: numpy
path: {os.path.basename(nc_val_ids_path)}
- name: labels
format: torch
path: {os.path.basename(nc_val_labels_path)}
test_set:
- data:
- name: seeds
format: numpy
path: {os.path.basename(nc_test_ids_path)}
- name: labels
format: torch
path: {os.path.basename(nc_test_labels_path)}
- name: link_prediction
num_classes: 10
train_set:
- data:
- name: seeds
format: numpy
path: {os.path.basename(lp_train_seeds_path)}
validation_set:
- data:
- name: seeds
format: numpy
path: {os.path.basename(lp_val_seeds_path)}
- name: labels
format: numpy
path: {os.path.basename(lp_val_labels_path)}
- name: indexes
format: numpy
path: {os.path.basename(lp_val_indexes_path)}
test_set:
- data:
- name: seeds
format: numpy
path: {os.path.basename(lp_test_seeds_path)}
- name: labels
format: numpy
path: {os.path.basename(lp_test_labels_path)}
- name: indexes
format: numpy
path: {os.path.basename(lp_test_indexes_path)}
"""
metadata_path = os.path.join(base_dir, "metadata.yaml")
with open(metadata_path, "w") as f:
f.write(yaml_content)
实例化 OnDiskDataset
现在我们准备通过dgl.graphbolt.OnDiskDataset加载数据集。在实例化时,我们只需传入metadata.yaml文件所在的基础目录。
在首次实例化期间,GraphBolt 会预处理原始数据,例如从边构建 FusedCSCSamplingGraph。所有数据,包括图、特征数据、训练/验证/测试集,在预处理后都会放入 preprocessed 目录中。任何后续的数据集加载都将跳过预处理阶段。
预处理后,需要显式调用load()以加载图、特征数据和任务。
[8]:
dataset = gb.OnDiskDataset(base_dir).load()
graph = dataset.graph
print(f"Loaded graph: {graph}\n")
feature = dataset.feature
print(f"Loaded feature store: {feature}\n")
tasks = dataset.tasks
nc_task = tasks[0]
print(f"Loaded node classification task: {nc_task}\n")
lp_task = tasks[1]
print(f"Loaded link prediction task: {lp_task}\n")
Start to preprocess the on-disk dataset.
Finish preprocessing the on-disk dataset.
Loaded graph: FusedCSCSamplingGraph(csc_indptr=tensor([ 0, 10, 21, ..., 9980, 9995, 10000], dtype=torch.int32),
indices=tensor([181, 410, 596, ..., 745, 997, 122], dtype=torch.int32),
total_num_nodes=1000, num_edges=10000,)
Loaded feature store: TorchBasedFeatureStore(
{(<OnDiskFeatureDataDomain.NODE: 'node'>, None, 'feat_0'): TorchBasedFeature(
feature=tensor([[0.6642, 0.4884, 0.4882, 0.3999, 0.0619],
[0.7041, 0.2383, 0.4081, 0.4412, 0.6981],
[0.2674, 0.5248, 0.1454, 0.8276, 0.5928],
...,
[0.6519, 0.2808, 0.8133, 0.0882, 0.6630],
[0.2974, 0.7740, 0.1138, 0.4471, 0.2879],
[0.9448, 0.7801, 0.3655, 0.6760, 0.9172]], dtype=torch.float64),
metadata={},
), (<OnDiskFeatureDataDomain.NODE: 'node'>, None, 'feat_1'): TorchBasedFeature(
feature=tensor([[0.0832, 0.0468, 0.6916, 0.8953, 0.2208],
[0.6560, 0.8490, 0.4157, 0.3341, 0.3320],
[0.8025, 0.3192, 0.2389, 0.8955, 0.4619],
...,
[0.6520, 0.4619, 0.5146, 0.6097, 0.2386],
[0.7205, 0.9640, 0.6694, 0.1268, 0.5149],
[0.5294, 0.7750, 0.7792, 0.0633, 0.3548]]),
metadata={},
), (<OnDiskFeatureDataDomain.EDGE: 'edge'>, None, 'feat_0'): TorchBasedFeature(
feature=tensor([[0.8520, 0.3744, 0.4329, 0.1148, 0.1608],
[0.0980, 0.4897, 0.4891, 0.9673, 0.3854],
[0.4057, 0.0456, 0.5742, 0.3933, 0.1086],
...,
[0.0465, 0.2654, 0.7799, 0.4601, 0.6528],
[0.0966, 0.4897, 0.2502, 0.4873, 0.8816],
[0.6565, 0.7743, 0.6417, 0.2032, 0.5695]], dtype=torch.float64),
metadata={},
), (<OnDiskFeatureDataDomain.EDGE: 'edge'>, None, 'feat_1'): TorchBasedFeature(
feature=tensor([[0.9977, 0.6291, 0.8746, 0.1186, 0.6706],
[0.4452, 0.6584, 0.8161, 0.8932, 0.5067],
[0.0245, 0.8730, 0.6280, 0.7643, 0.1814],
...,
[0.4550, 0.8023, 0.3347, 0.6111, 0.3855],
[0.9893, 0.4175, 0.5815, 0.0469, 0.0468],
[0.9241, 0.3524, 0.5294, 0.5368, 0.7772]]),
metadata={},
)}
)
Loaded node classification task: OnDiskTask(validation_set=ItemSet(
items=(tensor([923, 240, 934, 846, 20, 775, 142, 617, 109, 974, 383, 228, 69, 178,
403, 189, 973, 312, 276, 829, 794, 447, 900, 88, 724, 281, 715, 814,
953, 696, 34, 633, 637, 275, 315, 164, 888, 412, 405, 567, 855, 595,
434, 333, 640, 384, 477, 140, 739, 778, 282, 769, 624, 262, 491, 662,
484, 459, 969, 221, 935, 569, 575, 892, 758, 757, 55, 138, 130, 344,
322, 767, 653, 279, 903, 191, 820, 655, 476, 267, 26, 157, 742, 802,
465, 623, 737, 962, 450, 199, 631, 510, 527, 719, 848, 348, 347, 321,
796, 291, 795, 915, 682, 202, 101, 196, 183, 878, 206, 147, 584, 52,
270, 673, 123, 621, 690, 730, 319, 83, 692, 716, 817, 390, 679, 732,
66, 306, 821, 580, 978, 840, 528, 260, 327, 740, 210, 931, 539, 151,
288, 113, 955, 610, 374, 389, 168, 363, 718, 889, 295, 534, 46, 530,
224, 593, 810, 201, 169, 585, 899, 521, 776, 675, 421, 801, 245, 886,
124, 606, 331, 950, 856, 371, 646, 880, 782, 746, 627, 929, 490, 907,
893, 70, 436, 761, 104, 96, 117, 590, 512, 803, 377, 250, 970, 203,
438, 813, 238, 455], dtype=torch.int32), tensor([7, 3, 1, 1, 8, 3, 5, 0, 3, 4, 0, 8, 6, 6, 1, 5, 8, 6, 9, 5, 8, 9, 6, 0,
0, 3, 3, 4, 4, 5, 9, 6, 2, 0, 2, 7, 3, 3, 8, 9, 8, 8, 5, 5, 0, 6, 1, 0,
5, 5, 9, 4, 8, 3, 8, 2, 0, 1, 0, 1, 3, 4, 5, 9, 7, 1, 1, 3, 2, 6, 1, 6,
5, 4, 5, 6, 7, 0, 8, 0, 6, 9, 9, 2, 1, 0, 7, 3, 9, 9, 1, 6, 5, 2, 5, 3,
0, 0, 3, 2, 3, 9, 7, 7, 9, 9, 7, 3, 4, 7, 7, 4, 3, 5, 6, 0, 7, 0, 5, 5,
4, 8, 2, 2, 9, 3, 0, 4, 9, 0, 7, 2, 6, 1, 2, 3, 1, 8, 6, 9, 2, 1, 1, 6,
8, 3, 4, 3, 8, 7, 1, 8, 8, 2, 3, 2, 1, 3, 9, 7, 1, 2, 5, 6, 7, 1, 0, 3,
0, 1, 2, 0, 4, 2, 8, 8, 9, 0, 4, 3, 3, 8, 8, 9, 2, 5, 2, 3, 4, 1, 0, 4,
4, 7, 0, 0, 2, 3, 1, 1])),
names=('seeds', 'labels'),
),
train_set=ItemSet(
items=(tensor([284, 651, 844, 33, 924, 367, 582, 222, 964, 972, 223, 263, 522, 639,
150, 400, 213, 255, 97, 159, 722, 661, 917, 704, 218, 672, 330, 976,
948, 843, 543, 466, 997, 960, 193, 328, 143, 446, 342, 309, 533, 965,
707, 127, 688, 2, 957, 986, 693, 904, 932, 765, 431, 6, 667, 712,
362, 586, 398, 72, 643, 591, 299, 141, 161, 204, 609, 485, 244, 239,
302, 755, 884, 551, 670, 708, 536, 336, 703, 800, 149, 86, 462, 784,
5, 537, 133, 861, 920, 112, 186, 603, 200, 1, 493, 869, 676, 642,
717, 956, 608, 445, 148, 501, 334, 979, 681, 561, 175, 668, 947, 304,
453, 187, 967, 293, 944, 12, 918, 355, 94, 77, 167, 387, 454, 991,
36, 762, 842, 984, 341, 660, 474, 118, 269, 316, 674, 126, 583, 589,
747, 296, 397, 789, 17, 949, 264, 432, 205, 435, 894, 79, 122, 975,
871, 81, 496, 40, 274, 473, 994, 546, 864, 981, 227, 392, 544, 427,
75, 963, 11, 709, 738, 56, 556, 559, 461, 728, 857, 500, 626, 839,
910, 233, 684, 190, 62, 895, 602, 669, 868, 804, 39, 43, 375, 90,
700, 58, 860, 926, 982, 139, 613, 80, 809, 729, 781, 579, 560, 422,
826, 832, 475, 547, 9, 404, 253, 604, 54, 366, 32, 562, 63, 495,
990, 10, 246, 798, 230, 507, 47, 361, 697, 320, 21, 600, 155, 339,
508, 261, 25, 908, 463, 735, 694, 152, 691, 989, 851, 231, 714, 845,
616, 87, 78, 764, 710, 834, 825, 108, 369, 983, 326, 305, 942, 91,
396, 873, 0, 605, 777, 875, 259, 928, 61, 685, 351, 129, 242, 859,
828, 176, 797, 581, 511, 850, 952, 343, 51, 919, 531, 555, 184, 225,
3, 665, 283, 346, 324, 451, 612, 28, 656, 587, 268, 699, 265, 357,
553, 812, 611, 503, 441, 273, 905, 13, 570, 173, 557, 966, 439, 376,
552, 874, 858, 550, 488, 658, 632, 992, 913, 513, 216, 487, 174, 921,
212, 830, 517, 413, 911, 256, 564, 819, 425, 902, 549, 849, 160, 415,
594, 993, 44, 486, 440, 538, 388, 943, 35, 793, 313, 483, 131, 408,
323, 247, 763, 468, 482, 574, 303, 128, 82, 566, 563, 618, 266, 67,
548, 38, 448, 518, 678, 16, 68, 23, 614, 254, 497, 677, 601, 572,
219, 181, 414, 945, 30, 815, 237, 783, 540, 958, 525, 998, 588, 774,
308, 879, 232, 515, 598, 896, 31, 136, 657, 723, 479, 318, 654, 298,
381, 936, 89, 663, 689, 170, 125, 165, 406, 426, 901, 423, 290, 695,
85, 368, 95, 419, 792, 49, 287, 74, 380, 541, 744, 838, 243, 208,
759, 285, 27, 464, 307, 941, 770, 504, 771, 329, 876, 578, 433, 754,
753, 177, 345, 145, 401, 429, 41, 930, 999, 98, 499, 705, 885, 277,
883, 215, 650, 198, 332, 300, 565, 114, 987, 45, 18, 768, 297, 881,
980, 671, 573, 615, 365, 721, 382, 647, 469, 897, 619, 399, 402, 898,
866, 946, 542, 545, 683, 395, 725, 909, 172, 217, 317, 354, 135, 492,
760, 635, 120, 442, 641, 71, 648, 119, 750, 831, 890, 73, 121, 4,
701, 937, 498, 529, 634, 818, 478, 748, 596, 780, 506, 115, 711, 659,
458, 420, 234, 280, 686, 311, 116, 629, 514, 638, 520, 607, 411, 743,
480, 571, 137, 833, 636, 195, 337, 241, 372, 847, 153, 257, 325, 824,
378, 197, 470, 211, 863, 14, 741, 460, 791, 535, 15, 182, 862, 912,
229, 734, 977, 424, 853, 64, 887, 822, 959, 444, 457, 106],
dtype=torch.int32), tensor([3, 7, 3, 0, 8, 5, 1, 4, 1, 4, 7, 3, 9, 2, 0, 5, 4, 4, 1, 4, 7, 4, 6, 2,
0, 9, 3, 7, 3, 7, 4, 8, 8, 8, 0, 5, 0, 0, 9, 4, 9, 1, 5, 9, 5, 6, 1, 0,
3, 7, 5, 1, 1, 6, 0, 7, 4, 4, 9, 1, 3, 9, 5, 7, 1, 4, 3, 2, 6, 9, 5, 8,
0, 7, 4, 9, 9, 6, 8, 5, 6, 3, 4, 4, 9, 1, 7, 7, 8, 7, 6, 4, 0, 3, 7, 5,
9, 8, 9, 9, 1, 5, 6, 8, 2, 0, 4, 0, 0, 3, 4, 4, 6, 6, 4, 9, 6, 4, 5, 1,
1, 2, 4, 6, 4, 4, 8, 5, 8, 4, 0, 4, 3, 1, 6, 2, 3, 0, 6, 0, 4, 0, 8, 4,
6, 0, 1, 1, 0, 7, 6, 1, 1, 7, 3, 0, 6, 6, 8, 3, 0, 8, 3, 6, 4, 5, 2, 6,
9, 4, 9, 3, 9, 2, 0, 3, 9, 2, 1, 4, 4, 9, 9, 1, 6, 0, 1, 4, 7, 5, 9, 4,
4, 4, 6, 4, 1, 8, 7, 8, 0, 9, 2, 5, 7, 5, 8, 1, 3, 6, 2, 8, 6, 2, 7, 7,
4, 0, 3, 9, 7, 0, 7, 5, 8, 0, 2, 7, 3, 5, 2, 9, 5, 2, 2, 8, 6, 8, 4, 4,
7, 5, 4, 9, 0, 3, 7, 7, 7, 4, 4, 3, 4, 2, 5, 7, 7, 5, 5, 7, 0, 2, 4, 7,
1, 1, 1, 9, 0, 6, 9, 6, 2, 4, 3, 5, 0, 7, 2, 6, 8, 8, 2, 6, 3, 1, 1, 7,
4, 6, 8, 3, 8, 0, 9, 7, 6, 4, 4, 2, 6, 0, 9, 4, 1, 7, 9, 6, 0, 0, 0, 8,
9, 9, 5, 4, 1, 1, 7, 8, 3, 1, 4, 8, 9, 5, 5, 3, 1, 6, 9, 3, 9, 6, 6, 4,
8, 1, 9, 5, 8, 3, 5, 4, 9, 8, 6, 5, 3, 5, 9, 6, 3, 4, 1, 2, 4, 8, 2, 1,
3, 0, 0, 3, 0, 1, 7, 6, 7, 7, 7, 2, 6, 9, 2, 5, 9, 5, 2, 0, 4, 1, 0, 6,
1, 8, 6, 4, 7, 5, 5, 9, 1, 2, 6, 0, 5, 9, 2, 2, 9, 3, 2, 3, 6, 3, 3, 7,
2, 3, 3, 8, 5, 6, 7, 0, 9, 2, 0, 5, 2, 4, 2, 8, 4, 9, 5, 3, 2, 4, 0, 5,
0, 4, 7, 7, 6, 0, 8, 5, 5, 9, 1, 9, 6, 4, 6, 5, 2, 2, 0, 3, 6, 4, 7, 3,
8, 5, 9, 3, 8, 9, 2, 4, 4, 8, 8, 2, 2, 7, 7, 6, 5, 6, 4, 7, 1, 1, 0, 2,
8, 3, 6, 5, 2, 7, 0, 2, 5, 1, 5, 0, 9, 5, 1, 9, 8, 6, 1, 1, 4, 6, 8, 9,
5, 9, 8, 5, 5, 0, 0, 4, 0, 0, 3, 2, 9, 3, 2, 7, 2, 3, 9, 2, 6, 8, 9, 1,
3, 3, 3, 1, 4, 3, 0, 0, 0, 4, 0, 8, 6, 6, 8, 4, 2, 9, 5, 0, 5, 3, 1, 9,
5, 6, 8, 7, 3, 4, 1, 7, 0, 4, 7, 2, 8, 5, 6, 1, 1, 0, 2, 9, 7, 5, 0, 0,
8, 3, 7, 5, 8, 7, 9, 8, 6, 1, 1, 3, 3, 4, 2, 5, 9, 5, 6, 7, 4, 5, 0, 9])),
names=('seeds', 'labels'),
),
test_set=ItemSet(
items=(tensor([107, 664, 870, 939, 207, 162, 509, 353, 756, 103, 745, 100, 8, 294,
698, 171, 494, 971, 988, 301, 766, 338, 630, 467, 272, 19, 22, 790,
940, 995, 37, 865, 996, 358, 925, 352, 726, 379, 576, 417, 481, 644,
65, 502, 471, 214, 916, 524, 99, 951, 180, 156, 568, 166, 472, 57,
356, 985, 961, 179, 852, 271, 622, 854, 42, 836, 773, 340, 437, 418,
50, 394, 841, 687, 105, 416, 409, 807, 827, 713, 872, 144, 554, 526,
628, 489, 53, 720, 59, 505, 706, 779, 625, 386, 235, 731, 523, 236,
649, 373, 158, 922, 226, 93, 787, 452, 84, 811, 391, 927, 24, 891,
350, 592, 314, 76, 194, 385, 134, 733, 335, 251, 816, 577, 154, 933,
188, 102, 752, 430, 597, 248, 808, 620, 258, 449, 785, 289, 938, 558,
110, 751, 29, 443, 799, 349, 364, 914, 727, 772, 192, 60, 805, 360,
877, 702, 393, 516, 645, 837, 407, 310, 163, 92, 428, 666, 519, 835,
532, 7, 968, 249, 209, 806, 359, 788, 132, 252, 906, 185, 749, 410,
48, 736, 680, 599, 456, 286, 220, 954, 370, 111, 786, 652, 292, 867,
882, 146, 823, 278], dtype=torch.int32), tensor([6, 9, 4, 0, 4, 5, 3, 3, 3, 6, 6, 1, 0, 7, 9, 3, 1, 4, 2, 0, 9, 4, 0, 8,
6, 9, 8, 1, 1, 9, 6, 3, 8, 2, 9, 6, 3, 3, 0, 1, 7, 7, 5, 1, 3, 7, 2, 9,
5, 3, 2, 9, 4, 7, 5, 0, 6, 6, 3, 7, 3, 1, 8, 0, 1, 0, 2, 8, 0, 4, 9, 0,
1, 3, 8, 8, 1, 5, 1, 7, 0, 3, 8, 3, 0, 5, 2, 0, 3, 7, 5, 0, 6, 1, 6, 6,
9, 9, 0, 7, 2, 5, 8, 9, 9, 0, 0, 4, 7, 8, 4, 9, 2, 1, 1, 0, 3, 1, 9, 2,
1, 8, 6, 4, 0, 0, 4, 6, 4, 5, 7, 1, 7, 6, 0, 5, 8, 8, 0, 5, 6, 4, 0, 0,
8, 5, 8, 9, 2, 0, 2, 5, 1, 2, 9, 8, 2, 4, 7, 8, 7, 1, 0, 4, 5, 8, 1, 5,
3, 3, 6, 0, 1, 6, 8, 1, 5, 5, 9, 5, 1, 2, 7, 7, 9, 1, 3, 5, 6, 6, 5, 4,
8, 5, 8, 7, 1, 3, 6, 1])),
names=('seeds', 'labels'),
),
metadata={'name': 'node_classification', 'num_classes': 10},)
Loaded link prediction task: OnDiskTask(validation_set=ItemSet(
items=(tensor([[956, 201],
[899, 528],
[209, 380],
...,
[693, 506],
[693, 911],
[693, 979]], dtype=torch.int32), tensor([1., 1., 1., ..., 0., 0., 0.], dtype=torch.float64), tensor([ 0, 1, 2, ..., 1999, 1999, 1999])),
names=('seeds', 'labels', 'indexes'),
),
train_set=ItemSet(
items=(tensor([[855, 775],
[850, 798],
[336, 200],
...,
[819, 807],
[758, 628],
[324, 175]], dtype=torch.int32),),
names=('seeds',),
),
test_set=ItemSet(
items=(tensor([[ 40, 984],
[622, 853],
[589, 241],
...,
[339, 387],
[339, 570],
[339, 443]], dtype=torch.int32), tensor([1., 1., 1., ..., 0., 0., 0.], dtype=torch.float64), tensor([ 0, 1, 2, ..., 1999, 1999, 1999])),
names=('seeds', 'labels', 'indexes'),
),
metadata={'name': 'link_prediction', 'num_classes': 10},)
/home/ubuntu/prod-doc/readthedocs.org/user_builds/dgl/envs/latest/lib/python3.8/site-packages/dgl-2.3-py3.8-linux-x86_64.egg/dgl/graphbolt/impl/ondisk_dataset.py:460: DGLWarning: Edge feature is stored, but edge IDs are not saved.
dgl_warning("Edge feature is stored, but edge IDs are not saved.")