Tolokers数据集

class dgl.data.TolokersDataset(raw_dir=None, force_reload=False, verbose=True, transform=None)[source]

Bases: HeterophilousGraphDataset

来自论文《A Critical Look at the Evaluation of GNNs under Heterophily: Are We Really Making Progress? <https://arxiv.org/abs/2302.11640>》的Tolokers数据集。

该数据集基于Toloka众包平台的数据。节点代表tolokers（工作者）。如果两个tolokers在同一个任务上工作过，则用边连接它们。目标是预测哪些tolokers在其中一个项目中被禁止。节点特征基于工作者的个人信息和任务表现统计。

统计：

节点数：11758
边数: 1038000
类别：2
节点特征：10
10 个训练/验证/测试分割

Parameters:

raw_dir (str, optional) – Raw file directory to store the processed data. Default: ~/.dgl/
force_reload (bool, optional) – Whether to re-download the data source. Default: False
verbose (bool, optional) – Whether to print progress information. Default: True
transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access. Default: None

num_classes

节点类的数量

Type:: int

示例

>>> from dgl.data import TolokersDataset
>>> dataset = TolokersDataset()
>>> g = dataset[0]
>>> num_classes = dataset.num_classes

>>> # get node features
>>> feat = g.ndata["feat"]

>>> # get the first data split
>>> train_mask = g.ndata["train_mask"][:, 0]
>>> val_mask = g.ndata["val_mask"][:, 0]
>>> test_mask = g.ndata["test_mask"][:, 0]

>>> # get labels
>>> label = g.ndata['label']

__getitem__(idx): 获取索引处的数据对象。

__len__(): 数据集中的示例数量。