欺诈Yelp数据集
- class dgl.data.FraudYelpDataset(raw_dir=None, random_seed=717, train_size=0.7, val_size=0.1, force_reload=False, verbose=True, transform=None)[source]
Bases:
FraudDataset
欺诈Yelp数据集
Yelp数据集包括由Yelp过滤(垃圾邮件)和推荐(合法)的酒店和餐厅评论。可以进行垃圾评论检测任务,这是一个二元分类任务。从<http://dx.doi.org/10.1145/2783258.2783370>中提取的32个手工特征被用作原始节点特征。评论是图中的节点,三种关系是:
R-U-R: 它连接了同一用户发布的评论
R-S-R:它将同一产品下具有相同星级评分(1-5星)的评论连接起来
R-T-R:它连接了同一产品下在同一月份发布的两条评论。
统计:
节点数:45,954
边:
R-U-R: 98,630
R-T-R: 1,147,232
R-S-R: 6,805,486
类:
正面(垃圾邮件):6,677
负面(合法):39,277
正负比例: 1 : 5.9
节点特征大小:32
- Parameters:
raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/
random_seed (int) – Specifying the random seed in splitting the dataset. Default: 717
train_size (float) – training set size of the dataset. Default: 0.7
val_size (float) – validation set size of the dataset, and the size of testing set is (1 - train_size - val_size) Default: 0.1
force_reload (bool) – Whether to reload the dataset. Default: False
verbose (bool) – Whether to print out progress information. Default: True.
transform (callable, optional) – A transform that takes in a
DGLGraph
object and returns a transformed version. TheDGLGraph
object will be transformed before every access.
示例
>>> dataset = FraudYelpDataset() >>> graph = dataset[0] >>> num_classes = dataset.num_classes >>> feat = graph.ndata['feature'] >>> label = graph.ndata['label']
- __getitem__(idx)
获取图形对象
- Parameters:
idx (int) – Item index
- Returns:
graph structure, node features, node labels and masks
ndata['feature']
: node featuresndata['label']
: node labelsndata['train_mask']
: mask of training setndata['val_mask']
: mask of validation setndata['test_mask']
: mask of testing set
- Return type:
- __len__()
数据示例的数量