项目采样器
- class dgl.graphbolt.ItemSampler(item_set: ~dgl.graphbolt.itemset.ItemSet | ~dgl.graphbolt.itemset.ItemSetDict, batch_size: int, minibatcher: ~typing.Callable | None = <function minibatcher_default>, drop_last: bool | None = False, shuffle: bool | None = False, use_indexing: bool | None = True, buffer_size: int | None = -1)[source]
基础类:
IterDataPipe
一个用于遍历输入项并创建子集的采样器。
输入项可以是节点ID、带或不带标签的节点对、带有负源/目标的节点对、DGLGraphs和异构对应项。
注意:这个类 ItemSampler 故意没有用 torchdata.datapipes.functional_datapipe 装饰。这表明它不支持类似函数的调用。但来自 torchdata 的任何可迭代数据管道都可以进一步附加。
- Parameters:
item_set (Union[ItemSet, ItemSetDict]) – 要采样的数据。
batch_size (int) – 每个批次的大小。
minibatcher (可选[可调用]) – 一个可调用对象,接收一个项目列表并返回一个MiniBatch。
drop_last (bool) – 如果最后一个批次不完整,选择是否丢弃。
shuffle (bool) – 在采样前进行洗牌的选项。
use_indexing (bool) – 使用索引从项目集中切片项目的选项。这是一种优化,以避免对项目集进行耗时的迭代。如果项目集不支持索引,此选项将自动禁用。如果项目集支持索引但用户希望禁用它,可以将此选项设置为False。默认情况下,它设置为True。
buffer_size (int) – 用于存储从
ItemSet
或ItemSetDict
切片的项的缓冲区大小。默认情况下,它设置为-1,这意味着如果支持索引,缓冲区大小将设置为项集中的总项数。如果不支持索引,则设置为10 * 批量大小。如果项集太大,建议设置较小的缓冲区大小以避免内存不足错误。由于每个缓冲区内的项会被打乱,较小的缓冲区大小可能会导致较少的随机性,这种较少的随机性可能会进一步影响训练性能,如收敛速度和准确性。因此,如果可能的话,建议设置较大的缓冲区大小。
示例
节点ID。
>>> import torch >>> from dgl import graphbolt as gb >>> item_set = gb.ItemSet(torch.arange(0, 10), names="seeds") >>> item_sampler = gb.ItemSampler( ... item_set, batch_size=4, shuffle=False, drop_last=False ... ) >>> next(iter(item_sampler)) MiniBatch(seeds=tensor([0, 1, 2, 3]), sampled_subgraphs=None, node_features=None, labels=None, input_nodes=None, indexes=None, edge_features=None, compacted_seeds=None, blocks=None,)
节点对。
>>> item_set = gb.ItemSet(torch.arange(0, 20).reshape(-1, 2), ... names="seeds") >>> item_sampler = gb.ItemSampler( ... item_set, batch_size=4, shuffle=False, drop_last=False ... ) >>> next(iter(item_sampler)) MiniBatch(seeds=tensor([[0, 1], [2, 3], [4, 5], [6, 7]]), sampled_subgraphs=None, node_features=None, labels=None, input_nodes=None, indexes=None, edge_features=None, compacted_seeds=None, blocks=None,)
节点对和标签。
>>> item_set = gb.ItemSet( ... (torch.arange(0, 20).reshape(-1, 2), torch.arange(10, 20)), ... names=("seeds", "labels") ... ) >>> item_sampler = gb.ItemSampler( ... item_set, batch_size=4, shuffle=False, drop_last=False ... ) >>> next(iter(item_sampler)) MiniBatch(seeds=tensor([[0, 1], [2, 3], [4, 5], [6, 7]]), sampled_subgraphs=None, node_features=None, labels=tensor([10, 11, 12, 13]), input_nodes=None, indexes=None, edge_features=None, compacted_seeds=None, blocks=None,)
节点对、标签和索引。
>>> seeds = torch.arange(0, 20).reshape(-1, 2) >>> labels = torch.tensor([1, 1, 0, 0, 0, 0, 0, 0, 0, 0]) >>> indexes = torch.tensor([0, 1, 0, 0, 0, 0, 1, 1, 1, 1]) >>> item_set = gb.ItemSet((seeds, labels, indexes), names=("seeds", ... "labels", "indexes")) >>> item_sampler = gb.ItemSampler( ... item_set, batch_size=4, shuffle=False, drop_last=False ... ) >>> next(iter(item_sampler)) MiniBatch(seeds=tensor([[0, 1], [2, 3], [4, 5], [6, 7]]), sampled_subgraphs=None, node_features=None, labels=tensor([1, 1, 0, 0]), input_nodes=None, indexes=tensor([0, 1, 0, 0]), edge_features=None, compacted_seeds=None, blocks=None,)
DGL图。
>>> import dgl >>> graphs = [ dgl.rand_graph(10, 20) for _ in range(5) ] >>> item_set = gb.ItemSet(graphs) >>> item_sampler = gb.ItemSampler(item_set, 3) >>> list(item_sampler) [Graph(num_nodes=30, num_edges=60, ndata_schemes={} edata_schemes={}), Graph(num_nodes=20, num_edges=40, ndata_schemes={} edata_schemes={})]
6. 使用其他数据管道进一步处理批次,例如
torchdata.datapipes.iter.Mapper
。>>> item_set = gb.ItemSet(torch.arange(0, 10)) >>> data_pipe = gb.ItemSampler(item_set, 4) >>> def add_one(batch): ... return batch + 1 >>> data_pipe = data_pipe.map(add_one) >>> list(data_pipe) [tensor([1, 2, 3, 4]), tensor([5, 6, 7, 8]), tensor([ 9, 10])]
异构节点ID。
>>> ids = { ... "user": gb.ItemSet(torch.arange(0, 5), names="seeds"), ... "item": gb.ItemSet(torch.arange(0, 6), names="seeds"), ... } >>> item_set = gb.ItemSetDict(ids) >>> item_sampler = gb.ItemSampler(item_set, batch_size=4) >>> next(iter(item_sampler)) MiniBatch(seeds={'user': tensor([0, 1, 2, 3])}, sampled_subgraphs=None, node_features=None, labels=None, input_nodes=None, indexes=None, edge_features=None, compacted_seeds=None, blocks=None,)
异构节点对。
>>> seeds_like = torch.arange(0, 10).reshape(-1, 2) >>> seeds_follow = torch.arange(10, 20).reshape(-1, 2) >>> item_set = gb.ItemSetDict({ ... "user:like:item": gb.ItemSet( ... seeds_like, names="seeds"), ... "user:follow:user": gb.ItemSet( ... seeds_follow, names="seeds"), ... }) >>> item_sampler = gb.ItemSampler(item_set, batch_size=4) >>> next(iter(item_sampler)) MiniBatch(seeds={'user:like:item': tensor([[0, 1], [2, 3], [4, 5], [6, 7]])}, sampled_subgraphs=None, node_features=None, labels=None, input_nodes=None, indexes=None, edge_features=None, compacted_seeds=None, blocks=None,)
异构节点对和标签。
>>> seeds_like = torch.arange(0, 10).reshape(-1, 2) >>> labels_like = torch.arange(0, 5) >>> seeds_follow = torch.arange(10, 20).reshape(-1, 2) >>> labels_follow = torch.arange(5, 10) >>> item_set = gb.ItemSetDict({ ... "user:like:item": gb.ItemSet((seeds_like, labels_like), ... names=("seeds", "labels")), ... "user:follow:user": gb.ItemSet((seeds_follow, labels_follow), ... names=("seeds", "labels")), ... }) >>> item_sampler = gb.ItemSampler(item_set, batch_size=4) >>> next(iter(item_sampler)) MiniBatch(seeds={'user:like:item': tensor([[0, 1], [2, 3], [4, 5], [6, 7]])}, sampled_subgraphs=None, node_features=None, labels={'user:like:item': tensor([0, 1, 2, 3])}, input_nodes=None, indexes=None, edge_features=None, compacted_seeds=None, blocks=None,)
异构节点对、标签和索引。
>>> seeds_like = torch.arange(0, 10).reshape(-1, 2) >>> labels_like = torch.tensor([1, 1, 0, 0, 0]) >>> indexes_like = torch.tensor([0, 1, 0, 0, 1]) >>> seeds_follow = torch.arange(20, 30).reshape(-1, 2) >>> labels_follow = torch.tensor([1, 1, 0, 0, 0]) >>> indexes_follow = torch.tensor([0, 1, 0, 0, 1]) >>> item_set = gb.ItemSetDict({ ... "user:like:item": gb.ItemSet((seeds_like, labels_like, ... indexes_like), names=("seeds", "labels", "indexes")), ... "user:follow:user": gb.ItemSet((seeds_follow,labels_follow, ... indexes_follow), names=("seeds", "labels", "indexes")), ... }) >>> item_sampler = gb.ItemSampler(item_set, batch_size=4) >>> next(iter(item_sampler)) MiniBatch(seeds={'user:like:item': tensor([[0, 1], [2, 3], [4, 5], [6, 7]])}, sampled_subgraphs=None, node_features=None, labels={'user:like:item': tensor([1, 1, 0, 0])}, input_nodes=None, indexes={'user:like:item': tensor([0, 1, 0, 0])}, edge_features=None, compacted_seeds=None, blocks=None,)