项目采样器

class dgl.graphbolt.ItemSampler(item_set: ~dgl.graphbolt.itemset.ItemSet | ~dgl.graphbolt.itemset.ItemSetDict, batch_size: int, minibatcher: ~typing.Callable | None = <function minibatcher_default>, drop_last: bool | None = False, shuffle: bool | None = False, use_indexing: bool | None = True, buffer_size: int | None = -1)[source]

基础类:IterDataPipe

一个用于遍历输入项并创建子集的采样器。

输入项可以是节点ID、带或不带标签的节点对、带有负源/目标的节点对、DGLGraphs和异构对应项。

注意:这个类 ItemSampler 故意没有用 torchdata.datapipes.functional_datapipe 装饰。这表明它不支持类似函数的调用。但来自 torchdata 的任何可迭代数据管道都可以进一步附加。

Parameters:
  • item_set (Union[ItemSet, ItemSetDict]) – 要采样的数据。

  • batch_size (int) – 每个批次的大小。

  • minibatcher (可选[可调用]) – 一个可调用对象,接收一个项目列表并返回一个MiniBatch

  • drop_last (bool) – 如果最后一个批次不完整,选择是否丢弃。

  • shuffle (bool) – 在采样前进行洗牌的选项。

  • use_indexing (bool) – 使用索引从项目集中切片项目的选项。这是一种优化,以避免对项目集进行耗时的迭代。如果项目集不支持索引,此选项将自动禁用。如果项目集支持索引但用户希望禁用它,可以将此选项设置为False。默认情况下,它设置为True。

  • buffer_size (int) – 用于存储从ItemSetItemSetDict切片的项的缓冲区大小。默认情况下,它设置为-1,这意味着如果支持索引,缓冲区大小将设置为项集中的总项数。如果不支持索引,则设置为10 * 批量大小。如果项集太大,建议设置较小的缓冲区大小以避免内存不足错误。由于每个缓冲区内的项会被打乱,较小的缓冲区大小可能会导致较少的随机性,这种较少的随机性可能会进一步影响训练性能,如收敛速度和准确性。因此,如果可能的话,建议设置较大的缓冲区大小。

示例

  1. 节点ID。

>>> import torch
>>> from dgl import graphbolt as gb
>>> item_set = gb.ItemSet(torch.arange(0, 10), names="seeds")
>>> item_sampler = gb.ItemSampler(
...     item_set, batch_size=4, shuffle=False, drop_last=False
... )
>>> next(iter(item_sampler))
MiniBatch(seeds=tensor([0, 1, 2, 3]), sampled_subgraphs=None,
    node_features=None, labels=None, input_nodes=None,
    indexes=None, edge_features=None, compacted_seeds=None,
    blocks=None,)
  1. 节点对。

>>> item_set = gb.ItemSet(torch.arange(0, 20).reshape(-1, 2),
...     names="seeds")
>>> item_sampler = gb.ItemSampler(
...     item_set, batch_size=4, shuffle=False, drop_last=False
... )
>>> next(iter(item_sampler))
MiniBatch(seeds=tensor([[0, 1], [2, 3], [4, 5], [6, 7]]),
    sampled_subgraphs=None, node_features=None, labels=None,
    input_nodes=None, indexes=None, edge_features=None,
    compacted_seeds=None, blocks=None,)
  1. 节点对和标签。

>>> item_set = gb.ItemSet(
...     (torch.arange(0, 20).reshape(-1, 2), torch.arange(10, 20)),
...     names=("seeds", "labels")
... )
>>> item_sampler = gb.ItemSampler(
...     item_set, batch_size=4, shuffle=False, drop_last=False
... )
>>> next(iter(item_sampler))
MiniBatch(seeds=tensor([[0, 1], [2, 3], [4, 5], [6, 7]]),
    sampled_subgraphs=None, node_features=None,
    labels=tensor([10, 11, 12, 13]), input_nodes=None,
    indexes=None, edge_features=None, compacted_seeds=None,
    blocks=None,)
  1. 节点对、标签和索引。

>>> seeds = torch.arange(0, 20).reshape(-1, 2)
>>> labels = torch.tensor([1, 1, 0, 0, 0, 0, 0, 0, 0, 0])
>>> indexes = torch.tensor([0, 1, 0, 0, 0, 0, 1, 1, 1, 1])
>>> item_set = gb.ItemSet((seeds, labels, indexes), names=("seeds",
...     "labels", "indexes"))
>>> item_sampler = gb.ItemSampler(
...     item_set, batch_size=4, shuffle=False, drop_last=False
... )
>>> next(iter(item_sampler))
MiniBatch(seeds=tensor([[0, 1], [2, 3], [4, 5], [6, 7]]),
    sampled_subgraphs=None, node_features=None,
    labels=tensor([1, 1, 0, 0]), input_nodes=None,
    indexes=tensor([0, 1, 0, 0]), edge_features=None,
    compacted_seeds=None, blocks=None,)
  1. DGL图。

>>> import dgl
>>> graphs = [ dgl.rand_graph(10, 20) for _ in range(5) ]
>>> item_set = gb.ItemSet(graphs)
>>> item_sampler = gb.ItemSampler(item_set, 3)
>>> list(item_sampler)
[Graph(num_nodes=30, num_edges=60,
  ndata_schemes={}
  edata_schemes={}),
 Graph(num_nodes=20, num_edges=40,
  ndata_schemes={}
  edata_schemes={})]

6. 使用其他数据管道进一步处理批次,例如 torchdata.datapipes.iter.Mapper

>>> item_set = gb.ItemSet(torch.arange(0, 10))
>>> data_pipe = gb.ItemSampler(item_set, 4)
>>> def add_one(batch):
...     return batch + 1
>>> data_pipe = data_pipe.map(add_one)
>>> list(data_pipe)
[tensor([1, 2, 3, 4]), tensor([5, 6, 7, 8]), tensor([ 9, 10])]
  1. 异构节点ID。

>>> ids = {
...     "user": gb.ItemSet(torch.arange(0, 5), names="seeds"),
...     "item": gb.ItemSet(torch.arange(0, 6), names="seeds"),
... }
>>> item_set = gb.ItemSetDict(ids)
>>> item_sampler = gb.ItemSampler(item_set, batch_size=4)
>>> next(iter(item_sampler))
MiniBatch(seeds={'user': tensor([0, 1, 2, 3])}, sampled_subgraphs=None,
    node_features=None, labels=None, input_nodes=None, indexes=None,
    edge_features=None, compacted_seeds=None, blocks=None,)
  1. 异构节点对。

>>> seeds_like = torch.arange(0, 10).reshape(-1, 2)
>>> seeds_follow = torch.arange(10, 20).reshape(-1, 2)
>>> item_set = gb.ItemSetDict({
...     "user:like:item": gb.ItemSet(
...         seeds_like, names="seeds"),
...     "user:follow:user": gb.ItemSet(
...         seeds_follow, names="seeds"),
... })
>>> item_sampler = gb.ItemSampler(item_set, batch_size=4)
>>> next(iter(item_sampler))
MiniBatch(seeds={'user:like:item':
    tensor([[0, 1], [2, 3], [4, 5], [6, 7]])}, sampled_subgraphs=None,
    node_features=None, labels=None, input_nodes=None, indexes=None,
    edge_features=None, compacted_seeds=None, blocks=None,)
  1. 异构节点对和标签。

>>> seeds_like = torch.arange(0, 10).reshape(-1, 2)
>>> labels_like = torch.arange(0, 5)
>>> seeds_follow = torch.arange(10, 20).reshape(-1, 2)
>>> labels_follow = torch.arange(5, 10)
>>> item_set = gb.ItemSetDict({
...     "user:like:item": gb.ItemSet((seeds_like, labels_like),
...         names=("seeds", "labels")),
...     "user:follow:user": gb.ItemSet((seeds_follow, labels_follow),
...         names=("seeds", "labels")),
... })
>>> item_sampler = gb.ItemSampler(item_set, batch_size=4)
>>> next(iter(item_sampler))
MiniBatch(seeds={'user:like:item':
    tensor([[0, 1], [2, 3], [4, 5], [6, 7]])}, sampled_subgraphs=None,
    node_features=None, labels={'user:like:item': tensor([0, 1, 2, 3])},
    input_nodes=None, indexes=None, edge_features=None,
    compacted_seeds=None, blocks=None,)
  1. 异构节点对、标签和索引。

>>> seeds_like = torch.arange(0, 10).reshape(-1, 2)
>>> labels_like = torch.tensor([1, 1, 0, 0, 0])
>>> indexes_like = torch.tensor([0, 1, 0, 0, 1])
>>> seeds_follow = torch.arange(20, 30).reshape(-1, 2)
>>> labels_follow = torch.tensor([1, 1, 0, 0, 0])
>>> indexes_follow = torch.tensor([0, 1, 0, 0, 1])
>>> item_set = gb.ItemSetDict({
...     "user:like:item": gb.ItemSet((seeds_like, labels_like,
...         indexes_like), names=("seeds", "labels", "indexes")),
...     "user:follow:user": gb.ItemSet((seeds_follow,labels_follow,
...         indexes_follow), names=("seeds", "labels", "indexes")),
... })
>>> item_sampler = gb.ItemSampler(item_set, batch_size=4)
>>> next(iter(item_sampler))
MiniBatch(seeds={'user:like:item':
    tensor([[0, 1], [2, 3], [4, 5], [6, 7]])}, sampled_subgraphs=None,
    node_features=None, labels={'user:like:item': tensor([1, 1, 0, 0])},
    input_nodes=None, indexes={'user:like:item': tensor([0, 1, 0, 0])},
    edge_features=None, compacted_seeds=None, blocks=None,)