dgl.sampling.sample_neighbors_biased

dgl.sampling.sample_neighbors_biased(g, nodes, fanout, bias, edge_dir='in', tag_offset_name='_TAG_OFFSET', replace=False, copy_ndata=True, copy_edata=True, output_device=None)[source]

对给定节点的邻近边进行采样并返回诱导子图，其中每个邻居被选中的概率由其标签决定。

For each node, a number of inbound (or outbound when edge_dir == 'out') edges will be randomly chosen. The graph returned will then contain all the nodes in the original graph, but only the sampled edges.

此版本的邻居采样可以支持相邻节点具有不同类型且具有不同采样概率的场景。每个节点被分配一个整数（称为标签），代表其类型。标签是同质图框架下节点类型的类比。具有相同标签的节点共享相同的概率。

例如，假设一个节点有 \(N+M\) 个邻居，其中 \(N\) 个邻居的标签为 0，而 \(M\) 个邻居的标签为 1。假设标签为 0 的节点被选中的未归一化概率为 \(p\)，而标签为 1 的节点被选中的未归一化概率为 \(q\)。此函数首先根据未归一化的概率分布 \(\frac{P(tag=0)}{P(tag=1)}=\frac{Np}{Mq}\) 选择一个标签，然后进行均匀采样以获取所选标签的节点。

为了使采样更高效，输入图的CSC矩阵（或如果edge_dir='out'则为CSR矩阵）必须根据标签进行排序。API sort_csc_by_tag() 和 sort_csr_by_tag() 就是为此目的设计的，它们会在内部根据标签重新排序邻居，使得相同标签的邻居存储在连续范围内。这两个API还会将这些范围的偏移量存储在名为tag_offset_name的节点特征中。

请确保在调用此函数之前，图的CSR（或CSC）矩阵已经排序。 此函数本身不会检查输入的图是否已排序。请注意，输入的tag_offset_name应与排序函数中的一致。

仅支持同构图或二分图。对于二分图，当edge_dir='in'时（或当edge_dir='out'时），源节点（或目标节点）的标签偏移将用于采样。

Node/edge features are not preserved. The original IDs of the sampled edges are stored as the dgl.EID feature in the returned graph.

Parameters:

g (DGLGraph) – 图。必须是同构的或二分的（只有一种边类型）。必须在CPU上。
nodes (tensor 或 list) – 从中采样邻居的节点ID。
fanout (int) –
每种边类型上每个节点要采样的边数。

如果给定-1，将选择所有具有非零概率的相邻边。
bias (tensor 或 list) –
与每个标签相关的（未归一化的）概率。其长度应等于标签的数量。

此数组的条目必须是非负浮点数。否则，结果将是未定义的。
edge_dir (str, optional) –
Determines whether to sample inbound or outbound edges.

Can take either in for inbound edges or out for outbound edges.
tag_offset_name (str, optional) –
存储标签偏移量的节点特征的名称。

(默认值: “_TAG_OFFSET”)
replace (bool, optional) – If True, sample with replacement.
copy_ndata (bool, optional) –
If True, the node features of the new graph are copied from the original graph. If False, the new graph will not have any node features.

(Default: True)
copy_edata (bool, optional) –
If True, the edge features of the new graph are copied from the original graph. If False, the new graph will not have any edge features.

(Default: True)
output_device (Framework-specific device context object, optional) – The output device. Default is the same as the input graph.

Returns:

一个仅包含采样邻居边的采样子图。它在CPU上。

Return type:

DGLGraph

注释

If copy_ndata or copy_edata is True, same tensors are used as the node or edge features of the original graph and the new graph. As a result, users should avoid performing in-place operations on the node features of the new graph to avoid feature corruption.

另请参阅

dgl.sort_csc_by_tag, dgl.sort_csr_by_tag

示例

假设你有以下图表

>>> g = dgl.graph(([0, 0, 1, 1, 2, 2], [1, 2, 0, 1, 2, 0]))

以及标签

>>> tag = torch.IntTensor([0, 0, 1])

对图表进行排序（必要！）

>>> g_sorted = dgl.transforms.sort_csr_by_tag(g, tag)
>>> g_sorted.ndata['_TAG_OFFSET']
tensor([[0, 1, 2],
        [0, 2, 2],
        [0, 1, 2]])

设置每个标签的概率：

>>> bias = torch.tensor([1.0, 0.001])
>>> # node 2 is almost impossible to be sampled because it has tag 1.

为节点0和节点2采样一条出边：

>>> sg = dgl.sampling.sample_neighbors_biased(g_sorted, [0, 2], 1, bias, edge_dir='out')
>>> sg.edges(order='eid')
(tensor([0, 2]), tensor([1, 0]))
>>> sg.edata[dgl.EID]
tensor([0, 5])

With fanout greater than the number of actual neighbors and without replacement, DGL will take all neighbors instead:

>>> sg = dgl.sampling.sample_neighbors_biased(g_sorted, [0, 2], 3, bias, edge_dir='out')
>>> sg.edges(order='eid')
(tensor([0, 0, 2, 2]), tensor([1, 2, 0, 2]))