表示法

在 PyKEEN 中，pykeen.nn.representation.Representation 用于将整数索引映射到数值表示。一个简单的例子是 pykeen.nn.representation.Embedding 类，其中映射是一个简单的查找。然而，也有更高级的表示模块可用。

消息传递

消息传递表示模块通过聚合来自其图邻域的信息来丰富实体的表示。PyKEEN中的示例实现包括使用RGCN层进行丰富的pykeen.nn.representation.RGCNRepresentation，或通过CompGCN层进行丰富的pykeen.nn.representation.SingleCompGCNRepresentation。

另一种利用消息传递的方式是通过pykeen.nn.pyg中提供的模块，这些模块允许使用来自PyTorch Geometric的消息传递层通过消息传递来丰富基础表示。

分解

由于知识图谱可能包含大量实体，为每个实体拥有独立的可训练嵌入可能会导致过多的可训练参数。因此，已经开发了一些方法，这些方法不学习独立的表示，而是拥有一组基础表示，并通过组合它们来创建个体表示。

低秩分解

减少参数数量的一个简单方法是使用嵌入矩阵的低秩分解，如pykeen.nn.representation.LowRankEmbeddingRepresentation中所实现的。在这里，每个表示都是共享基础表示的线性组合。通常，基础的数量选择小于每个基础表示的维度。

NodePiece

另一个例子是NodePiece，它从我们在NLP等领域遇到的标记化中汲取灵感，并将每个实体表示为一组标记。在PyKEEN中的实现，pykeen.nn.representation.NodePieceRepresentation，实现了一个简单但有效的变体，它使用一组随机选择的相关关系（包括逆关系）作为标记。

另请参阅

https://towardsdatascience.com/nodepiece-tokenizing-knowledge-graphs-6dd2b91847aa

基于文本的

基于文本的表示使用实体（或关系）的标签来派生表示。为此，pykeen.nn.representation.TextRepresentation 使用来自 transformers 库的（预训练）transformer 模型来编码标签。由于 transformer 模型已经在大量文本语料库上进行了训练，它们的文本编码通常包含语义信息，即具有相似语义的标签会得到相似的表示。虽然我们也可以通过使用 pykeen.nn.init.LabelBasedInitializer 初始化 pykeen.nn.representation.Embedding 来利用这些强大的特征，但 pykeen.nn.representation.TextRepresentation 将 transformer 模型作为 KGE 模型的一部分，从而允许对语言模型进行微调以适应 KGE 任务。这是有益的，例如，因为它允许一种简单的形式来获得归纳模型，该模型可以对训练期间未见过的实体进行预测。

from pykeen.pipeline import pipeline
from pykeen.datasets import get_dataset
from pykeen.nn import TextRepresentation
from pykeen.models import ERModel

dataset = get_dataset(dataset="nations")
entity_representations = TextRepresentation.from_dataset(
    triples_factory=dataset,
    encoder="transformer",
)
result = pipeline(
    dataset=dataset,
    model=ERModel,
    model_kwargs=dict(
        interaction="ermlpe",
        interaction_kwargs=dict(
            embedding_dim=entity_representations.shape[0],
        ),
        entity_representations=entity_representations,
        relation_representations_kwargs=dict(
            shape=entity_representations.shape,
        ),
    ),
    training_kwargs=dict(
        num_epochs=1,
    ),
)
model = result.model

我们可以使用标签编码器部分来为带有标签的未知实体生成表示。例如，“uk”是nations中的一个实体，但我们也可以输入“united kingdom”，并获得大致相同的向量表示。

entity_representation = model.entity_representations[0]
label_encoder = entity_representation.encoder
uk, united_kingdom = label_encoder(labels=["uk", "united kingdom"])

因此，如果我们将生成的表示放入交互函数中，我们将得到相似的分数

# true triple from train: ['brazil', 'exports3', 'uk']
relation_representation = model.relation_representations[0]
h_repr = entity_representation.get_in_more_canonical_shape(
    dim="h",
    indices=torch.as_tensor(dataset.entity_to_id["brazil"]).view(1),
)
r_repr = relation_representation.get_in_more_canonical_shape(
    dim="r",
    indices=torch.as_tensor(dataset.relation_to_id["exports3"]).view(1),
)
scores = model.interaction(
    h=h_repr,
    r=r_repr,
    t=torch.stack([uk, united_kingdom]),
)
print(scores)

作为缺点，这通常会显著增加计算三元组分数的计算成本。

生物医学实体

如果你的数据集使用紧凑的统一资源标识符（例如，CURIEs）来标记生物医学实体，如化学品、蛋白质、疾病和通路，那么pykeen.nn.representation.BiomedicalCURIERepresentation表示法可以利用pyobo通过pyobo.get_name()函数查找名称（通过CURIE），然后使用文本编码器对它们进行编码。

不幸的是，PyKEEN中的所有生物医学知识图谱（在添加此表示时）都没有使用CURIEs来引用生物医学实体。我们希望未来这种情况会有所改变。

要了解更多关于CURIEs的信息，请查看Bioregistry 和这篇关于CURIEs的博客文章。