自带数据

作为使用预打包数据集的替代方案,训练和测试可以通过文件路径或使用pykeen.triples.TriplesFactory的实例显式设置。在本教程中,将使用内置pykeen.datasets.Nations的训练、测试和验证集的路径作为示例。

预分层数据集

你已经有了训练和测试文件,它们都是3列的TSV文件,一切准备就绪。你确信测试集中没有出现在训练集中没有的实体或关系。像这样将它们加载到管道中:

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> result = pipeline(
...     training=NATIONS_TRAIN_PATH,
...     testing=NATIONS_TEST_PATH,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')

PyKEEN 将负责确保实体从它们的标签映射到适当的整数(技术上,0维的 torch.LongTensor)索引,并且不同的三元组集合共享相同的映射。

这同样适用于pykeen.hpo.hpo_pipeline(),它与pykeen.pipeline.pipeline()具有相似的接口,如下所示:

>>> from pykeen.hpo import hpo_pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH, NATIONS_VALIDATE_PATH
>>> result = hpo_pipeline(
...     n_trials=3,  # you probably want more than this
...     training=NATIONS_TRAIN_PATH,
...     testing=NATIONS_TEST_PATH,
...     validation=NATIONS_VALIDATE_PATH,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_hpo_pre_stratified_transe')

其余的示例将用于 pykeen.pipeline.pipeline(),但对于 pykeen.hpo.hpo_pipeline() 也同样适用。

如果你想添加数据集范围的参数,你可以使用dataset_kwargs参数 到pykeen.pipeline.pipeline以启用诸如create_inverse_triples=True的选项。

>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> result = pipeline(
...     training=NATIONS_TRAIN_PATH,
...     testing=NATIONS_TEST_PATH,
...     dataset_kwargs={'create_inverse_triples': True},
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')

如果你想更精细地控制三元组的创建方式,例如,如果它们并非全部来自TSV文件,你可以使用pykeen.triples.TriplesFactory接口。

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> training = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> testing = TriplesFactory.from_path(
...     NATIONS_TEST_PATH,
...     entity_to_id=training.entity_to_id,
...     relation_to_id=training.relation_to_id,
... )
>>> result = pipeline(
...     training=training,
...     testing=testing,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')

警告

测试工厂的实例化,我们使用了entity_to_idrelation_to_id关键字参数。 这是因为PyKEEN自动为每个三元组工厂的所有实体和关系分配数字标识符。然而,我们希望测试集的标识符与训练集完全相同,所以我们只是重用它。如果我们没有相同的标识符,那么在评估过程中,测试集将与训练集中的错误标识符混淆,我们会得到无意义的结果。

当传递你自己的pykeen.triples.TriplesFactory时,dataset_kwargs参数会被忽略,因此如果这是你期望的行为,请确保在实例化这些类时包含create_inverse_triples=True,如下所示:

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> training = TriplesFactory.from_path(
...     NATIONS_TRAIN_PATH,
...     create_inverse_triples=True,
... )
>>> testing = TriplesFactory.from_path(
...     NATIONS_TEST_PATH,
...     entity_to_id=training.entity_to_id,
...     relation_to_id=training.relation_to_id,
...     create_inverse_triples=True,
... )
>>> result = pipeline(
...     training=training,
...     testing=testing,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')

如果你已经在numpy.ndarray中加载了三元组,也可以使用triples关键字参数而不是path参数来实例化三元组工厂。

未分层数据集

在现实世界中,你的数据集通常不会已经分层为训练集和测试集。 PyKEEN 提供了 pykeen.triples.TriplesFactory.split() 函数,它可以帮助你创建一个分层的数据集。

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
>>> tf = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> training, testing = tf.split()
>>> result = pipeline(
...     training=training,
...     testing=testing,
...     model='TransE',
...     epochs=5,  # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_unstratified_transe')

默认情况下,这是一个80/20的分割。如果你想使用早停法,你还需要一个验证集,因此你应该指定分割比例:

>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
>>> tf = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> training, testing, validation = tf.split([.8, .1, .1])
>>> result = pipeline(
...     training=training,
...     testing=testing,
...     validation=validation,
...     model='TransE',
...     stopper='early',
...     epochs=5,  # short epochs for testing - you should go
...                # higher, especially with early stopper enabled
... )
>>> result.save_to_directory('doctests/test_unstratified_stopped_transe')

使用检查点自带数据

有关如何将您自己的数据与检查点一起使用的教程,请参阅自带数据时的检查点手动加载模型