自带数据
作为使用预打包数据集的替代方案,训练和测试可以通过文件路径或使用pykeen.triples.TriplesFactory的实例显式设置。在本教程中,将使用内置pykeen.datasets.Nations的训练、测试和验证集的路径作为示例。
预分层数据集
你已经有了训练和测试文件,它们都是3列的TSV文件,一切准备就绪。你确信测试集中没有出现在训练集中没有的实体或关系。像这样将它们加载到管道中:
>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> result = pipeline(
... training=NATIONS_TRAIN_PATH,
... testing=NATIONS_TEST_PATH,
... model='TransE',
... epochs=5, # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')
PyKEEN 将负责确保实体从它们的标签映射到适当的整数(技术上,0维的 torch.LongTensor)索引,并且不同的三元组集合共享相同的映射。
这同样适用于pykeen.hpo.hpo_pipeline(),它与pykeen.pipeline.pipeline()具有相似的接口,如下所示:
>>> from pykeen.hpo import hpo_pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH, NATIONS_VALIDATE_PATH
>>> result = hpo_pipeline(
... n_trials=3, # you probably want more than this
... training=NATIONS_TRAIN_PATH,
... testing=NATIONS_TEST_PATH,
... validation=NATIONS_VALIDATE_PATH,
... model='TransE',
... epochs=5, # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_hpo_pre_stratified_transe')
其余的示例将用于 pykeen.pipeline.pipeline(),但对于 pykeen.hpo.hpo_pipeline() 也同样适用。
如果你想添加数据集范围的参数,你可以使用dataset_kwargs参数
到pykeen.pipeline.pipeline以启用诸如create_inverse_triples=True的选项。
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> result = pipeline(
... training=NATIONS_TRAIN_PATH,
... testing=NATIONS_TEST_PATH,
... dataset_kwargs={'create_inverse_triples': True},
... model='TransE',
... epochs=5, # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')
如果你想更精细地控制三元组的创建方式,例如,如果它们并非全部来自TSV文件,你可以使用pykeen.triples.TriplesFactory接口。
>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> training = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> testing = TriplesFactory.from_path(
... NATIONS_TEST_PATH,
... entity_to_id=training.entity_to_id,
... relation_to_id=training.relation_to_id,
... )
>>> result = pipeline(
... training=training,
... testing=testing,
... model='TransE',
... epochs=5, # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')
警告
测试工厂的实例化,我们使用了entity_to_id和relation_to_id关键字参数。
这是因为PyKEEN自动为每个三元组工厂的所有实体和关系分配数字标识符。然而,我们希望测试集的标识符与训练集完全相同,所以我们只是重用它。如果我们没有相同的标识符,那么在评估过程中,测试集将与训练集中的错误标识符混淆,我们会得到无意义的结果。
当传递你自己的pykeen.triples.TriplesFactory时,dataset_kwargs参数会被忽略,因此如果这是你期望的行为,请确保在实例化这些类时包含create_inverse_triples=True,如下所示:
>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH, NATIONS_TEST_PATH
>>> training = TriplesFactory.from_path(
... NATIONS_TRAIN_PATH,
... create_inverse_triples=True,
... )
>>> testing = TriplesFactory.from_path(
... NATIONS_TEST_PATH,
... entity_to_id=training.entity_to_id,
... relation_to_id=training.relation_to_id,
... create_inverse_triples=True,
... )
>>> result = pipeline(
... training=training,
... testing=testing,
... model='TransE',
... epochs=5, # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_pre_stratified_transe')
如果你已经在numpy.ndarray中加载了三元组,也可以使用triples关键字参数而不是path参数来实例化三元组工厂。
未分层数据集
在现实世界中,你的数据集通常不会已经分层为训练集和测试集。
PyKEEN 提供了 pykeen.triples.TriplesFactory.split() 函数,它可以帮助你创建一个分层的数据集。
>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
>>> tf = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> training, testing = tf.split()
>>> result = pipeline(
... training=training,
... testing=testing,
... model='TransE',
... epochs=5, # short epochs for testing - you should go higher
... )
>>> result.save_to_directory('doctests/test_unstratified_transe')
默认情况下,这是一个80/20的分割。如果你想使用早停法,你还需要一个验证集,因此你应该指定分割比例:
>>> from pykeen.triples import TriplesFactory
>>> from pykeen.pipeline import pipeline
>>> from pykeen.datasets.nations import NATIONS_TRAIN_PATH
>>> tf = TriplesFactory.from_path(NATIONS_TRAIN_PATH)
>>> training, testing, validation = tf.split([.8, .1, .1])
>>> result = pipeline(
... training=training,
... testing=testing,
... validation=validation,
... model='TransE',
... stopper='early',
... epochs=5, # short epochs for testing - you should go
... # higher, especially with early stopper enabled
... )
>>> result.save_to_directory('doctests/test_unstratified_stopped_transe')