常见问题解答¶
您将在这里找到常见问题解答,以及一些不属于用户指南的其他用例示例。
如何为每个用户获取前N个推荐¶
这里有一个例子,我们从MovieLens-100k数据集中检索每个用户评分预测最高的前10个项目。我们首先在整个数据集上训练一个SVD算法,然后预测不在训练集中的所有(用户,项目)对的评分。然后我们检索每个用户的前10个预测。
examples/top_n_recommendations.py
¶from collections import defaultdict
from surprise import Dataset, SVD
def get_top_n(predictions, n=10):
"""Return the top-N recommendation for each user from a set of predictions.
Args:
predictions(list of Prediction objects): The list of predictions, as
returned by the test method of an algorithm.
n(int): The number of recommendation to output for each user. Default
is 10.
Returns:
A dict where keys are user (raw) ids and values are lists of tuples:
[(raw item id, rating estimation), ...] of size n.
"""
# First map the predictions to each user.
top_n = defaultdict(list)
for uid, iid, true_r, est, _ in predictions:
top_n[uid].append((iid, est))
# Then sort the predictions for each user and retrieve the k highest ones.
for uid, user_ratings in top_n.items():
user_ratings.sort(key=lambda x: x[1], reverse=True)
top_n[uid] = user_ratings[:n]
return top_n
# First train an SVD algorithm on the movielens dataset.
data = Dataset.load_builtin("ml-100k")
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)
# Than predict ratings for all pairs (u, i) that are NOT in the training set.
testset = trainset.build_anti_testset()
predictions = algo.test(testset)
top_n = get_top_n(predictions, n=10)
# Print the recommended items for each user
for uid, user_ratings in top_n.items():
print(uid, [iid for (iid, _) in user_ratings])
如何计算precision@k和recall@k¶
这里是一个我们为每个用户计算 Precision@k 和 Recall@k 的示例:
\(\text{Precision@k} = \frac{ | \{ \text{推荐的相关项目} \} | }{ | \{ \text{推荐的项目} \} | }\) \(\text{Recall@k} = \frac{ | \{ \text{推荐的相关项目} \} | }{ | \{ \text{相关项目} \} | }\)
如果一个项目的真实评分\(r_{ui}\)大于给定阈值,则认为该项目是相关的。如果一个项目的估计评分\(\hat{r}_{ui}\)大于阈值,并且它在估计评分最高的k个项目中,则认为该项目被推荐。
请注意,在发生除以零的边缘情况下, Precision@k 和 Recall@k 的值是未定义的。 按照惯例,在这种情况下我们将它们的值设置为0。
examples/precision_recall_at_k.py
¶from collections import defaultdict
from surprise import Dataset, SVD
from surprise.model_selection import KFold
def precision_recall_at_k(predictions, k=10, threshold=3.5):
"""Return precision and recall at k metrics for each user"""
# First map the predictions to each user.
user_est_true = defaultdict(list)
for uid, _, true_r, est, _ in predictions:
user_est_true[uid].append((est, true_r))
precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():
# Sort user ratings by estimated value
user_ratings.sort(key=lambda x: x[0], reverse=True)
# Number of relevant items
n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
# Number of recommended items in top k
n_rec_k = sum((est >= threshold) for (est, _) in user_ratings[:k])
# Number of relevant and recommended items in top k
n_rel_and_rec_k = sum(
((true_r >= threshold) and (est >= threshold))
for (est, true_r) in user_ratings[:k]
)
# Precision@K: Proportion of recommended items that are relevant
# When n_rec_k is 0, Precision is undefined. We here set it to 0.
precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
# Recall@K: Proportion of relevant items that are recommended
# When n_rel is 0, Recall is undefined. We here set it to 0.
recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
return precisions, recalls
data = Dataset.load_builtin("ml-100k")
kf = KFold(n_splits=5)
algo = SVD()
for trainset, testset in kf.split(data):
algo.fit(trainset)
predictions = algo.test(testset)
precisions, recalls = precision_recall_at_k(predictions, k=5, threshold=4)
# Precision and recall can then be averaged over all users
print(sum(prec for prec in precisions.values()) / len(precisions))
print(sum(rec for rec in recalls.values()) / len(recalls))
如何获取用户(或项目)的k个最近邻居¶
你可以使用算法对象的get_neighbors()
方法。这仅适用于使用相似度度量的算法,例如k-NN算法。
这里是一个例子,我们从MovieLens-100k数据集中检索电影《玩具总动员》的10个最近邻居。输出如下:
The 10 nearest neighbors of Toy Story are:
Beauty and the Beast (1991)
Raiders of the Lost Ark (1981)
That Thing You Do! (1996)
Lion King, The (1994)
Craft, The (1996)
Liar Liar (1997)
Aladdin (1992)
Cool Hand Luke (1967)
Winnie the Pooh and the Blustery Day (1968)
Indiana Jones and the Last Crusade (1989)
由于电影名称与其原始/内部ID之间的转换(参见此注释),存在大量的样板代码,但这一切都归结为使用get_neighbors()
:
examples/k_nearest_neighbors.py
¶import io # noqa
from surprise import Dataset, get_dataset_dir, KNNBaseline
def read_item_names():
"""Read the u.item file from MovieLens 100-k dataset and return two
mappings to convert raw ids into movie names and movie names into raw ids.
"""
file_name = get_dataset_dir() + "/ml-100k/ml-100k/u.item"
rid_to_name = {}
name_to_rid = {}
with open(file_name, encoding="ISO-8859-1") as f:
for line in f:
line = line.split("|")
rid_to_name[line[0]] = line[1]
name_to_rid[line[1]] = line[0]
return rid_to_name, name_to_rid
# First, train the algorithm to compute the similarities between items
data = Dataset.load_builtin("ml-100k")
trainset = data.build_full_trainset()
sim_options = {"name": "pearson_baseline", "user_based": False}
algo = KNNBaseline(sim_options=sim_options)
algo.fit(trainset)
# Read the mappings raw id <-> movie name
rid_to_name, name_to_rid = read_item_names()
# Retrieve inner id of the movie Toy Story
toy_story_raw_id = name_to_rid["Toy Story (1995)"]
toy_story_inner_id = algo.trainset.to_inner_iid(toy_story_raw_id)
# Retrieve inner ids of the nearest neighbors of Toy Story.
toy_story_neighbors = algo.get_neighbors(toy_story_inner_id, k=10)
# Convert inner ids of the neighbors into names.
toy_story_neighbors = (
algo.trainset.to_raw_iid(inner_id) for inner_id in toy_story_neighbors
)
toy_story_neighbors = (rid_to_name[rid] for rid in toy_story_neighbors)
print()
print("The 10 nearest neighbors of Toy Story are:")
for movie in toy_story_neighbors:
print(movie)
当然,对于用户来说,只需稍作修改即可实现相同的功能。
如何序列化一个算法¶
预测算法可以使用dump()
和load()
函数进行序列化和加载。这里有一个小例子,其中SVD算法在数据集上进行训练并序列化。然后重新加载并可以再次用于进行预测:
examples/serialize_algorithm.py
¶import os
from surprise import Dataset, dump, SVD
data = Dataset.load_builtin("ml-100k")
trainset = data.build_full_trainset()
algo = SVD()
algo.fit(trainset)
# Compute predictions of the 'original' algorithm.
predictions = algo.test(trainset.build_testset())
# Dump algorithm and reload it.
file_name = os.path.expanduser("~/dump_file")
dump.dump(file_name, algo=algo)
_, loaded_algo = dump.load(file_name)
# We now ensure that the algo is still the same by checking the predictions.
predictions_loaded_algo = loaded_algo.test(trainset.build_testset())
assert predictions == predictions_loaded_algo
print("Predictions are the same")
算法可以与其预测结果一起序列化,以便可以使用pandas数据帧进行进一步分析或与其他算法进行比较。以下两个笔记本中给出了一些示例:
如何构建我自己的预测算法¶
这里有一整本指南 here.
什么是原始ID和内部ID¶
用户和项目有一个原始ID和一个内部ID。某些方法将使用/返回原始ID(例如predict()
方法),而其他一些方法将使用/返回内部ID。
原始ID是在评分文件或pandas数据框中定义的ID。它们可以是字符串或数字。但请注意,如果评分是从文件中读取的(这是标准情况),它们将表示为字符串。如果您使用例如 predict()
或其他接受原始ID作为参数的方法,这一点很重要。
在创建训练集时,每个原始ID被映射到一个称为内部ID的唯一整数,这对于Surprise来说更易于操作。原始ID和内部ID之间的转换可以使用to_inner_uid()
、to_inner_iid()
、to_raw_uid()
和to_raw_iid()
方法来完成,这些方法属于trainset
。
我可以在Surprise中使用自己的数据集吗,它可以是pandas dataframe吗¶
是的,是的。请参阅用户指南。
如何调整算法参数¶
您可以使用GridSearchCV
类来调整算法的参数,如这里所述。调整后,您可能希望获得算法性能的无偏估计。
如何在训练集上获取准确度测量¶
你可以使用build_testset()
方法来构建一个测试集,然后可以与test()
方法一起使用:
examples/evaluate_on_trainset.py
¶from surprise import accuracy, Dataset, SVD
from surprise.model_selection import KFold
data = Dataset.load_builtin("ml-100k")
algo = SVD()
trainset = data.build_full_trainset()
algo.fit(trainset)
testset = trainset.build_testset()
predictions = algo.test(testset)
# RMSE should be low as we are biased
accuracy.rmse(predictions, verbose=True) # ~ 0.68 (which is low)
查看示例文件以获取更多使用示例。
如何保存一些数据以进行无偏准确度估计¶
如果你的目标是调整算法的参数,你可能希望保留一些数据以对其性能进行无偏估计。例如,你可能希望将数据分成两组A和B。A用于使用网格搜索进行参数调整,B用于无偏估计。这可以按如下方式完成:
examples/split_data_for_unbiased_estimation.py
¶import random
from surprise import accuracy, Dataset, SVD
from surprise.model_selection import GridSearchCV
# Load the full dataset.
data = Dataset.load_builtin("ml-100k")
raw_ratings = data.raw_ratings
# shuffle ratings if you want
random.shuffle(raw_ratings)
# A = 90% of the data, B = 10% of the data
threshold = int(0.9 * len(raw_ratings))
A_raw_ratings = raw_ratings[:threshold]
B_raw_ratings = raw_ratings[threshold:]
data.raw_ratings = A_raw_ratings # data is now the set A
# Select your best algo with grid search.
print("Grid Search...")
param_grid = {"n_epochs": [5, 10], "lr_all": [0.002, 0.005]}
grid_search = GridSearchCV(SVD, param_grid, measures=["rmse"], cv=3)
grid_search.fit(data)
algo = grid_search.best_estimator["rmse"]
# retrain on the whole set A
trainset = data.build_full_trainset()
algo.fit(trainset)
# Compute biased accuracy on A
predictions = algo.test(trainset.build_testset())
print("Biased accuracy on A,", end=" ")
accuracy.rmse(predictions)
# Compute unbiased accuracy on B
testset = data.construct_testset(B_raw_ratings) # testset is now the set B
predictions = algo.test(testset)
print("Unbiased accuracy on B,", end=" ")
accuracy.rmse(predictions)
如何进行可重复的实验¶
一些算法会随机初始化它们的参数(有时使用numpy
),并且交叉验证的折叠也是随机生成的。如果你需要多次重复你的实验,你只需要在程序开始时设置随机数生成器(RNG)的种子:
import random
import numpy as np
my_seed = 0
random.seed(my_seed)
np.random.seed(my_seed)
数据集存储在哪里以及如何更改?¶
默认情况下,Surprise下载的数据集将保存在'~/.surprise_data'
目录中。这也是转储文件将存储的地方。您可以通过设置'SURPRISE_DATA_FOLDER'
环境变量来更改默认目录。
Surprise 是否支持基于内容的数据或隐式评分?¶
不:这超出了surprise的范围。Surprise是为显式评分设计的。