可复现性

一个能够进行多次"随机"选择的软件包,其理想特性之一就是能够复现结果。

通常的策略是固定生成伪随机数的起始随机种子。遗憾的是,作为所有进化算法主要依赖的DEAP包,并未提供明确参数来固定该种子。

不过,有一种变通方法似乎可以复现这些结果,具体如下:

  • 设置numpyrandom包的随机种子,它们是底层的随机数生成器

  • 在支持该参数的每个scikit-learn和sklearn-genetic-opt对象中使用random_state参数

在以下示例中,为train_test_splitcross-validation生成器、param_grid中的每个超参数、RandomForestClassifier以及文件级别都设置了random_state。

示例:

import numpy as np
import random
from sklearn_genetic import GASearchCV
from sklearn_genetic.space import Continuous, Categorical, Integer
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.datasets import load_digits
from sklearn.metrics import accuracy_score


# Random Seed at file level
random_seed = 54

np.random.seed(random_seed)
random.seed(random_seed)


data = load_digits()
n_samples = len(data.images)
X = data.images.reshape((n_samples, -1))
y = data['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=random_seed)

clf = RandomForestClassifier(random_state=random_seed)

param_grid = {'min_weight_fraction_leaf': Continuous(0.01, 0.5, distribution='log-uniform',
                                                     random_state=random_seed),
              'bootstrap': Categorical([True, False], random_state=random_seed),
              'max_depth': Integer(2, 30, random_state=random_seed),
              'max_leaf_nodes': Integer(2, 35, random_state=random_seed),
              'n_estimators': Integer(100, 300, random_state=random_seed)}

cv = StratifiedKFold(n_splits=3, shuffle=True, random_state=random_seed)

evolved_estimator = GASearchCV(estimator=clf,
                               cv=cv,
                               scoring='accuracy',
                               population_size=8,
                               generations=5,
                               param_grid=param_grid,
                               n_jobs=-1,
                               verbose=True,
                               keep_top_k=4)

# Train and optimize the estimator
evolved_estimator.fit(X_train, y_train)
# Best parameters found
print(evolved_estimator.best_params_)
# Use the model fitted with the best parameters
y_predict_ga = evolved_estimator.predict(X_test)
print(accuracy_score(y_test, y_predict_ga))

# Saved metadata for further analysis
print("Stats achieved in each generation: ", evolved_estimator.history)
print("Best k solutions: ", evolved_estimator.hof)