集成 - 基于RDD的API

梯度提升树 vs. 随机森林
随机森林
- 基本算法
  - 训练
  - 预测
- 使用技巧
- 示例
  - 分类
  - 回归
梯度提升树 (GBTs)

一种集成方法是一个学习算法，它创建一个由一组其他基础模型组成的模型。 spark.mllib 支持两个主要的集成算法： GradientBoostedTrees 和 RandomForest 。两者都使用决策树作为它们的基础模型。

梯度增强树与随机森林

both 梯度提升树 (GBTs) 和随机森林是用于学习树的集成的算法，但训练过程是不同的。有几个实际的权衡：

GBTs 每次只训练一棵树，因此它们的训练时间可能比随机森林更长。随机森林可以并行训练多棵树。
- 另一方面，对于 GBTs，使用较小（较浅）树木通常是合理的，这样训练较小的树木所需的时间也更少。
随机森林可能不那么容易过拟合。在随机森林中训练更多的树可以减少过拟合的可能性，但在 GBTs 中训练更多的树却会增加过拟合的可能性。（在统计语言中，随机森林通过使用更多树木来减少方差，而 GBTs 则通过使用更多树木来减少偏差。）
由于性能随着树木数量的增加而单调改善，随机森林的调优可能更容易（而对于 GBTs，如果树木数量过大，性能可能开始下降）。

简而言之，这两种算法都可以有效，选择应基于特定的数据集。

随机森林

随机森林是一组决策树的集合。随机森林是分类和回归中最成功的机器学习模型之一。它们结合了许多决策树，以减少过拟合的风险。与决策树一样，随机森林处理分类特征，扩展到多类分类设置，不需要特征缩放，并且能够捕捉非线性和特征交互。

spark.mllib 支持用于二元和多类分类以及回归的随机森林，使用连续和分类特征。 spark.mllib 使用现有的决策树实现随机森林。有关树的更多信息，请参阅决策树指南。

基本算法

随机森林分别训练一组决策树，因此训练可以并行进行。该算法在训练过程中注入随机性，使得每棵决策树稍有不同。结合每棵树的预测可以降低预测的方差，从而提高测试数据的性能。

培训

注入到训练过程中的随机性包括：

在每次迭代中对子数据集进行重采样以获得不同的训练集（即自助抽样）。
在每棵树的节点上考虑不同的特征随机子集进行拆分。

除了这些随机化之外，决策树的训练与单个决策树的训练方式相同。

预测

要对新实例进行预测，随机森林必须聚合其决策树集的预测。这种聚合对于分类和回归是不同的。

分类 : 多数投票。每棵树的预测被计为对一个类别的投票。标签被预测为获得最多投票的类别。

回归 : 平均化。每棵树预测一个实际值。标签被预测为树预测值的平均值。

使用技巧

我们提供了一些使用随机森林的指南，讨论了各种参数。我们省略了一些决策树参数，因为这些在决策树指南中已涵盖。

我们提到的前两个参数是最重要的，调整它们通常可以提高性能：

numTrees : 森林中的树木数量。
- 增加树木数量将减少预测的方差，从而改善模型在测试时的准确性。
- 训练时间大致随着树木数量线性增加。
maxDepth : 森林中每棵树的最大深度。
- 增加深度使模型更具表现力和能力。然而，深树训练所需的时间更长，也更容易过拟合。
- 一般来说，使用随机森林时训练更深的树比使用单个决策树更可接受。单棵树比随机森林更容易过拟合（因为通过平均多个树来减少方差）。

接下来的两个参数通常不需要调整。然而，它们可以被调整以加快训练速度。

subsamplingRate : 此参数指定用于训练森林中每棵树的数据集大小，作为原始数据集大小的一个比例。默认值（1.0）是推荐的，但降低这个比例可以加快训练速度。
featureSubsetStrategy : 用作每棵树节点分裂候选特征的特征数量。这个数量作为总特征数量的一个比例或函数指定。减少这个数量将加快训练速度，但如果过低，可能会影响性能。

示例

分类

下面的示例演示了如何加载一个 LIBSVM 数据文件，将其解析为一个 LabeledPoint 的 RDD，然后使用随机森林进行分类。测试误差被计算出来以衡量算法的准确性。

请参阅 RandomForest Python 文档和 RandomForest Python 文档以获取有关 API 的更多详细信息。

from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
# 将数据文件加载并解析为 LabeledPoint 的 RDD。
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# 将数据拆分为训练集和测试集（30% 用于测试）
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# 训练一个 RandomForest 模型。
#  空的 categoricalFeaturesInfo 表示所有特征都是连续的。
#  注意：在实际应用中使用更大的 numTrees。
#  将 featureSubsetStrategy="auto" 让算法选择。
model = RandomForest.trainClassifier(trainingData, numClasses=2, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='gini', maxDepth=4, maxBins=32)
# 在测试实例上评估模型并计算测试错误
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print('测试错误 = ' + str(testErr))
print('学习到的分类森林模型:')
print(model.toDebugString())
# 保存和加载模型
model.save(sc, "target/tmp/myRandomForestClassificationModel")
sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")

Find full example code at "examples/src/main/python/mllib/random_forest_classification_example.py" in the Spark repo.

有关 API 的详细信息，请参考 RandomForest Scala 文档和 RandomForestModel Scala 文档。

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
// 加载并解析数据文件。
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// 将数据分为训练集和测试集（30% 用于测试）
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// 训练一个随机森林模型。
// 空的 categoricalFeaturesInfo 表示所有特征都是连续的。
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // 实际使用中可以更多。
val featureSubsetStrategy = "auto" // 让算法选择。
val impurity = "gini"
val maxDepth = 4
val maxBins = 32
val model = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
// 在测试实例上评估模型并计算测试误差
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println(s"测试误差 = $testErr")
println(s"学习到的分类森林模型:\n ${model.toDebugString}")
// 保存和加载模型
model.save(sc, "target/tmp/myRandomForestClassificationModel")
val sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestClassificationModel")

Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/RandomForestClassificationExample.scala" in the Spark repo.

有关API的详细信息，请参考 RandomForest Java文档和 RandomForestModel Java文档。

import java.util.HashMap;
import java.util.Map;
import scala.Tuple2;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.tree.RandomForest;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.mllib.util.MLUtils;
SparkConf sparkConf = new SparkConf().setAppName("JavaRandomForestClassificationExample");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// 加载并解析数据文件。
String datapath = "data/mllib/sample_libsvm_data.txt";
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), datapath).toJavaRDD();
// 将数据分为训练集和测试集（30%用于测试）
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
JavaRDD<LabeledPoint> trainingData = splits[0];
JavaRDD<LabeledPoint> testData = splits[1];
// 训练一个随机森林模型。
// 空的 categoricalFeaturesInfo 表示所有特征都是连续的。
int numClasses = 2;
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();
int numTrees = 3; // 实际应用中使用更多的树。
String featureSubsetStrategy = "auto"; // 让算法选择。
String impurity = "gini";
int maxDepth = 5;
int maxBins = 32;
int seed = 12345;
RandomForestModel model = RandomForest.trainClassifier(trainingData, numClasses,
categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins,
seed);
// 在测试实例上评估模型并计算测试错误
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
double testErr =
predictionAndLabel.filter(pl -> !pl._1().equals(pl._2())).count() / (double) testData.count();
System.out.println("测试错误: " + testErr);
System.out.println("学习到的分类森林模型:\n" + model.toDebugString());
// 保存和加载模型
model.save(jsc.sc(), "target/tmp/myRandomForestClassificationModel");
RandomForestModel sameModel = RandomForestModel.load(jsc.sc(),
"target/tmp/myRandomForestClassificationModel");

Find full example code at "examples/src/main/java/org/apache/spark/examples/mllib/JavaRandomForestClassificationExample.java" in the Spark repo.

回归

下面的例子演示了如何加载一个 LIBSVM 数据文件，将其解析为一个 LabeledPoint 的 RDD，然后使用随机森林进行回归。最后计算均方误差 (MSE) 来评估拟合优度。

有关API的更多详细信息，请参阅 RandomForest Python文档和 RandomForest Python文档。

from pyspark.mllib.tree import RandomForest, RandomForestModel
from pyspark.mllib.util import MLUtils
# 加载并解析数据文件为 LabeledPoint 的 RDD。
data = MLUtils.loadLibSVMFile(sc, 'data/mllib/sample_libsvm_data.txt')
# 将数据拆分为训练集和测试集（30% 用于测试）
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# 训练一个 RandomForest 模型。
#  为空的 categoricalFeaturesInfo 表示所有特征都是连续的。
#  注意：在实际操作中使用更大的 numTrees。
#  设置 featureSubsetStrategy="auto" 允许算法进行选择。
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},
numTrees=3, featureSubsetStrategy="auto",
impurity='variance', maxDepth=4, maxBins=32)
# 在测试实例上评估模型并计算测试误差
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda lp: (lp[0] - lp[1]) * (lp[0] - lp[1])).sum() /\
    float(testData.count())
print('测试均方误差 = ' + str(testMSE))
print('学习到的回归森林模型:')
print(model.toDebugString())
# 保存和加载模型
model.save(sc, "target/tmp/myRandomForestRegressionModel")
sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestRegressionModel")

Find full example code at "examples/src/main/python/mllib/random_forest_regression_example.py" in the Spark repo.

有关API的详细信息，请参阅 RandomForest Scala文档和 RandomForestModel Scala文档。

import org.apache.spark.mllib.tree.RandomForest
import org.apache.spark.mllib.tree.model.RandomForestModel
import org.apache.spark.mllib.util.MLUtils
// 加载和解析数据文件。
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// 将数据分成训练集和测试集（30% 用于测试）
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// 训练一个随机森林模型。
// 空的 categoricalFeaturesInfo 表示所有特征都是连续的。
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val numTrees = 3 // 实际中可以使用更多树。
val featureSubsetStrategy = "auto" // 让算法选择。
val impurity = "variance"
val maxDepth = 4
val maxBins = 32
val model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo,
numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins)
// 在测试实例上评估模型并计算测试误差
val labelsAndPredictions = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println(s"测试均方误差 = $testMSE")
println(s"学习到的回归森林模型：\n ${model.toDebugString}")
// 保存和加载模型
model.save(sc, "target/tmp/myRandomForestRegressionModel")
val sameModel = RandomForestModel.load(sc, "target/tmp/myRandomForestRegressionModel")

Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/RandomForestRegressionExample.scala" in the Spark repo.

请参阅 RandomForest Java 文档和 RandomForestModel Java 文档以获取有关API的详细信息。

import java.util.HashMap;
import java.util.Map;
import scala.Tuple2;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.tree.RandomForest;
import org.apache.spark.mllib.tree.model.RandomForestModel;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.SparkConf;
SparkConf sparkConf = new SparkConf().setAppName("JavaRandomForestRegressionExample");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// 加载和解析数据文件。
String datapath = "data/mllib/sample_libsvm_data.txt";
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), datapath).toJavaRDD();
// 将数据拆分为训练集和测试集（30%用于测试）
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
JavaRDD<LabeledPoint> trainingData = splits[0];
JavaRDD<LabeledPoint> testData = splits[1];
// 设置参数。
// 空的 categoricalFeaturesInfo 表示所有特征都是连续的。
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();
int numTrees = 3; // 实际使用中应该更多。
String featureSubsetStrategy = "auto"; // 让算法选择。
String impurity = "variance";
int maxDepth = 4;
int maxBins = 32;
int seed = 12345;
// 训练一个 RandomForest 模型。
RandomForestModel model = RandomForest.trainRegressor(trainingData,
categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed);
// 在测试实例上评估模型并计算测试误差
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
double testMSE = predictionAndLabel.mapToDouble(pl -> {
double diff = pl._1() - pl._2();
return diff * diff;
}).mean();
System.out.println("测试均方误差: " + testMSE);
System.out.println("学习到的回归森林模型:\n" + model.toDebugString());
// 保存和加载模型
model.save(jsc.sc(), "target/tmp/myRandomForestRegressionModel");
RandomForestModel sameModel = RandomForestModel.load(jsc.sc(),
"target/tmp/myRandomForestRegressionModel");

Find full example code at "examples/src/main/java/org/apache/spark/examples/mllib/JavaRandomForestRegressionExample.java" in the Spark repo.

梯度提升树 (GBTs)

梯度提升树 (GBTs) 是一组决策树的组合。GBTs 通过迭代训练决策树来最小化损失函数。与决策树一样，GBTs 处理分类特征，扩展到多类分类设置，不需要特征缩放，并且能够捕捉非线性和特征交互。

spark.mllib 支持用于二分类和回归的GBT，使用连续特征和分类特征。 spark.mllib 使用现有的决策树实现来实现GBT。有关树的更多信息，请参见决策树指南。

注意 : GBTs 尚不支持多类分类。对于多类问题，请使用决策树或随机森林。

基本算法

梯度提升迭代地训练一系列决策树。在每次迭代中，算法使用当前集成来预测每个训练实例的标签，然后将预测与真实标签进行比较。数据集被重新标记，以更强调那些预测不佳的训练实例。因此，在下一次迭代中，决策树将帮助纠正之前的错误。

重新标签实例的具体机制是由一个损失函数定义的（下面会讨论）。随着每次迭代，GBT进一步减少训练数据上的这个损失函数。

损失

下表列出了目前在 spark.mllib 中支持的损失函数。请注意，每个损失函数适用于分类或回归中的一种，而不是两者。

符号说明: $N$ = 实例的数量。 $y_i$ = 实例 $i$ 的标签。 $x_i$ = 实例 $i$ 的特征。 $F(x_i)$ = 模型对实例 $i$ 的预测标签。

损失	任务	公式	描述
对数损失	分类	$2 \sum_{i=1}^{N} \log(1+\exp(-2 y_i F(x_i)))$	双倍二项负对数似然。
平方误差	回归	$\sum_{i=1}^{N} (y_i - F(x_i))^2$	也称为L2损失。回归任务的默认损失。
绝对误差	回归	$\sum_{i=1}^{N} \|y_i - F(x_i)\|$	也称为L1损失。对异常值比平方误差更具鲁棒性。

使用技巧

我们提供了一些使用GBT的指导方针，通过讨论不同的参数。我们省略了一些决策树参数，因为这些在决策树指南中已涵盖。

loss : 请参见上面的部分以获取有关损失及其对任务（分类与回归）的适用性的信息。不同的损失可能会根据数据集产生显著不同的结果。
numIterations : 这设置了集成中树的数量。每次迭代都会产生一棵树。增加这个数字使模型更具表现力，提高训练数据的准确性。然而，如果这个数值太大，测试时的准确性可能会受到影响。
learningRate : 这个参数不需要调节。如果算法的行为似乎不稳定，降低这个值可能会改善稳定性。
algo : 算法或任务（分类与回归）是通过树[Strategy]参数设置的。

训练过程中的验证

当用更多树进行训练时，梯度提升可能会过拟合。为了防止过拟合，训练时进行验证是有用的。提供了方法 runWithValidation 来利用这个选项。它接受一对 RDD 作为参数，第一个是训练数据集，第二个是验证数据集。

当验证错误的改进不超过某个容忍度时，训练将停止（由 validationTol 参数在 BoostingStrategy 中提供）。实际上，验证错误最初会下降，随后会增加。可能会出现验证错误不单调变化的情况，建议用户设定一个足够大的负容忍度，并使用 evaluateEachIteration （提供每次迭代的错误或损失）来调整迭代次数。

示例

分类

下面的示例演示了如何加载一个 LIBSVM 数据文件，将其解析为 LabeledPoint 的 RDD，然后使用具有对数损失的梯度提升树进行分类。测试误差被计算用于衡量算法的准确性。

请参阅 GradientBoostedTrees Python 文档和 GradientBoostedTreesModel Python 文档以获取有关API的更多详细信息。

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.mllib.util import MLUtils
# 加载和解析数据文件。
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
# 将数据分割为训练集和测试集（30% 用于测试）
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# 训练一个 GradientBoostedTrees 模型。
#  注意：（a）空的 categoricalFeaturesInfo 表示所有特征都是连续的。
#         （b）实际应用中使用更多的迭代次数。
model = GradientBoostedTrees.trainClassifier(trainingData,
categoricalFeaturesInfo={}, numIterations=3)
# 在测试实例上评估模型并计算测试误差
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testErr = labelsAndPredictions.filter(
lambda lp: lp[0] != lp[1]).count() / float(testData.count())
print('测试误差 = ' + str(testErr))
print('学习到的分类 GBT 模型:')
print(model.toDebugString())
# 保存和加载模型
model.save(sc, "target/tmp/myGradientBoostingClassificationModel")
sameModel = GradientBoostedTreesModel.load(sc,
"target/tmp/myGradientBoostingClassificationModel")

Find full example code at "examples/src/main/python/mllib/gradient_boosting_classification_example.py" in the Spark repo.

请参阅 GradientBoostedTrees 的Scala文档和 GradientBoostedTreesModel 的Scala文档以获取有关API的详细信息。

import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils
// 加载并解析数据文件。
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// 将数据拆分为训练集和测试集（30%用于测试）
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// 训练一个 GradientBoostedTrees 模型。
// 默认分类参数使用 LogLoss。
val boostingStrategy = BoostingStrategy.defaultParams("Classification")
boostingStrategy.numIterations = 3 // 注意：实际操作中请使用更多迭代。
boostingStrategy.treeStrategy.numClasses = 2
boostingStrategy.treeStrategy.maxDepth = 5
// 空的 categoricalFeaturesInfo 表示所有特征都是连续的。
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()
val model = GradientBoostedTrees.train(trainingData, boostingStrategy)
// 在测试实例上评估模型并计算测试错误
val labelAndPreds = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count.toDouble / testData.count()
println(s"测试错误 = $testErr")
println(s"学习到的分类 GBT 模型：\n ${model.toDebugString}")
// 保存和加载模型
model.save(sc, "target/tmp/myGradientBoostingClassificationModel")
val sameModel = GradientBoostedTreesModel.load(sc,
"target/tmp/myGradientBoostingClassificationModel")

Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostingClassificationExample.scala" in the Spark repo.

请参考 GradientBoostedTrees Java 文档和 GradientBoostedTreesModel Java 文档以获取有关 API 的详细信息。

import java.util.HashMap;
import java.util.Map;
import scala.Tuple2;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.tree.GradientBoostedTrees;
import org.apache.spark.mllib.tree.configuration.BoostingStrategy;
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel;
import org.apache.spark.mllib.util.MLUtils;
SparkConf sparkConf = new SparkConf()
.setAppName("JavaGradientBoostedTreesClassificationExample");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// 加载和解析数据文件。
String datapath = "data/mllib/sample_libsvm_data.txt";
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), datapath).toJavaRDD();
// 将数据分割为训练集和测试集（30%用于测试）
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
JavaRDD<LabeledPoint> trainingData = splits[0];
JavaRDD<LabeledPoint> testData = splits[1];
// 训练一个GradientBoostedTrees模型。
// 默认的分类参数使用LogLoss作为默认。
BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Classification");
boostingStrategy.setNumIterations(3); // 注意：实际使用中应使用更多的迭代。
boostingStrategy.getTreeStrategy().setNumClasses(2);
boostingStrategy.getTreeStrategy().setMaxDepth(5);
// 空的categoricalFeaturesInfo表示所有特征都是连续的。
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();
boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);
GradientBoostedTreesModel model = GradientBoostedTrees.train(trainingData, boostingStrategy);
// 在测试实例上评估模型并计算测试错误
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
double testErr =
predictionAndLabel.filter(pl -> !pl._1().equals(pl._2())).count() / (double) testData.count();
System.out.println("测试错误: " + testErr);
System.out.println("学习到的分类GBT模型：\n" + model.toDebugString());
// 保存和加载模型
model.save(jsc.sc(), "target/tmp/myGradientBoostingClassificationModel");
GradientBoostedTreesModel sameModel = GradientBoostedTreesModel.load(jsc.sc(),
"target/tmp/myGradientBoostingClassificationModel");

Find full example code at "examples/src/main/java/org/apache/spark/examples/mllib/JavaGradientBoostingClassificationExample.java" in the Spark repo.

回归

下面的例子演示了如何加载一个 LIBSVM 数据文件，将其解析为 LabeledPoint 的 RDD，然后使用带有平方误差作为损失的梯度提升树进行回归。最后计算均方误差 (MSE) 来评估拟合优度。

请参阅 GradientBoostedTrees Python 文档和 GradientBoostedTreesModel Python 文档以获取有关API的更多详细信息。

from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel
from pyspark.mllib.util import MLUtils
# 加载并解析数据文件。
data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
# 将数据分割为训练集和测试集（30% 用于测试）
(trainingData, testData) = data.randomSplit([0.7, 0.3])
# 训练一个 GradientBoostedTrees 模型。
#  注意: (a) 空的 categoricalFeaturesInfo 表示所有特征都是连续的。
#         (b) 实际中使用更多的迭代次数。
model = GradientBoostedTrees.trainRegressor(trainingData,
categoricalFeaturesInfo={}, numIterations=3)
# 在测试实例上评估模型并计算测试误差
predictions = model.predict(testData.map(lambda x: x.features))
labelsAndPredictions = testData.map(lambda lp: lp.label).zip(predictions)
testMSE = labelsAndPredictions.map(lambda lp: (lp[0] - lp[1]) * (lp[0] - lp[1])).sum() /\
    float(testData.count())
print('测试均方误差 = ' + str(testMSE))
print('学习到的回归 GBT 模型:')
print(model.toDebugString())
# 保存和加载模型
model.save(sc, "target/tmp/myGradientBoostingRegressionModel")
sameModel = GradientBoostedTreesModel.load(sc, "target/tmp/myGradientBoostingRegressionModel")

Find full example code at "examples/src/main/python/mllib/gradient_boosting_regression_example.py" in the Spark repo.

有关API的详细信息，请参考 GradientBoostedTrees 的Scala文档和 GradientBoostedTreesModel 的Scala文档。

import org.apache.spark.mllib.tree.GradientBoostedTrees
import org.apache.spark.mllib.tree.configuration.BoostingStrategy
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel
import org.apache.spark.mllib.util.MLUtils
// 加载并解析数据文件。
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// 将数据分为训练集和测试集（30%用于测试）
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// 训练一个 GradientBoostedTrees 模型。
// 默认的回归参数使用平方误差。
val boostingStrategy = BoostingStrategy.defaultParams("Regression")
boostingStrategy.numIterations = 3 // 注意：在实际中使用更多迭代。
boostingStrategy.treeStrategy.maxDepth = 5
// 空的 categoricalFeaturesInfo 表示所有特征都是连续的。
boostingStrategy.treeStrategy.categoricalFeaturesInfo = Map[Int, Int]()
val model = GradientBoostedTrees.train(trainingData, boostingStrategy)
// 在测试实例上评估模型并计算测试误差
val labelsAndPredictions = testData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println(s"测试均方误差 = $testMSE")
println(s"学习到的回归 GBT 模型:\n ${model.toDebugString}")
// 保存和加载模型
model.save(sc, "target/tmp/myGradientBoostingRegressionModel")
val sameModel = GradientBoostedTreesModel.load(sc,
"target/tmp/myGradientBoostingRegressionModel")

Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/GradientBoostingRegressionExample.scala" in the Spark repo.

有关 API 的详细信息，请参阅 GradientBoostedTrees Java 文档和 GradientBoostedTreesModel Java 文档。

import java.util.HashMap;
import java.util.Map;
import scala.Tuple2;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.tree.GradientBoostedTrees;
import org.apache.spark.mllib.tree.configuration.BoostingStrategy;
import org.apache.spark.mllib.tree.model.GradientBoostedTreesModel;
import org.apache.spark.mllib.util.MLUtils;
SparkConf sparkConf = new SparkConf()
.setAppName("JavaGradientBoostedTreesRegressionExample");
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
// 载入并解析数据文件。
String datapath = "data/mllib/sample_libsvm_data.txt";
JavaRDD<LabeledPoint> data = MLUtils.loadLibSVMFile(jsc.sc(), datapath).toJavaRDD();
// 将数据分为训练集和测试集（30%用于测试）
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[]{0.7, 0.3});
JavaRDD<LabeledPoint> trainingData = splits[0];
JavaRDD<LabeledPoint> testData = splits[1];
// 训练一个 GradientBoostedTrees 模型。
// 默认的回归参数使用均方误差。
BoostingStrategy boostingStrategy = BoostingStrategy.defaultParams("Regression");
boostingStrategy.setNumIterations(3); // 注：实际使用中请使用更多的迭代次数。
boostingStrategy.getTreeStrategy().setMaxDepth(5);
// 空的 categoricalFeaturesInfo 表示所有特征都是连续的。
Map<Integer, Integer> categoricalFeaturesInfo = new HashMap<>();
boostingStrategy.treeStrategy().setCategoricalFeaturesInfo(categoricalFeaturesInfo);
GradientBoostedTreesModel model = GradientBoostedTrees.train(trainingData, boostingStrategy);
// 在测试实例上评估模型并计算测试误差
JavaPairRDD<Double, Double> predictionAndLabel =
testData.mapToPair(p -> new Tuple2<>(model.predict(p.features()), p.label()));
double testMSE = predictionAndLabel.mapToDouble(pl -> {
double diff = pl._1() - pl._2();
return diff* diff;
}).mean();
System.out.println("测试均方误差: " + testMSE);
System.out.println("学到的回归 GBT 模型:\n" + model.toDebugString());
// 保存和加载模型
model.save(jsc.sc(), "target/tmp/myGradientBoostingRegressionModel");
GradientBoostedTreesModel sameModel = GradientBoostedTreesModel.load(jsc.sc(),
"target/tmp/myGradientBoostingRegressionModel");

Find full example code at "examples/src/main/java/org/apache/spark/examples/mllib/JavaGradientBoostingRegressionExample.java" in the Spark repo.

MLlib：主要指南

MLlib: 基于RDD的API指南

集成 - 基于RDD的API

梯度增强树与随机森林

随机森林

基本算法

培训

预测

使用技巧

示例

分类

回归

梯度提升树 (GBTs)

基本算法

损失

使用技巧

训练过程中的验证

示例

分类

回归