分类¶
设置¶
In [ ]:
Copied!
pip install ydf -U
pip install ydf -U
什么是分类?¶
分类是预测一个分类值的任务,例如从有限的可能值集中预测一个枚举、类型或类别。例如,从可能颜色集合中预测颜色(如红色、蓝色、绿色)就是一个分类任务。分类模型的输出是对可能类别的概率分布。预测的类别是具有最高概率的类别。
当只有两个类别时,我们称之为二分类。在这种情况下,模型只返回一个概率。
分类标签可以是字符串、整数或布尔值。
训练分类模型¶
模型的任务(例如,分类、回归)由 task 学习器参数确定。该参数的默认值为 ydf.Task.CLASSIFICATION,这意味着默认情况下,YDF 训练分类模型。
In [2]:
Copied!
# 加载库
import ydf # Yggdrasil决策森林
import pandas as pd # 我们使用Pandas加载小型数据集。
# 下载一个分类数据集,并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# 打印前5个训练样本
train_ds.head(5)
# 加载库
import ydf # Yggdrasil决策森林
import pandas as pd # 我们使用Pandas加载小型数据集。
# 下载一个分类数据集,并将其加载为Pandas DataFrame。
ds_path = "https://raw.githubusercontent.com/google/yggdrasil-decision-forests/main/yggdrasil_decision_forests/test_data/dataset"
train_ds = pd.read_csv(f"{ds_path}/adult_train.csv")
test_ds = pd.read_csv(f"{ds_path}/adult_test.csv")
# 打印前5个训练样本
train_ds.head(5)
Out[2]:
| age | workclass | fnlwgt | education | education_num | marital_status | occupation | relationship | race | sex | capital_gain | capital_loss | hours_per_week | native_country | income | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 44 | Private | 228057 | 7th-8th | 4 | Married-civ-spouse | Machine-op-inspct | Wife | White | Female | 0 | 0 | 40 | Dominican-Republic | <=50K |
| 1 | 20 | Private | 299047 | Some-college | 10 | Never-married | Other-service | Not-in-family | White | Female | 0 | 0 | 20 | United-States | <=50K |
| 2 | 40 | Private | 342164 | HS-grad | 9 | Separated | Adm-clerical | Unmarried | White | Female | 0 | 0 | 37 | United-States | <=50K |
| 3 | 30 | Private | 361742 | Some-college | 10 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 50 | United-States | <=50K |
| 4 | 67 | Self-emp-inc | 171564 | HS-grad | 9 | Married-civ-spouse | Prof-specialty | Wife | White | Female | 20051 | 0 | 30 | England | >50K |
标签列是:
In [3]:
Copied!
train_ds["income"]
train_ds["income"]
Out[3]:
0 <=50K
1 <=50K
2 <=50K
3 <=50K
4 >50K
...
22787 <=50K
22788 >50K
22789 <=50K
22790 <=50K
22791 <=50K
Name: income, Length: 22792, dtype: object
我们可以训练一个分类模型:
In [4]:
Copied!
model = ydf.RandomForestLearner(label="income",
task=ydf.Task.CLASSIFICATION).train(train_ds)
# 注意:ydf.Task.CLASSIFICATION 是默认值 "task"
assert model.task() == ydf.Task.CLASSIFICATION
model = ydf.RandomForestLearner(label="income",
task=ydf.Task.CLASSIFICATION).train(train_ds)
# 注意:ydf.Task.CLASSIFICATION 是默认值 "task"
assert model.task() == ydf.Task.CLASSIFICATION
Train model on 22792 examples Model trained in 0:00:01.179527
分类模型的评估使用准确率、混淆矩阵、ROC-AUC和PR-AUC。
In [5]:
Copied!
evaluation = model.evaluate(test_ds)
print(evaluation)
evaluation = model.evaluate(test_ds)
print(evaluation)
accuracy: 0.866005
confusion matrix:
label (row) \ prediction (col)
+-------+-------+-------+
| | <=50K | >50K |
+-------+-------+-------+
| <=50K | 6976 | 873 |
+-------+-------+-------+
| >50K | 436 | 1484 |
+-------+-------+-------+
characteristics:
name: '>50K' vs others
ROC AUC: 0.908676
PR AUC: 0.790029
Num thresholds: 302
loss: 0.394958
num examples: 9769
num examples (weighted): 9769
您可以绘制丰富的评估,通过ROC和PR图。
In [6]:
Copied!
evaluation
evaluation
Out[6]:
accuracy:
0.866005
AUC: '>50K' vs others:
0.908676
PR-AUC: '>50K' vs others:
0.790029
loss:
0.394958
9769
9769
| Label \ Pred | <=50K | >50K |
|---|---|---|
| <=50K | 6976 | 436 |
| >50K | 873 | 1484 |