注意
点击 here 下载完整的示例代码
并行训练
较大的数据集需要更多的时间进行训练。
虽然默认情况下HiClass中的模型是使用单核进行训练的,
但可以通过利用Ray库1并行训练每个本地分类器。
如果未安装Ray,则并行性默认为Joblib。
在此示例中,我们演示了如何通过将参数n_jobs设置为使用所有可用核心来并行训练分层分类器。训练是在来自Kaggle2的模拟数据集上进行的。
import sys
from os import cpu_count
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from hiclass import LocalClassifierPerParentNode
from hiclass.datasets import load_hierarchical_text_classification
# Load train and test splits
X_train, X_test, Y_train, Y_test = load_hierarchical_text_classification()
# We will use logistic regression classifiers for every parent node
lr = LogisticRegression(max_iter=1000)
pipeline = Pipeline(
[
("count", CountVectorizer()),
("tfidf", TfidfTransformer()),
(
"lcppn",
LocalClassifierPerParentNode(local_classifier=lr, n_jobs=cpu_count()),
),
]
)
# Fixes bug AttributeError: '_LoggingTee' object has no attribute 'fileno'
# This only happens when building the documentation
# Hence, you don't actually need it for your code to work
sys.stdout.fileno = lambda: False
# Now, let's train the local classifier per parent node
pipeline.fit(X_train, Y_train)
脚本的总运行时间: (1 分钟 8.821 秒)