并行训练

较大的数据集需要更多的时间进行训练。 虽然默认情况下HiClass中的模型是使用单核进行训练的, 但可以通过利用Ray库1并行训练每个本地分类器。 如果未安装Ray,则并行性默认为Joblib。 在此示例中,我们演示了如何通过将参数n_jobs设置为使用所有可用核心来并行训练分层分类器。训练是在来自Kaggle2的模拟数据集上进行的。

1

https://www.ray.io/

2

https://www.kaggle.com/datasets/kashnitsky/hierarchical-text-classification

Pipeline(steps=[('count', CountVectorizer()), ('tfidf', TfidfTransformer()),
                ('lcppn',
                 LocalClassifierPerParentNode(local_classifier=LogisticRegression(max_iter=1000),
                                              n_jobs=2))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


import sys
from os import cpu_count
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

from hiclass import LocalClassifierPerParentNode
from hiclass.datasets import load_hierarchical_text_classification

# Load train and test splits
X_train, X_test, Y_train, Y_test = load_hierarchical_text_classification()

# We will use logistic regression classifiers for every parent node
lr = LogisticRegression(max_iter=1000)

pipeline = Pipeline(
    [
        ("count", CountVectorizer()),
        ("tfidf", TfidfTransformer()),
        (
            "lcppn",
            LocalClassifierPerParentNode(local_classifier=lr, n_jobs=cpu_count()),
        ),
    ]
)

# Fixes bug AttributeError: '_LoggingTee' object has no attribute 'fileno'
# This only happens when building the documentation
# Hence, you don't actually need it for your code to work
sys.stdout.fileno = lambda: False

# Now, let's train the local classifier per parent node
pipeline.fit(X_train, Y_train)

脚本的总运行时间: (1 分钟 8.821 秒)

Gallery generated by Sphinx-Gallery