.. DO NOT EDIT. .. THIS FILE WAS AUTOMATICALLY GENERATED BY SPHINX-GALLERY. .. TO MAKE CHANGES, EDIT THE SOURCE PYTHON FILE: .. "auto_examples/linear_model/plot_sparse_logistic_regression_20newsgroups.py" .. LINE NUMBERS ARE GIVEN BELOW. .. only:: html .. note:: :class: sphx-glr-download-link-note :ref:`Go to the end ` to download the full example code. or to run this example in your browser via Binder .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_linear_model_plot_sparse_logistic_regression_20newsgroups.py: ==================================================== 20类新闻组数据集上的多分类稀疏逻辑回归 ==================================================== 对比多项逻辑回归L1与一对多L1逻辑回归在newgroups20数据集上对文档进行分类的效果。多项逻辑回归在大规模数据集上训练速度更快且结果更准确。 在这里,我们使用L1稀疏性来将不具信息量的特征权重修剪为零。如果目标是提取每个类别的强区分性词汇,这是有益的。如果目标是获得最佳预测准确性,则最好使用不引入稀疏性的L2惩罚。 一种更传统(且可能更好)的在稀疏输入特征子集上进行预测的方法是使用单变量特征选择,然后使用传统的(L2惩罚的)逻辑回归模型。 .. GENERATED FROM PYTHON SOURCE LINES 13-123 .. image-sg:: /auto_examples/linear_model/images/sphx_glr_plot_sparse_logistic_regression_20newsgroups_001.png :alt: Multinomial vs One-vs-Rest Logistic L1 Dataset 20newsgroups :srcset: /auto_examples/linear_model/images/sphx_glr_plot_sparse_logistic_regression_20newsgroups_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out .. code-block:: none Dataset 20newsgroup, train_samples=4500, n_features=130107, n_classes=20 [model=One versus Rest, solver=saga] Number of epochs: 1 [model=One versus Rest, solver=saga] Number of epochs: 2 [model=One versus Rest, solver=saga] Number of epochs: 3 Test accuracy for model ovr: 0.5960 % non-zero coefficients for model ovr, per class: [0.26593496 0.43348936 0.26362917 0.31973683 0.37815029 0.2928359 0.27054655 0.62717609 0.19522393 0.30897646 0.34586917 0.28207552 0.34125758 0.29898468 0.34279478 0.59489497 0.38353048 0.35278655 0.19829832 0.14603365] Run time (3 epochs) for model ovr:2.80 [model=Multinomial, solver=saga] Number of epochs: 1 [model=Multinomial, solver=saga] Number of epochs: 2 [model=Multinomial, solver=saga] Number of epochs: 5 Test accuracy for model multinomial: 0.6440 % non-zero coefficients for model multinomial, per class: [0.36047253 0.1268187 0.10606655 0.17985197 0.5395559 0.07993421 0.06686804 0.21443888 0.11528972 0.2075215 0.10914094 0.11144673 0.13988486 0.09684337 0.26286057 0.11682692 0.55800226 0.17370318 0.11452112 0.14603365] Run time (5 epochs) for model multinomial:2.83 Example run in 11.676 s | .. code-block:: Python # Author: Arthur Mensch import timeit import warnings import matplotlib.pyplot as plt import numpy as np from sklearn.datasets import fetch_20newsgroups_vectorized from sklearn.exceptions import ConvergenceWarning from sklearn.linear_model import LogisticRegression from sklearn.model_selection import train_test_split from sklearn.multiclass import OneVsRestClassifier warnings.filterwarnings("ignore", category=ConvergenceWarning, module="sklearn") t0 = timeit.default_timer() # 我们使用SAGA求解器 solver = "saga" # 降低以加快运行时间 n_samples = 5000 X, y = fetch_20newsgroups_vectorized(subset="all", return_X_y=True) X = X[:n_samples] y = y[:n_samples] X_train, X_test, y_train, y_test = train_test_split( X, y, random_state=42, stratify=y, test_size=0.1 ) train_samples, n_features = X_train.shape n_classes = np.unique(y).shape[0] print( "Dataset 20newsgroup, train_samples=%i, n_features=%i, n_classes=%i" % (train_samples, n_features, n_classes) ) models = { "ovr": {"name": "One versus Rest", "iters": [1, 2, 3]}, "multinomial": {"name": "Multinomial", "iters": [1, 2, 5]}, } for model in models: # 添加初始机会水平值以用于绘图目的 accuracies = [1 / n_classes] times = [0] densities = [1] model_params = models[model] # 少量的训练周期以加快运行时间 for this_max_iter in model_params["iters"]: print( "[model=%s, solver=%s] Number of epochs: %s" % (model_params["name"], solver, this_max_iter) ) clf = LogisticRegression( solver=solver, penalty="l1", max_iter=this_max_iter, random_state=42, ) if model == "ovr": clf = OneVsRestClassifier(clf) t1 = timeit.default_timer() clf.fit(X_train, y_train) train_time = timeit.default_timer() - t1 y_pred = clf.predict(X_test) accuracy = np.sum(y_pred == y_test) / y_test.shape[0] if model == "ovr": coef = np.concatenate([est.coef_ for est in clf.estimators_]) else: coef = clf.coef_ density = np.mean(coef != 0, axis=1) * 100 accuracies.append(accuracy) densities.append(density) times.append(train_time) models[model]["times"] = times models[model]["densities"] = densities models[model]["accuracies"] = accuracies print("Test accuracy for model %s: %.4f" % (model, accuracies[-1])) print( "%% non-zero coefficients for model %s, per class:\n %s" % (model, densities[-1]) ) print( "Run time (%i epochs) for model %s:%.2f" % (model_params["iters"][-1], model, times[-1]) ) fig = plt.figure() ax = fig.add_subplot(111) for model in models: name = models[model]["name"] times = models[model]["times"] accuracies = models[model]["accuracies"] ax.plot(times, accuracies, marker="o", label="Model: %s" % name) ax.set_xlabel("Train time (s)") ax.set_ylabel("Test accuracy") ax.legend() fig.suptitle("Multinomial vs One-vs-Rest Logistic L1\nDataset %s" % "20newsgroups") fig.tight_layout() fig.subplots_adjust(top=0.85) run_time = timeit.default_timer() - t0 print("Example run in %.3f s" % run_time) plt.show() .. rst-class:: sphx-glr-timing **Total running time of the script:** (0 minutes 11.717 seconds) .. _sphx_glr_download_auto_examples_linear_model_plot_sparse_logistic_regression_20newsgroups.py: .. only:: html .. container:: sphx-glr-footer sphx-glr-footer-example .. container:: binder-badge .. image:: images/binder_badge_logo.svg :target: https://mybinder.org/v2/gh/scikit-learn/scikit-learn/main?urlpath=lab/tree/notebooks/auto_examples/linear_model/plot_sparse_logistic_regression_20newsgroups.ipynb :alt: Launch binder :width: 150 px .. container:: sphx-glr-download sphx-glr-download-jupyter :download:`Download Jupyter notebook: plot_sparse_logistic_regression_20newsgroups.ipynb ` .. container:: sphx-glr-download sphx-glr-download-python :download:`Download Python source code: plot_sparse_logistic_regression_20newsgroups.py ` .. container:: sphx-glr-download sphx-glr-download-zip :download:`Download zipped: plot_sparse_logistic_regression_20newsgroups.zip ` .. include:: plot_sparse_logistic_regression_20newsgroups.recommendations .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_