使用Scikit-Learn进行文本分类

本教程将介绍如何在Label Studio中使用机器学习(ML)后端的基础知识，通过scikit-learn库提供的简单文本分类模型进行演示。

按照本教程操作一个文本分类项目，其中标注界面使用控制标签和对象标签。以下是您可以使用的示例标注配置：

<View>
  <Text name="news" value="$text"/>
  <Choices name="topic" toName="news">
    <Choice value="Politics"/>
    <Choice value="Technology"/>
    <Choice value="Sport"/>
    <Choice value="Weather"/>
  </Choices>
</View>

创建模型脚本

如果使用Label Studio的ML SDK创建机器学习后端，请确保您的ML后端脚本执行以下操作：

继承创建的模型类自 label_studio_ml.LabelStudioMLBase
Override the 2 methods:
- predict()，接收输入任务并以Label Studio JSON格式输出预测结果。
- fit()，接收annotations可迭代对象并返回包含已创建链接和资源的字典。该字典随后会通过self.train_output字段用于加载模型。

创建一个名为model.py的文件，内容如下：

import pickle
import os
import numpy as np

from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline

from label_studio_ml.model import LabelStudioMLBase


class SimpleTextClassifier(LabelStudioMLBase):

    def __init__(self, **kwargs):
        # don't forget to initialize base class...
        super(SimpleTextClassifier, self).__init__(**kwargs)

        # then collect all keys from config which will be used to extract data from task and to form prediction
        # Parsed label config contains only one output of <Choices> type
        assert len(self.parsed_label_config) == 1
        self.from_name, self.info = list(self.parsed_label_config.items())[0]
        assert self.info['type'] == 'Choices'

        # the model has only one textual input
        assert len(self.info['to_name']) == 1
        assert len(self.info['inputs']) == 1
        assert self.info['inputs'][0]['type'] == 'Text'
        self.to_name = self.info['to_name'][0]
        self.value = self.info['inputs'][0]['value']

        if not self.train_output:
            # If there is no trainings, define cold-started the simple TF-IDF text classifier
            self.reset_model()
            # This is an array of <Choice> labels
            self.labels = self.info['labels']
            # make some dummy initialization
            self.model.fit(X=self.labels, y=list(range(len(self.labels))))
            print('Initialized with from_name={from_name}, to_name={to_name}, labels={labels}'.format(
                from_name=self.from_name, to_name=self.to_name, labels=str(self.labels)
            ))
        else:
            # otherwise load the model from the latest training results
            self.model_file = self.train_output['model_file']
            with open(self.model_file, mode='rb') as f:
                self.model = pickle.load(f)
            # and use the labels from training outputs
            self.labels = self.train_output['labels']
            print('Loaded from train output with from_name={from_name}, to_name={to_name}, labels={labels}'.format(
                from_name=self.from_name, to_name=self.to_name, labels=str(self.labels)
            ))

    def reset_model(self):
        self.model = make_pipeline(TfidfVectorizer(ngram_range=(1, 3)), LogisticRegression(C=10, verbose=True))

    def predict(self, tasks, **kwargs):
        # collect input texts
        input_texts = []
        for task in tasks:
            input_texts.append(task['data'][self.value])

        # get model predictions
        probabilities = self.model.predict_proba(input_texts)
        predicted_label_indices = np.argmax(probabilities, axis=1)
        predicted_scores = probabilities[np.arange(len(predicted_label_indices)), predicted_label_indices]
        predictions = []
        for idx, score in zip(predicted_label_indices, predicted_scores):
            predicted_label = self.labels[idx]
            # prediction result for the single task
            result = [{
                'from_name': self.from_name,
                'to_name': self.to_name,
                'type': 'choices',
                'value': {'choices': [predicted_label]}
            }]

            # expand predictions with their scores for all tasks
            predictions.append({'result': result, 'score': score})

        return predictions

    def fit(self, completions, workdir=None, **kwargs):
        input_texts = []
        output_labels, output_labels_idx = [], []
        label2idx = {l: i for i, l in enumerate(self.labels)}

        for completion in completions:
            # get input text from task data
            print(completion)
            if completion['annotations'][0].get('skipped') or completion['annotations'][0].get('was_cancelled'):
                continue

            input_text = completion['data'][self.value]
            input_texts.append(input_text)

            # get an annotation
            output_label = completion['annotations'][0]['result'][0]['value']['choices'][0]
            output_labels.append(output_label)
            output_label_idx = label2idx[output_label]
            output_labels_idx.append(output_label_idx)

        new_labels = set(output_labels)
        if len(new_labels) != len(self.labels):
            self.labels = list(sorted(new_labels))
            print('Label set has been changed:' + str(self.labels))
            label2idx = {l: i for i, l in enumerate(self.labels)}
            output_labels_idx = [label2idx[label] for label in output_labels]

        # train the model
        self.reset_model()
        self.model.fit(input_texts, output_labels_idx)

        # save output resources
        model_file = os.path.join(workdir, 'model.pkl')
        with open(model_file, mode='wb') as fout:
            pickle.dump(self.model, fout)

        train_output = {
            'labels': self.labels,
            'model_file': model_file
        }
        return train_output

创建ML后端配置与脚本

Label Studio可以自动创建运行ML后端所需的所有配置和脚本，这些配置和脚本来自您新创建的模型。

调用您的机器学习后端 my_backend，并通过命令行初始化机器学习后端目录 ./my_backend：

label-studio-ml init my_backend

最后一条命令会获取您的脚本./model.py并在同级目录下创建./my_backend文件夹，复制启动ML后端所需的各种配置和脚本，适用于开发或生产模式。

备注

你可以为模型脚本指定不同的位置，例如：label-studio-ml init my_backend --script /path/to/my/script.py。

启动ML后端服务器

开发模式

在开发模式下，训练和推理在同一个进程中完成，因此服务器在模型训练期间不会响应传入的预测请求。

要在Flask开发模式下启动ML后端服务器，请从命令行运行以下命令：

label-studio-ml start my_backend

服务器已启动在 http://localhost:9090 并在控制台输出日志。

生产模式

生产模式由Redis服务器和RQ作业提供支持，负责后台训练流程。这意味着您可以启动模型训练，同时继续从当前模型状态请求预测。当模型完成训练后，新版本模型会自动更新。

对于生产环境，请确保您的系统已安装Docker和docker-compose。然后在命令行中运行以下命令：

cd my_backend/
docker-compose up

您可以在my_backend/logs/uwsgi.log中查看运行时日志，在my_backend/logs/rq.log中查看RQ训练日志

在Label Studio中使用机器学习后端

初始化并启动一个新的Label Studio项目，连接到正在运行的ML后端：

label-studio start my_project --init --ml-backends http://localhost:9090

获取预测结果

您应该在标注界面中看到模型预测结果。参见使用Label Studio设置机器学习。

模型训练

通过点击项目设置中机器学习页面的Start training按钮手动触发模型训练，或使用API调用：

curl -X POST http://localhost:8080/api/models/train

专为各种规模的团队设计版本比较