使用Scikit-Learn进行文本分类
本教程将介绍如何在Label Studio中使用机器学习(ML)后端的基础知识,通过scikit-learn库提供的简单文本分类模型进行演示。
按照本教程操作一个文本分类项目,其中标注界面使用控制标签和对象标签。以下是您可以使用的示例标注配置:
<View>
<Text name="news" value="$text"/>
<Choices name="topic" toName="news">
<Choice value="Politics"/>
<Choice value="Technology"/>
<Choice value="Sport"/>
<Choice value="Weather"/>
</Choices>
</View>
创建模型脚本
如果使用Label Studio的ML SDK创建机器学习后端,请确保您的ML后端脚本执行以下操作:
- 继承创建的模型类自
label_studio_ml.LabelStudioMLBase - Override the 2 methods:
predict(),接收输入任务并以Label Studio JSON格式输出预测结果。fit(),接收annotations可迭代对象并返回包含已创建链接和资源的字典。该字典随后会通过self.train_output字段用于加载模型。
创建一个名为model.py的文件,内容如下:
import pickle
import os
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import make_pipeline
from label_studio_ml.model import LabelStudioMLBase
class SimpleTextClassifier(LabelStudioMLBase):
def __init__(self, **kwargs):
# don't forget to initialize base class...
super(SimpleTextClassifier, self).__init__(**kwargs)
# then collect all keys from config which will be used to extract data from task and to form prediction
# Parsed label config contains only one output of <Choices> type
assert len(self.parsed_label_config) == 1
self.from_name, self.info = list(self.parsed_label_config.items())[0]
assert self.info['type'] == 'Choices'
# the model has only one textual input
assert len(self.info['to_name']) == 1
assert len(self.info['inputs']) == 1
assert self.info['inputs'][0]['type'] == 'Text'
self.to_name = self.info['to_name'][0]
self.value = self.info['inputs'][0]['value']
if not self.train_output:
# If there is no trainings, define cold-started the simple TF-IDF text classifier
self.reset_model()
# This is an array of <Choice> labels
self.labels = self.info['labels']
# make some dummy initialization
self.model.fit(X=self.labels, y=list(range(len(self.labels))))
print('Initialized with from_name={from_name}, to_name={to_name}, labels={labels}'.format(
from_name=self.from_name, to_name=self.to_name, labels=str(self.labels)
))
else:
# otherwise load the model from the latest training results
self.model_file = self.train_output['model_file']
with open(self.model_file, mode='rb') as f:
self.model = pickle.load(f)
# and use the labels from training outputs
self.labels = self.train_output['labels']
print('Loaded from train output with from_name={from_name}, to_name={to_name}, labels={labels}'.format(
from_name=self.from_name, to_name=self.to_name, labels=str(self.labels)
))
def reset_model(self):
self.model = make_pipeline(TfidfVectorizer(ngram_range=(1, 3)), LogisticRegression(C=10, verbose=True))
def predict(self, tasks, **kwargs):
# collect input texts
input_texts = []
for task in tasks:
input_texts.append(task['data'][self.value])
# get model predictions
probabilities = self.model.predict_proba(input_texts)
predicted_label_indices = np.argmax(probabilities, axis=1)
predicted_scores = probabilities[np.arange(len(predicted_label_indices)), predicted_label_indices]
predictions = []
for idx, score in zip(predicted_label_indices, predicted_scores):
predicted_label = self.labels[idx]
# prediction result for the single task
result = [{
'from_name': self.from_name,
'to_name': self.to_name,
'type': 'choices',
'value': {'choices': [predicted_label]}
}]
# expand predictions with their scores for all tasks
predictions.append({'result': result, 'score': score})
return predictions
def fit(self, completions, workdir=None, **kwargs):
input_texts = []
output_labels, output_labels_idx = [], []
label2idx = {l: i for i, l in enumerate(self.labels)}
for completion in completions:
# get input text from task data
print(completion)
if completion['annotations'][0].get('skipped') or completion['annotations'][0].get('was_cancelled'):
continue
input_text = completion['data'][self.value]
input_texts.append(input_text)
# get an annotation
output_label = completion['annotations'][0]['result'][0]['value']['choices'][0]
output_labels.append(output_label)
output_label_idx = label2idx[output_label]
output_labels_idx.append(output_label_idx)
new_labels = set(output_labels)
if len(new_labels) != len(self.labels):
self.labels = list(sorted(new_labels))
print('Label set has been changed:' + str(self.labels))
label2idx = {l: i for i, l in enumerate(self.labels)}
output_labels_idx = [label2idx[label] for label in output_labels]
# train the model
self.reset_model()
self.model.fit(input_texts, output_labels_idx)
# save output resources
model_file = os.path.join(workdir, 'model.pkl')
with open(model_file, mode='wb') as fout:
pickle.dump(self.model, fout)
train_output = {
'labels': self.labels,
'model_file': model_file
}
return train_output
创建ML后端配置与脚本
Label Studio可以自动创建运行ML后端所需的所有配置和脚本,这些配置和脚本来自您新创建的模型。
调用您的机器学习后端 my_backend,并通过命令行初始化机器学习后端目录 ./my_backend:
label-studio-ml init my_backend
最后一条命令会获取您的脚本./model.py并在同级目录下创建./my_backend文件夹,复制启动ML后端所需的各种配置和脚本,适用于开发或生产模式。
备注
你可以为模型脚本指定不同的位置,例如:label-studio-ml init my_backend --script /path/to/my/script.py。
启动ML后端服务器
开发模式
在开发模式下,训练和推理在同一个进程中完成,因此服务器在模型训练期间不会响应传入的预测请求。
要在Flask开发模式下启动ML后端服务器,请从命令行运行以下命令:
label-studio-ml start my_backend
服务器已启动在 http://localhost:9090 并在控制台输出日志。
生产模式
生产模式由Redis服务器和RQ作业提供支持,负责后台训练流程。这意味着您可以启动模型训练,同时继续从当前模型状态请求预测。当模型完成训练后,新版本模型会自动更新。
对于生产环境,请确保您的系统已安装Docker和docker-compose。然后在命令行中运行以下命令:
cd my_backend/
docker-compose up
您可以在my_backend/logs/uwsgi.log中查看运行时日志,在my_backend/logs/rq.log中查看RQ训练日志
在Label Studio中使用机器学习后端
初始化并启动一个新的Label Studio项目,连接到正在运行的ML后端:
label-studio start my_project --init --ml-backends http://localhost:9090
获取预测结果
您应该在标注界面中看到模型预测结果。参见使用Label Studio设置机器学习。
模型训练
通过点击项目设置中机器学习页面的Start training按钮手动触发模型训练,或使用API调用:
curl -X POST http://localhost:8080/api/models/train