使用AutoMM进行文本到文本的语义匹配

Open In Colab Open In SageMaker Studio Lab

计算两个句子/段落之间的相似度是自然语言处理中的常见任务,具有多种实际应用,如网络搜索、问答、文档去重、抄袭比较、自然语言推理、推荐引擎等。通常,文本相似度模型会将两个句子/段落作为输入并将其转换为向量,然后使用余弦相似度、点积或欧几里得距离计算的相似度分数来衡量两个文本片段的相似性或差异性。

准备你的数据

在本教程中,我们将演示如何使用AutoMM进行文本到文本的语义匹配,使用斯坦福自然语言推理(SNLI)语料库。SNLI语料库包含大约570k个人工编写的句子对,标记为蕴含矛盾中立。它是一个广泛使用的基准,用于评估机器学习方法的表示和推理能力。下表包含从该语料库中提取的三个示例。

前提

假设

标签

一辆黑色赛车在人群前启动。

一个男人正沿着一条孤寂的道路行驶。

矛盾

一个年长和一个年轻的男人在微笑。

两个男人正在笑着看猫在地板上玩耍。

中性

一场有多名男性参与的足球比赛。

一些男人正在玩一项运动。

蕴含

在这里,我们将标签为entailment的句子对视为正对(标记为1),将标签为contradiction的句子对视为负对(标记为0)。具有中性关系的句子对被丢弃。以下代码下载并将语料库加载到数据框中。

from autogluon.core.utils.loaders import load_pd
import pandas as pd

snli_train = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/snli/snli_train.csv', delimiter="|")
snli_test = load_pd.load('https://automl-mm-bench.s3.amazonaws.com/snli/snli_test.csv', delimiter="|")
snli_train.head()
premise hypothesis label
0 A person on a horse jumps over a broken down a... A person is at a diner , ordering an omelette . 0
1 A person on a horse jumps over a broken down a... A person is outdoors , on a horse . 1
2 Children smiling and waving at camera There are children present 1
3 Children smiling and waving at camera The kids are frowning 0
4 A boy is jumping on skateboard in the middle o... The boy skates down the sidewalk . 0

训练你的模型

理想情况下,我们希望获得一个能够为正面/负面文本对返回高/低分的模型。传统的文本相似度方法仅在词汇层面上工作,没有考虑语义方面,例如使用词频或tf-idf向量。通过AutoMM,我们可以轻松训练一个捕捉句子之间语义关系的模型。基本上,它使用BERT将每个句子投影到高维向量中,并将匹配问题视为分类问题,遵循句子转换器的设计。 使用AutoMM,您只需指定查询、响应和标签列名称,并在训练数据集上拟合模型,而无需担心实现细节。请注意,标签应该是二进制的,我们需要指定match_label,这意味着两个句子具有相同的语义。在实践中,您的任务可能有不同的标签,例如重复或不重复。您可能需要根据您的具体任务上下文定义match_label

from autogluon.multimodal import MultiModalPredictor

# Initialize the model
predictor = MultiModalPredictor(
        problem_type="text_similarity",
        query="premise", # the column name of the first sentence
        response="hypothesis", # the column name of the second sentence
        label="label", # the label column name
        match_label=1, # the label indicating that query and response have the same semantic meanings.
        eval_metric='auc', # the evaluation metric
    )

# Fit the model
predictor.fit(
    train_data=snli_train,
    time_limit=180,
)
/home/ci/opt/venv/lib/python3.11/site-packages/mmengine/optim/optimizer/zero_optimizer.py:11: DeprecationWarning: `TorchScript` support for functional optimizers is deprecated and will be removed in a future PyTorch release. Consider using the `torch.compile` optimizer instead.
  from torch.distributed.optim import \
No path specified. Models will be saved in: "AutogluonModels/ag-20241127_100648"
=================== System Info ===================
AutoGluon Version:  1.2b20241127
Python Version:     3.11.9
Operating System:   Linux
Platform Machine:   x86_64
Platform Version:   #1 SMP Tue Sep 24 10:00:37 UTC 2024
CPU Count:          8
Pytorch Version:    2.5.1+cu124
CUDA Version:       12.4
Memory Avail:       28.20 GB / 30.95 GB (91.1%)
Disk Space Avail:   171.59 GB / 255.99 GB (67.0%)
===================================================
AutoGluon infers your prediction problem is: 'binary' (because only two unique label-values observed).
2 unique label values:  [0, 1]
If 'binary' is not the correct problem_type, please manually specify the problem_type parameter during Predictor init (You may specify problem_type as one of: ['binary', 'multiclass', 'regression', 'quantile'])
/home/ci/autogluon/multimodal/src/autogluon/multimodal/utils/metric.py:116: UserWarning: Metric auc is not supported as the evaluation metric for binary in matching tasks.The evaluation metric is changed to roc_auc by default.
  warnings.warn(

AutoMM starts to create your model. ✨✨✨

To track the learning progress, you can open a terminal and launch Tensorboard:
    ```shell
    # Assume you have installed tensorboard
    tensorboard --logdir /home/ci/autogluon/docs/tutorials/multimodal/semantic_matching/AutogluonModels/ag-20241127_100648
    ```
Seed set to 0
GPU Count: 1
GPU Count to be Used: 1
GPU 0 Name: Tesla T4
GPU 0 Memory: 0.43GB/15.0GB (Used/Total)
Using 16bit Automatic Mixed Precision (AMP)
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
| Name              | Type                         | Params | Mode 
---------------------------------------------------------------------------
0 | query_model       | HFAutoModelForTextPrediction | 33.4 M | train
1 | response_model    | HFAutoModelForTextPrediction | 33.4 M | train
2 | validation_metric | BinaryAUROC                  | 0      | train
3 | loss_func         | ContrastiveLoss              | 0      | train
4 | miner_func        | PairMarginMiner              | 0      | train
---------------------------------------------------------------------------
33.4 M    Trainable params
0         Non-trainable params
33.4 M    Total params
133.440   Total estimated model params size (MB)
13        Modules in train mode
228       Modules in eval mode
Time limit reached. Elapsed time is 0:03:00. Signaling Trainer to stop.
Epoch 0, global step 196: 'val_roc_auc' reached 0.89803 (best 0.89803), saving model to '/home/ci/autogluon/docs/tutorials/multimodal/semantic_matching/AutogluonModels/ag-20241127_100648/epoch=0-step=196.ckpt' as top 3
Start to fuse 1 checkpoints via the greedy soup algorithm.
/home/ci/autogluon/multimodal/src/autogluon/multimodal/learners/matching.py:1844: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(path, map_location=torch.device("cpu"))["state_dict"]  # nosec B614
/home/ci/autogluon/multimodal/src/autogluon/multimodal/utils/checkpoint.py:63: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  avg_state_dict = torch.load(checkpoint_paths[0], map_location=torch.device("cpu"))["state_dict"]  # nosec B614
AutoMM has created your model. 🎉🎉🎉

To load the model, use the code below:
    ```python
    from autogluon.multimodal import MultiModalPredictor
    predictor = MultiModalPredictor.load("/home/ci/autogluon/docs/tutorials/multimodal/semantic_matching/AutogluonModels/ag-20241127_100648")
    ```

If you are not satisfied with the model, try to increase the training time, 
adjust the hyperparameters (https://auto.gluon.ai/stable/tutorials/multimodal/advanced_topics/customization.html),
or post issues on GitHub (https://github.com/autogluon/autogluon/issues).
<autogluon.multimodal.predictor.MultiModalPredictor at 0x7f74d44a4ed0>

Evaluate on Test Dataset

您可以在测试数据集上评估匹配器,以查看其使用roc_auc分数的表现:

score = predictor.evaluate(snli_test)
print("evaluation score: ", score)
evaluation score:  {'roc_auc': 0.9133081751453099}

预测新的句子对

我们创建一个具有相似意义的新句子对(预计将被预测为\(1\)),并使用训练好的模型进行预测。

pred_data = pd.DataFrame.from_dict({"premise":["The teacher gave his speech to an empty room."], 
                                    "hypothesis":["There was almost nobody when the professor was talking."]})

predictions = predictor.predict(pred_data)
print('Predicted entities:', predictions[0])
Predicted entities: 1

预测匹配概率

我们还可以计算句子对的匹配概率。

probabilities = predictor.predict_proba(pred_data)
print(probabilities)
0         1
0  0.206385  0.793615

Extract Embeddings

此外,我们支持分别为两个句子组提取嵌入。

embeddings_1 = predictor.extract_embedding({"premise":["The teacher gave his speech to an empty room."]})
print(embeddings_1.shape)
embeddings_2 = predictor.extract_embedding({"hypothesis":["There was almost nobody when the professor was talking."]})
print(embeddings_2.shape)
(1, 384)
(1, 384)

Other Examples

You may go to AutoMM Examples to explore other examples about AutoMM.

Customization

To learn how to customize AutoMM, please refer to Customize AutoMM.