2024年4月10日

利用CLIP嵌入提升基于GPT-4 Vision的多模态RAG性能

多模态RAG将额外模态整合到传统的基于文本的RAG中，通过提供额外上下文和基础文本数据来增强LLMs的问答能力，从而提升理解效果。

采用服装搭配师手册中的方法，我们直接嵌入图像进行相似性搜索，绕过有损的文本描述过程，以提高检索准确性。

使用基于CLIP的嵌入还可以针对特定数据进行微调或使用未见过的图像进行更新。

该技术通过搜索企业知识库中用户提供的技术图片来提供相关信息。

安装指南

首先让我们安装相关的软件包。

#installations
%pip install clip
%pip install torch
%pip install pillow
%pip install faiss-cpu
%pip install numpy
%pip install git+https://github.com/openai/CLIP.git
%pip install openai

然后让我们导入所有需要的包。

# model imports
import faiss
import json
import torch
from openai import OpenAI
import torch.nn as nn
from torch.utils.data import DataLoader
import clip
client = OpenAI()

# helper imports
from tqdm import tqdm
import json
import os
import numpy as np
import pickle
from typing import List, Union, Tuple

# visualisation imports
from PIL import Image
import matplotlib.pyplot as plt
import base64

现在让我们加载CLIP模型。

#load model on device. The device you are running inference/training on is either a CPU or GPU if you have.
device = "cpu"
model, preprocess = clip.load("ViT-B/32",device=device)

我们现在将：

创建图像嵌入数据库
设置对视觉模型的查询
执行语义搜索
将用户查询传递给图像

创建图像嵌入数据库

接下来我们将从一个图片目录创建我们的图像嵌入知识库。这将作为技术知识库，用于搜索用户上传图片的相关信息。

我们传入存储图片（JPEG格式）的目录，并遍历每个文件以创建嵌入向量。

我们还提供了一个description.json文件。该文件为知识库中的每张图片都设有一个条目，包含两个键：'image_path'和'description'。它将每张图片映射到对该图片的有用描述，以帮助回答用户问题。

首先，我们编写一个函数来获取给定目录中的所有图像路径。然后，我们将从名为'image_database'的目录中获取所有jpeg文件

def get_image_paths(directory: str, number: int = None) -> List[str]:
    image_paths = []
    count = 0
    for filename in os.listdir(directory):
        if filename.endswith('.jpeg'):
            image_paths.append(os.path.join(directory, filename))
            if number is not None and count == number:
                return [image_paths[-1]]
            count += 1
    return image_paths
direc = 'image_database/'
image_paths = get_image_paths(direc)

接下来我们将编写一个函数，通过CLIP模型获取给定一系列路径的图像嵌入。

我们首先使用之前获得的预处理函数对图像进行预处理。这一步骤执行多项操作，以确保输入到CLIP模型的格式和维度正确，包括调整大小、归一化、色彩通道调整等。

然后我们将这些预处理后的图像堆叠在一起，这样就可以一次性将它们传入模型，而不是通过循环处理。最后返回模型输出，即一个嵌入向量数组。

def get_features_from_image_path(image_paths):
  images = [preprocess(Image.open(image_path).convert("RGB")) for image_path in image_paths]
  image_input = torch.tensor(np.stack(images))
  with torch.no_grad():
    image_features = model.encode_image(image_input).float()
  return image_features
image_features = get_features_from_image_path(image_paths)

我们现在可以创建我们的向量数据库。

index = faiss.IndexFlatIP(image_features.shape[1])
index.add(image_features)

同时导入我们的JSON文件以进行图像-描述映射，并创建一个JSON列表。我们还创建了一个辅助函数，用于在该列表中搜索所需的图像，从而获取该图像的描述。

data = []
image_path = 'train1.jpeg'
with open('description.json', 'r') as file:
    for line in file:
        data.append(json.loads(line))
def find_entry(data, key, value):
    for entry in data:
        if entry.get(key) == value:
            return entry
    return None

让我们展示一个示例图片，这将是用户上传的图片。这是一项在2024年国际消费电子展上发布的技术。它是DELTA Pro Ultra全屋电池发电机。

im = Image.open(image_path)
plt.imshow(im)
plt.show()

Delta Pro

查询视觉模型

现在让我们看看GPT-4 Vision（此前从未见过这项技术）会如何标注它。

首先我们需要编写一个函数将图像编码为base64格式，因为这是我们传递给视觉模型的格式。然后我们将创建一个通用的image_query函数，以便能够通过图像输入查询LLM。

def encode_image(image_path):
    with open(image_path, 'rb') as image_file:
        encoded_image = base64.b64encode(image_file.read())
        return encoded_image.decode('utf-8')

def image_query(query, image_path):
    response = client.chat.completions.create(
        model='gpt-4-vision-preview',
        messages=[
            {
            "role": "user",
            "content": [
                {
                "type": "text",
                "text": query,
                },
                {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{encode_image(image_path)}",
                },
                }
            ],
            }
        ],
        max_tokens=300,
    )
    # Extract relevant features from the response
    return response.choices[0].message.content
image_query('Write a short label of what is show in this image?', image_path)

'Autonomous Delivery Robot'

我们可以看到，它基于训练数据尽力而为，但由于训练数据中未见过类似内容而犯错。这是因为图像本身存在歧义，难以进行推断和演绎。

执行语义搜索

现在让我们执行相似性搜索，以在知识库中找到两张最相似的图像。具体做法是获取用户输入的image_path的嵌入向量，从数据库中检索相似图像的索引和距离。距离将作为相似度的代理指标，距离越小表示越相似。然后我们按距离降序排序。

image_search_embedding = get_features_from_image_path([image_path])
distances, indices = index.search(image_search_embedding.reshape(1, -1), 2) #2 signifies the number of topmost similar images to bring back
distances = distances[0]
indices = indices[0]
indices_distances = list(zip(indices, distances))
indices_distances.sort(key=lambda x: x[1], reverse=True)

我们需要这些索引，因为将使用它们来搜索我们的图像目录，并选择索引对应位置的图像输入到视觉模型中进行RAG。

让我们看看它返回了什么（我们按相似度顺序显示）：

#display similar images
for idx, distance in indices_distances:
    print(idx)
    path = get_image_paths(direc, idx)[0]
    im = Image.open(path)
    plt.imshow(im)
    plt.show()

Delta Pro2

Delta Pro3

我们可以看到这里返回了两张包含DELTA Pro Ultra全屋电池发电机的图片。其中一张图片的背景可能有些干扰，但它还是成功找到了正确的图像。

用户查询最相似的图片

现在，对于我们最相似的图像，我们希望将其及其描述与用户查询一起传递给gpt-v，以便他们可以询问可能购买的技术。这正是视觉模型的强大之处，你可以向模型提出它未经过明确训练的一般性查询，而它仍能以高准确率作出回应。

在下面的示例中，我们将询问该物品的容量。

similar_path = get_image_paths(direc, indices_distances[0][0])[0]
element = find_entry(data, 'image_path', similar_path)

user_query = 'What is the capacity of this item?'
prompt = f"""
Below is a user query, I want you to answer the query using the description and image provided.

user query:
{user_query}

description:
{element['description']}
"""
image_query(prompt, similar_path)

'The portable home battery DELTA Pro has a base capacity of 3.6kWh. This capacity can be expanded up to 25kWh with additional batteries. The image showcases the DELTA Pro, which has an impressive 3600W power capacity for AC output as well.'

我们可以看到它能够回答问题。这只有通过直接匹配图像并从中收集相关描述作为上下文才能实现。

结论

在本笔记本中，我们讲解了如何使用CLIP模型，包括一个使用CLIP模型创建图像嵌入数据库的示例，执行语义搜索，最后通过用户查询来回答问题。

这种使用模式的应用范围覆盖了许多不同的应用领域，并且可以轻松改进以进一步提升技术。例如，您可以微调CLIP，可以像在RAG中那样改进检索过程，还可以对GPT-V进行提示工程。