MobileVLM

目前本实现支持 MobileVLM-1.7B 和 MobileVLM_V2-1.7B 两个模型版本。

本项目基于 LLaVA（“大语言与视觉辅助模型”，LLaVA 是 Large Language And Vision Assistant 的缩写）开发，兼容 llava 和 mobileVLM。基本用法与 llava 保持一致。

注意： MobileVLM 和 MobileVLM_V2 两个模型在模型推理（即模型预测）流程上完全一致，但在模型转换的步骤上略有不同。下面以 MobileVLM-1.7B 为例，展示不同的转换步骤。

使用方法

首先编译 llama-mtmd-cli 可执行文件。

编译完成后，运行 ./llama-mtmd-cli 可查看详细用法。例如：

./llama-mtmd-cli -m MobileVLM-1.7B/ggml-model-q4_k.gguf \
    --mmproj MobileVLM-1.7B/mmproj-model-f16.gguf \
    --chat-template deepseek

-m 指定主语言模型的路径
--mmproj 指定多模态投影模块（即将图片特征与语言模型输入适配的模块）路径
--chat-template 选择对话模板

模型转换流程

第一步：克隆模型文件

分别克隆 MobileVLM-1.7B 和视觉部分所需的 clip-vit-large-patch14-336 模型到本地：

git clone https://huggingface.co/mtgv/MobileVLM-1.7B

git clone https://huggingface.co/openai/clip-vit-large-patch14-336

第二步：分离模型组件

使用 llava_surgery.py 工具将 LLaVA 模型拆分成 LLaMA（语言部分）和多模态投影（图片到语言的衔接模块）两部分：

python ./tools/mtmd/llava_surgery.py -m path/to/MobileVLM-1.7B

第三步：将图片编码器转换为 GGUF 格式

GGUF 是一种高效的模型权重存储格式。

使用 convert_image_encoder_to_gguf.py 工具将 clip 模型编码器转换为 GGUF 格式。对于 V2 版本需指定不同的投影模块类型参数。

V1 版本：

python ./tools/mtmd/convert_image_encoder_to_gguf.py \
    -m path/to/clip-vit-large-patch14-336 \
    --llava-projector path/to/MobileVLM-1.7B/llava.projector \
    --output-dir path/to/MobileVLM-1.7B \
    --projector-type ldp

V2 版本：

python ./tools/mtmd/convert_image_encoder_to_gguf.py \
    -m path/to/clip-vit-large-patch14-336 \
    --llava-projector path/to/MobileVLM-1.7B_V2/llava.projector \
    --output-dir path/to/MobileVLM-1.7B_V2 \
    --projector-type ldpv2

说明：
--projector-type ldp 适用于 V1，ldpv2 适用于 V2。projector 是负责图像向量和文本向量空间匹配的模块。

第四步：将 LLaMA 部分转换为 GGUF 格式

调用 examples/convert_legacy_llama.py 工具，将 LLaMA（语言模型）部分转换为 GGUF 格式：

python ./examples/convert_legacy_llama.py path/to/MobileVLM-1.7B --skip-unknown

第五步：语言模型量化（降低模型精度以提高推理速度和节省内存）

使用 llama-quantize 工具将 LLaMA 部分的数据类型从 fp32（32位浮点数）转换为 q4_k（4位混合量化，常用的轻量级精度格式）：

./llama-quantize path/to/MobileVLM-1.7B/ggml-model-F32.gguf path/to/MobileVLM-1.7B/ggml-model-q4_k.gguf q4_k_s

完成上述操作后，MobileVLM-1.7B 文件夹下将同时包含语言模型和图像编码器。

在 Android 端编译与运行

编译方法

参考 tools/mtmd/android/build_64.sh 脚本：

mkdir tools/mtmd/android/build_64
cd tools/mtmd/android/build_64
../build_64.sh

在 Android 设备上运行

参考 android/adb_run.sh，根据你的资源文件路径和名称进行修改。

部分在 Android 设备（Snapdragon 888 芯片）上的运行结果

示例 1

输入命令：

/data/local/tmp/llama-mtmd-cli \
    -m /data/local/tmp/ggml-model-q4_k.gguf \
    --mmproj /data/local/tmp/mmproj-model-f16.gguf \
    -t 4 \
    --image /data/local/tmp/demo.jpg \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:"

-t 4 为线程数，建议根据设备实际 CPU 调整。

输出：

encode_image_with_clip: image encoded in 21148.71 ms by CLIP (  146.87 ms per image patch)
 Susan Wise Bauer
llama_print_timings:        load time =   23574.72 ms
llama_print_timings:      sample time =       1.24 ms /     6 runs   (    0.21 ms per token,  4850.44 tokens per second)
llama_print_timings: prompt eval time =   12460.15 ms /   246 tokens (   50.65 ms per token,    19.74 tokens per second)
llama_print_timings:        eval time =     424.86 ms /     6 runs   (   70.81 ms per token,    14.12 tokens per second)
llama_print_timings:       total time =   34731.93 ms

示例 2

输入命令：

/data/local/tmp/llama-mtmd-cli \
    -m /data/local/tmp/ggml-model-q4_k.gguf \
    --mmproj /data/local/tmp/mmproj-model-f16.gguf \
    -t 4 \
    --image /data/local/tmp/cat.jpeg \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:"

输出：

encode_image_with_clip: image encoded in 21149.51 ms by CLIP (  146.87 ms per image patch)
 The image depicts a cat sitting in the grass near some tall green plants.
llama_print_timings:        load time =   23257.32 ms
llama_print_timings:      sample time =       5.25 ms /    18 runs   (    0.29 ms per token,  3430.53 tokens per second)
llama_print_timings: prompt eval time =   11900.73 ms /   232 tokens (   51.30 ms per token,    19.49 tokens per second)
llama_print_timings:        eval time =    1279.03 ms /    18 runs   (   71.06 ms per token,    14.07 tokens per second)
llama_print_timings:       total time =   34570.79 ms

Snapdragon 778G 芯片上的运行结果

MobileVLM-1.7B 测试

使用 mtmd-cli release-b2005 版本

输入命令：

/data/local/tmp/llama-mtmd-cli \
    -m /data/local/tmp/ggml-model-q4_k.gguf \
    --mmproj /data/local/tmp/mmproj-model-f16.gguf \
    -t 4 \
    --image /data/local/tmp/many_llamas.jpeg \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:"

输出：

encode_image_with_clip: image encoded in 18728.52 ms by CLIP (  130.06 ms per image patch)
system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
user_prompt: \nWhat's that? ASSISTANT:

 A group of llamas are standing in a green pasture.

llama_print_timings:        load time =   20357.33 ms
llama_print_timings:      sample time =       2.96 ms /    14 runs   (    0.21 ms per token,  4734.53 tokens per second)
llama_print_timings: prompt eval time =    8119.49 ms /   191 tokens (   42.51 ms per token,    23.52 tokens per second)
llama_print_timings:        eval time =    1005.75 ms /    14 runs   (   71.84 ms per token,    13.92 tokens per second)
llama_print_timings:       total time =   28038.34 ms /   205 tokens

使用 mtmd-cli 最新版

输入命令方法与上例相同。

输出：（推理速度明显变慢）

encode_image_with_clip: image embedding created: 144 tokens

encode_image_with_clip: image encoded in 288268.88 ms by CLIP ( 2001.87 ms per image patch)
system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
user_prompt: \nWhat's that? ASSISTANT:

 It is a group of sheep standing together in a grass field.

llama_print_timings:        load time =  818120.91 ms
llama_print_timings:      sample time =       3.44 ms /    14 runs   (    0.25 ms per token,  4067.40 tokens per second)
llama_print_timings: prompt eval time =  529274.69 ms /   191 tokens ( 2771.07 ms per token,     0.36 tokens per second)
llama_print_timings:        eval time =   43894.02 ms /    13 runs   ( 3376.46 ms per token,     0.30 tokens per second)
llama_print_timings:       total time =  865441.76 ms /   204 tokens

MobileVLM_V2-1.7B 测试

使用 mtmd-cli release-2005b 版本

输入命令方法与上例相同。

输出：

encode_image_with_clip: image encoded in 20609.61 ms by CLIP (  143.12 ms per image patch)
system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
user_prompt: \nWhat's that? ASSISTANT:

 This image captures a lively scene of 20 llamas in motion on an expansive, grassy field. The llama is scattered across the landscape with some standing and others sitting down as if taking rest or observing their surroundings from different vantage points within this verdant setting.

The background offers glimpses into a picturesque town nestled amidst hills under an overcast sky, adding depth to the scene while also emphasizing that distance between these llama and human-made structures like houses or roads in which they roam freely without any barriers around them. The image is framed by text at both right angles on white backgrounds against a contrasting blue backdrop with green foliage, further drawing attention to the llamas amidst their natural habitat while also inviting viewers into this picturesque landscape within town limits of Alta Llama

llama_print_timings:        load time =   22406.77 ms
llama_print_timings:      sample time =      49.26 ms /   186 runs   (    0.26 ms per token,  3776.27 tokens per second)
llama_print_timings: prompt eval time =    9044.54 ms /   191 tokens (   47.35 ms per token,    21.12 tokens per second)
llama_print_timings:        eval time =   14497.49 ms /   186 runs   (   77.94 ms per token,    12.83 tokens per second)
llama_print_timings:       total time =   44411.01 ms /   377 tokens

在 Orin 平台编译与运行

编译方法

make GGML_CUDA=1 CUDA_DOCKER_ARCH=sm_87 GGML_CUDA_F16=1 -j 32

GGML_CUDA=1 表示开启 CUDA 支持（NVIDIA 显卡加速）
CUDA_DOCKER_ARCH=sm_87 请根据你的 GPU 架构设置
GGML_CUDA_F16=1 表示使用 16 位半精度浮点格式

在 Orin 上运行示例

示例 1

输入命令：

./llama-mtmd-cli \
    -m /data/local/tmp/ggml-model-q4_k.gguf \
    --mmproj /data/local/tmp/mmproj-model-f16.gguf \
    --image /data/local/tmp/demo.jpeg \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWho is the author of this book? \nAnswer the question using a single word or phrase. ASSISTANT:" \
    --n-gpu-layers 999

--n-gpu-layers 指定模型中使用 GPU 加速的层数，例如 999 表示全部使用 GPU。

输出：

encode_image_with_clip: image encoded in   296.62 ms by CLIP (    2.06 ms per image patch)

 Susan Wise Bauer

llama_print_timings:        load time =    1067.64 ms
llama_print_timings:      sample time =       1.53 ms /     6 runs   (    0.25 ms per token,  3934.43 tokens per second)
llama_print_timings: prompt eval time =     306.84 ms /   246 tokens (    1.25 ms per token,   801.72 tokens per second)
llama_print_timings:        eval time =      91.50 ms /     6 runs   (   15.25 ms per token,    65.58 tokens per second)
llama_print_timings:       total time =    1352.63 ms /   252 tokens

示例 2

输入命令：

./llama-mtmd-cli \
    -m /data/local/tmp/ggml-model-q4_k.gguf \
    --mmproj /data/local/tmp/mmproj-model-f16.gguf \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat is in the image? ASSISTANT:" \
    --n-gpu-layers 999

输出：

encode_image_with_clip: image encoded in   302.15 ms by CLIP (    2.10 ms per image patch)

 The image features a cat lying in the grass.

llama_print_timings:        load time =    1057.07 ms
llama_print_timings:      sample time =       3.27 ms /    11 runs   (    0.30 ms per token,  3360.83 tokens per second)
llama_print_timings: prompt eval time =     213.60 ms /   232 tokens (    0.92 ms per token,  1086.14 tokens per second)
llama_print_timings:        eval time =     166.65 ms /    11 runs   (   15.15 ms per token,    66.01 tokens per second)
llama_print_timings:       total time =    1365.47 ms /   243 tokens

在 Intel(R) Core(TM) i7-10750H 平台运行

操作系统

Ubuntu 22.04

编译方法

make -j32

MobileVLM-1.7B 运行示例

输入命令：

-m /path/to/ggml-model-q4_k.gguf \
    --mmproj /path/to/mmproj-model-f16.gguf \
    --image /path/to/many_llamas.jpeg
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:" \

输出：

encode_image_with_clip: image embedding created: 144 tokens

encode_image_with_clip: image encoded in  2730.94 ms by CLIP (   18.96 ms per image patch)
system_prompt: A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER:
user_prompt: \nWhat's that?ASSISTANT:

 A group of llamas are walking together in a field.

llama_print_timings:        load time =    5506.60 ms
llama_print_timings:      sample time =       0.44 ms /    13 runs   (    0.03 ms per token, 29545.45 tokens per second)
llama_print_timings: prompt eval time =    2031.58 ms /   190 tokens (   10.69 ms per token,    93.52 tokens per second)
llama_print_timings:        eval time =     438.92 ms /    12 runs   (   36.58 ms per token,    27.34 tokens per second)
llama_print_timings:       total time =    5990.25 ms /   202 tokens

MobileVLM_V2-1.7B 案例

输入命令与上述一致。

输出：

encode_image_with_clip: image embedding created: 144 tokens

（文档未给出完整输出，完结于此处）

encode_image_with_clip：使用CLIP模型对图片编码耗时 3223.89 毫秒（每个图片块 22.39 毫秒） system_prompt：用户与人工智能助手之间的对话。助手会为用户的问题提供有用、详细且礼貌的回答。USER: user_prompt：\n那是什么？ASSISTANT:

画面拍摄的是一个公园里宁静的场景，大约有20只美洲驼（llama）聚集在一起。这些美洲驼有白色和黑色两种，排成一行，它们身上的黑白花纹与公园郁郁葱葱的绿色草坪形成鲜明对比。美洲驼按顺序排列，显示出一定的社会秩序。

公园本身草木茂盛，背景中零星分布着一些树木。画面中还有一个写着“Llamas Tico Ana”的标志牌，可能表示这个地方的名字或是美洲驼的品种。图片拍摄距离较远，能看到整个场景和周围环境。

美洲驼之间的位置关系，与标志牌、树木共同构成了和谐美观的画面。图片中没有明显的文字，整体场景展现了宁静和自然之美，美洲驼呈现出自然的生活状态，被公园中鲜艳的色彩和茂密的植被包围。

llama_print_timings：加载耗时 = 6642.61 毫秒 llama_print_timings：生成样本耗时 = 8.15 毫秒 / 223 次（每个token约0.04毫秒，27358.61 tokens/秒） llama_print_timings：提示词计算耗时 = 2475.07 毫秒 / 190 tokens（每个token约13.03毫秒，76.77 tokens/秒） llama_print_timings：推理计算耗时 = 8760.60 毫秒 / 222 次（每个token约39.46毫秒，25.34 tokens/秒） llama_print_timings：总耗时 = 15513.95 毫秒 / 412 tokens

在 Intel(R) Core(TM) Ultra7 115H 上运行

操作系统

Windows11

编译命令

make -j32

MobileVLM-1.7B 示例

输入

-m /path/to/ggml-model-q4_k.gguf \
    --mmproj /path/to/tmp/mmproj-model-f16.gguf \
    -p "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: <image>\nWhat's that? ASSISTANT:" \

输出

encode_image_with_clip: image encoded in  4902.81 ms by CLIP (   34.05 ms per image patch)
system_prompt: 用户与人工智能助手之间的对话。助手会为用户的问题提供有用、详细且礼貌的回答。USER:
user_prompt: \n那是什么？ASSISTANT:

 图片中是一群在草地上站立的棕色和白色美洲驼。

llama_print_timings:        加载耗时 =    7441.06 毫秒
llama_print_timings:      生成样本耗时 =       0.72 毫秒 /    19 次   （每个token约0.04毫秒，26279.39 tokens/秒）
llama_print_timings:      提示词计算耗时 =    2090.71 毫秒 /   191 tokens（每个token约10.95毫秒，91.36 tokens/秒）
llama_print_timings:      推理计算耗时 =     512.35 毫秒 /    18 次   （每个token约28.46毫秒，35.13 tokens/秒）
llama_print_timings:       总耗时 =    7987.23 毫秒 /   209 tokens

MobileVLM_V2-1.7B 示例

输入

与上述命令完全相同。

输出

encode_image_with_clip: image encoded in  4682.44 ms by CLIP (   32.52 ms per image patch)
system_prompt: 用户与人工智能助手之间的对话。助手会为用户的问题提供有用、详细且礼貌的回答。USER:
user_prompt: \n那是什么？ASSISTANT:

 这张图片展现了14只美洲驼在草地上的生动场景。它们具有独特的黑白花纹，正排队站立和行走，仿佛在进行某种社交活动。其中最前面的一只背对着镜头，似乎在观察远处的事物。

 站在队伍前端的这只美洲驼因其黑白相间的毛色显得格外醒目，这种花纹在美洲驼中较为少见。它面对镜头，呈现出一种与观众互动的感觉，也更显得警觉。

 图片是从侧面拍摄的，可以清晰地看到队伍最前方的那只美洲驼以及它的伙伴们。值得注意的是，前面美洲驼没有显现出任何跛行的迹象，说明这并不是照片的重点。

 图片背景为一片草地，远处可以看到栅栏和一棵树。这棵树没有叶子，表明可能是树木休眠或落叶的季节。

llama_print_timings:        加载耗时 =    7015.35 毫秒
llama_print_timings:      生成样本耗时 =      10.61 毫秒 /   256 次   （每个token约0.04毫秒，24119.09 tokens/秒）
llama_print_timings:      提示词计算耗时 =    2052.45 毫秒 /   191 tokens（每个token约10.75毫秒，93.06 tokens/秒）
llama_print_timings:      推理计算耗时 =    7259.43 毫秒 /   255 次   （每个token约28.47毫秒，35.13 tokens/秒）
llama_print_timings:       总耗时 =   14371.19 毫秒 /   446 tokens

待办事项（TODO）

为新算子支持非CPU后端，例如 depthwise（深度卷积）、hardswish（硬激活函数swish）、hardsigmoid（硬sigmoid激活函数）
优化LDP（LDP投影器）性能
- 优化结构定义，避免不必要的内存重排，减少 ggml_permute_cpy 的使用；
- 优化算子实现（ARM CPU/NVIDIA GPU），如深度卷积、hardswish、hardsigmoid等；
实现MobileVLM在Jetson Orin上的运行
支持更多模型变体，例如 MobileVLM-3B。

贡献者 (contributor)

zhangjidong05, yangyang260, huyiming03, chenxiaotao03, ZiangWu-77

注意：
所有运行参数和示例都需要根据你的设备和文件路径进行适当调整。
初学者建议从简单命令和较小的模型入门，逐步熟悉各参数含义和模型表现。

使用方法​

模型转换流程​

第一步：克隆模型文件​

第二步：分离模型组件​

第三步：将图片编码器转换为 GGUF 格式​

第四步：将 LLaMA 部分转换为 GGUF 格式​

第五步：语言模型量化（降低模型精度以提高推理速度和节省内存）​

在 Android 端编译与运行​

编译方法​

在 Android 设备上运行​

部分在 Android 设备（Snapdragon 888 芯片）上的运行结果​

示例 1​

示例 2​

Snapdragon 778G 芯片上的运行结果​

MobileVLM-1.7B 测试​

使用 mtmd-cli release-b2005 版本​

使用 mtmd-cli 最新版​

MobileVLM_V2-1.7B 测试​

使用 mtmd-cli release-2005b 版本​

在 Orin 平台编译与运行​

编译方法​

在 Orin 上运行示例​

示例 1​

示例 2​

在 Intel(R) Core(TM) i7-10750H 平台运行​

操作系统​

编译方法​

MobileVLM-1.7B 运行示例​

MobileVLM_V2-1.7B 案例​

在 Intel(R) Core(TM) Ultra7 115H 上运行​

操作系统​

编译命令​

MobileVLM-1.7B 示例​

MobileVLM_V2-1.7B 示例​

待办事项（TODO）​

贡献者 (contributor)​

使用方法

模型转换流程

第一步：克隆模型文件

第二步：分离模型组件

第三步：将图片编码器转换为 GGUF 格式

第四步：将 LLaMA 部分转换为 GGUF 格式

第五步：语言模型量化（降低模型精度以提高推理速度和节省内存）

在 Android 端编译与运行

编译方法

在 Android 设备上运行

部分在 Android 设备（Snapdragon 888 芯片）上的运行结果

示例 1

示例 2

Snapdragon 778G 芯片上的运行结果

MobileVLM-1.7B 测试

使用 mtmd-cli release-b2005 版本

使用 mtmd-cli 最新版

MobileVLM_V2-1.7B 测试

使用 mtmd-cli release-2005b 版本

在 Orin 平台编译与运行

编译方法

在 Orin 上运行示例

示例 1

示例 2

在 Intel(R) Core(TM) i7-10750H 平台运行

操作系统

编译方法

MobileVLM-1.7B 运行示例

MobileVLM_V2-1.7B 案例

在 Intel(R) Core(TM) Ultra7 115H 上运行

操作系统

编译命令

MobileVLM-1.7B 示例

MobileVLM_V2-1.7B 示例

待办事项（TODO）

贡献者 (contributor)