OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

OA0 › 代码 › MLX-VLM — 在 Apple Silicon 上运行视觉语言模型的工具集

MLX-VLM — 在 Apple Silicon 上运行视觉语言模型的工具集

flora · 2026-02-10 11:20:01 · 54 次点击 · 0 条评论

MLX-VLM

MLX-VLM 是一个用于在 Mac 上使用 MLX 进行视觉语言模型（VLMs）和全模态模型（支持音频和视频的 VLMs）推理与微调的软件包。

模型特定文档

部分模型提供了详细的文档，包含提示格式、示例和最佳实践：

模型	文档
DeepSeek-OCR	文档
DeepSeek-OCR-2	文档
DOTS-OCR	文档
GLM-OCR	文档
Phi-4 Reasoning Vision	文档
MiniCPM-o	文档
Phi-4 Multimodal	文档
MolmoPoint	文档
Moondream3	文档

安装

最简单的方式是使用 pip 安装 mlx-vlm 包：

pip install -U mlx-vlm

使用

命令行界面 (CLI)

使用 CLI 从模型生成输出：

# 文本生成
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Hello, how are you?"

# 图像生成
mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --temperature 0.0 --image http://images.cocodataset.org/val2017/000000039769.jpg

# 音频生成 (新增)
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you hear" --audio /path/to/audio.wav

# 多模态生成 (图像 + 音频)
mlx_vlm.generate --model mlx-community/gemma-3n-E2B-it-4bit --max-tokens 100 --prompt "Describe what you see and hear" --image /path/to/image.jpg --audio /path/to/audio.wav

思考预算

对于支持思考的模型（例如 Qwen3.5），您可以限制在思考块中消耗的令牌数量：

mlx_vlm.generate --model mlx-community/Qwen3.5-2B-4bit \
  --thinking-budget 50 \
  --thinking-start-token "<think>" \
  --thinking-end-token "</think>" \
  --enable-thinking \
  --prompt "Solve 2+2"

标志	描述
`--enable-thinking`	在聊天模板中激活思考模式
`--thinking-budget`	思考块内允许的最大令牌数
`--thinking-start-token`	开启思考块的令牌（默认：`<think>`）
`--thinking-end-token`	关闭思考块的令牌（默认：`</think>`）

当预算超出时，模型会被强制输出 \n</think> 并过渡到答案。如果传入了 --enable-thinking 但模型的聊天模板不支持，则仅当模型自行生成起始令牌时才会应用预算。

使用 Gradio 的聊天界面

使用 Gradio 启动一个聊天界面：

mlx_vlm.chat_ui --model mlx-community/Qwen2-VL-2B-Instruct-4bit

Python 脚本

以下是如何在 Python 脚本中使用 MLX-VLM 的示例：

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# 加载模型
model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = load_config(model_path)

# 准备输入
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
# image = [Image.open("...")] 也可以使用 PIL.Image.Image 对象
prompt = "Describe this image."

# 应用聊天模板
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

# 生成输出
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)

音频示例

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# 加载支持音频的模型
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config

# 准备音频输入
audio = ["/path/to/audio1.wav", "/path/to/audio2.mp3"]
prompt = "Describe what you hear in these audio files."

# 应用带音频的聊天模板
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_audios=len(audio)
)

# 使用音频生成输出
output = generate(model, processor, formatted_prompt, audio=audio, verbose=False)
print(output)

多模态示例 (图像 + 音频)

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# 加载多模态模型
model_path = "mlx-community/gemma-3n-E2B-it-4bit"
model, processor = load(model_path)
config = model.config

# 准备输入
image = ["/path/to/image.jpg"]
audio = ["/path/to/audio.wav"]
prompt = ""

# 应用聊天模板
formatted_prompt = apply_chat_template(
    processor, config, prompt,
    num_images=len(image),
    num_audios=len(audio)
)

# 生成输出
output = generate(model, processor, formatted_prompt, image, audio=audio, verbose=False)
print(output)

服务器 (FastAPI)

启动服务器：

mlx_vlm.server --port 8080

# 在启动时预加载模型（Hugging Face 仓库或本地路径）
mlx_vlm.server --model <hf_repo_or_local_path>

# 使用适配器预加载模型
mlx_vlm.server --model <hf_repo_or_local_path> --adapter-path <adapter_path>

# 启用信任远程代码（某些模型需要）
mlx_vlm.server --trust-remote-code

服务器选项

--model：在服务器启动时预加载模型，接受 Hugging Face 仓库 ID 或本地路径（可选，如果省略则在首次请求时延迟加载）
--adapter-path：与预加载模型一起使用的适配器权重路径
--host：主机地址（默认：0.0.0.0）
--port：端口号（默认：8080）
--trust-remote-code：从 Hugging Face Hub 加载模型时信任远程代码

您也可以通过环境变量设置信任远程代码：

MLX_TRUST_REMOTE_CODE=true mlx_vlm.server

服务器为不同用例提供多个端点，并支持动态模型加载/卸载及缓存（一次一个模型）。

可用端点

/models 和 /v1/models - 列出本地可用的模型
/chat/completions 和 /v1/chat/completions - OpenAI 兼容的聊天式交互端点，支持图像、音频和文本
/responses 和 /v1/responses - OpenAI 兼容的响应端点
/health - 检查服务器状态
/unload - 从内存中卸载当前模型

使用示例

列出可用模型

curl "http://localhost:8080/models"

文本输入

curl -X POST "http://localhost:8080/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": "Hello, how are you"
      }
    ],
    "stream": true,
    "max_tokens": 100
  }'

图像输入

curl -X POST "http://localhost:8080/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2.5-VL-32B-Instruct-8bit",
    "messages":
    [
      {
        "role": "system",
        "content": "You are a helpful assistant."
      },
      {
        "role": "user",
        "content": [
          {
            "type": "text",
            "text": "This is today's chart for energy demand in California. Can you provide an analysis of the chart and comment on the implications for renewable energy in California?"
          },
          {
            "type": "input_image",
            "image_url": "/path/to/repo/examples/images/renewables_california.png"
          }
        ]
      }
    ],
    "stream": true,
    "max_tokens": 1000
  }'

音频支持 (新增)

curl -X POST "http://localhost:8080/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gemma-3n-E2B-it-4bit",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "Describe what you hear in these audio files" },
          { "type": "input_audio", "input_audio": "/path/to/audio1.wav" },
          { "type": "input_audio", "input_audio": "https://example.com/audio2.mp3" }
        ]
      }
    ],
    "stream": true,
    "max_tokens": 500
  }'

多模态 (图像 + 音频)

curl -X POST "http://localhost:8080/generate" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/gemma-3n-E2B-it-4bit",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "input_image", "image_url": "/path/to/image.jpg"},
          {"type": "input_audio", "input_audio": "/path/to/audio.wav"}
        ]
      }
    ],
    "max_tokens": 100
  }'

响应端点

curl -X POST "http://localhost:8080/responses" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mlx-community/Qwen2-VL-2B-Instruct-4bit",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "input_text", "text": "What is in this image?"},
          {"type": "input_image", "image_url": "/path/to/image.jpg"}
        ]
      }
    ],
    "max_tokens": 100
  }'

请求参数

model：模型标识符（必需）
messages：聊天/OpenAI 端点的聊天消息
max_tokens：生成的最大令牌数
temperature：采样温度
top_p：Top-p 采样参数
top_k：Top-k 采样截断
min_p：Min-p 采样阈值
repetition_penalty：应用于重复令牌的惩罚
stream：启用流式响应

激活量化 (CUDA)

在带有 MLX CUDA 的 NVIDIA GPU 上运行时，使用 mxfp8 或 nvfp4 模式量化的模型需要激活量化才能正常工作。这会将 QuantizedLinear 层转换为 QQLinear 层，从而对权重和激活都进行量化。

命令行

使用 -qa 或 --quantize-activations 标志：

mlx_vlm.generate --model /path/to/mxfp8-model --prompt "Describe this image" --image /path/to/image.jpg -qa

Python API

将 quantize_activations=True 传递给 load 函数：

from mlx_vlm import load, generate

# 加载时启用激活量化
model, processor = load(
    "path/to/mxfp8-quantized-model",
    quantize_activations=True
)

# 照常生成
output = generate(model, processor, "Describe this image", image=["image.jpg"])

支持的量化模式

mxfp8 - 8 位 MX 浮点数
nvfp4 - 4 位 NVIDIA 浮点数

注意：此功能是 CUDA 上 mxfp/nvfp 量化模型所必需的。在 Apple Silicon (Metal) 上，这些模型无需此标志即可工作。

多图像聊天支持

MLX-VLM 支持使用特定模型同时分析多张图像。此功能支持更复杂的视觉推理任务，以及单次对话中对多张图像的全面分析。

使用示例

Python 脚本

from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

model_path = "mlx-community/Qwen2-VL-2B-Instruct-4bit"
model, processor = load(model_path)
config = model.config

images = ["path/to/image1.jpg", "path/to/image2.jpg"]
prompt = "Compare these two images."

formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(images)
)

output = generate(model, processor, formatted_prompt, images, verbose=False)
print(output)

命令行

mlx_vlm.generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Compare these images" --image path/to/image1.jpg path/to/image2.jpg

视频理解

MLX-VLM 还支持使用特定模型进行视频分析，例如字幕生成、摘要等。

支持的模型

以下模型支持视频聊天：

Qwen2-VL
Qwen2.5-VL
Idefics3
LLaVA

更多模型即将推出。

使用示例

命令行

mlx_vlm.video_generate --model mlx-community/Qwen2-VL-2B-Instruct-4bit --max-tokens 100 --prompt "Describe this video" --video path/to/video.mp4 --max-pixels 224 224 --fps 1.0

这些示例展示了如何使用 MLX-VLM 处理多张图像，以完成更复杂的视觉推理任务。

微调

MLX-VLM 支持使用 LoRA 和 QLoRA 对模型进行微调。

LoRA & QLoRA

要了解更多关于 LoRA 的信息，请参阅 LoRA.md 文件。

项目地址：https://github.com/Blaizzy/mlx-vlm

54 次点击 ∙ 0 人收藏

登录后收藏

0 条回复

MLX-VLM — 在 Apple Silicon 上运行视觉语言模型的工具集

MLX-VLM

目录

模型特定文档

安装

使用

命令行界面 (CLI)

思考预算

使用 Gradio 的聊天界面

Python 脚本

音频示例

多模态示例 (图像 + 音频)

服务器 (FastAPI)

服务器选项

可用端点

使用示例

列出可用模型

文本输入

图像输入

音频支持 (新增)

多模态 (图像 + 音频)

响应端点

请求参数

激活量化 (CUDA)

命令行

Python API

支持的量化模式

多图像聊天支持

使用示例

Python 脚本

命令行

视频理解

支持的模型

使用示例

命令行

微调

LoRA & QLoRA