CogAgent — 面向 GUI 理解与操作的多模态智能体模型

cosmic · 2026-02-10 16:51:58 · 59 次点击 · 0 条评论

CogAgent: 基于视觉语言模型的开源 GUI 代理

中文文档

🔥 🆕 2024年12月： 我们开源了 最新版本的 CogAgent-9B-20241220 模型。与之前的 CogAgent 版本相比，CogAgent-9B-20241220 在 GUI 感知、推理准确性、动作空间完整性、任务通用性和泛化能力方面均有显著提升。它支持通过屏幕截图和自然语言进行双语（中英文）交互。
🏆 2024年6月： CogAgent 被 CVPR 2024 接收，并被评选为会议亮点论文（前 3%）。
2023年12月： 我们 开源了首个 GUI 代理：CogAgent（原仓库可在此处访问：CogVLM），并 发表了相关论文：📖 CogAgent 论文。

模型介绍

模型	模型下载链接	技术文档	在线演示
cogagent-9b-20241220	🤗 HuggingFace 🤖 ModelScope 🟣 WiseModel 🧩 Modelers (Ascend)	📄 官方技术博客 📘 实践指南（中文）	🤗 HuggingFace Space 🤖 ModelScope Space 🧩 Modelers Space (Ascend)

模型概述

CogAgent-9B-20241220 模型基于双语开源 VLM 基础模型 GLM-4V-9B。通过数据收集与优化、多阶段训练和策略改进，CogAgent-9B-20241220 在 GUI 感知、推理预测准确性、动作空间完整性以及跨任务泛化能力方面取得了显著进步。该模型支持通过截图和语言输入进行双语（中英文）交互。此版本的 CogAgent 模型已应用于智谱 AI 的 GLM-PC 产品。我们希望该模型的发布能够帮助研究者和开发者推进基于视觉语言模型的 GUI 代理的研究与应用。

能力展示

CogAgent-9b-20241220 模型在 GUI 代理任务和 GUI 定位基准测试的多个平台和类别中取得了领先的结果。在 CogAgent-9b-20241220 技术博客中，我们将其与基于 API 的商业模型（GPT-4o-20240806, Claude-3.5-Sonnet）、商业 API + GUI 定位模型（GPT-4o + UGround, GPT-4o + OS-ATLAS）以及开源 GUI 代理模型（Qwen2-VL, ShowUI, SeeClick）进行了比较。结果表明，CogAgent 在 GUI 定位（Screenspot）、单步操作（OmniAct）、中文分步内部基准（CogAgentBench-basic-cn）和多步操作（OSWorld）方面处于领先地位，仅在 OSWorld 上略逊于专门用于计算机使用的 Claude-3.5-Sonnet 以及结合了外部 GUI 定位模型的 GPT-4o。

CogAgent 祝您圣诞快乐！让大模型自动为您的朋友发送圣诞祝福。

想提交一个 issue？让 CogAgent 帮您发送邮件。

CogAgent

推理与微调成本

在 BF16 精度下进行推理，模型至少需要 29GB 的 VRAM。由于性能损失显著，不建议使用 INT4 精度进行推理。INT4 推理的 VRAM 使用量约为 8GB，而 INT8 推理约为 15GB。在 inference/cli_demo.py 文件中，我们已注释掉这两行。您可以取消注释并使用 INT4 或 INT8 推理。此方案仅在 NVIDIA 设备上受支持。
以上所有 GPU 参考均指 A100 或 H100 GPU。对于其他设备，您需要相应计算所需的 GPU/CPU 内存。
在 SFT（监督微调）期间，此代码库冻结了 Vision Encoder，使用批大小为 1，并在 8 * A100 GPU 上进行训练。总输入 token 数（包括图像，占 1600 个 token）总计为 2048 个 token。此代码库无法在不冻结 Vision Encoder 的情况下进行 SFT 微调。
对于 LoRA 微调，Vision Encoder 不被冻结；批大小为 1，使用 1 * A100 GPU。总输入 token 数（包括图像，1600 个 token）也总计为 2048 个 token。在上述设置中，SFT 微调每个 GPU 至少需要 60GB 的 GPU 内存（使用 8 个 GPU），而 LoRA 微调在单个 GPU 上至少需要 70GB 的 GPU 内存（无法拆分）。
Ascend 设备 尚未进行 SFT 微调测试。我们仅在 Atlas800 训练服务器集群上进行了测试。您需要根据 Ascend 设备 下载链接中描述的加载机制相应修改推理代码。
在线演示链接不支持控制计算机；它仅允许您查看模型的推理结果。我们建议在本地部署模型。

模型输入与输出

cogagent-9b-20241220 是一个代理型执行模型，而非对话模型。它不支持连续对话，但确实支持连续执行历史记录。（换句话说，每次都需要启动一个新的对话会话，并且需要将过去的历史记录提供给模型。）CogAgent 的工作流程如下图所示：

为了达到最佳的 GUI 代理性能，我们采用了严格的输入输出格式。
以下是用户应如何格式化输入并将其提供给模型，以及如何解析模型的响应。

用户输入

您可以参考 app/client.py#L115 来构建用户输入提示。用户输入拼接代码的最小示例如下：


current_platform = identify_os() # "Mac" 或 "WIN" 或 "Mobile"。注意大小写敏感。
platform_str = f"(Platform: {current_platform})\n"
format_str = "(Answer in Action-Operation-Sensitive format.)\n" # 您可以使用其他格式替换 "Action-Operation-Sensitive"

history_str = "\nHistory steps: "
for index, (grounded_op_func, action) in enumerate(zip(history_grounded_op_funcs, history_actions)):
   history_str += f"\n{index}. {grounded_op_func}\t{action}" # 从 0 开始。

query = f"Task: {task}{history_str}\n{platform_str}{format_str}" # 注意 \n

拼接后的 Python 字符串：

"Task: Search for doors, click doors on sale and filter by brands \"Mastercraft\".\nHistory steps: \n0. CLICK(box=[[352,102,786,139]], element_info='Search')\tLeft click on the search box located in the middle top of the screen next to the Menards logo.\n1. TYPE(box=[[352,102,786,139]], text='doors', element_info='Search')\tIn the search input box at the top, type 'doors'.\n2. CLICK(box=[[787,102,809,139]], element_info='SEARCH')\tLeft click on the magnifying glass icon next to the search bar to perform the search.\n3. SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]')\tScroll down the page to see the available doors.\n4. CLICK(box=[[280,708,710,809]], element_info='Doors on Sale')\tClick the \"Doors On Sale\" button in the middle of the page to view the doors that are currently on sale.\n(Platform: WIN)\n(Answer in Action-Operation format.)\n"

打印的提示：

Task: Search for doors, click doors on sale and filter by brands "Mastercraft".

History steps:

CLICK(box=[[352,102,786,139]], element_info='Search') 点击位于屏幕顶部中央、Menards 徽标旁边的搜索框。

TYPE(box=[[352,102,786,139]], text='doors', element_info='Search') 在顶部的搜索输入框中输入 'doors'。

CLICK(box=[[787,102,809,139]], element_info='SEARCH') 点击搜索栏旁边的放大镜图标执行搜索。

SCROLL_DOWN(box=[[0,209,998,952]], step_count=5, element_info='[None]') 向下滚动页面查看可用的门。

CLICK(box=[[280,708,710,809]], element_info='Doors on Sale') 点击页面中间的 "Doors On Sale" 按钮查看当前促销的门。

(Platform: WIN)

(Answer in Action-Operation format.)

如果您想详细了解每个字段的含义和表示，请继续阅读或参考实践文档（中文），“提示拼接”部分。

task 字段
用户的任务描述，采用类似提示的文本格式。此输入指示 cogagent-9b-20241220 模型如何执行用户的请求。请保持简洁明了。
platform 字段
cogagent-9b-20241220 支持在具有图形界面的多个平台上进行代理操作。我们目前支持三个系统：
- Windows 10, 11：使用 WIN 字段。
- macOS 14, 15：使用 Mac 字段。
- Android 13, 14, 15（以及其他具有类似 GUI 操作的 Android UI 变体）：使用 Mobile 字段。
如果您的系统不在其中，效果可能不理想。您可以尝试对移动设备使用 Mobile，对 Windows 使用 WIN，或对 Mac 使用 Mac。
format 字段
用户希望 cogagent-9b-20241220 返回数据的格式。我们提供了几个选项：
- Answer in Action-Operation-Sensitive format.：此仓库中默认的演示返回类型。返回模型的行动、相应操作以及敏感度级别。
- Answer in Status-Plan-Action-Operation format.：返回模型的状态、计划和相应操作。
- Answer in Status-Action-Operation-Sensitive format.：返回模型的状态、行动、相应操作和敏感度。
- Answer in Status-Action-Operation format.：返回模型的状态和行动。
- Answer in Action-Operation format.：返回模型的行动和相应操作。
history 字段
应按以下顺序拼接：
query = f'{task}{history}{platform}{format}'
Continue 字段
CogAgent 允许用户让模型 继续回答。这需要用户在 {task} 后附加 [Continue]\n 字段。在这种情况下，拼接顺序和结果应如下所示：
query = f'{task}[Continue]\n{history}{platform}{format}'

模型输出

敏感操作：包括 <<敏感操作>> 和 <<一般操作>>。仅当您请求 Sensitive 格式时才会返回。
Plan、Status、Action 字段：用于描述模型的行为和操作。仅当您请求相应字段时才会返回。例如，如果格式包含 Action，则模型返回 Action 字段。
通用回答部分：出现在格式化答案之前的总结。
Grounded Operation 字段：
描述模型的具体操作，包括操作位置、操作类型和操作细节。box 属性表示执行的坐标区域，element_type 表示元素类型，element_info 描述元素。这些细节包装在“操作指令”中。关于动作空间的定义，请参考此处。

示例

假设用户希望将所有邮件标记为已读。用户使用的是 Mac，并且希望模型以 Action-Operation-Sensitive 格式返回。正确 拼接的提示 应为：

Task: Please mark all my emails as read
History steps:
(Platform: Mac)
(Answer in Action-Operation-Sensitive format.)

注意：即使没有历史操作，提示中仍需要附加 "History steps:"。以下是不同格式要求的 示例输出：

Answer in Action-Operation-Sensitive format

Action: Click the 'Mark all as read' button in the top toolbar of the page to mark all emails as read.
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable text', element_info='Mark all emails as read')
<<一般操作>>

Answer in Status-Plan-Action-Operation format

Status: Currently in the email interface [[0, 2, 998, 905]], with the email categories on the left [[1, 216, 144, 570]], and the inbox in the center [[144, 216, 998, 903]]. The "Mark all as read" button has been clicked [[223, 178, 311, 210]].
Plan: Future tasks: 1. Click the 'Mark all as read' button; 2. Task complete.
Action: Click the "Mark all as read" button at the top center of the inbox page to mark all emails as read.
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable text', element_info='Mark all emails as read')

Answer in Status-Action-Operation-Sensitive format

Status: Currently in the email interface [[0, 2, 998, 905]], with the email categories on the left [[1, 216, 144, 570]], and the inbox in the center [[144, 216, 998, 903]]. The "Mark all as read" button has been clicked [[223, 178, 311, 210]].
Action: Click the "Mark all as read" button at the top center of the inbox page to mark all emails as read.
Grounded Operation: CLICK(box=[[219,186,311,207]], element_type='Clickable text', element_info='Mark all emails as read')
<<一般操作>>

Answer in Status-Action-Operation format

``` Status: Currently in the email interface [[0, 2, 998, 905]], with the email categories on the left [[1, 216,

项目地址：https://github.com/THUDM/CogAgent

59 次点击 ∙ 0 人收藏

登录后收藏

0 条回复