LLaMA-Omni2 — 面向端到端多模态交互的开源研究项目

cipher · 2026-05-23 11:00:21 · 53 次点击 · 0 条评论

🦙🎧 LLaMA-Omni 2：基于LLM的实时语音聊天机器人及自回归流式语音合成

作者：Qingkai Fang、Yan Zhou、Shoutao Guo、Shaolei Zhang、Yang Feng*

LLaMA-Omni 2 是一系列基于 Qwen2.5-0.5B/1.5B/3B/7B/14B/32B-Instruct 模型构建的语音-语言模型。与 LLaMA-Omni 类似，它能同时生成文本和语音响应，实现高质量、低延迟的语音交互。通过引入全新的流式自回归语音解码器，LLaMA-Omni 2 在语音质量上相比 LLaMA-Omni 有了显著提升。

🔥 最新动态

[25/05] LLaMA-Omni 2 已被 ACL 2025 主会议接收！

安装

克隆本仓库。

git clone https://github.com/ictnlp/LLaMA-Omni2
cd LLaMA-Omni2

安装依赖包。

conda create -n llama-omni2 python=3.10
conda activate llama-omni2
pip install -e .

快速开始

下载 Whisper-large-v3 模型。

import whisper
model = whisper.load_model("large-v3", download_root="models/speech_encoder/")

下载 CosyVoice 2 的流匹配模型和声码器。

huggingface-cli download --resume-download ICTNLP/cosy2_decoder --local-dir models/cosy2_decoder

[!Tip]
如果在国内连接 Hugging Face 不稳定，可以尝试设置以下环境变量：

shell export HF_ENDPOINT=https://hf-mirror.com

从 Hugging Face 下载 LLaMA-Omni2 系列模型。LLaMA-Omni2-0.5B/1.5B/3B/7B/14B 支持仅英文，而 LLaMA-Omni2-0.5B/1.5B/3B/7B/14B/32B-Bilingual 支持英文和中文。

model_name=LLaMA-Omni2-7B-Bilingual
huggingface-cli download --resume-download ICTNLP/$model_name --local-dir models/$model_name

LLaMA-Omni2	LLaMA-Omni2-Bilingual
🤗 LLaMA-Omni2-0.5B	🤗 LLaMA-Omni2-0.5B-Bilingual
🤗 LLaMA-Omni2-1.5B	🤗 LLaMA-Omni2-1.5B-Bilingual
🤗 LLaMA-Omni2-3B	🤗 LLaMA-Omni2-3B-Bilingual
🤗 LLaMA-Omni2-7B	🤗 LLaMA-Omni2-7B-Bilingual
🤗 LLaMA-Omni2-14B	🤗 LLaMA-Omni2-14B-Bilingual
-	🤗 LLaMA-Omni2-32B-Bilingual

Gradio 演示

启动控制器。

shell python -m llama_omni2.serve.controller --host 0.0.0.0 --port 10000

启动 Gradio Web 服务器。

shell python -m llama_omni2.serve.gradio_web_server --controller http://localhost:10000 --port 8000 --vocoder-dir models/cosy2_decoder

启动模型工作节点。

shell python -m llama_omni2.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path models/$model_name --model-name $model_name

访问 http://localhost:8000/ 与 LLaMA-Omni2 交互！

本地推理

output_dir=examples/$model_name
mkdir -p $output_dir

python llama_omni2/inference/run_llama_omni2.py \
    --model_path models/$model_name \
    --question_file examples/questions.json \
    --answer_file $output_dir/answers.jsonl \
    --temperature 0 \
    --s2s

python llama_omni2/inference/run_cosy2_decoder.py \
    --input-path $output_dir/answers.jsonl \
    --output-dir $output_dir/wav \
    --lang en

许可证

我们的代码基于 Apache-2.0 许可证发布。模型仅用于学术研究目的，不得用于商业用途。

在遵守以下条件的前提下，允许在学术环境中自由使用、修改和分发本模型：

非商业用途：模型不得用于任何商业目的。
引用：如果在研究中使用本模型，请引用原始工作。

商业用途限制

如需商业用途查询或获取商业许可证，请联系 fengyang@ict.ac.cn。

致谢

CosyVoice 2：我们使用了 CosyVoice 2 的预训练语音分词器、流匹配模型和声码器。
SLAM-LLM：我们借鉴了部分关于语音编码器和语音适配器的代码。

引用

如有任何问题，欢迎提交 Issue 或联系 fangqingkai21b@ict.ac.cn。

如果我们的工作对您有帮助，请引用如下：

@inproceedings{
  fang2025llamaomni2,
  title={{LL}a{MA}-{O}mni 2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis},
  author={Fang, Qingkai and Zhou, Yan and Guo, Shoutao and Zhang, Shaolei and Feng, Yang},
  booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics},
  year={2025}
}

@inproceedings{
  fang2025llamaomni,
  title={{LL}a{MA}-{O}mni: Seamless Speech Interaction with Large Language Models},
  author={Qingkai Fang and Shoutao Guo and Yan Zhou and Zhengrui Ma and Shaolei Zhang and Yang Feng},
  booktitle={The Thirteenth International Conference on Learning Representations},
  year={2025},
  url={https://openreview.net/forum?id=PYmrUQmMEw}
}

项目地址：https://github.com/ictnlp/LLaMA-Omni2

53 次点击 ∙ 0 人收藏

登录后收藏

0 条回复