作者:Qingkai Fang、Yan Zhou、Shoutao Guo、Shaolei Zhang、Yang Feng*
LLaMA-Omni 2 是一系列基于 Qwen2.5-0.5B/1.5B/3B/7B/14B/32B-Instruct 模型构建的语音-语言模型。与 LLaMA-Omni 类似,它能同时生成文本和语音响应,实现高质量、低延迟的语音交互。通过引入全新的流式自回归语音解码器,LLaMA-Omni 2 在语音质量上相比 LLaMA-Omni 有了显著提升。

git clone https://github.com/ictnlp/LLaMA-Omni2
cd LLaMA-Omni2
conda create -n llama-omni2 python=3.10
conda activate llama-omni2
pip install -e .
Whisper-large-v3 模型。import whisper
model = whisper.load_model("large-v3", download_root="models/speech_encoder/")
CosyVoice 2 的流匹配模型和声码器。huggingface-cli download --resume-download ICTNLP/cosy2_decoder --local-dir models/cosy2_decoder
[!Tip]
如果在国内连接 Hugging Face 不稳定,可以尝试设置以下环境变量:
shell export HF_ENDPOINT=https://hf-mirror.com
LLaMA-Omni2-0.5B/1.5B/3B/7B/14B 支持仅英文,而 LLaMA-Omni2-0.5B/1.5B/3B/7B/14B/32B-Bilingual 支持英文和中文。model_name=LLaMA-Omni2-7B-Bilingual
huggingface-cli download --resume-download ICTNLP/$model_name --local-dir models/$model_name
| LLaMA-Omni2 | LLaMA-Omni2-Bilingual |
|---|---|
| 🤗 LLaMA-Omni2-0.5B | 🤗 LLaMA-Omni2-0.5B-Bilingual |
| 🤗 LLaMA-Omni2-1.5B | 🤗 LLaMA-Omni2-1.5B-Bilingual |
| 🤗 LLaMA-Omni2-3B | 🤗 LLaMA-Omni2-3B-Bilingual |
| 🤗 LLaMA-Omni2-7B | 🤗 LLaMA-Omni2-7B-Bilingual |
| 🤗 LLaMA-Omni2-14B | 🤗 LLaMA-Omni2-14B-Bilingual |
| - | 🤗 LLaMA-Omni2-32B-Bilingual |
shell
python -m llama_omni2.serve.controller --host 0.0.0.0 --port 10000
shell
python -m llama_omni2.serve.gradio_web_server --controller http://localhost:10000 --port 8000 --vocoder-dir models/cosy2_decoder
shell
python -m llama_omni2.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path models/$model_name --model-name $model_name
output_dir=examples/$model_name
mkdir -p $output_dir
python llama_omni2/inference/run_llama_omni2.py \
--model_path models/$model_name \
--question_file examples/questions.json \
--answer_file $output_dir/answers.jsonl \
--temperature 0 \
--s2s
python llama_omni2/inference/run_cosy2_decoder.py \
--input-path $output_dir/answers.jsonl \
--output-dir $output_dir/wav \
--lang en
我们的代码基于 Apache-2.0 许可证发布。模型仅用于学术研究目的,不得用于商业用途。
在遵守以下条件的前提下,允许在学术环境中自由使用、修改和分发本模型:
如需商业用途查询或获取商业许可证,请联系 fengyang@ict.ac.cn。
如有任何问题,欢迎提交 Issue 或联系 fangqingkai21b@ict.ac.cn。
如果我们的工作对您有帮助,请引用如下:
@inproceedings{
fang2025llamaomni2,
title={{LL}a{MA}-{O}mni 2: LLM-based Real-time Spoken Chatbot with Autoregressive Streaming Speech Synthesis},
author={Fang, Qingkai and Zhou, Yan and Guo, Shoutao and Zhang, Shaolei and Feng, Yang},
booktitle = {Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics},
year={2025}
}
@inproceedings{
fang2025llamaomni,
title={{LL}a{MA}-{O}mni: Seamless Speech Interaction with Large Language Models},
author={Qingkai Fang and Shoutao Guo and Yan Zhou and Zhengrui Ma and Shaolei Zhang and Yang Feng},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=PYmrUQmMEw}
}