OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

🦙🎧 LLaMA-Omni: 与大型语言模型的无缝语音交互

作者：Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, Yang Feng*

LLaMA-Omni 是一个基于 Llama-3.1-8B-Instruct 构建的语音语言模型。它支持低延迟、高质量的语音交互，能够根据语音指令同时生成文本和语音响应。

🔥 新闻

[25/05] LLaMA-Omni 2 被 ACL 2025 主会接收！
[25/05] InstructS2S-200K 的改进版本已在此链接公开。我们将其扩展至多轮对话，并增加了输入语音音色的多样性。久等了！
[25/04] 我们发布了 LLaMA-Omni2，一系列参数规模从 0.5B 到 32B 的语音语言模型，响应质量和语音生成质量均有提升。
[25/01] LLaMA-Omni 被 ICLR 2025 接收！新加坡见！

💡 亮点

💪 基于 Llama-3.1-8B-Instruct，确保高质量响应。
🚀 低延迟语音交互，延迟可低至 226ms。
🎧 同时生成文本和语音响应。
♻️ 仅需 4 张 GPU，训练时间不超过 3 天。

https://github.com/user-attachments/assets/2b097af8-47d7-494f-b3b3-6be17ca0247a

安装

克隆此仓库。

git clone https://github.com/ictnlp/LLaMA-Omni
cd LLaMA-Omni

安装依赖包。

conda create -n llama-omni python=3.10
conda activate llama-omni
pip install pip==24.0
pip install -e .

安装 fairseq。

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install -e . --no-build-isolation

安装 flash-attention。

pip install flash-attn --no-build-isolation

快速开始

从 🤗Huggingface 下载 Llama-3.1-8B-Omni 模型。
下载 Whisper-large-v3 模型。

import whisper
model = whisper.load_model("large-v3", download_root="models/speech_encoder/")

下载基于单元的 HiFi-GAN 声码器。

wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/g_00500000 -P vocoder/
wget https://dl.fbaipublicfiles.com/fairseq/speech_to_speech/vocoder/code_hifigan/mhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj/config.json -P vocoder/

Gradio 演示

启动控制器。

python -m omni_speech.serve.controller --host 0.0.0.0 --port 10000

启动 Gradio Web 服务器。

python -m omni_speech.serve.gradio_web_server --controller http://localhost:10000 --port 8000 --model-list-mode reload --vocoder vocoder/g_00500000 --vocoder-cfg vocoder/config.json

启动模型工作节点。

python -m omni_speech.serve.model_worker --host 0.0.0.0 --controller http://localhost:10000 --port 40000 --worker http://localhost:40000 --model-path Llama-3.1-8B-Omni --model-name Llama-3.1-8B-Omni --s2s

访问 http://localhost:8000/ 与 LLaMA-3.1-8B-Omni 交互！

注意：由于 Gradio 中流式音频播放不稳定，我们仅实现了流式音频合成，但未开启自动播放。如果你有好的解决方案，欢迎提交 PR。谢谢！

本地推理

如需在本地进行推理，请按照 omni_speech/infer/examples 目录中的格式组织语音指令文件，然后参考以下脚本。

bash omni_speech/infer/run.sh omni_speech/infer/examples

许可证

我们的代码基于 Apache-2.0 许可证发布。我们的模型仅供学术研究使用，不得用于商业目的。

你可以在学术环境中自由使用、修改和分发此模型，但需满足以下条件：

非商业用途：模型不得用于任何商业目的。
引用：如果你在研究中使用此模型，请引用原始作品。

商业使用限制

如有任何商业使用需求或获取商业许可证，请联系 fengyang@ict.ac.cn。

致谢

LLaVA：我们构建的代码库基础。
SLAM-LLM：我们借用了部分语音编码器和语音适配器的代码。

引用

如有任何问题，请随时提交 issue 或联系 fangqingkai21b@ict.ac.cn。

如果我们的工作对你有帮助，请引用：

@article{fang-etal-2024-llama-omni,
  title={LLaMA-Omni: Seamless Speech Interaction with Large Language Models},
  author={Fang, Qingkai and Guo, Shoutao and Zhou, Yan and Ma, Zhengrui and Zhang, Shaolei and Feng, Yang},
  journal={arXiv preprint arXiv:2409.06666},
  year={2024}
}

Star 历史

项目地址：https://github.com/ictnlp/LLaMA-Omni

67 次点击 ∙ 0 人收藏

登录后收藏

0 条回复

LLaMA-Omni — 面向语音多模态交互的大模型项目