Fun-CosyVoice 3.0: 演示; 论文; 魔搭社区; Huggingface; CV3-Eval
CosyVoice 2.0: 演示; 论文; 魔搭社区; HuggingFace
CosyVoice 1.0: 演示; 论文; 魔搭社区; HuggingFace
Fun-CosyVoice 3.0 是一个基于大语言模型(LLM)的先进文本转语音(TTS)系统,在内容一致性、说话人相似度和韵律自然度方面超越了其前代(CosyVoice 2.0)。它专为真实场景下的零样本多语言语音合成而设计。
[x] 2025年12月
[x] 2025年08月
[x] 2025年07月
[x] 2025年05月
[x] 2024年12月
[x] 2024年09月
[x] 2024年08月
[x] 2024年07月
ttsfrd 不可用时,支持 WeTextProcessing| 模型 | 开源 | 模型大小 | test-zh CER (%) ↓ |
test-zh SS (%) ↑ |
test-en WER (%) ↓ |
test-en SS (%) ↑ |
test-hard CER (%) ↓ |
test-hard SS (%) ↑ |
|---|---|---|---|---|---|---|---|---|
| 真人录音 | - | - | 1.26 | 75.5 | 2.14 | 73.4 | - | - |
| Seed-TTS | ❌ | - | 1.12 | 79.6 | 2.25 | 76.2 | 7.59 | 77.6 |
| MiniMax-Speech | ❌ | - | 0.83 | 78.3 | 1.65 | 69.2 | - | - |
| F5-TTS | ✅ | 0.3B | 1.52 | 74.1 | 2.00 | 64.7 | 8.67 | 71.3 |
| Spark TTS | ✅ | 0.5B | 1.2 | 66.0 | 1.98 | 57.3 | - | - |
| CosyVoice2 | ✅ | 0.5B | 1.45 | 75.7 | 2.57 | 65.9 | 6.83 | 72.4 |
| FireRedTTS2 | ✅ | 1.5B | 1.14 | 73.2 | 1.95 | 66.5 | - | - |
| Index-TTS2 | ✅ | 1.5B | 1.03 | 76.5 | 2.23 | 70.6 | 7.12 | 75.5 |
| VibeVoice-1.5B | ✅ | 1.5B | 1.16 | 74.4 | 3.04 | 68.9 | - | - |
| VibeVoice-Realtime | ✅ | 0.5B | - | - | 2.05 | 63.3 | - | - |
| HiggsAudio-v2 | ✅ | 3B | 1.50 | 74.0 | 2.44 | 67.7 | - | - |
| VoxCPM | ✅ | 0.5B | 0.93 | 77.2 | 1.85 | 72.9 | 8.87 | 73.0 |
| GLM-TTS | ✅ | 1.5B | 1.03 | 76.1 | - | - | - | - |
| GLM-TTS RL | ✅ | 1.5B | 0.89 | 76.4 | - | - | - | - |
| Fun-CosyVoice3-0.5B-2512 | ✅ | 0.5B | 1.21 | 78.0 | 2.24 | 71.8 | 6.71 | 75.8 |
| Fun-CosyVoice3-0.5B-2512_RL | ✅ | 0.5B | 0.81 | 77.4 | 1.68 | 69.5 | 5.44 | 75.0 |
克隆仓库
sh
git clone --recursive https://github.com/FunAudioLLM/CosyVoice.git
# 如果由于网络问题克隆子模块失败,请运行以下命令直到成功
cd CosyVoice
git submodule update --init --recursive
安装 Conda:请参考 https://docs.conda.io/en/latest/miniconda.html
创建 Conda 环境:
``` sh
conda create -n cosyvoice -y python=3.10
conda activate cosyvoice
pip install -r requirements.txt -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
sudo apt-get install sox libsox-dev
sudo yum install sox sox-devel
```
我们强烈建议下载我们预训练的 Fun-CosyVoice3-0.5B、CosyVoice2-0.5B、CosyVoice-300M、CosyVoice-300M-SFT、CosyVoice-300M-Instruct 模型以及 CosyVoice-ttsfrd 资源。
# 使用 modelscope SDK 下载模型
from modelscope import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('iic/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('iic/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
snapshot_download('iic/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
snapshot_download('iic/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
snapshot_download('iic/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
# 海外用户可使用 huggingface SDK 下载模型
from huggingface_hub import snapshot_download
snapshot_download('FunAudioLLM/Fun-CosyVoice3-0.5B-2512', local_dir='pretrained_models/Fun-CosyVoice3-0.5B')
snapshot_download('FunAudioLLM/CosyVoice2-0.5B', local_dir='pretrained_models/CosyVoice2-0.5B')
snapshot_download('FunAudioLLM/CosyVoice-300M', local_dir='pretrained_models/CosyVoice-300M')
snapshot_download('FunAudioLLM/CosyVoice-300M-SFT', local_dir='pretrained_models/CosyVoice-300M-SFT')
snapshot_download('FunAudioLLM/CosyVoice-300M-Instruct', local_dir='pretrained_models/CosyVoice-300M-Instruct')
snapshot_download('FunAudioLLM/CosyVoice-ttsfrd', local_dir='pretrained_models/CosyVoice-ttsfrd')
可选地,您可以解压 ttsfrd 资源并安装 ttsfrd 包以获得更好的文本正则化性能。
请注意,此步骤不是必需的。如果您不安装 ttsfrd 包,我们将默认使用 wetext。
cd pretrained_models/CosyVoice-ttsfrd/
unzip resource.zip -d .
pip install ttsfrd_dependency-0.1-py3-none-any.whl
pip install ttsfrd-0.4.2-cp310-cp310-linux_x86_64.whl
我们强烈建议使用 Fun-CosyVoice3-0.5B 以获得更好的性能。
请参考 example.py 中的代码了解每个模型的详细用法。
python example.py
CosyVoice2/3 现在支持 vLLM 0.11.x+ (V1引擎) 和 vLLM 0.9.0 (旧版)。
较旧的 vllm 版本(<0.9.0)不支持 CosyVoice 推理,中间版本(如 0.10.x)未经测试。
请注意 vllm 有许多特定要求。如果您的硬件不支持 vllm 并且旧环境被破坏,您可以创建一个新环境。
conda create -n cosyvoice_vllm --clone cosyvoice
conda activate cosyvoice_vllm
# 对于 vllm==0.9.0
pip install vllm==v0.9.0 transformers==4.51.3 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
# 对于 vllm>=0.11.0
pip install vllm==v0.11.0 transformers==4.57.1 numpy==1.26.4 -i https://mirrors.aliyun.com/pypi/simple/ --trusted-host=mirrors.aliyun.com
python vllm_example.py
您可以使用我们的 Web 演示页面快速熟悉 CosyVoice。
详情请查看演示网站。
# 将 iic/CosyVoice-300M-SFT 替换为 sft 推理,或将 iic/CosyVoice-300M-Instruct 替换为 instruct 推理
python3 webui.py --port 50000 --model_dir pretrained_models/CosyVoice-300M
对于高级用户,我们在 examples/libritts 中提供了训练和推理脚本。
可选地,如果您需要进行服务部署,可以运行以下步骤。
cd runtime/python
docker build -t cosyvoice:v1.0 .
# 如果您想使用 instruct 推理,请将 iic/CosyVoice-300M 替换为 iic/CosyVoice-300M-Instruct
# 使用 grpc
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/grpc && python3 server.py --port 50000 --max_conc 4 --model_dir iic/CosyVoice-300M && sleep infinity"
cd grpc && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
# 使用 fastapi
docker run -d --runtime=nvidia -p 50000:50000 cosyvoice:v1.0 /bin/bash -c "cd /opt/CosyVoice/CosyVoice/runtime/python/fastapi && python3 server.py --port 50000 --model_dir iic/CosyVoice-300M && sleep infinity"
cd fastapi && python3 client.py --port 50000 --mode <sft|zero_shot|cross_lingual|instruct>
使用 TensorRT-LLM 加速 CosyVoice2 LLM,相比 HuggingFace Transformers 实现可获得高达 4 倍的加速。
快速开始:
cd runtime/triton_trtllm
docker compose up -d
更多详情,请查看此处。
您可以直接在 Github Issues 上讨论。
您也可以扫描二维码加入我们的官方钉钉群。

``` bibtex
@article{du2024cosyvoice,
title={Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens},
author={Du, Zhihao and Chen, Qian and Zhang, Shiliang and Hu, Kai and Lu, Heng and Yang, Yexin and Hu, Hangrui and Zheng, Siqi and Gu, Yue and Ma, Ziyang and others},
journal={arXiv preprint arXiv:2407.05407},
year={2024}
}
@article{du2024cosyvoice,
title={Cosyvoice 2: Scalable streaming speech synthesis with large language models},
author={Du, Zhihao and Wang, Yuxuan and Chen, Qian and Shi, Xian and Lv, Xiang and Zhao, Tianyu and Gao, Zhifu and Yang, Yexin and Gao, Changfeng and Wang, Hui and others},
journal={arXiv preprint arXiv:2412.10117},
year={2024}
}
@article{du2025cosyvoice,
title={CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training},
author={Du, Zhihao and Gao, Changfeng and Wang, Yuxuan and Yu, Fan and Zhao, Tianyu and Wang, Hao and Lv, Xiang and Wang, Hui