OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

OA0 › 代码 › F5-TTS — 兼顾自然度与可控性的开源语音合成项目

F5-TTS — 兼顾自然度与可控性的开源语音合成项目

felicity · 2026-04-22 11:00:26 · 13 次点击 · 0 条评论

F5-TTS：一个用流匹配技术伪造流畅且忠实语音的童话讲述者

F5-TTS：采用 ConvNeXt V2 的扩散 Transformer，训练和推理速度更快。

E2 TTS：Flat-UNet Transformer，最接近原论文的复现。

Sway Sampling：推理时的流步长采样策略，显著提升性能。

感谢所有贡献者！

安装

如需，创建一个独立的环境

# 创建一个 python_version>=3.10 的 conda 环境（也可以使用 virtualenv）
conda create -n f5-tts python=3.11
conda activate f5-tts

# 如果尚未安装 FFmpeg，请安装
conda install ffmpeg

安装与设备匹配的 PyTorch

NVIDIA GPU

> ```bash > # 根据你的 CUDA 版本安装 pytorch，例如 > pip install torch==2.8.0+cu128 torchaudio==2.8.0+cu128 --extra-index-url https://download.pytorch.org/whl/cu128 > > # 也可以安装可能的早期版本，例如 > pip install torch==2.4.0+cu124 torchaudio==2.4.0+cu124 --extra-index-url https://download.pytorch.org/whl/cu124 > # 等等。 > ```

AMD GPU

> ```bash > # 根据你的 ROCm 版本安装 pytorch（仅限 Linux），例如 > pip install torch==2.5.1+rocm6.2 torchaudio==2.5.1+rocm6.2 --extra-index-url https://download.pytorch.org/whl/rocm6.2 > ```

Intel GPU

> ```bash > # 根据你的 XPU 版本安装 pytorch，例如 > # 必须安装 Intel® Deep Learning Essentials 或 Intel® oneAPI Base Toolkit > pip install torch torchaudio --index-url https://download.pytorch.org/whl/test/xpu > > # 通过 IPEX（Intel® Extension for PyTorch）也支持 Intel GPU > # IPEX 不需要 Intel® Deep Learning Essentials 或 Intel® oneAPI Base Toolkit > # 参见：https://pytorch-extension.intel.com/installation?request=platform > ```

Apple Silicon

> ```bash > # 安装稳定的 pytorch，例如 > pip install torch torchaudio > ```

然后你可以从以下选项中选择一个：

1. 作为 pip 包安装（如果仅用于推理）

bash pip install f5-tts

2. 本地可编辑安装（如果还要进行训练、微调）

```bash
git clone https://github.com/SWivid/F5-TTS.git
cd F5-TTS

git submodule update --init --recursive # （可选，如果使用 bigvgan 作为声码器）

pip install -e .
```

也可使用 Docker

# 从 Dockerfile 构建
docker build -t f5tts:v1 .

# 从 GitHub Container Registry 运行
docker container run --rm -it --gpus=all --mount 'type=volume,source=f5-tts,target=/root/.cache/huggingface/hub/' -p 7860:7860 ghcr.io/swivid/f5-tts:main

# 快速启动，如果你只想运行 Web 界面（而非 CLI）
docker container run --rm -it --gpus=all --mount 'type=volume,source=f5-tts,target=/root/.cache/huggingface/hub/' -p 7860:7860 ghcr.io/swivid/f5-tts:main f5-tts_infer-gradio --host 0.0.0.0

运行时

使用 Triton 和 TensorRT-LLM 的部署方案。

基准测试结果

在单个 L20 GPU 上进行解码，使用 26 组不同的 prompt_audio 和 target_text 配对，16 NFE。

模型	并发数	平均延迟	RTF	模式
F5-TTS Base (Vocos)	2	253 ms	0.0394	客户端-服务器
F5-TTS Base (Vocos)	1 (Batch_size)	-	0.0402	离线 TRT-LLM
F5-TTS Base (Vocos)	1 (Batch_size)	-	0.1467	离线 Pytorch

更多信息请参阅详细说明。

推理

为了达到理想的性能，请花点时间阅读详细指南。
通过正确搜索遇到的问题关键词，Issues 非常有帮助。

1. Gradio 应用

当前支持的功能：

支持分块推理的基础 TTS
多风格/多说话人生成
由 Qwen2.5-3B-Instruct 驱动的语音聊天
支持更多语言的自定义推理

# 启动 Gradio 应用（Web 界面）
f5-tts_infer-gradio

# 指定端口/主机
f5-tts_infer-gradio --port 7860 --host 0.0.0.0

# 启动一个分享链接
f5-tts_infer-gradio --share

NVIDIA 设备 docker compose 文件示例

services:
  f5-tts:
    image: ghcr.io/swivid/f5-tts:main
    ports:
      - "7860:7860"
    environment:
      GRADIO_SERVER_PORT: 7860
    entrypoint: ["f5-tts_infer-gradio", "--port", "7860", "--host", "0.0.0.0"]
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  f5-tts:
    driver: local

2. CLI 推理

# 使用标志运行
# 将 --ref_text 设为 "" 将使用 ASR 模型转录（需要额外的 GPU 内存）
f5-tts_infer-cli --model F5TTS_v1_Base \
--ref_audio "provide_prompt_wav_path_here.wav" \
--ref_text "参考音频的内容、字幕或转录文本。" \
--gen_text "你想让 TTS 模型为你生成的一些文本。"

# 使用默认设置运行。src/f5_tts/infer/examples/basic/basic.toml
f5-tts_infer-cli
# 或者使用你自己的 .toml 文件
f5-tts_infer-cli -c custom.toml

# 多语音。参见 src/f5_tts/infer/README.md
f5-tts_infer-cli -c src/f5_tts/infer/examples/multi/story.toml

训练

1. 使用 Hugging Face Accelerate

请参考训练与微调指南以获取最佳实践。

2. 使用 Gradio 应用

# 通过 Gradio Web 界面快速开始
f5-tts_finetune-gradio

更多说明请阅读训练与微调指南。

评估

开发

使用 pre-commit 确保代码质量（将自动运行代码检查器和格式化工具）：

pip install pre-commit
pre-commit install

提交 Pull Request 前，在每次提交前运行：

pre-commit run --all-files

注意：为了适应张量表示法，某些模型组件对 E722 有代码检查例外。

致谢

E2-TTS 杰出的工作，简洁而有效
Emilia、WenetSpeech4TTS、LibriTTS、LJSpeech 宝贵的数据集
lucidrains 初始的 CFM 结构，以及 bfs18 的讨论
SD3 和 Hugging Face diffusers 的 DiT 和 MMDiT 代码结构
torchdiffeq 作为 ODE 求解器，Vocos 和 BigVGAN 作为声码器
FunASR、faster-whisper、UniSpeech、SpeechMOS 评估工具
ctc-forced-aligner 用于语音编辑测试
mrfakename huggingface space 演示 ~
f5-tts-mlx 由 Lucas Newman 使用 MLX 框架的实现
F5-TTS-ONNX 由 DakeQQ 开发的 ONNX Runtime 版本
Yuekai Zhang Triton 和 TensorRT-LLM 支持 ~

引用

如果我们的工作和代码库对你有用，请引用：

@article{chen-etal-2024-f5tts,
      title={F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching}, 
      author={Yushen Chen and Zhikang Niu and Ziyang Ma and Keqi Deng and Chunhui Wang and Jian Zhao and Kai Yu and Xie Chen},
      journal={arXiv preprint arXiv:2410.06885},
      year={2024},
}

许可证

我们的代码根据 MIT 许可证发布。由于训练数据 Emilia 是一个野外数据集，预训练模型根据 CC-BY-NC 许可证授权。对此可能带来的不便，我们深表歉意。

项目地址：https://github.com/SWivid/F5-TTS

13 次点击 ∙ 0 人收藏

登录后收藏

0 条回复