Megatron-LM — NVIDIA 大模型训练框架

OA0

OA0 是一个探索 AI 的社区

现在注册

已注册用户请登录

Megatron-LM 与 Megatron Core =============================

用于大规模训练 Transformer 模型的 GPU 优化库

[![Documentation](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://docs.nvidia.com/megatron-core/developer-guide/latest/index.html) [![version](https://img.shields.io/badge/release-0.15.0-green)](./CHANGELOG.md) [![license](https://img.shields.io/badge/license-Apache-blue)](./LICENSE)

## 关于本项目本仓库包含两个组件：**Megatron-LM** 和 **Megatron Core**。 **Megatron-LM** 是一个参考示例，包含 Megatron Core 以及预配置的训练脚本。最适合研究团队、学习分布式训练和快速实验。 **Megatron Core** 是一个可组合的库，提供了用于构建自定义训练框架的 GPU 优化构建模块。它包含 Transformer 构建块、高级并行策略（TP、PP、DP、EP、CP）、混合精度支持（FP16、BF16、FP8、FP4）和模型架构。最适合构建自定义训练流程的框架开发者和机器学习工程师。 **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** 提供了双向的 Hugging Face ↔ Megatron 检查点转换功能，并附带可用于生产的配方。 ## 快速开始通过 pip 安装 Megatron Core： 1. 安装 Megatron Core 及其必需依赖： ```bash pip install --no-build-isolation megatron-core[mlm,dev] ``` 2. 克隆仓库以获取示例： ```bash git clone https://github.com/NVIDIA/Megatron-LM.git cd Megatron-LM pip install --no-build-isolation .[mlm,dev] ``` ## 最新动态 - **[2026年1月]** **[动态上下文并行](https://developer.nvidia.com/blog/speeding-up-variable-length-training-with-dynamic-context-parallelism-and-nvidia-megatron-core/)** - 通过自适应 CP 大小调整，为变长序列训练带来高达 1.48 倍的加速。 - **[2025年12月]** **Megatron Core 开发已迁移至 GitHub！** 所有开发和 CI 现在都在开源环境中进行。我们欢迎社区贡献。 - **[2025年10月]** **[Megatron 开发分支](https://github.com/NVIDIA/Megatron-LM/tree/dev)** - 包含实验性功能的早期访问分支。 - **[2025年10月]** **[Megatron Bridge](https://github.com/NVIDIA-NeMo/Megatron-Bridge)** - 用于 Hugging Face 和 Megatron 检查点之间互操作性的双向转换器，包含流行模型的生产就绪配方。 - **[2025年8月]** **[MoE 2025年第三、四季度路线图](https://github.com/NVIDIA/Megatron-LM/issues/1729)** - 针对 MoE 功能的全面路线图，包括 DeepSeek-V3、Qwen3、高级并行策略、FP8 优化和 Blackwell 性能增强。 - **[2025年8月]** **[GPT-OSS 模型](https://github.com/NVIDIA/Megatron-LM/issues/1739)** - YaRN RoPE 缩放、注意力汇聚和自定义激活函数等高级功能正在集成到 Megatron Core 中。 - **[2025年6月]** **[Megatron MoE 模型库](https://github.com/yanring/Megatron-MoE-ModelZoo)** - 使用性能基准测试和检查点转换工具训练 DeepSeek-V3、Mixtral 和 Qwen3 MoE 模型的最佳实践和优化配置。 - **[2025年5月]** Megatron Core v0.11.0 为跨数据中心 LLM 训练带来了新功能（[博客](https://developer.nvidia.com/blog/turbocharge-llm-training-across-long-haul-data-center-networks-with-nvidia-nemo-framework/)）。

往期动态

- **[2024年7月]** Megatron Core v0.7 提升了可扩展性和训练弹性，并增加了对多模态训练的支持（[博客](https://developer.nvidia.com/blog/train-generative-ai-models-more-efficiently-with-new-nvidia-Megatron-Core-functionalities/)）。 - **[2024年6月]** Megatron Core 增加了对基于 Mamba 模型的支持。查看我们的论文 [An Empirical Study of Mamba-based Language Models](https://arxiv.org/pdf/2406.07887) 和 [代码示例](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba)。 - **[2024年1月公告]** NVIDIA 已将 **Megatron-LM** 的核心功能发布到本仓库的 [**Megatron Core**](https://github.com/NVIDIA/Megatron-LM/tree/main/megatron/core) 中。Megatron Core 在 Megatron-LM 的 GPU 优化技术基础上进行了扩展，引入了更多关于系统级优化的前沿创新，并提供了可组合和模块化的 API。

## 项目结构

Megatron-LM/
├── megatron/
│   ├── core/                    # Megatron Core (内核、并行、构建块)
│   │   ├── models/              # Transformer 模型
│   │   ├── transformer/         # Transformer 构建块
│   │   ├── tensor_parallel/     # 张量并行
│   │   ├── pipeline_parallel/   # 流水线并行
│   │   ├── distributed/         # 分布式训练 (FSDP, DDP)
│   │   ├── optimizer/           # 优化器
│   │   ├── datasets/            # 数据集加载器
│   │   ├── inference/           # 推理引擎和服务器
│   │   └── export/              # 模型导出 (例如 TensorRT-LLM)
│   ├── training/                # 训练脚本
│   ├── legacy/                  # 遗留组件
│   ├── post_training/           # 后训练 (量化、蒸馏、剪枝等)
│   └── rl/                      # 强化学习 (RLHF 等)
├── examples/                    # 开箱即用的训练示例
├── tools/                       # 实用工具
├── tests/                       # 全面的测试套件
└── docs/                        # 文档

## 性能基准测试有关我们最新的性能基准测试结果，请参阅 [NVIDIA Megatron Bridge 性能摘要](https://docs.nvidia.com/nemo/megatron-bridge/latest/performance-summary.html)。我们的代码库能够在数千个 GPU 上高效训练从 20 亿到 4620 亿参数的模型，在 H100 集群上实现高达 **47% 的模型 FLOP 利用率 (MFU)**。 ![模型表格](images/model_table.png) **基准测试配置：** - **词汇表大小**：131,072 个词元 - **序列长度**：4096 个词元 - **模型缩放**：改变隐藏层大小、注意力头数和层数以达到目标参数量 - **通信优化**：与 DP (`--overlap-grad-reduce`, `--overlap-param-gather`)、TP (`--tp-comm-overlap`) 和 PP（默认启用）进行细粒度重叠 **关键结果：** - **6144 个 H100 GPU**：成功对 4620 亿参数模型训练进行了基准测试 - **超线性扩展**：随着模型规模增大，MFU 从 41% 提升至 47-48% - **端到端测量**：吞吐量包含所有操作（数据加载、优化器步骤、通信、日志记录） - **生产就绪**：包含检查点和容错机制的完整训练流程 - *注：性能结果是在未训练至收敛的情况下测量的* ### 弱扩展结果我们的弱扩展结果显示超线性扩展（最小模型的 MFU 为 41%，最大模型的 MFU 为 47-48%）；这是因为更大的 GEMM 运算具有更高的算术强度，因此执行效率更高。 ![弱扩展](images/weak_scaling.png) ### 强扩展结果我们还对标准 GPT-3 模型（由于词汇表更大，我们的版本参数略多于 1750 亿）进行了强扩展，从 96 个 H100 GPU 扩展到 4608 个 GPU，全程使用相同的 1152 个序列的批次大小。在更大规模下，通信开销变得更加明显，导致 MFU 从 47% 降至 42%。 ![强扩展](images/strong_scaling.png) ## 路线图 - **[MoE 路线图](https://github.com/NVIDIA/Megatron-LM/issues/1729)** - DeepSeek-V3、Qwen3、高级并行、FP8 优化和 Blackwell 增强 ## 资源 ### 获取帮助 - 📖 **[文档](https://docs.nvidia.com/megatron-core/developer-guide/latest/index.html)** - 官方文档 - 🐛 **[问题](https://github.com/NVIDIA/Megatron-LM/issues)** - 报告错误和功能请求 ### 贡献我们 ❤️ 贡献！参与方式： - 🐛 **报告错误** - 帮助我们提高可靠性 - 💡 **建议功能** - 塑造 Megatron Core 的未来 - 📝 **改进文档** - 让 Megatron Core 更易于使用 - 🔧 **提交 PR** - 贡献代码改进 **→ [贡献指南](https://docs.nvidia.com/megatron-core/developer-guide/latest/developer/contribute.html)** ### 引用如果您在研究或项目中使用 Megatron，我们感谢您使用以下引用：

@article{megatron-lm,
  title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
  author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:1909.08053},
  year={2019}
}

项目地址：https://github.com/NVIDIA/Megatron-LM

60 次点击 ∙ 0 人收藏

登录后收藏

0 条回复