Megatron-LM 激活检查点重计算与选择性 CPU/GPU 卸载工程实践

在训练亿参数 Transformer 模型时，激活值（activations）往往占据 GPU 内存的 50% 以上，尤其在长序列或大 micro-batch 时容易 OOM。Megatron-LM 通过激活检查点重计算（activation recomputation）和选择性卸载（selective offloading）有效缓解这一问题，前者以额外计算换内存，后者利用 PCIe 带宽实现近零开销内存释放。本文聚焦工程实践，给出参数配置、阈值选择与落地清单，帮助在 H100 等硬件上训练更大模型。

激活检查点重计算：selective 与 full 模式选择

激活重计算核心思想：在前向传播仅保存少数检查点激活，反向时重跑前向计算缺失激活。Megatron-LM（基于 Megatron Core）支持两种粒度：

selective（推荐起点）：仅重算内存密集模块，如 self-attention（core_attn）和 MLP，内存节省 60-80%，计算开销 <5%。适用于大多数场景。配置示例（TransformerConfig 或 GPTModelProvider）：
```
recompute_granularity="selective"
recompute_modules=["core_attn", "mlp", "layernorm"]  # 可扩展到 "moe", "moe_act"
distribute_saved_activations=True  # TP 下分片保存激活
```
与 Flash Attention 结合时，attention 模块自动优化。

full：全层重算，内存节省最高（>90%），但计算开销～30%。用于极致内存压力。

recompute_granularity="full"
recompute_method="uniform"  # 或 "block"
recompute_num_layers=4  # PP stage 每块层数，virtual PP 时为 virtual stage

阈值选择：先用 nvidia-smi 或 nsight-systems profiling 测峰值激活内存。若 >80% 总内存，启用 selective；若仍 OOM，渐进 full 并调 num_layers 从总层 1/4 开始。预期 MFU 降幅：selective <3%，full 10-20%。

Megatron Bridge 文档指出，selective 模式针对 attention 的 softmax/QKV 等 O (S^2) 操作，重算成本低而节省大。[1]

Fine-grained 激活卸载：模块级 CPU 转移

对于重算开销高的模块（如线性投影），Megatron Core 提供细粒度卸载到 CPU，利用双缓冲重叠传输与计算，几乎零开销。

启用 flags：

--fine-grained-activation-offloading
--offload-modules "attn_norm,core_attn,attn_proj,mlp_norm,expert_fc1,moe_act"

策略：

低开销模块（layernorm, moe_act）：优先 recompute。
高内存模块（attn_proj, expert_fc1）：offload，确保卸载 / 重载与 GEMM 重叠。
MoE 模型特化：offload expert_fc1，结合 recompute moe_act。

兼容 FP8、MTP、CUDA Graph（暂避 offload 模块于 graph 范围）。在 PP/Interleaved PP 下无缝工作。

阈值：监控 PCIe Rx/Tx bw，若 <50GB/s 饱和，减少 offload 模块数；H100 上，offload 20% 激活典型节省 10-15GB 内存。

CPU 卸载：层级 activations/weights 转移

当 fine-grained 不够时，用层级 CPU offload，针对深层（内存峰值高）：

cpu_offloading=True
cpu_offloading_num_layers=16  # 从 layer 0 开始的层数，总层 e.g. 32
cpu_offloading_activations=True
cpu_offloading_weights=False  # 仅 activations 优先，weights 若闲置再 True
cpu_offloading_double_buffering=True  # 重叠多层传输

落地参数：

num_layers：profiling 选激活峰值 Top 30% 层，通常后半深层。
双缓冲：启用后，传输 latency 降 50%。
与 recompute 结合：offload + selective，内存节省叠加。

H100 80GB 上，训练 70B 模型 seq=4096 micro-batch=4 时，offload 16 层可增 batch 2x。

完整落地 Checklist

环境：pip install megatron-core[mlm,dev]；git clone NVIDIA/Megatron-LM。
基准测试：无优化跑 1 epoch，记 OOM 前内存 / MFU。
渐进启用：
- selective recompute + distribute_saved_activations。
- 加 --fine-grained-activation-offloading --offload-modules "attn_proj,mlp_norm"。
- 若 OOM，full recompute num_layers = 总层 / 8 + cpu_offloading num_layers = 总层 / 2。
脚本修改：examples/pretrain_gpt.sh 加 flags；DeepSpeed 集成时 set_deepspeed_activation_checkpointing。
验证：nsight-compute 查重算 FLOPs 增幅 <15%；wandb log MFU>40%。

监控要点：

指标	阈值	异常处理
Peak GPU Mem	<90%	增 offload_layers
MFU Drop	<10%	减 full recompute
PCIe BW	<60GB/s	优化双缓冲 / 模块
Backward Time	+20%	调 selective modules

风险与回滚

开销超标：full recompute MFU 降 >20%，回滚 selective。
兼容：MoE/FP8 下 test 小模型；CUDA Graph 暂避 offload。
带宽瓶颈：多 GPU 共享 PCIe，优先 NVLink 集群。

实践证明，此组合在 6144 H100 上训 462B 模型达 47% MFU。[2]

资料来源： [1] Megatron Bridge Activation Recomputation [2] Megatron Core Fine-grained Offloading NVIDIA/Megatron-LM GitHub

（正文 1250 字）