写在前面

没错这是一个新坑

为什么要做DeepGEMM的探索

某位师兄的MoE训练成本太高了，不得不借鉴Deepseek的FP8训练策略。于是我就来研究DeepGEMM了…

使用的算力平台

某超算中心（H20 Slrum）

环境配置

翻阅DeepGEMM的Github, 官方给出了所需的运行环境：

Requirements
- Hopper architecture GPUs, sm_90a must be supported
- Python 3.8 or above
- CUDA 12.3 or above
- But we highly recommend 12.8 or above for the best performance
- PyTorch 2.1 or above
- CUTLASS 3.6 or above (could be cloned by Git submodule)

除了CUTLASS，其他的模块在超算中心中的modulelist中都能找到，这里我使用的配置是：
gcc_compiler10.3.0 cuda-12.4 cmake-3.24.0rc4 binutils-2.30(忘记为什么要装了保守起见就不删了)

CUTLASS克隆

参考NVIDIA CUTLASS文档

具体不同版本的CUTLASS支持的GPU架构与Features不同版本的release中都有介绍。

1. 下载cutlass-3.8.0并解压到目录`{$CUTLASS_PATH}`

2. 进入cutlass-3.8.0文件夹中创建build目录, 运行cmake

cd {$CUTLASS_PATH}
mkdir build && cd build

export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc

cmake .. -DCUTLASS_NVCC_ARCHS=90a            # compiles for NVIDIA Hopper GPU architecture
cmake .. -DCUTLASS_NVCC_ARCHS=100a           # compiles for NVIDIA Blackwell SM100 GPU architecture

因为本人用的算力平台全是H卡，故使用cmake .. -DCUTLASS_NVCC_ARCHS=90a

控制台输出如下：

(base) [sc100116@psn001 build]$ cmake .. -DCUTLASS_NVCC_ARCHS=90a
-- CMake Version: 3.24.0-rc4
-- CUTLASS 3.8.0
-- The CUDA compiler identification is NVIDIA 12.4.131
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /online1/public/support/intel/cuda/12.4/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /online1/public/support/intel/cuda/12.4/include (found version "12.4.131") 
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- CUDART: /online1/public/support/intel/cuda/12.4/lib64/libcudart.so
-- CUDA Driver: /online1/public/support/intel/cuda/12.4/lib64/stubs/libcuda.so
-- NVRTC: /online1/public/support/intel/cuda/12.4/lib64/libnvrtc.so
-- Default Install Location: /usr/local
-- Found Python3: /home/export/base/sc100116/sc100116/online1/miniconda3/bin/python3.12 (found suitable version "3.12.9", minimum required is "3.5") found components: Interpreter 
-- CUDA Compilation Architectures: 90a
-- Enable caching of reference results in conv unit tests
-- Enable rigorous conv problem sizes in conv unit tests
-- Using the following NVCC flags: 
  --expt-relaxed-constexpr
  -DCUTLASS_TEST_LEVEL=0
  -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1
  -DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1
  -DCUTLASS_DEBUG_TRACE_LEVEL=0
  -Xcompiler=-Wconversion
  -Xcompiler=-fno-strict-aliasing
-- CUTLASS Revision: 4499c4c
-- The C compiler identification is GNU 10.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /online1/public/support/intel/gcc_compiler/10.3.0/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found Python3: /home/export/base/sc100116/sc100116/online1/miniconda3/bin/python3.12 (found version "3.12.9") found components: Interpreter 
-- Configuring cublas ...
-- cuBLAS Disabled.
-- Configuring cuBLAS ... done.
-- Completed generation of library instances. See /home/export/base/sc100116/sc100116/online1/projects/DeepGEMM/cutlass-3.8.0/build/tools/library/library_instance_generation.log for more information.
-- Found Python3: /home/export/base/sc100116/sc100116/online1/miniconda3/bin/python3.12 (found suitable version "3.12.9", minimum required is "3.5") found components: Interpreter 
-- Enable device reference verification in conv unit tests
-- Configuring done
-- Generating done
-- Build files have been written to: /home/export/base/sc100116/sc100116/online1/projects/DeepGEMM/cutlass-3.8.0/build

进行完cmake后，build目录下会生成一个Makefile文件，接下来就可以编译了。
其实这一步是为了验证我自己的C++环境是否配置正确，并非必须。
CUTLASS在文档中会提示使用者可以选择性编译 CUTLASS GEMM 和卷积核，为的是尽量减少编译所需时间。

文档中也明确给出了可以使用以下 CMake 标志明确排除 cuBLAS 和 cuDNN 作为依赖项

1 2	-DCUTLASS_ENABLE_CUBLAS=OFF -DCUTLASS_ENABLE_CUDNN=OFF

这一步如果机器没有安装cuBLAS和cuDNN，用户不用特意去干预，先前的cmake会自动排除这两个依赖项。

3. [可选]构建并运行 CUTLASS Profiler

1	make cutlass_profiler -j36

然后就是漫长的编译过程…

CUTLASS Profiler是一个用于分析 CUTLASS GEMM 内核性能的工具，用于测试 GPU 性能。详见此指南和此WIKI

DeepGEMM克隆

克隆DeepGEMM (要带`--recursive`把cutlass也克隆下来)

1	git clone --recursive https://github.com/deepseek-ai/DeepGEMM.git

进入DeepGEMM目录

1	cd DeepGEMM

官方对DeepGEMM的介绍如下

1
2
3

DeepGEMM is a library designed for clean and efficient FP8 General Matrix Multiplications (GEMMs) with fine-grained scaling, as proposed in DeepSeek-V3. It supports both normal and Mix-of-Experts (MoE) grouped GEMMs. Written in CUDA, the library has no compilation need during installation, by compiling all kernels at runtime using a lightweight Just-In-Time (JIT) module.

DeepGEMM 是一个专为实现干净高效的 FP8 通用矩阵乘法 (GEMM) 而设计的库，其具有细粒度的缩放功能，正如 DeepSeek-V3 中所提议的那样。它支持普通 GEMM 和混合专家 (MoE) 分组 GEMM。该库采用 CUDA 编写，无需在安装过程中进行编译，而是使用轻量级的即时 (JIT) 模块在运行时编译所有内核。

1
2
3

Currently, DeepGEMM exclusively supports NVIDIA Hopper tensor cores. To address the imprecise FP8 tensor core accumulation, it employs CUDA-core two-level accumulation (promotion).

目前，DeepGEMM 仅支持 NVIDIA Hopper 张量核心。为了解决 FP8 张量核心累积不精确的问题，它采用了 CUDA 核心的两级累积（提升）。

验证&安装

# Make symbolic links for third-party (CUTLASS and CuTe) include directories
python setup.py develop

# Test JIT compilation
python tests/test_jit.py

# Test all GEMM implements (normal, contiguous-grouped and masked-grouped)
python tests/test_core.py

1	python setup.py install

之后就可以在python工程中import deep_gemm了

改写DeepseekV3的FP8训练代码

1. DeepseekV3MLP

原版的DeepseekV3MLP代码如下：

class DeepseekV3MLP(nn.Module):
    def __init__(self, config, hidden_size=None, intermediate_size=None):
        super().__init__()
        self.config = config
        self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
        self.intermediate_size = (
            config.intermediate_size if intermediate_size is None else intermediate_size
        )

        self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
        self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
        self.act_fn = ACT2FN[config.hidden_act]

    def forward(self, x):
        down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
        return down_proj

这里的核心计算就在三个Linear层中，分别是gate_proj、up_proj和down_proj。下面我们试图替换成用DeepGEMM实现的FP8的Linear

DeepGEMM提供的Core测试代码中，有一些现成的API。这些代码也刚好对应Deepseek-V3的Tech Report种对于FP8 training的描述。

Page 15 原文描述：

To address this challenge and effectively extend the dynamic range of the FP8 format, we introduce a fine-grained quantization strategy: tile-wise grouping with 1 × 𝑁𝑐 elements or block-wise grouping with 𝑁𝑐 × 𝑁𝑐 elements...

为了实现FP8训练，团队提出了细粒度的训练中计算量化的策略：1. tile-wise的分组量化（一条一条/一个token一个token（因为一个token被embed成一个向量，这里也就是向量维度上的））和 2. block-wise的分组量化（一个矩阵块的）

这两个策略刚好对应Core测试代码中的per_token_cast_to_fp8 和 per_block_cast_to_fp8

在报告中，作者团队还指出，他们为了求得训练稳定性和效率上的一个平衡，在embedding module, output head, MoE gating modules, normalization ops, attention ops都是全精度的（BF16/FP32）