写在前面

没错 这是一个新坑

为什么要做DeepGEMM的探索

某位师兄的MoE训练成本太高了,不得不借鉴Deepseek的FP8训练策略。于是我就来研究DeepGEMM了…

使用的算力平台

某超算中心(H20 Slrum)

环境配置

翻阅DeepGEMM的Github, 官方给出了所需的运行环境:

1
2
3
4
5
6
7
Requirements
- Hopper architecture GPUs, sm_90a must be supported
- Python 3.8 or above
- CUDA 12.3 or above
- But we highly recommend 12.8 or above for the best performance
- PyTorch 2.1 or above
- CUTLASS 3.6 or above (could be cloned by Git submodule)

除了CUTLASS,其他的模块在超算中心中的modulelist中都能找到,这里我使用的配置是:
gcc_compiler10.3.0 cuda-12.4 cmake-3.24.0rc4 binutils-2.30(忘记为什么要装了 保守起见就不删了)

CUTLASS克隆

参考NVIDIA CUTLASS文档

具体不同版本的CUTLASS支持的GPU架构与Features不同版本的release中都有介绍。

1. 下载cutlass-3.8.0并解压到目录{$CUTLASS_PATH}

2. 进入cutlass-3.8.0文件夹中创建build目录, 运行cmake

1
2
3
4
5
6
7
cd {$CUTLASS_PATH}
mkdir build && cd build

export CUDACXX=${CUDA_INSTALL_PATH}/bin/nvcc

cmake .. -DCUTLASS_NVCC_ARCHS=90a # compiles for NVIDIA Hopper GPU architecture
cmake .. -DCUTLASS_NVCC_ARCHS=100a # compiles for NVIDIA Blackwell SM100 GPU architecture

因为本人用的算力平台全是H卡,故使用cmake .. -DCUTLASS_NVCC_ARCHS=90a

控制台输出如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
(base) [sc100116@psn001 build]$ cmake .. -DCUTLASS_NVCC_ARCHS=90a
-- CMake Version: 3.24.0-rc4
-- CUTLASS 3.8.0
-- The CUDA compiler identification is NVIDIA 12.4.131
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Check for working CUDA compiler: /online1/public/support/intel/cuda/12.4/bin/nvcc - skipped
-- Detecting CUDA compile features
-- Detecting CUDA compile features - done
-- Found CUDAToolkit: /online1/public/support/intel/cuda/12.4/include (found version "12.4.131")
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE
-- CUDART: /online1/public/support/intel/cuda/12.4/lib64/libcudart.so
-- CUDA Driver: /online1/public/support/intel/cuda/12.4/lib64/stubs/libcuda.so
-- NVRTC: /online1/public/support/intel/cuda/12.4/lib64/libnvrtc.so
-- Default Install Location: /usr/local
-- Found Python3: /home/export/base/sc100116/sc100116/online1/miniconda3/bin/python3.12 (found suitable version "3.12.9", minimum required is "3.5") found components: Interpreter
-- CUDA Compilation Architectures: 90a
-- Enable caching of reference results in conv unit tests
-- Enable rigorous conv problem sizes in conv unit tests
-- Using the following NVCC flags:
--expt-relaxed-constexpr
-DCUTLASS_TEST_LEVEL=0
-DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1
-DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1
-DCUTLASS_DEBUG_TRACE_LEVEL=0
-Xcompiler=-Wconversion
-Xcompiler=-fno-strict-aliasing
-- CUTLASS Revision: 4499c4c
-- The C compiler identification is GNU 10.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /online1/public/support/intel/gcc_compiler/10.3.0/bin/gcc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Found Python3: /home/export/base/sc100116/sc100116/online1/miniconda3/bin/python3.12 (found version "3.12.9") found components: Interpreter
-- Configuring cublas ...
-- cuBLAS Disabled.
-- Configuring cuBLAS ... done.
-- Completed generation of library instances. See /home/export/base/sc100116/sc100116/online1/projects/DeepGEMM/cutlass-3.8.0/build/tools/library/library_instance_generation.log for more information.
-- Found Python3: /home/export/base/sc100116/sc100116/online1/miniconda3/bin/python3.12 (found suitable version "3.12.9", minimum required is "3.5") found components: Interpreter
-- Enable device reference verification in conv unit tests
-- Configuring done
-- Generating done
-- Build files have been written to: /home/export/base/sc100116/sc100116/online1/projects/DeepGEMM/cutlass-3.8.0/build

进行完cmake后,build目录下会生成一个Makefile文件,接下来就可以编译了。
其实这一步是为了验证我自己的C++环境是否配置正确,并非必须。
CUTLASS在文档中会提示使用者可以选择性编译 CUTLASS GEMM 和卷积核,为的是尽量减少编译所需时间。

文档中也明确给出了可以使用以下 CMake 标志明确排除 cuBLAS 和 cuDNN 作为依赖项

1
2
-DCUTLASS_ENABLE_CUBLAS=OFF
-DCUTLASS_ENABLE_CUDNN=OFF

这一步如果机器没有安装cuBLAS和cuDNN,用户不用特意去干预,先前的cmake会自动排除这两个依赖项。

3. [可选]构建并运行 CUTLASS Profiler

1
make cutlass_profiler -j36

然后就是漫长的编译过程…

CUTLASS Profiler是一个用于分析 CUTLASS GEMM 内核性能的工具,用于测试 GPU 性能。详见此指南此WIKI

DeepGEMM克隆

克隆DeepGEMM (要带--recursive把cutlass也克隆下来)

1
git clone --recursive https://github.com/deepseek-ai/DeepGEMM.git

进入DeepGEMM目录

1
cd DeepGEMM

官方对DeepGEMM的介绍如下

1
2
3
DeepGEMM is a library designed for clean and efficient FP8 General Matrix Multiplications (GEMMs) with fine-grained scaling, as proposed in DeepSeek-V3. It supports both normal and Mix-of-Experts (MoE) grouped GEMMs. Written in CUDA, the library has no compilation need during installation, by compiling all kernels at runtime using a lightweight Just-In-Time (JIT) module.

DeepGEMM 是一个专为实现干净高效的 FP8 通用矩阵乘法 (GEMM) 而设计的库,其具有细粒度的缩放功能,正如 DeepSeek-V3 中所提议的那样。它支持普通 GEMM 和混合专家 (MoE) 分组 GEMM。该库采用 CUDA 编写,无需在安装过程中进行编译,而是使用轻量级的即时 (JIT) 模块在运行时编译所有内核。
1
2
3
Currently, DeepGEMM exclusively supports NVIDIA Hopper tensor cores. To address the imprecise FP8 tensor core accumulation, it employs CUDA-core two-level accumulation (promotion).

目前,DeepGEMM 仅支持 NVIDIA Hopper 张量核心。为了解决 FP8 张量核心累积不精确的问题,它采用了 CUDA 核心的两级累积(提升)。

验证&安装

1
2
3
4
5
6
7
8
# Make symbolic links for third-party (CUTLASS and CuTe) include directories
python setup.py develop

# Test JIT compilation
python tests/test_jit.py

# Test all GEMM implements (normal, contiguous-grouped and masked-grouped)
python tests/test_core.py
1
python setup.py install

之后就可以在python工程中import deep_gemm

改写DeepseekV3的FP8训练代码

1. DeepseekV3MLP

原版的DeepseekV3MLP代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class DeepseekV3MLP(nn.Module):
def __init__(self, config, hidden_size=None, intermediate_size=None):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
self.intermediate_size = (
config.intermediate_size if intermediate_size is None else intermediate_size
)

self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]

def forward(self, x):
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
return down_proj

这里的核心计算就在三个Linear层中,分别是gate_projup_projdown_proj。下面我们试图替换成用DeepGEMM实现的FP8的Linear

DeepGEMM提供的Core测试代码中,有一些现成的API。这些代码也刚好对应Deepseek-V3的Tech Report种对于FP8 training的描述。

Page 15 原文描述:

1
To address this challenge and effectively extend the dynamic range of the FP8 format, we introduce a fine-grained quantization strategy: tile-wise grouping with 1 × 𝑁𝑐 elements or block-wise grouping with 𝑁𝑐 × 𝑁𝑐 elements...

为了实现FP8训练,团队提出了细粒度的训练中计算量化的策略:1. tile-wise的分组量化(一条一条/一个token一个token(因为一个token被embed成一个向量,这里也就是向量维度上的)) 和 2. block-wise的分组量化(一个矩阵块的)

这两个策略刚好对应Core测试代码中的per_token_cast_to_fp8per_block_cast_to_fp8

在报告中,作者团队还指出,他们为了求得训练稳定性和效率上的一个平衡,在embedding module, output head, MoE gating modules, normalization ops, attention ops都是全精度的(BF16/FP32)