Requirements - Hopper architecture GPUs, sm_90a must be supported - Python 3.8 or above - CUDA 12.3 or above - But we highly recommend 12.8 or above for the best performance - PyTorch 2.1 or above - CUTLASS 3.6 or above (could be cloned by Git submodule)
(base) [sc100116@psn001 build]$ cmake .. -DCUTLASS_NVCC_ARCHS=90a -- CMake Version: 3.24.0-rc4 -- CUTLASS 3.8.0 -- The CUDA compiler identification is NVIDIA 12.4.131 -- Detecting CUDA compiler ABI info -- Detecting CUDA compiler ABI info - done -- Check for working CUDA compiler: /online1/public/support/intel/cuda/12.4/bin/nvcc - skipped -- Detecting CUDA compile features -- Detecting CUDA compile features - done -- Found CUDAToolkit: /online1/public/support/intel/cuda/12.4/include (found version "12.4.131") -- Performing Test CMAKE_HAVE_LIBC_PTHREAD -- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed -- Looking for pthread_create in pthreads -- Looking for pthread_create in pthreads - not found -- Looking for pthread_create in pthread -- Looking for pthread_create in pthread - found -- Found Threads: TRUE -- CUDART: /online1/public/support/intel/cuda/12.4/lib64/libcudart.so -- CUDA Driver: /online1/public/support/intel/cuda/12.4/lib64/stubs/libcuda.so -- NVRTC: /online1/public/support/intel/cuda/12.4/lib64/libnvrtc.so -- Default Install Location: /usr/local -- Found Python3: /home/export/base/sc100116/sc100116/online1/miniconda3/bin/python3.12 (found suitable version "3.12.9", minimum required is "3.5") found components: Interpreter -- CUDA Compilation Architectures: 90a -- Enable caching of reference results in conv unit tests -- Enable rigorous conv problem sizes in conv unit tests -- Using the following NVCC flags: --expt-relaxed-constexpr -DCUTLASS_TEST_LEVEL=0 -DCUTLASS_TEST_ENABLE_CACHED_RESULTS=1 -DCUTLASS_CONV_UNIT_TEST_RIGOROUS_SIZE_ENABLED=1 -DCUTLASS_DEBUG_TRACE_LEVEL=0 -Xcompiler=-Wconversion -Xcompiler=-fno-strict-aliasing -- CUTLASS Revision: 4499c4c -- The C compiler identification is GNU 10.3.0 -- Detecting C compiler ABI info -- Detecting C compiler ABI info - done -- Check for working C compiler: /online1/public/support/intel/gcc_compiler/10.3.0/bin/gcc - skipped -- Detecting C compile features -- Detecting C compile features - done -- Found Python3: /home/export/base/sc100116/sc100116/online1/miniconda3/bin/python3.12 (found version "3.12.9") found components: Interpreter -- Configuring cublas ... -- cuBLAS Disabled. -- Configuring cuBLAS ... done. -- Completed generation of library instances. See /home/export/base/sc100116/sc100116/online1/projects/DeepGEMM/cutlass-3.8.0/build/tools/library/library_instance_generation.log for more information. -- Found Python3: /home/export/base/sc100116/sc100116/online1/miniconda3/bin/python3.12 (found suitable version "3.12.9", minimum required is "3.5") found components: Interpreter -- Enable device reference verification in conv unit tests -- Configuring done -- Generating done -- Build files have been written to: /home/export/base/sc100116/sc100116/online1/projects/DeepGEMM/cutlass-3.8.0/build
DeepGEMM is a library designed for clean and efficient FP8 General Matrix Multiplications (GEMMs) with fine-grained scaling, as proposed in DeepSeek-V3. It supports both normal and Mix-of-Experts (MoE) grouped GEMMs. Written in CUDA, the library has no compilation need during installation, by compiling all kernels at runtime using a lightweight Just-In-Time (JIT) module.
To address this challenge and effectively extend the dynamic range of the FP8 format, we introduce a fine-grained quantization strategy: tile-wise grouping with 1 × 𝑁𝑐 elements or block-wise grouping with 𝑁𝑐 × 𝑁𝑐 elements...