WMMA guide for AMD RDNA 4 architecture GPUs - part 3

In-register matrix transpose for RDNA 4 architecture GPUs

In-register matrix transpose is an important optimization technique in FFT and Neural Texture Compression. This article explores practical implementations using WMMA on AMD RDNA™ 4 architecture graphics cards.

1. Problem description

Matrix transpose loading is critical in modern GPGPU computing. However, RDNA 4 architecture GPUs lack both shared-memory transpose loading and in-register matrix transpose capabilities, making it difficult for programmers to achieve peak performance. Consequently, an efficient in-register matrix transpose solution is essential for RDNA 4.

2. Transpose by warp shuffle instruction

A straightforward approach uses warp shuffle instructions to exchange data between threads in the warp. However, as the RDNA 4 WMMA layout and shuffle instructions only support constant index of a value, requiring redundant data exchanges that limit performance. Here is a sample code:

#define WMMA_DATA_WIDTH 8
typedef _Float16 frag_type __attribute__((ext_vector_type(WMMA_DATA_WIDTH)));
__device__ __forceinline__ frag_type shfl_movmatrix(const frag_type& t) {
    auto lane_id = __lane_id();
    auto v_src = lane_id % 8;
    auto TLayout = [] (auto lane_id) {
        constexpr unsigned shape[] = {8, 2, 2};
        constexpr unsigned stride[] = {0, 16, 8};
        for(int i = 0; i < sizeof(shape) / sizeof(shape[0]); ++i) {
            result += lane_id % shape[i] * stride[i];
    auto t_trans = TLayout(lane_id);
    for(int v = 0; v < 8; ++v) {
        auto t_src = t_trans + v;
        uint32_t* in_reg = (uint32_t*)(&t);
        uint32_t* out_reg = (uint32_t*)(&reg);
        static_assert(sizeof(reg) % sizeof(*in_reg) == 0, "frag_type must be dividend by uint32_t evenly");
        for(int tv = 0; tv < sizeof(reg) / sizeof(*in_reg); ++tv) {
            out_reg[tv] = __shfl(in_reg[tv], t_src);

3. Transpose by WMMA

To implement in-register matrix transpose effectively, one must first understand the Wide Matrix Multiply Accumulate (WMMA) layout in RDNA 4. Using the Matrix Cores of AMD RDNA 4 architecture GPUs provides a comprehensive introduction to this topic.

The following illustrates the WMMA layout in RDNA 4. All data types—including FP16, INT8, and INT4—utilize this unified layout:

Both matrix A and B are K-major, with each thread holding 8 contiguous elements. This layout enables efficient 128-bit vectorized loads. Matrix D is M-major. So, matrix D is the transposed version of matrix A.

Leveraging the RDNA 4 architecture WMMA layout, we can construct an identity matrix in register B while loading the source matrix into register A. A single WMMA operation then performs the transpose entirely in-register—no additional memory operations required.

4. Sample code

The following code demonstrates the in-register matrix transpose implementation on RDNA 4.

#define WMMA_DATA_WIDTH 8
typedef _Float16 frag_type __attribute__((ext_vector_type(WMMA_DATA_WIDTH)));
__device__ __forceinline__ void make_identity(frag_type& m) {
    const int lIdx = __lane_id();
    const int row = lIdx / 16;
    const int col = lIdx % 16 / 8;
    const __half num = row == col;
    const int idx = lIdx % 16 % 8;
__global__ void wmma_movmatrix(__half* a, __half* c) {
    const int gIdx = blockIdx.x * blockDim.x + threadIdx.x;
    const int lIdx = threadIdx.x;
    const int lane = lIdx % 16;
    const int laneGroup = lIdx / 16;
    for(int ele = 0; ele < WMMA_DATA_WIDTH; ++ele) {
  a_frag[ele] = a[16 * lane + ele+laneGroup * WMMA_DATA_WIDTH];
    c_frag = __builtin_amdgcn_wmma_f16_16x16x16_f16_w32_gfx12(a_frag, b_frag, c_frag);
    for( int ele = 0; ele < WMMA_DATA_WIDTH; ++ele ) {
  c[16 * lane + ele+laneGroup * WMMA_DATA_WIDTH] = c_frag[ele];
int main(int argc, char* argv[]) {
    thrust::host_vector<__half> h_a(16*16);
    thrust::host_vector<__half> h_c(16*16);
    for(int i = 0; i < h_a.size(); ++i) {
    thrust::device_vector<__half> d_a = h_a;
    thrust::device_vector<__half> d_c = h_c;
    wmma_movmatrix<<<dim3(1), dim3(32, 1, 1), 0, 0>>>(d_a.data().get(), d_c.data().get());
    for (int i = 0; i < 16; ++i) {
        for (int j = 0; j < 16; ++j) {
            printf("%3i ", (int)h_a[i*16 + j]);
    for (int i = 0; i < 16; ++i) {
        for (int j = 0; j < 16; ++j) {
            printf("%3i ", (int)h_c[i*16 + j]);

5. Conclusion

Using WMMA for in-register matrix transpose on AMD RDNA 4 architecture GPUs provides a lightweight alternative to CUDA’s ldmatrix.trans and movmatrix instructions, particularly when matrix core utilization is low.

This technique has been deployed in Llama.cpp to implement ldmatrix_trans in Flash Attention on RDNA 4, serving as a real-world validation of the approach.

Footnotes

Links to third party sites are provided for convenience and unless explicitly stated, AMD is not responsible for the contents of such linked sites and no endorsement is implied. GD-97.

Source link