04-GPU Programming 101

本文最后更新于：June 30, 2025 am

This is a lecture note of the course CSE 234 - Data Systems for ML - LE [A00].

From UC SanDiego
Prof. Zhang Hao
Winter, 2025
Link: https://podcast.ucsd.edu/watch/wi25/cse234_a00/1

1 Basic Concepts in GPU

To calculate the FLOPs (Floating Point Operations) of matrix multiplication $C = A\times B$ , where $A\in \mathbb{R}^{m\times n}$ , $B\in \mathbb{R}^{n\times p}$ , $C\in \mathbb{R}^{m\times p}$ , the FLOPs is

2mnp

In GPU programming, there are a few basic concepts:

Threads: The smallest units to process a chunk of data
Blocks: A group of threads that share memory
Grid: A collection of blocks that execute the same kernel
Kernel: CUDA program (Just a fancy name)

A thread runs on a CUDA core, it is one ALU in GPU. A block relates to a streaming multiprocessor in GPU, the SM contains many CUDA cores. There are lots of SMs (Streaming Multiprocessor) in GPU. The grid relates to the GPU itself.

To build a more powerful GPU, we can build more SMs, more cores inside the SM or more powerful cores.

For instance, Nvidia H100 has 144 SMs, 2048 threads inside one SM.

2 CUDA Programming

2.1 Execution Model

In CUDA, there is an indexing system, for instance, a 2D $4\times 3$ thread indexing is like this

\begin{bmatrix} (0,0)&(1,0)&(2,0)&(3,0)\\ (0,1)&(1,1)&(2,1)&(3,1)\\ (0,2)&(1,2)&(2,2)&(3,2) \end{bmatrix}

As you can see, inside this 2D block, moving on the row is changing the x coordinate of the index, you can access it by threadIdx.x, it’s a global variable. Moving on the column is changing the y coordinate of the index, which can be accessed by threadIdx.y. This is different from the tensor indexing, where we access rows by the x coordinates and columns by the y coordinates.

Writing CUDA programs requires two parts of coding, one on CPU and one on GPU. For example

const int Nx = 12;
const int Ny = 6;

// 12 threads for one block, block size is 3*2=6, 72 threads in total
dim3 threadsPerBlock(4, 3, 1);
dim3 numBlocks(Nx/threadsPerBlock.x, Ny/threadsPerBlock.y, 1);

// Assume A, B, C are Nx, Ny float arrays
matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

The above code runs on CPU, only the last line matrixAdd executes on GPU.

When programming CUDA kernels, we have some global variables to use

GridDim: The dimensions or size of the grid (how many blocks), we can access GridDim.x and GridDim.y, but still, x is for horizontal (#columns) size, and y is for vertical (#rows) size.
blockIdx: The index of the blocks inside the grid, also has blockIdx.x and blockIdx.y, x changes when moving on horizontal dim and y changes when moving on vertical dim.
blockDim: Similar to GridDim, tells you the dimensions or size of the block (how many threads), accessed by blockDim.x and blockDim.y.
threadIdx: The index of the threads inside the block, also has threadIdx.x and threadIdx.y, the indexing system follows the above.

Now let’s see the complete example of a CUDA program: calculate A + 2*B

// CPU part ----------------
const int Nx = 12;
const int Ny = 6;

// 12 threads for one block, block size is 3*2=6, 72 threads in total
dim3 threadsPerBlock(4, 3, 1);
dim3 numBlocks(Nx/threadsPerBlock.x, Ny/threadsPerBlock.y, 1);

// Assume A, B, C are Nx, Ny float arrays
matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

// ------------------------

// GPU part
__device__ float doubleValue(float x)
{
    return 2 * x;
}

__global__ void matrixAddDoubleB(
    float A[Ny][Nx],
    float B[Ny][Nx],
    float C[Ny][Nx]
)
{
    /* convert index for blocks and threads to index on
    columns of the matrices. Similar to stride format,
    when moving horizontally, one block index should
    skip blockDim.x threads.
    */
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    int j = blockIdx.y * blockDim.y + threadIdx.y;

    C[j][i] = A[j][i] + doubleValue(B[j][i])
}

The __global__ actually tells compiler this is a CUDA kernel. In the CUDA kernel above, each thread indexes its data using blockIdx, blockDim, threadIdx and execute the computation.

So we basically launched lots of threads, we define how these threads are going to use our data, and each thread will execute the addition and doubleValue independently and simultaneously.

Also, we need to avoid multiple threads writing the same value in an array.

In the code above, we ask the thread to index the data according to its threadIdx, then each thread will access a unique value in the matrix, they read a unique value according to the threadIdx and write to the corresponding position of C, these threads run independently and simultaneously, then we are doing the addition in parallel.

The CPU part is called host code, it runs in serial. The GPU part is called device code, it’s SIMD code runs in parallel.

One important thing is that the launching code matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C) will not block the CPU code, the CPU code will execute the next line after launching CUDA kernel. So if you want to wait the CUDA program, you must use CUDA synchronize function.

Based on the above example, writing CUDA kernel

Writes both CPU and GPU code
Statically declare the blockDim, shapes.
Map data to blocks and threads

One big thing here is that the CUDA kernel is static, first, you can’t declare variable numBlocks and threadsPerBlock. Second, using if else will slow it down.

For example, a code like this

__global__ void f(float A[N])
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    float x = A[i];
    if (x > 0)
    {
        x = 2.0 * x;
    } else {
        x = exp(x, 5.0);
    }
    A[i] = x;
}

The code above asks different threads to do different operations. If the data accessed by this thread is greater than 0, we double it, otherwise we take exponential.

When GPU executes the code, it will first execute those threads who found x>0 and left the rest of the threads idle, after these threads finished, the GPU will execute those who found x<=0.

A formal definition to this behavior is coherent execution and divergent execution.

Coherent Execution: Same instructions apply to all data
Divergence Execution: Different instructions on some data. Should be minimized in CUDA.

2.2 Memory

In GPU, we have a special memory called HBM (High Bandwidth Memory), it’s much faster than DRAM (used by CPU). The CPU memory is also called host memory, and the GPU memory is called device memory. Meanwhile, CPU code can’t access device memory, GPU code can’t access host memory.

So, in order to move variables between different memories, we use some APIs from CUDA

float* A = new float[N];
// initialize elements
for (int i=0; i<N; i++)
{
    A[i] = (float)i;
}

int bytes = sizeof(float) * N;
float* deviceA;
// allocate memory of size bytes
cudaMalloc(&deviceA, bytes);
// populate deviceA
cudaMemcpy(deviceA, A, bytes, cudaMemcpyHostToDevice);

Another useful concept is pinned memory. A pinned memory is a part of the host memory, and it’s optimized for data transfer between CPU/GPU. The pinned memory is not pagable by OS, it’s reserved by CUDA. And certain CUDA API only work on pinned memory.

Except for HBM, a thread will have its private memory (faster), a block will also have its own memory (still faster than HBM), and the memory is shared by all threads in that block.

But why we need shared memory across threads? Let’s look at an example. Suppose we want to do moving average on an array with window size 3, which means

1 2	`for i in range(len(input)-2): output[i] = (input[i]+input[i+1]+input[i+2]) / 3`

One way to write CUDA code is

// A large array
int N = 1024 * 1024
// Allocate array in device memory
/* Here we consider boundary condition where we
need to read two more elements at the end of the array.*/
cudaMalloc(&devInput, sizeof(float)*(N+2))
cudaMalloc(&devOutput, sizeof(float)*N)

// some memory copy goes here

#define THREADS_PER_BLK 128

__global__ void convolve(int N, float* input, float* output)
{
    // suppose 1D thread and 1D block
    int index = blockIdx.x * blockDim.x + threadIdx.x

    float result = 0.0;
    // reduce three consecutive elements
    for (int i=0; i<3; i++)
    {
        result += input[index + i];
    }
    output[index] = result / 3.0;
}

We ask each thread to read and reduce its three consecutive elements. There are N threads, each reads 3 elements, that is $3N$ times reading in total. But we know there are some overlaps.

The solution is shared memory.

#define THREADS_PER_BLK 128

__global__ void convolve(int N, float* input, float* output)
{
    // suppose 1D thread and 1D block
    int index = blockIdx.x * blockDim.x + threadIdx.x

    // allocate a shared memory for a block
    __shared__ float support[THREADS_PER_BLK+2];
    // copy the part of the data belonging to this block from input
    support[threadIdx.x] = input[index];
    /* handle boundary condition, note that this causes
    slight efficiency degredation */
    if (threadIdx.x < 2)
    {
        // copy 128 and 129 for extra two elements
        support[THREADS_PER_BLK+threadIdx.x] = input[index+THREADS_PER_BLK];
    }
    // barrier to wait threads complete
    __syncthreads();

    float result = 0.0;
    for (int i=0; i<3; i++)
    {
        /* read from support that belongs to this block,
        only threads in this block can see this shared
        memory, so we only need to index with threadIdx */
        results += support[threadIdx.x+i];
    }
    output[index] = result / 3.0;
}

Now we only have to read $128+2$ times per block, but the previous code needs $128\times 3$ times.

Synchronization API

__syncthreads(): Wait all threads in a block to arrive at this line of code
cudasynchronize(): Sync between host and device.

2.3 Scheduler

The user may ask many blocks, but GPU has limited blocks, and different GPU has the different number of blocks.

In CUDA program, we assume that thread blocks can be executed in any order, this is reasonable because blocks run independently (SIMD). GPU maps thread blocks to cores using a dynamic scheduling policy that respects resource requirements.

Assume we ask for 8k blocks, each block requires 520 bytes of shared memory, and 128 CUDA threads, but we only have a GPU with 2 SMs, each SM contains 384 CUDA threads and 1.5KB shared memory, how do we schedule them?

SM-0 gets Block 0, it occupies (0-127) threads, and takes 520 bytes shared memory.
SM-1 gets Block 1, it occupies (0-127) threads, and takes 520 bytes shared memory.
SM-0 gets Block 2, it occupies (128-255) threads, and takes another 520 bytes shared memory.
SM-1 gets Block 3, it occupies (128-255) threads, and takes another 520 bytes shared memory.

Now we can’t allocate another block because the shared memory is not enough now, the rest of the blocks will be put in a queue, they will wait until the allocated blocks finish their job.

From the above example we also know that one SM can hold many blocks as long as they have enough resources.

In practice, we always want to oversubscribe the resources, then the GPU utilization will be always high.

CUDA cores and CUDA threads: #CUDA cores is not equal to #CUDA threads.

If you read some GPU specs, you may find #CUDA cores is not equal to #CUDA threads. Actually, a CUDA thread is a software-level execution threads, we can launch thousands of them.

Threads are then scheduled in groups of 32 called warps, which issue the same instruction across multiple data elements in lock-step (Single Instruction Multiple Threads, SIMT)

GPUs support launching far more threads than there are CUDA cores (often 10–30× more). This is intentional: while one warp waits for memory or other latency, another warp can run, keeping the cores busy.

AI Infra > ML System

#Deep Learning #AI Infra #ML Systems

04-GPU Programming 101

https://jesseprince.github.io/2025/06/29/ai_infra/mlsys/3_cuda101/

Author

林正

Posted on

June 29, 2025

Licensed under

Efficient PyTorch Implementation of MoE with Aux loss and Token drop Previous

03-Optimization on Operator and Matrix Multiplication Next