04-GPU Programming 101
本文最后更新于:June 30, 2025 am
This is a lecture note of the course CSE 234 - Data Systems for ML - LE [A00].
- From UC SanDiego
- Prof. Zhang Hao
- Winter, 2025
- Link: https://podcast.ucsd.edu/watch/wi25/cse234_a00/1
1 Basic Concepts in GPU
To calculate the FLOPs (Floating Point Operations) of matrix multiplication , where , , , the FLOPs is
In GPU programming, there are a few basic concepts:
- Threads: The smallest units to process a chunk of data
- Blocks: A group of threads that share memory
- Grid: A collection of blocks that execute the same kernel
- Kernel: CUDA program (Just a fancy name)
A thread runs on a CUDA core, it is one ALU in GPU. A block relates to a streaming multiprocessor in GPU, the SM contains many CUDA cores. There are lots of SMs (Streaming Multiprocessor) in GPU. The grid relates to the GPU itself.
To build a more powerful GPU, we can build more SMs, more cores inside the SM or more powerful cores.
For instance, Nvidia H100 has 144 SMs, 2048 threads inside one SM.
2 CUDA Programming
2.1 Execution Model
In CUDA, there is an indexing system, for instance, a 2D thread indexing is like this
As you can see, inside this 2D block, moving on the row is changing the x
coordinate of the index, you can access it by threadIdx.x
, it’s a global variable. Moving on the column is changing the y
coordinate of the index, which can be accessed by threadIdx.y
. This is different from the tensor indexing, where we access rows by the x coordinates and columns by the y coordinates.
Writing CUDA programs requires two parts of coding, one on CPU and one on GPU. For example
1 |
|
The above code runs on CPU, only the last line matrixAdd
executes on GPU.
When programming CUDA kernels, we have some global variables to use
GridDim
: The dimensions or size of the grid (how many blocks), we can accessGridDim.x
andGridDim.y
, but still, x is for horizontal (#columns) size, and y is for vertical (#rows) size.blockIdx
: The index of the blocks inside the grid, also hasblockIdx.x
andblockIdx.y
, x changes when moving on horizontal dim and y changes when moving on vertical dim.blockDim
: Similar toGridDim
, tells you the dimensions or size of the block (how many threads), accessed byblockDim.x
andblockDim.y
.threadIdx
: The index of the threads inside the block, also hasthreadIdx.x
andthreadIdx.y
, the indexing system follows the above.
Now let’s see the complete example of a CUDA program: calculate A + 2*B
1 |
|
The __global__
actually tells compiler this is a CUDA kernel. In the CUDA kernel above, each thread indexes its data using blockIdx
, blockDim
, threadIdx
and execute the computation.
So we basically launched lots of threads, we define how these threads are going to use our data, and each thread will execute the addition and doubleValue
independently and simultaneously.
Also, we need to avoid multiple threads writing the same value in an array.
In the code above, we ask the thread to index the data according to its threadIdx
, then each thread will access a unique value in the matrix, they read a unique value according to the threadIdx
and write to the corresponding position of C
, these threads run independently and simultaneously, then we are doing the addition in parallel.
The CPU part is called host code, it runs in serial. The GPU part is called device code, it’s SIMD code runs in parallel.
One important thing is that the launching code matrixAdd<<<numBlocks, threadsPerBlock>>>(A, B, C)
will not block the CPU code, the CPU code will execute the next line after launching CUDA kernel. So if you want to wait the CUDA program, you must use CUDA synchronize function.
Based on the above example, writing CUDA kernel
- Writes both CPU and GPU code
- Statically declare the
blockDim
, shapes. - Map data to blocks and threads
One big thing here is that the CUDA kernel is static, first, you can’t declare variable numBlocks
and threadsPerBlock
. Second, using if
else
will slow it down.
For example, a code like this
1 |
|
The code above asks different threads to do different operations. If the data accessed by this thread is greater than 0, we double it, otherwise we take exponential.
When GPU executes the code, it will first execute those threads who found x>0
and left the rest of the threads idle, after these threads finished, the GPU will execute those who found x<=0
.
A formal definition to this behavior is coherent execution and divergent execution.
- Coherent Execution: Same instructions apply to all data
- Divergence Execution: Different instructions on some data. Should be minimized in CUDA.
2.2 Memory
In GPU, we have a special memory called HBM (High Bandwidth Memory), it’s much faster than DRAM (used by CPU). The CPU memory is also called host memory, and the GPU memory is called device memory. Meanwhile, CPU code can’t access device memory, GPU code can’t access host memory.
So, in order to move variables between different memories, we use some APIs from CUDA
1 |
|
Another useful concept is pinned memory. A pinned memory is a part of the host memory, and it’s optimized for data transfer between CPU/GPU. The pinned memory is not pagable by OS, it’s reserved by CUDA. And certain CUDA API only work on pinned memory.
Except for HBM, a thread will have its private memory (faster), a block will also have its own memory (still faster than HBM), and the memory is shared by all threads in that block.
But why we need shared memory across threads? Let’s look at an example. Suppose we want to do moving average on an array with window size 3, which means
1 |
|
One way to write CUDA code is
1 |
|
We ask each thread to read and reduce its three consecutive elements. There are N threads, each reads 3 elements, that is times reading in total. But we know there are some overlaps.
The solution is shared memory.
1 |
|
Now we only have to read times per block, but the previous code needs times.
Synchronization API
__syncthreads()
: Wait all threads in a block to arrive at this line of codecudasynchronize()
: Sync between host and device.
2.3 Scheduler
The user may ask many blocks, but GPU has limited blocks, and different GPU has the different number of blocks.
In CUDA program, we assume that thread blocks can be executed in any order, this is reasonable because blocks run independently (SIMD). GPU maps thread blocks to cores using a dynamic scheduling policy that respects resource requirements.
Assume we ask for 8k blocks, each block requires 520 bytes of shared memory, and 128 CUDA threads, but we only have a GPU with 2 SMs, each SM contains 384 CUDA threads and 1.5KB shared memory, how do we schedule them?
- SM-0 gets Block 0, it occupies (0-127) threads, and takes 520 bytes shared memory.
- SM-1 gets Block 1, it occupies (0-127) threads, and takes 520 bytes shared memory.
- SM-0 gets Block 2, it occupies (128-255) threads, and takes another 520 bytes shared memory.
- SM-1 gets Block 3, it occupies (128-255) threads, and takes another 520 bytes shared memory.
Now we can’t allocate another block because the shared memory is not enough now, the rest of the blocks will be put in a queue, they will wait until the allocated blocks finish their job.
From the above example we also know that one SM can hold many blocks as long as they have enough resources.
In practice, we always want to oversubscribe the resources, then the GPU utilization will be always high.
CUDA cores and CUDA threads: #CUDA cores is not equal to #CUDA threads.
If you read some GPU specs, you may find #CUDA cores is not equal to #CUDA threads. Actually, a CUDA thread is a software-level execution threads, we can launch thousands of them.
Threads are then scheduled in groups of 32 called warps, which issue the same instruction across multiple data elements in lock-step (Single Instruction Multiple Threads, SIMT)
GPUs support launching far more threads than there are CUDA cores (often 10–30× more). This is intentional: while one warp waits for memory or other latency, another warp can run, keeping the cores busy.