Welcome to Lesson 2 of the CUDA Programming Tutorial Series! In Lesson 1, we explored why GPUs outperform CPUs and how CUDA unlocks that power. Now it's time to get our hands dirty and write our first CUDA program.
🧑💻 What You'll Learn
- How to write a CUDA kernel
- How to launch threads on the GPU
- How to compile and run CUDA code
- How to understand basic thread indexing
🔧 Step 1: Set Up Your Environment
Before we code, make sure you've installed:
- CUDA Toolkit from NVIDIA
- A C++ compiler (e.g.,
g++) - NVIDIA GPU with CUDA support
- Verify with:
nvcc --version✍️ Step 2: Write Your First CUDA Program
Let's create a simple program that prints a message from multiple GPU threads.
📄 hello.cu
#include <stdio.h>
// Kernel function that runs on the GPU
__global__ void helloCUDA() {
printf("Hello from block %d, thread %d!\n", blockIdx.x, threadIdx.x);
}
int main() {
// Launch kernel with 2 blocks and 4 threads per block
helloCUDA<<<2, 4>>>();
// Wait for GPU to finish
cudaDeviceSynchronize();
return 0;
}🧠 What's Happening Here?
__global__marks a function as a GPU kernel.<<<2, 4>>>launches the kernel with 2 blocks and 4 threads per block.blockIdx.xandthreadIdx.xare built-in variables that identify each thread.cudaDeviceSynchronize()ensures the CPU waits for the GPU to finish.
🧪 Step 3: Compile and Run
Use the NVIDIA compiler:
nvcc hello.cu -o hello
./hello🖨️ Output:
Hello from block 0, thread 0!
Hello from block 0, thread 1!
Hello from block 0, thread 2!
Hello from block 0, thread 3!
Hello from block 1, thread 0!
Hello from block 1, thread 1!
Hello from block 1, thread 2!
Hello from block 1, thread 3!Boom! You just launched 8 parallel threads on your GPU.
🧭 What's Next?
In Lesson 3, we'll dive deeper into CUDA's execution hierarchy: threads, blocks, and grids. You'll learn how to scale your programs and control parallelism like a pro.