Welcome to Lesson 2 of the CUDA Programming Tutorial Series! In Lesson 1, we explored why GPUs outperform CPUs and how CUDA unlocks that power. Now it's time to get our hands dirty and write our first CUDA program.

🧑‍💻 What You'll Learn

  • How to write a CUDA kernel
  • How to launch threads on the GPU
  • How to compile and run CUDA code
  • How to understand basic thread indexing

🔧 Step 1: Set Up Your Environment

Before we code, make sure you've installed:

  • CUDA Toolkit from NVIDIA
  • A C++ compiler (e.g., g++)
  • NVIDIA GPU with CUDA support
  • Verify with:
nvcc --version

✍️ Step 2: Write Your First CUDA Program

Let's create a simple program that prints a message from multiple GPU threads.

📄 hello.cu

#include <stdio.h>
// Kernel function that runs on the GPU
__global__ void helloCUDA() {
    printf("Hello from block %d, thread %d!\n", blockIdx.x, threadIdx.x);
}
int main() {
    // Launch kernel with 2 blocks and 4 threads per block
    helloCUDA<<<2, 4>>>();
    // Wait for GPU to finish
    cudaDeviceSynchronize();
return 0;
}

🧠 What's Happening Here?

  • __global__ marks a function as a GPU kernel.
  • <<<2, 4>>> launches the kernel with 2 blocks and 4 threads per block.
  • blockIdx.x and threadIdx.x are built-in variables that identify each thread.
  • cudaDeviceSynchronize() ensures the CPU waits for the GPU to finish.

🧪 Step 3: Compile and Run

Use the NVIDIA compiler:

nvcc hello.cu -o hello
./hello

🖨️ Output:

Hello from block 0, thread 0!
Hello from block 0, thread 1!
Hello from block 0, thread 2!
Hello from block 0, thread 3!
Hello from block 1, thread 0!
Hello from block 1, thread 1!
Hello from block 1, thread 2!
Hello from block 1, thread 3!

Boom! You just launched 8 parallel threads on your GPU.

🧭 What's Next?

In Lesson 3, we'll dive deeper into CUDA's execution hierarchy: threads, blocks, and grids. You'll learn how to scale your programs and control parallelism like a pro.