Dr. Partha Majumder is a Gen-AI product engineer and research scientist with a Ph.D. from IIT Bombay. He specializes in building end-to-end, production-ready Gen-AI applications using Python, FastAPI, LangChain/LangGraph, and Next.js.
Core stack
Python, FastAPI, LangChain/LangGraph, Next.js
Applied Gen-AI
AI video, avatar videos, real-time streaming, voice agents
This part is intentionally practical. Instead of staying in theory, we will focus on complete, runnable CUDA C++ examples and use them to build a clear mental model of kernels, indexing, and host-device execution.
The examples are organized to increase in complexity:
Array summation (1D indexing)
Matrix summation (2D indexing)
Matrix transpose (an indexing-heavy pattern that appears often in real workloads)
Each example uses the same basic structure—allocate memory, copy inputs to the GPU, launch a kernel, copy results back—so the differences stand out for the right reasons.
Example 1: Summation of Two Arrays Using CUDA
Array addition is a clean starting point for CUDA programming because it makes the mapping obvious: one GPU thread can compute one array element. The kernel in this example calculates a global index and uses that index to read from a and b and write the sum into c.
#include <iostream>using namespace std;#define N 1000#define BLOCK_DIM 256// CUDA kernel to add two arrays__global__ void arraySum(int *a, int *b, int *c){ int idx = threadIdx.x + blockIdx.x * blockDim.x; if (idx < N) { c[idx] = a[idx] + b[idx]; }}int main(){ // Host arrays for input and output int* h_a = (int*)malloc(N * sizeof(int)); int* h_b = (int*)malloc(N * sizeof(int)); int* h_c = (int*)malloc(N * sizeof(int)); int* h_sum = (int*)malloc(N * sizeof(int)); // Initialize host arrays for (int i = 0; i < N; i++) { h_a[i] = i; h_b[i] = i; h_sum[i] = h_a[i] + h_b[i]; } // Device arrays for input and output int *d_a; int *d_b; int *d_c; cudaMalloc((void**)&d_a, N * sizeof(int)); cudaMalloc((void**)&d_b, N * sizeof(int)); cudaMalloc((void**)&d_c, N * sizeof(int)); // Copy data from host to device cudaMemcpy(d_a, h_a, N * sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(d_b, h_b, N * sizeof(int), cudaMemcpyHostToDevice); // Calculate block size and grid size for the CUDA kernel dim3 dimBlock(BLOCK_DIM); dim3 dimGrid((N+dimBlock.x-1)/dimBlock.x); // Launch the CUDA kernel arraySum<<<dimGrid, dimBlock>>>(d_a, d_b, d_c); // Copy the result from device to host cudaMemcpy(h_c, d_c, N * sizeof(int), cudaMemcpyDeviceToHost); // Output the result for (int i = 0; i < N; i++) { cout << h_c[i] << " "; } // Free device and host memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); free(h_a); free(h_b); free(h_c); return 0;}
This example follows the standard CUDA workflow and highlights the core idea of 1D indexing.
Allocates input/output arrays on the host, then allocates matching buffers on the device using cudaMalloc.
Copies input arrays from host to device using cudaMemcpy.
Computes launch configuration (dimGrid, dimBlock) so there are enough threads to cover N elements.
In the kernel, each thread computes a global index idx and guards with if (idx < N) to avoid out-of-bounds access.
Copies results back to the host and prints them, then releases device memory with cudaFree.
Example 2: Summation Of Two Matrices Using CUDA
Once the 1D case feels straightforward, the natural next step is 2D data. Matrix addition uses the same idea as arrays—one thread per element—but the indexing expands into two dimensions (x for columns and y for rows).
// Add Two Matrices#include <iostream>#define N 10#define M 20#define BLOCK_DIM 16using namespace std;// CUDA kernel to add two matrices__global__ void matrixAdd(int *a, int *b, int *c){ int ix = blockIdx.x * blockDim.x + threadIdx.x; int iy = blockIdx.y * blockDim.y + threadIdx.y; int index = ix + iy * N; if (ix < N && iy < M) { c[index] = a[index] + b[index]; }}int main(){ // Define matrix dimensions and block size int h_a[N][M], h_b[N][M], h_c[N][M]; int *d_a, *d_b, *d_c; int size = N * M * sizeof(int); // Initialize host matrices h_a and h_b for (int i = 0; i < N; i++) { for (int j = 0; j < M; j++) { h_a[i][j] = 1; h_b[i][j] = 2; } } // Allocate device memory for matrices cudaMalloc((void **)&d_a, size); cudaMalloc((void **)&d_b, size); cudaMalloc((void **)&d_c, size); // Copy data from host to device cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice); cudaMemcpy(d_b, h_b, size, cudaMemcpyHostToDevice); // Define the grid and block dimensions dim3 dimBlock(BLOCK_DIM, BLOCK_DIM); dim3 dimGrid((N + dimBlock.x - 1) / dimBlock.x, (M + dimBlock.y - 1) / dimBlock.y); // Launch the kernel to add matrices matrixAdd<<<dimGrid, dimBlock>>>(d_a, d_b, d_c); cudaDeviceSynchronize(); // Copy the result from device to host cudaMemcpy(h_c, d_c, size, cudaMemcpyDeviceToHost); // Output the result for (int i = 0; i < N; i++) { for (int j = 0; j < M; j++) { cout << h_c[i][j] << " "; } cout << endl; } // Free device memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); return 0;}
This example demonstrates 2D indexing and how a 2D grid maps naturally onto a matrix.
The kernel computes (ix, iy) using blockIdx, blockDim, and threadIdx, then converts that 2D coordinate into a 1D linear index for memory access.
The if (ix < N && iy < M) check ensures threads outside the matrix bounds do not write invalid memory.
The host code allocates matrices, copies them to the GPU, launches the kernel, and copies the result back.
Example 3: Matrix Transpose Using CUDA
Matrix transpose is a useful example because it forces careful thinking about indexing: each thread reads one location in the input and writes to a different location in the output. This pattern shows up in many performance-sensitive kernels, especially when data layout matters.
// Transpose of a matrix#include<iostream>#define N 4#define M 7#define block_dim 32using namespace std;// CUDA kernel to transpose a matrix__global__ void matrixTranspose(int *a, int *b){ int ix = threadIdx.x + blockIdx.x * blockDim.x; int iy = threadIdx.y + blockIdx.y * blockDim.y; int index = ix + iy * N; int transposed_index = iy + ix * M; // Check boundaries to avoid out-of-bounds memory access if (ix < N && iy < M) { b[index] = a[transposed_index]; }}int main(){ int h_a[N][M], h_b[M][N]; int *d_a, *d_b; int size = N * M * sizeof(int); // Initialize the host matrix h_a for (int i = 0; i < N; i++) { for (int j = 0; j < M; j++) { h_a[i][j] = i * j; } } // Allocate device memory for matrices cudaMalloc((void**)&d_a, size); cudaMalloc((void**)&d_b, size); // Copy data from host to device cudaMemcpy(d_a, h_a, size, cudaMemcpyHostToDevice); // Define grid and block dimensions dim3 dimBlock(block_dim, block_dim); dim3 dimGrid((N + dimBlock.x - 1) / dimBlock.x, (M + dimBlock.y - 1) / dimBlock.y); // Launch the kernel to transpose the matrix matrixTranspose<<<dimGrid, dimBlock>>>(d_a, d_b); cudaDeviceSynchronize(); // Copy the result from device to host cudaMemcpy(h_b, d_b, size, cudaMemcpyDeviceToHost); cout << "Matrix=>" << endl; for (int i = 0; i < N; i++) { for (int j = 0; j < M; j++) { cout << h_a[i][j] << " "; } cout << endl; } cout << "Transposed Matrix=>" << endl; for (int i = 0; i < M; i++) { for (int j = 0; j < N; j++) { cout << h_b[i][j] << " "; } cout << endl; } // Free device memory cudaFree(d_a); cudaFree(d_b); return 0;}
In this CUDA C++ tutorial, we worked through three practical GPU programming examples and reinforced the core patterns that show up in most CUDA applications.
CUDA programs commonly follow a repeatable flow: allocate device memory, copy inputs to the GPU, launch a kernel, and copy results back.
For 1D data (arrays), a single global thread index is often enough to map threads to elements.
For 2D data (matrices), it is common to use blockIdx.{x,y} and threadIdx.{x,y} to compute (row, column) positions.
Indexing is not just a detail; it determines correctness, and it strongly influences performance.
What’s Next
The next step in the series focuses on how CUDA actually executes work on the GPU (warps, scheduling, and the execution model) and how that maps to GPU architecture. Those ideas make it easier to reason about performance and explain why certain kernel configurations are faster than others.