Tiling is a performance technique where:
This is fundamental to:
On NVIDIA GPUs:
blockIdx, threadIdx in CUDABelow is a clean, from‑first‑principles mathematical explanation of tiling for matrix multiplication, independent of CUDA or programming language. This is suitable for MD docs, design notes, or performance papers.
Let
Matrix multiplication is defined element‑wise as
This formulation is mathematically correct but does not exploit data reuse.
Observe that
Therefore, many operands are reused across the summation.
Tiling reorganizes the summation to make this reuse explicit.
Choose a tile size .
Rewrite the reduction index as
k=tT+rk = tT + r
where
Substituting into the original definition yields
This expression is algebraically identical to the naive formulation of matrix
multiplication. The difference lies only in how the computation is grouped.
Partition the output indices as
i=IT+i′i = IT + i’
j=JT+j′j = JT + j’
where
Each output tile satisfies
C(I,J)∈RT×TC^{(I,J)} \in \mathbb{R}^{T \times T}
To express tiling mathematically, we define submatrices (tiles) of the
original matrices and .
These submatrices correspond exactly to the blocks processed together.
The tile of matrix is defined as:
The tile of matrix is defined as:
Each submatrix is a contiguous block of size .
Using these definitions, each output tile is computed as:
This equation is the fundamental mathematical statement of tiling.
For an element within an output tile,
This shows that
Without tiling,
With tiling,
The reuse factor is therefore approximately
Floating‑point operations per tile:
Global memory loads per tile:
Arithmetic intensity becomes:
This explains why performance improves with larger tiles until resource limits
(such as shared memory or register capacity) are reached.
NVIDIA Tensor Cores evaluate fixed‑size tile products:
C16×16=A16×16⋅B16×16C^{16 \times 16} = A^{16 \times 16} \cdot B^{16 \times 16}
Element‑wise, this corresponds to
This is exactly the tiled formulation with , implemented directly in hardware.
Tiling rewrites matrix multiplication by partitioning both the summation index and the output indices into fixed‑size blocks. Each output tile is computed as a sum of products of submatrices and . This transformation preserves mathematical correctness while increasing operand reuse, improving arithmetic intensity and enabling efficient execution on modern GPU architectures.
This document explains matrix multiplication tiling in a way that is:
You can read it top‑to‑bottom like an instruction manual.
Matrix multiplication combines two matrices to produce a third one. It is used everywhere:
The challenge is performance. Modern GPUs are extremely fast at computation, but slow compared to computation when accessing memory. Tiling exists to reduce this memory cost.
Let
Each element of is computed as
Plain‑English meaning:
This formula is correct — but inefficient for large matrices on real hardware.
In the formula above:
However, the naive approach reloads these values from memory repeatedly.
Key insight:
The math already has reuse — but the computation order does not take advantage of it.
Instead of working on the entire matrix at once, we:
Analogy:
Instead of walking to the fridge for every ingredient, bring several ingredients to the table and cook efficiently.
Choose a tile size (for example, or ).
We will process matrices in chunks of size .
Rewrite the summation index as
where
Substituting this into the original equation gives
Nothing has changed mathematically — only the grouping.
We also split the output indices:
Here:
Each output tile satisfies
We now define submatrices of and :
These are exactly the tiles we load into fast memory.
Using these definitions, each output tile is computed as
Plain‑English meaning:
To compute one output tile, multiply pairs of input tiles and accumulate the results.
This is the central mathematical idea behind tiling.
For an individual element inside a tile:
This shows explicitly how values are reused many times after being loaded once.
Without tiling:
With tiling:
The reuse factor is approximately
More reuse means fewer memory accesses and higher performance.
For a single tile computation:
Arithmetic intensity becomes
As grows, computation increases faster than memory traffic — ideal for GPUs.
NVIDIA Tensor Cores operate on fixed tiles:
Element‑wise:
This is exactly the tiled formulation with , implemented directly in hardware.
Tiling reorganizes matrix multiplication to match how hardware really works. It keeps data close, maximizes reuse, and transforms a correct but inefficient formula into one that runs at peak performance on modern GPUs.
This document shows how CUDA tiling (shared‑memory block tiling) is expressed and used across different languages.
The key rule to remember is:
CUDA tiling always happens inside CUDA kernels.
Other languages only launch those kernels.
#define TILE 16
extern "C" __global__
void matmul_tiled(float* A, float* B, float* C, int N)
{
__shared__ float As[TILE][TILE];
__shared__ float Bs[TILE][TILE];
int row = blockIdx.y * TILE + threadIdx.y;
int col = blockIdx.x * TILE + threadIdx.x;
float sum = 0.0f;
for (int t = 0; t < N / TILE; t++) {
As[threadIdx.y][threadIdx.x] =
A[row * N + t * TILE + threadIdx.x];
Bs[threadIdx.y][threadIdx.x] =
B[(t * TILE + threadIdx.y) * N + col];
__syncthreads();
for (int k = 0; k < TILE; k++) {
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
}
__syncthreads();
}
C[row * N + col] = sum;
}
This is the reference implementation.
All other languages ultimately rely on kernels like this.
from numba import cuda, float32
TILE = 16
@cuda.jit
def matmul_tiled(A, B, C):
As = cuda.shared.array((TILE, TILE), float32)
Bs = cuda.shared.array((TILE, TILE), float32)
tx = cuda.threadIdx.x
ty = cuda.threadIdx.y
row = cuda.blockIdx.y * TILE + ty
col = cuda.blockIdx.x * TILE + tx
tmp = 0.0
for t in range(A.shape[0] // TILE):
As[ty, tx] = A[row, t * TILE + tx]
Bs[ty, tx] = B[t * TILE + ty, col]
cuda.syncthreads()
for k in range(TILE):
tmp += As[ty, k] * Bs[k, tx]
cuda.syncthreads()
C[row, col] = tmp
This corresponds line‑for‑line with the CUDA C++ kernel.
import cupy as cp
A = cp.random.rand(4096, 4096).astype(cp.float32)
B = cp.random.rand(4096, 4096).astype(cp.float32)
C = A @ B
Tiling is handled internally by cuBLAS and Tensor Core kernels.
custRust does not express tiling directly.
The tiling lives in the CUDA kernel.
extern "C" __global__
void matmul_tiled(float* A, float* B, float* C, int N)
{
// Same implementation as the C++ kernel above
}
Compile with:
nvcc -ptx matmul.cu -o matmul.ptx
use cust::prelude::*;
fn launch_matmul(
ptx: &str,
d_a: DevicePointer<f32>,
d_b: DevicePointer<f32>,
d_c: DevicePointer<f32>,
n: i32,
) -> CudaResult<()> {
cust::init(CudaFlags::empty())?;
let device = Device::get_device(0)?;
let _ctx = Context::create_and_push(
ContextFlags::MAP_HOST | ContextFlags::SCHED_AUTO,
device,
)?;
let module = Module::from_ptx(ptx, &[])?;
let func = module.get_function("matmul_tiled")?;
let block = Dim3::new(16, 16, 1);
let grid = Dim3::new(
(n as u32 + 15) / 16,
(n as u32 + 15) / 16,
1,
);
let stream = Stream::new(StreamFlags::DEFAULT, None)?;
unsafe {
launch!(
func<<<grid, block, 0, stream>>>(
d_a,
d_b,
d_c,
n
)
)?;
}
stream.synchronize()?;
Ok(())
}
VBA cannot run CUDA kernels directly.
The only workable approach is to call a CUDA DLL.
extern "C" __declspec(dllexport)
void matmul_cuda(float* A, float* B, float* C, int N)
{
// Internally launches CUDA kernel with tiling
}
Declare PtrSafe Sub matmul_cuda _
Lib "cuda_matmul.dll" _
(ByRef A As Single, _
ByRef B As Single, _
ByRef C As Single, _
ByVal N As Long)
Sub RunCudaMatmul()
Dim N As Long
N = 1024
Call matmul_cuda(A(0), B(0), C(0), N)
End Sub
CUDA tiling always happens in CUDA kernels.
C++ expresses tiling directly.
Python and Rust launch tiled CUDA kernels.
VBA can only call into compiled CUDA code.