A graphics processing unit (GPU) is similar to a set of vector processors sharing hardware. The multiple SIMD
processors in a GPU act as independent MIMD cores, like vector computers have multiple vector processors. The main
difference is multithreading, which is fundamental to GPU. This feature is missing on most vector processors.
Set of vector processors
Multiple SIMD processors
Act like independend
MIMD
Multithreading
Programming for the GPU
Compute Unified Device Architecture (CUDA)
It is a C-like programming language developed by
NVIDIA used to program for its GPUs. CUDA
generates C/C++ code for the system processor
(named host), and a C/C++ dialect for the GPU
(named device). In this setup system, the processor
is known as the “host”, and the GPU as the “device”.
Characteristics
Developed by NVIDIA
C-like programming language
Setup System
Host
System processor
C/C++ code
Device
GPU
C/C++ dialect
CUDA thread
Lowest level of parallelism
Single instruction, Multiple Threads
(SIMT)
Thread block
Threads are executed
together in blocks
Multithreaded SIMD
It is the hardware that executes a whole block of threads.
Modifiers
Function modifiers
The CUDA functions can have different modifiers such as device, global or
host.
__device__
Executed in the device, launched by the device.
__global__
Executed in the device, launched by the host.
__host__
Executed in the host, launched by the host.
Variable
Modifiers
__device__
A variable declared with this modifier is allocated to the GPU
memory, and accessible by all multithreaded SIMD processors
The CUDA variables have also some modifiers such as the device.
CUDA specific terms
Code examples
Ex. Y = a*X + Y
Conventional C code
CUDA corresponding version
This code launches n threads, once per
vector element, with 256 threads per thread block in
a multithread SIMD processor. The GPU function
begins by computing the corresponding element
index i based on the block ID, number of threads per
block, and the thread ID. The operation of
multiplication and addition is performed as long as
the index i is within the array.
Ex. A = B * C
Multiply 2 vectors with
8192 elements each
Grid (Vectorized loop)
GPU code that works on the whole 8192 elements
multiply is called grid.
A grid is composed of thread blocks (body of vectorized loops)
in this case, each thread block with up to 512 elements (16 threads/block x 32 elements/thread),
i.e., 16 threads per block
SIMD instruction executes 32 elements at a time
Open Computing Language (Open-CL)
The Open Computing Language (Open-CL) is a CUDA-similar programming language, in
a general and rough sense. Several companies are developing OpenCL to offer a
vendor-independent language for multiple platforms, in contrast to CUDA.
Vendor independent
Multiple Platforms
Extended function call
Components
dimGrid
Specifies the dimensions of the code, in terms of thread blocks
dimBlock
Specifies the dimensions of a block, in terms of threads.
Parameter list
blockIdx
It is the identifier/index for blocks.
threadIdx
It is the identifier/index of the current thread within its block
blockDim
It stands for the number of threads in a block, which comes from
the dimBlock parameter.