The compute capability of a GPU determines its general specifications and available features supported by the GPU hardware. The NVIDIA CUDA compiler does a good job in optimizing memory resources but an expert CUDA developer can choose to use this memory hierarchy efficiently to optimize the CUDA programs as needed. Global memory-This is the framebuffer size of the GPU and DRAM sitting in the GPU.The NVIDIA A100 GPU has increased the L2 cache size to 40 MB as compared to 6 MB in V100 GPUs. L2 cache-The L2 cache is shared across all SMs, so every thread in every CUDA block can access this memory.Read-only memory-Each SM has an instruction cache, constant memory, texture memory and RO cache, which is read-only to kernel code.All threads in a CUDA block can share shared memory, and all CUDA blocks running on a given SM can share the physical memory resource provided by the SM. L1/Shared memory (SMEM)-Every SM has a fast, on-chip scratchpad memory that can be used as L1 cache and shared memory.The compiler makes decisions about register utilization. Registers-These are private to each thread, which means that registers assigned to a thread are not visible to other threads.The following memories are exposed by the GPU architecture: _global_ void MatAdd(float MatA, float MatB, float MatC) Memory hierarchyĬUDA-capable GPUs have a memory hierarchy as depicted in Figure 4. Kernel - Adding two matrices MatA and MatB The total number of blocks are computed using the data size divided by the size of each block. In the example below, a 2D block is chosen for ease of indexing and each block has 256 threads with 16 each in x and y-direction. The CUDA program for adding two matrices below shows multi-dimensional blockIdx and threadIdx and other variables like blockDim. These triple angle brackets mark a call from host code to device code. The number of threads per block and the number of blocks per grid specified in the > syntax can be of type int or dim3.With _syncthreads, all threads in the block must wait before anyone can proceed. All threads within a block can be synchronized using an intrinsic function _syncthreads.The dimension of the thread block is accessible within the kernel through the built-in blockDim variable.CUDA architecture limits the numbers of threads per block (1024 threads per block limit).Similarly, blocks are also indexed using the in-built 3D variable called blockIdx. Three-dimensional indexing provides a natural way to index elements in vectors, matrix, and volume and makes CUDA programming easier. Threads are indexed using the built-in 3D variable threadIdx. The parallel portion of your applications is executed K times in parallel by K different CUDA threads, as opposed to only one time like regular C/C++ functions.ĬUDA defines built-in 3D variables for threads and blocks. Copy the results from device memory to host memory, also called device-to-host transfer.įigure 1 shows that the CUDA kernel is a function that gets executed on GPU.Load the GPU program and execute, caching data on-chip for performance.Copy the input data from host memory to device memory, also known as host-to-device transfer.To execute any CUDA program, there are three main steps: The GPU is called a device and GPU memory likewise called device memory. The system memory associated with the CPU is called host memory. The host is the CPU available in the system. Let me introduce two keywords widely used in CUDA programming model: host and device. This post outlines the main concepts of the CUDA programming model by outlining how they are exposed in general-purpose programming languages like C/C++. ![]() The CUDA programming model provides an abstraction of GPU architecture that acts as a bridge between an application and its possible implementation on GPU hardware. This is the fourth post in the CUDA Refresher series, which has the goal of refreshing key concepts in CUDA, tools, and optimization for beginning or intermediate developers.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |