Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

GPU Architecture and CUDA Programming Model, Summaries of Architecture

An overview of the Graphics Processing Unit (GPU) architecture, its initial development as graphics accelerators, and its evolution as a dense compute engine for non-graphics workloads. It also explains the CUDA programming model for scientific applications and the SAXPY computation example. The GPU architecture is based on SIMT (single instruction, multiple threads) model with many SIMT cores, thread blocks, warps, and in-order pipelines. The CUDA programming model includes steps for substituting library calls, managing data locality, transferring data between CPU and GPU, and allocating memory.

What you will learn

  • What is the GPU architecture and how does it work?
  • What is the CUDA programming model and how is it used for scientific applications?
  • How is data locality managed in CUDA programming model?
  • What are the steps for substituting library calls in CUDA programming model?
  • What is the SAXPY computation example in CUDA programming model?

Typology: Summaries

2021/2022

Uploaded on 09/27/2022

aramix
aramix 🇬🇧

4.5

(29)

368 documents

1 / 25

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
GRAPHICS PROCESSING UNIT
CS/ECE 6810: Computer Architecture
Mahdi Nazm Bojnordi
Assistant Professor
School of Computing
University of Utah
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Partial preview of the text

Download GPU Architecture and CUDA Programming Model and more Summaries Architecture in PDF only on Docsity!

GRAPHICS PROCESSING UNIT

CS/ECE 6810: Computer Architecture

Mahdi Nazm Bojnordi

Assistant Professor

School of Computing

University of Utah

Overview

¨ Announcement

¤ Homework 6 will be available tonight (due on 04/18)

¨ This lecture

¤ Classification of parallel computers

¤ Graphics processing

¤ GPU architecture

¤ CUDA programming model

Graphics Processing Unit

¨ Initially developed as graphics accelerator

¤ It receives geometry information from the CPU as an

input and provides a picture as an output

Graphics Processing Unit (GPU) host interface memory interface Vertex Processing Triangle Setup Pixel Processing

Host Interface

¨ The host interface is the communication bridge

between the CPU and the GPU

¨ It receives commands from the CPU and also

pulls geometry information from system

memory

¨ It outputs a stream of vertices in object space

with all their associated information

Pixel Processing

¨ Rasterize triangles to pixels

¨ Each fragment provided by triangle setup is fed

into fragment processing as a set of attributes

(position, normal, texcoord etc), which are used to

compute the final color for this pixel

¨ The computations taking place here include texture

mapping and math operations

Programming GPUs

¨ The programmer can write programs that are

executed for every vertex as well as for every

fragment

¨ This allows fully customizable geometry and

shading effects that go well beyond the

generic look and feel of older 3D applications

host interface memory interface Vertex Processing Triangle Setup Pixel Processing

Z-Buffer

¨ Example of 3 objects

Graphics Processing Unit

¨ Initially developed as graphics accelerators

¤ one of the densest compute engines available now

¨ Many efforts to run non-graphics workloads on GPUs

¤ general-purpose GPUs (GPGPUs)

¨ C/C++ based programming platforms

¤ CUDA from NVidia and OpenCL from an industry consortium

¨ A heterogeneous system

¤ a regular host CPU ¤ a GPU that handles CUDA (may be on the same CPU chip)

The GPU Architecture

¨ SIMT – single instruction, multiple threads

¤ GPU has many SIMT cores

¨ Application à many thread blocks (1 per SIMT core)

¨ Thread block à many warps (1 warp per SIMT core)

¨ Warp à many in-order pipelines (SIMD lanes)

Why GPU Computing?

Source: NVIDIA

GPU Computing

¨ Low latency or high throughput?

GPU Computing

¨ Low latency or high throughput

Example: SAXPY Code

int N = 1 << 20;

// Perform SAXPY on 1M elements: y[]=a*x[]+y[]

saxpy( N , 2.0, x, 1, y, 1);

Example: CUDA Lib Calls

int N = 1 << 20;

// Perform SAXPY on 1M elements: d_y[]=a*d_x[]+d_y[]

cublasSaxpy( N , 2.0, d_x, 1, d_y, 1);