



GP-GPU

## **GPU** architecture

#### Stéphane Vialle







Stephane.Vialle@centralesupelec.fr http://www.metz.supelec.fr/~vialle

1



# **GPU** architecture

- 1 From vector to GPU based architectures
- 2 NVIDIA products
- 3 CUDA cores architecture
- 4 Recent architecture issues













R



From vector to GPU based architectures

GPU in the 2023-Top500

Group of TOP500 machines with GPU NVIDIA

[+] None

Supercomputer Fugaku, Sunway M Think Cray

Pow HP HPE C Think

HPE Cray EX235a, AMD Optimized HPE Cray EX23

HPE Cray EX235a, AMD Optimized HPE Cray EX235a, AMD Optimized



## **GPU** architecture

- 1 From vector to GPU based architectures
- 2 NVIDIA products
- 3 CUDA cores architecture
- 4 Recent architecture issues

11



From vector to GPU based architectures

## GeForce / Tesla families

#### 2 NVIDIA product families, 2 strategies:

#### Available cores:



- Generic CUDA Cores
- Tensor Cores
- Ray Tracing Cores

#### Floating-point format:

- simple-precision: many units, high perf
- double-precision: very few units, low perf

#### Computing capabilities:

· NOT certified

#### Insertion into clusters

• forbiden!

#### Available cores:

- Generic CUDA Cores
- Tensor Cores
- Ray Tracing Cores

#### Floating-point format:

- simple-precision: many units, high perf
- double-precision: many units, high perf

#### Computing capabilities:

• certified

#### Insertion into clusters

• authorized/recommanded







## **GPU** architecture

- 1 From vector to GPU based architectures
- 2 NVIDIA products
- 3 CUDA cores architecture
- 4 Recent architecture issues

15





CUDA cores architecture POLYTECH' Stream Multiprocessors architecture A GPU is a set of N Stream multiprocessors (SM): • N independent « SIMT » machines Sharing the GPU board memory **Before Volta & Turing architectures** one SM included: • 1 instruction decoder/unit • 32 strongly synchronized *hardware* threads, running warps of 32 threads • 32K-128K registers distributed among all hardware threads (and not shared) · A fast memory shared between all running threads (of the SM) + A scheduler of warps of threads, among a larger block of threads

18

document nVIDIA











# Turing memory hierarchy

- All data accesses cross the L2 generic cache
- Both L1 generic cache and specialized texture cache read the L2 generic cache
- Using only L2-L1 mechanism is not so disadvantageous since Turing architeture
- But using the *Shared Memory* is still efficient (and more complex)



23



## **GPU** architecture

- 1 From vector to GPU based architectures
- 2 NVIDIA products
- 3 CUDA cores architecture
- 4 Recent architecture issues







Recent architecture issues

### New GeForce GPU

#### **Turing architecture:**

Ex: RTX 2080Ti GPU board

- 72 Stream Multiprocessor (SM)
- 64 CUDA cores / SM
  - → 4608 CUDA cores
- Double-precision computing is possible but slow
- 8 Tensor cores / SM
  - → 576 Tensor cores
- 1 Ray Tracing core / SM
   → 72 RT cores
- More efficient cache memory



1 Stream Multiprocessor

27



Recent architecture issues

# Motivation to design RT cores

#### **Ray Tracing cores:**

- Final objective: « real time ray tracing for video »
- Currently: GPU not powerful enough
  - → Real Time RT on a subset of rays + interpolation with Tensor Cores



# Video game remains the main market for NVIDIA

→ GPU architecture evolutions must be useful for the video game market

SOL MAN from NVIDIA SOL ray tracing demo running on a Turing TU102 GPU with NVIDIA RTX technology in real-time







#### Recent architecture issues

# Cache memory improvement

#### New fast memory:

- 96 or 128 KBytes per SM
- Used for both: L1 cache, shared memory, texture cache

Rmk: The « shared memory » is an unmanaged L1 cache memory. The application developer has to design and implement a strategy adapted to its computations!

- If the shared memory is unused, the 96 KBytes will be automatically used for L1 cache
- A new and more efficient cache management strategy has been implemented

**Objectives** of this new fast memory architecture and management:

- To decrease the performance loss when not using shared memory...
- ... many users have refused to design and implement a new cache management strategy (too difficult).

31





