





























## 

























## CUDA cores architecture Stream Multiprocessors architecture

## A GPU is a set of N Stream

- multiprocessors (SM):
  N independent « SIMT » machines
  Sharing the GPU board memory

#### Before Volta & Turing architectures one SM included:

- 1 instruction decoder/unit
- 32 strongly synchronized hardware *threads*, running *warps* of 32 threads 32K-128K registers distributed
- among all *hardware threads* (and not shared)
- A fast memory shared between all running threads (of the SM)
- A scheduler of warps of threads,
- among a larger block of threads



































# Motivation to design RT cores

#### **Ray Tracing cores:**

- Final objective: « real time ray tracing for video »
- Currently: GPU not powerful enough
   Real Time RT on a subset of rays
   + interpolation with Tensor Cores



## Video game remains the main market for NVIDIA

→ GPU architecture evolutions must be useful for the video game market

SOL MAN from NVIDIA SOL ray tracing demo running on a Turing TU102 GPU with NVIDIA RTX technology in real-time





## ŝ

## Recent architecture issues Tensor Core features

#### **Tensor cores:**

- A Tensor Core achieves a flow of product-add on a flow of 4x4 matrixes
  D = A.B : produces a flow of D output matrixes
  - D = A.B + C, with accumulation of A.B product
  - flow into C matrix

#### A Tensor core:

- is a hardware implementation of a matrix operator,
- is a very useful operator for modern applications,
- including graphic applications (main GPU market).
- $\rightarrow$  A math operator who deserves to occupy part of the chip!



D

## Recent architecture issues Cache memory improvement

#### New fast memory:

ŝ

- 96 KBytes per SM
- Used for both: L1 cache, shared memory, texture cache Rmk: The « shared memory » is an unmanaged L1 cache memory. The
- application developer has to design and implement a strategy adapted to its computations!
- · If the shared memory is unused, the 96 KBytes will be automatically used for L1 cache
- · A new and more efficient cache management strategy has been implemented

### Objectives of this new fast memory architecture and management:

• To decrease the performance loss when not using shared memory... ... many users have refused to design and implement a new cache management strategy (too difficult).

| Recent architect<br>Evolution of the                                                                                               | G   | Ρl                                        | J      | _   |      | at  | uı     | es         |       |     |       |  |
|------------------------------------------------------------------------------------------------------------------------------------|-----|-------------------------------------------|--------|-----|------|-----|--------|------------|-------|-----|-------|--|
| (https://en.wikipedia.org<br>Feature support (unlisted features are supported for all compute abilities)                           |     | g/W1K1/CUDA) Compute capability (version) |        |     |      |     |        |            |       |     |       |  |
|                                                                                                                                    |     | 1.1 1                                     | .2 1.3 | 2   | x 3. | 3.2 | 3.5, 3 | .7, 5.0, 5 | 2 5.3 | 6.x | 7.x 8 |  |
| Integer atomic functions operating on 32-bit words in global memory                                                                | No  |                                           |        |     |      |     |        |            |       |     |       |  |
| atomicExch() operating on 32-bit floating point values in global memory                                                            | NO  |                                           | Yes    |     |      |     |        |            |       |     |       |  |
| Integer atomic functions operating on 32-bit words in shared memory                                                                |     |                                           |        |     |      |     |        |            |       |     |       |  |
| atomicExch() operating on 32-bit floating point values in shared memory                                                            |     | No                                        |        |     |      |     |        |            |       |     |       |  |
| Integer atomic functions operating on 64-bit words in global memory                                                                |     | NO                                        |        | Yes |      |     |        |            |       |     |       |  |
| Warp vote functions                                                                                                                |     |                                           |        |     |      |     |        |            |       |     |       |  |
| Double-precision floating-point operations                                                                                         |     | No                                        |        |     |      | Yes |        |            |       |     |       |  |
| Atomic functions operating on 64-bit integer values in shared memory                                                               |     |                                           |        |     |      |     |        |            |       |     |       |  |
| Floating-point atomic addition operating on 32-bit words in global and shared memory                                               |     | No                                        |        |     |      |     |        |            |       |     |       |  |
| _ballot()                                                                                                                          |     |                                           |        |     |      |     |        |            |       |     |       |  |
| _threadfence_system()                                                                                                              |     |                                           |        | Yes |      |     |        |            |       |     |       |  |
| _syncthreads_count(), _syncthreads_and(), _syncthreads_or()                                                                        |     |                                           |        |     |      |     |        |            |       |     |       |  |
| Surface functions                                                                                                                  |     |                                           |        |     |      |     |        |            |       |     |       |  |
| 3D grid of thread block                                                                                                            |     |                                           |        |     |      |     |        |            |       |     |       |  |
| Warp shuffle functions                                                                                                             |     | No                                        |        |     |      | Yes |        |            |       |     |       |  |
| Funnel shift                                                                                                                       |     | No                                        |        |     | Yes  |     |        |            |       |     |       |  |
| Dynamic parallelism                                                                                                                |     | No                                        |        |     |      | Yes |        |            |       |     |       |  |
| Half-precision floating-point operations:<br>addition, subtraction, multiplication, comparison, warp shuffle functions, conversion |     | No                                        |        |     | 1    | 'es |        |            |       |     |       |  |
| Atomic addition operating on 64-bit floating point values in global memory and shared mem                                          | ory | No                                        |        |     |      |     |        |            |       | Yes |       |  |
| Tensor core                                                                                                                        |     | No                                        |        |     |      |     |        |            | Yes   |     |       |  |



| Evolution of the GPU Tesla      |                |                 |                |                             |  |  |  |  |  |
|---------------------------------|----------------|-----------------|----------------|-----------------------------|--|--|--|--|--|
| Tesla Product                   | Tesla K40      | Tesla M40       | Tesla P100     | Tesla V100<br>GV100 (Volta) |  |  |  |  |  |
| GPU                             | GK180 (Kepler) | GM200 (Maxwell) | GP100 (Pascal) |                             |  |  |  |  |  |
| SMs                             | 15             | 24              | 56             | 80                          |  |  |  |  |  |
| TPCs                            | 15             | 24              | 28             | 40                          |  |  |  |  |  |
| FP32 Cores / SM                 | 192            | 128             | 64             | 64                          |  |  |  |  |  |
| FP32 Cores / GPU                | 2880           | 3072            | 3584           | 5120                        |  |  |  |  |  |
| FP64 Cores / SM                 | 64             | 4               | 32             | 32                          |  |  |  |  |  |
| FP64 Cores / GPU                | 960            | 96              | 1792           | 2560                        |  |  |  |  |  |
| Tensor Cores / SM               | NA             | NA              | NA             | 8                           |  |  |  |  |  |
| Tensor Cores / GPU              | NA             | NA              | NA             | 640                         |  |  |  |  |  |
| GPU Boost Clock                 | 810/875 MHz    | 1114 MHz        | 1480 MHz       | 1530 MHz                    |  |  |  |  |  |
| Peak FP32 TFLOPS <sup>1</sup>   | 5              | 6.8             | 10.6           | 15.7                        |  |  |  |  |  |
| Peak FP64 TFLOPS <sup>1</sup>   | 1.7            | .21             | 5.3            | 7.8                         |  |  |  |  |  |
| Peak Tensor TFLOPS <sup>1</sup> | NA             | NA              | NA 125         |                             |  |  |  |  |  |



