

# Software Optimization Code Modernization Unleash Si perf.

### Rama Malladi

Intel, Bangalore



### Legal Disclaimer & Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2017, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

#### **Optimization Notice**

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision #20110804

### Vectorize & Thread or Performance Dies

Threaded + Vectorized can be much faster than either one alone



Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance of that product when combined with other products. For more information go to <a href="https://www.intel.com/performance">https://www.intel.com/performance</a> of that product when combined with other products. For more information go to <a href="https://www.intel.com/performance">https://www.intel.com/performance</a> of that product when combined with other products. For more information go to <a href="https://www.intel.com/performance">https://www.intel.com/performance</a> of that product when combined with other products. For more information go to <a href="https://www.intel.com/performance">https://www.intel.com/performance</a> of that product when combined with other products. For more information go to <a href="https://www.intel.com/performance">https://www.intel.com/performance</a> of that product when combined with other products. For more information go to <a href="https://www.intel.com/performance">https://www.intel.com/performance</a> of that product when combined with other products. For more information go to <a href="https://www.intel.com/performance">https://www.intel.com/performance</a> of the performance of that product when combined with other products. For more information go to <a href="https://www.intel.com/performance">https://www.intel.com/performance</a> of the performance of that product when combined with other products. For more information go to <a href="https://www.intel.com/performance">https://www.intel.com/performance</a> of the performance other performance oth

3

# Mandelbrot: Speedup on Xeon Phi™



intel

# **Optimized & Modernized Implementation**

### ✓ Loop Unrolling (#pragma unroll)

- Short loop hurts instruction scheduling.
- Threading (#pragma omp parallel)
  - Embarrassingly parallel.
  - No write conflicts and small working set.
- ✓ Vectorization (#pragma omp simd)
  - v0/v1 must be reduced.
  - max() call introduces control divergence.
  - m\_r[p] should be aligned.
- ✓ Arithmetic
  - Use native exp2() call on coprocessor.

```
#pragma omp target device(0)
#pragma omp parallel for
for(int o = 0; o < nopt; o++)
  const REAL T rt tLN2=sqrt(T[0])*vol/M LN2;
  const REAL T mu tLN2 = T[0]*mu/MLN2;
  REAL T v0 = 0, v1 = 0, res;
  #pragma omp simd reduction(+:v0,v1)
  aligned(m r:64) unroll(4)
 for(int p = 0; p < npath; ++p) {
    res = max(0, S[o]*exp2(v rt tLN2*m r[p])
                 + mu tLN2)-X[0]);
    v0 += res;
    v1 += res*res;
  result
           [0] += v0;
  confidence[0] += v1;
```

# Availability of Tools?

intel

### Create Fast Code Faster with Intel® Parallel Studio XE

#### Build high performance, scalable applications for HPC, enterprise and cloud solutions running on Intel<sup>®</sup> platforms.

- Take full advantage Intel hardware and performance capabilities.
- Deliver consistent programming using Intel<sup>®</sup> AVX-512 for Intel<sup>®</sup> Xeon<sup>®</sup> and Intel<sup>®</sup> Xeon Phi<sup>™</sup> processors.
- Simplify developing and modernizing code with the latest techniques in vectorization, multi-threading, multi-node, and memory optimization.
- Use industry-leading compilers, numerical libraries, performance profilers, and code analyzers to confidently optimize software for modern hardware.

Applicable for **C, C++, Fortran** and **Python\*** software developers. **Use standards-driven parallel models**: OpenMP\*, MPI, and Intel<sup>®</sup> Threading Building Blocks.





7

### What's Inside Intel® Parallel Studio XE

Accelerate HPC, Enterprise & Cloud Applications

| PROFESSIONAL EDITION         | CLUSTER EDITION                                                                                                                                                                                                                              |
|------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <b>ANALYZE</b>               | <b>SCALE</b>                                                                                                                                                                                                                                 |
| Analysis Tools               | Cluster Tools                                                                                                                                                                                                                                |
| Intel® VTune™ Amplifier      | Intel <sup>®</sup> MPI Library                                                                                                                                                                                                               |
| Performance Profiler         | Message Passing Interface Library                                                                                                                                                                                                            |
| Intel <sup>®</sup> Inspector | Intel® Trace Analyzer & Collector                                                                                                                                                                                                            |
| Memory & Thread Debugger     | MPI Tuning & Analysis                                                                                                                                                                                                                        |
| g Vectorization Optimization | Intel <sup>®</sup> Cluster Checker                                                                                                                                                                                                           |
| & Thread Prototyping         | Cluster Diagnostic Expert System                                                                                                                                                                                                             |
|                              |                                                                                                                                                                                                                                              |
| CORE 13                      | CORE IS                                                                                                                                                                                                                                      |
| Inste                        | inside                                                                                                                                                                                                                                       |
|                              | ANALYZE         Analysis Tools         Intel® VTune™ Amplifier         Performance Profiler         Intel® Inspector         Memory & Thread Debugger         Intel® Advisor         Vectorization Optimization         & Thread Prototyping |

### Boost Application Performance on Linux\* Using Intel® Compiler (higher is better)



Configuration: Linux hardware: 2x Intel® Xeon® Gold 6148 CPU @ 2.40GHz, 192 GB RAM, HyperThreading is on. Software: Intel compilers 18.0, GCC 7.1.0. PGI 15.10, Clang/LLVM 4.0. Linux OS: Red Hat Enterprise Linux Server release 7.2 (Maipo), kernel 3.10.0-514.el7.x86\_64. SPEC\* Benchmark (<u>www.spec.org</u>). SmartHeap 10 was used for CXX tests when measuring SPECInt® benchmarks. SPECfp® tests measure C/C++ code performance only. SPECint®\_rate\_base\_2006 compiler switches: SmartHeap 10 were used for C++ tests. Intel C/C++ compiler 18.0: -m32 -xCORE-AVX512 -ipo -03 -no-prec-div -qopt-prefetch -qopt-mem-layouttrans=3 C++ code adds option -static. GCC 7.1.0: -m32 -Ofast -flto -march=core-avx2 -mfpmath=sse -funroll-loops . Clang 4.0: -m32 -Ofast -march=core-avx2 -flto -mfpmath=sse -funroll-loops C++ code adds option -fno-fast-math . SPECfp®\_rate\_base\_2006 compiler switches: Intel C/C++ compiler 18.0: -m64 -xCORE-AVX512 -ipo -03 -no-prec-div -qopt-mem-layouttrans=3 c++ code adds option -static. Intel C/C++ compiler 18.0: -m64 -xCORE-AVX512 -ipo -03 -no-prec-div -qopt-mem-layout-trans=3 -auto-p32. C code adds option -static. Intel Fortran 18.0: -m64 -xCORE-AVX512 -ipo -03 -no-prec-div -qopt-mem-layout-trans=3 -auto-p32. C code adds option -static. Intel Fortran 18.0: -m64 -xCORE-AVX512 -ipo -03 -no-prec-div -qopt-mem-layout-trans=3 -static. GCC 7.1.0: -m64 -Ofast -flto -march=core-avx2 -mfpmath=sse -funroll-loops. Clang 4.0: m64 -Ofast -march=core-avx2 -flto -mfpmath=sse -funroll-loops. Clang 4.0: -

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to <a href="https://www.intel.com/performance">https://www.intel.com/performance</a> Benchmark Source: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessors-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804 .

### Faster Python\* with Intel® Distribution for Python\*

### Advance Performance Closer to Native Code

- Accelerated NumPy, SciPy, scikit-learn for scientific computing, machine learning & data analytics
- Drop-in replacement for existing Python no code changes required
- Highly optimized for the latest Intel processors

#### What's New in the 2018 edition

- Updated to support Python 3.6
- Optimized scikit-learn for machine learning speedups
- Conda build recipes for custom infrastructure

#### Intel<sup>®</sup> Distribution for Python\* Performance Speedups for Select Math Functions on Intel<sup>®</sup> Xeon<sup>™</sup> Processors



Configuration: Hardware: Intel<sup>®</sup> Xeon<sup>®</sup> CPU E5-2699 v4 @ 2.20GHz (2 sockets, 22 cores per socket, 1 thread per core – HT is off), 256GB DDR4 @ 2400MHz. Software: Stock: CentOS Linux\* release 7.3.1611 (Core), python 3.6.2, pip 9.0.1, numpy 1.13.1, scipy 0.19.1, scikit-learn 0.19.0. Intel<sup>®</sup> Distribution for Python\* 2018 Gold: mkl 2018.0.0 intel\_4, daal 2018.0.0.20170814, numpy 1.13.1 py36\_intel\_15, openmp 2018.0.0 intel\_7, scipy 0.19.1 np113py36\_intel\_11, scikit-learn 0.18.2 np113py36\_intel\_3

10

Learn More: software.intel.com/distribution-for-python

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to <a href="http://www.intel.com/performance">http://www.intel.com/performance</a>. Benchmark Source: Intel Corporation.

#### Intel<sup>®</sup> DAAL 2018 vs Apache Spark\* MlLib Performance Intel<sup>®</sup> Data Analytics Acceleration Library (Intel<sup>®</sup> DAAL)



**Configuration**: 2x Intel® Xeon® E5-2660 CPU @ 2.60GHz, 128 GB, Intel® DAAL 2018; Alternating Least Squares – Users=1M Products=1M Ratings=10M Factors=100 Iterations=1 MLLib time=165.9 sec DAAL time=40.5 sec Gain=4.1x; Correlation – N=1M P=2000 size=37 GB Mllib time=169.2 sec DAAL=12.9 sec Gain=13.1x; PCA – n=10M p=1000 Partitions=360 Size=75 GB Mllib=246.6 sec DAAL (seq)=17.4 sec Gain=14.2x Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more complete information visit www.intel.com/benchmarks. Source: Intel Corporation – performance measured in Intel labs by Intel employees. <u>Optimization Notice:</u> Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product use with Intel microprocessors. Certain optimizations not specific to Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #20110804.



### Intel<sup>®</sup> Parallel Studio XE: High Performance, Scalable Software across Multiple Industries

| Energy             |                                      | Schlumberg                                                | <b>ei</b> 10X               |                   |
|--------------------|--------------------------------------|-----------------------------------------------------------|-----------------------------|-------------------|
| EDA                |                                      | Graphie                                                   | 11X                         |                   |
| Science & Research | the Walker Molecular<br>Dynamics lab | N*Novosibinsk<br>State<br>Bhivernity<br>*THE REAL SCIENCE | <b>8X</b><br>Kyoto Universi | NERSC 35%         |
| Manufacturing      | 🛆 Altair 1.4X                        | e                                                         | 4X                          |                   |
| Government         |                                      | AWE                                                       | 25X                         |                   |
| Computer Software  | FIXSTARS 2.5                         | 5X Flow                                                   | 1.25X                       | software 1.3X     |
| IT                 | NEC 5                                | × (6)                                                     | 3                           | WERE A CASCADE 2X |
| Healthcare         |                                      | MASSACHUSETTS<br>GENERAL HOSPITAL                         | 20X                         |                   |
| Digital Media      | Presentationers                      |                                                           |                             |                   |
| Telecommunications |                                      | рекір                                                     | 2.5X                        |                   |

Software & workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark & MobileMark, are measured using specific computer systems, components, software, operations & functions. Any change to any of those factors may cause the results to vary. You should consult other information & performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance.



# Extracting Si Performance

### Roofline Analysis?

How to determine if we got the best / peak performance?

- Run GEMM?
- LINPACK?

. . .

- STREAM bandwidth tests?
- Latency benchmarks?
- Get theoretical peak possible for the code?

```
Roofline Analysis
"paper-pen" exercise
```

```
float *A, *B, *C, d;
for(i=0; i<n; i++)
{
    A[i] = B[i] + d * C[i];
}</pre>
```

The above code on Intel Xeon Phi is bound by:

- Compute?
- Bandwidth?

### What is specfem3D\_globe?



The software package SPECFEM3D\_GLOBE simulates threedimensional global and regional seismic wave propagation based upon the spectral-element method (SEM).

> A time-step algorithm which simulates the propagation of earth waves given the initial conditions, mesh coordinates/ details of the earth crust.

Performance of the code is measured by the time taken to simulate "n" time-steps for a given mesh volume. Typically focus is to get whole earth simulations instead of partial earth crust/ regions.

#### More details:

http://geodynamics.org/cig/software/specfem3d\_globe/

## **Basic Performance Analysis**

Understand application behavior out-of-the-box

#### **Thread Scaling**



Good thread scaling Performance relative to 1T

#### Instruction Set



30% Gain w KNL AVX512

#### MCDRAM vs. DRAM - KNL



~1.3x gain using high bandwidth memory

### VTune: General Exploration

#### Memory Latency Issues

#### 🔗 Unfilled Pipeline Slots (Stalls):

#### 

Identify slots where no uOps are delivered due to a lack of required resources for accepting more uOps in the back-end of Back-end metrics describe a portion of the pipeline where the out-of-order scheduler dispatches ready uOps into their res execution units, and, once completed, these uOps get retired according to program order. Stalls due to data-cache misse: the overloaded divider unit are examples of back-end bound issues.

Memory Bound:<sup>◎</sup> 0.400

This metric shows how memory subsystem issues affect the performance. Memory Bound measures a fraction of cyc pipeline could be stalled due to demand load or store instructions. This accounts mainly for incomplete in-flight memo loads that coincide with execution starvation in addition to less common cases where stores could imply back-pressu

⊗ <u>L1 Bound:</u>◎ 0.065

ORAM Bound: 0.204

This metric shows how often CPU was stalled on the main memory (DRAM). Caching typically improves the laten increases performance.

<u>Memory Bandwidth:</u> 0.047

Memory Latency:<sup>®</sup> 0.336

This metric shows how often CPU could be stalled due to the latency of the main memory (DRAM). Consic data layout or using Software Prefetches (through the compiler).

| Local DF                                | <u>≀AM:</u> ®           | 0.099 |  |  |  |
|-----------------------------------------|-------------------------|-------|--|--|--|
| <u>Remote</u>                           | DRAM:                   | 0.000 |  |  |  |
| <u>Remote</u>                           | 0.000                   |       |  |  |  |
| ⊗ <u>Store Boun</u>                     | <mark>d:</mark> ② 0.109 |       |  |  |  |
| ⊗ <u>Core Bound:</u> <sup>©</sup> 0.188 |                         |       |  |  |  |
| <u>Divider:</u> ®                       | 0.0                     | 022   |  |  |  |
| Ort Utilization: 0.166                  |                         |       |  |  |  |

intel

### Top hotspots are Memory bound

↓

|                                   | *          |             | Unfilled Pipeline Slots (Stalls) |       |       |        |        |       |       |       |       |       |          |         |       |
|-----------------------------------|------------|-------------|----------------------------------|-------|-------|--------|--------|-------|-------|-------|-------|-------|----------|---------|-------|
|                                   |            |             |                                  |       |       |        |        | Back- | End E | Bound |       |       |          |         | ≪     |
| Function / Call Stack             | Clockticks | CPI<br>Rate |                                  |       | Ме    | mory E | 3ound  |       |       | ~     |       | Co    | re Bour  | nd      | ≪     |
|                                   |            | Nace        | L1 🔊                             | 13 🔊  |       | DRAN   | 4 Bour | d     | ≪     | st. 🔊 |       |       | ort Util | ization | ≪     |
|                                   |            |             | Bo.                              | Bo.   | Mem.  | Mem.   | Lo     | Re.   | Re.   | Bo.   | DIVI  | Cyc   | Cyc      | Сус     | Cyc   |
| ▷compute_element_tiso             | 18.4%      | 0.975       | 0.132                            | 0.027 | 0.051 | 0.321  | 0.189  | 0.0   | 0.0   | 0.124 | 0.062 | 0.300 | 0.192    | 0.186   | 0.248 |
| ▶svml_sincos4_e9                  | 14.2%      | 0.908       | 0.186                            | 0.000 | 0.004 | 0.350  | 0.10   | 0.0   | 0.0   | 0.000 | 0.000 | 0.263 | 0.256    | 0.214   | 0.210 |
| ▷compute_element_iso              | 13.6%      | 0.862       | 0.000                            | 0.046 | 0.033 | 0.530  | 0.073  | 0.0   | 0.0   | 0.000 | 0.076 | 0.254 | 0.080    | 0.269   | 0.233 |
| ▷ compute_forces_crust_mantle_dev | 11.4%      | 0.705       | 0.091                            | 0.072 | 0.039 | 0.351  | 0.130  | 0.0   | 0.0   | 0.182 | 0.000 | 0.221 | 0.091    | 0.230   | 0.343 |
| ▶svml_cosf8_e9                    | 5.5%       | 0.728       | 0.072                            | 0.000 | 0.000 | 0.027  | 0.000  | 0.0   | 0.0   | 0.296 | 0.000 | 0.144 | 0.350    | 0.359   | 0.251 |
| vupdate_displ_elastic             | 5.0%       | 5.678       | 0.000                            | 1.000 | 0.207 | 0.793  | 0.099  | 0.0   | 0.0   | 0.484 | 0.000 | 0.642 | 0.079    | 0.020   | 0.059 |
| ¢compute_forces_crust_mantle_de   | 4.1%       | 0.490       | 0.000                            | 0.000 | 0.000 | 0.000  | 0.000  | 0.0   | 0.0   | 0.012 | 0.000 | 0.012 | 0.000    | 0.120   | 0.840 |
| _svml_sincosf8_e9                 | 3.7%       | 0.639       | 0.108                            | 0.000 | 0.000 | 0.135  | 0.13   | 0.0   | 0.0   | 0.081 | 0.000 | 0.202 | 0.216    | 0.148   | 0.337 |
| ▶update_veloc_elastic             | 3.4%       | 9.438       | 0.000                            | 0.970 | 0.367 | 0.514  | 0.147  | 0.0   | 0.0   | 0.015 | 0.000 | 0.573 | 0.000    | 0.029   | 0.015 |
| ▶ multiply_accel_elastic          | 3.2%       | 2.119       | 0.203                            | 0.000 | 0.000 | 0.783  | 0.000  | 0.0   | 0.0   | 0.000 | 0.000 | 0.783 | 0.329    | 0.031   | 0.078 |
| Þmxm5_3comp_singlea               | 2.0%       | 0.400       | 0.000                            | 0.000 | 0.000 | 0.000  | 0.000  | 0.0   | 0.0   | 0.000 | 0.000 | 0.025 | 0.099    | 0.124   | 0.642 |
| ▶mxm5_3comp_singleb               | 1.7%       | 0.743       | 0.147                            | 0.000 | 0.000 | 0.000  | 0.000  | 0.0   | 0.0   | 0.000 | 0.000 | 0.059 | 0.029    | 0.412   | 0.382 |
| КII _II Д                         | 1 50/ 0    | 0.400       | 0.000                            | 0.000 | 0.000 | 0.000  | 0.000  | ~ ~   | ~ ~   | 0.000 | 0.000 | 0.000 | 0.100    | 0.050   | 0.040 |

| 143 | <pre>xivl = xiv(INDEX_LlK_ispec)</pre>   |                                                    | 94 000 141 | 188 | 0.500 | 0.6 | 0.0 | 1.000 | 0.809 |
|-----|------------------------------------------|----------------------------------------------------|------------|-----|-------|-----|-----|-------|-------|
| 144 | <pre>xizl = xiz(INDEX_IJK, ispec)</pre>  |                                                    | 74,000,111 | 146 | 0.507 | 0.8 | 0.0 | 0.514 | 1.000 |
| 145 | <pre>etaxl = etax(INDEX_IJK,ispec)</pre> |                                                    | 1,200,001  | 444 | 2.703 | 0.0 | 0.0 | 0.697 | 0.063 |
| 146 | etayl = etay(INDEX_IJK,ispec)            |                                                    | 1,056,001  | 368 | 2.870 | 0.1 | 0.0 | 0.648 | 0.000 |
| 147 | etazl = etaz(INDEX_IJK,ispec)            |                                                    | 300,000,4  | 132 | 2.273 | 0.0 | 0.0 | 1.000 | 0.000 |
| 148 | gammaxl = gammax(INDEX_IJK,ispec)        |                                                    | 70,000,105 | 92, | 0.761 | 0.5 | 0.0 | 0.000 | 0.543 |
|     |                                          | Indirect access to arrays<br>with "ispec" as index |            |     |       |     |     |       | 10    |

intel.

### **Random Access Latency**



200

### Vector Advisor



#### Hotspot loops are Vectorized

| Function Call Sites and Leans              | 0          | Self     | Total Time |               | Wh.  | Vectorize | ed Lo | <b></b> |            |
|--------------------------------------------|------------|----------|------------|---------------|------|-----------|-------|---------|------------|
| Function Call Sites and Loops              | <br>¥      | Time     | iocal fime | Loop Туре     | No   | Vector    | Eff   | Gain    | Vector Len |
| 🕨 🔽 [loop at compute_element.F90:54        | <u> </u>   | 5.959s 🛛 | 11.929st   | <u>Expand</u> | Exp. | AVX       |       | 7.66    | 4; 8       |
| ⊳ 🔽 [loop at compute_element.F90:142 in co | @ <u>3</u> | 3.540s   | 5.160s (   | Vectorized (B |      | AVX       |       | 8.48    | 4; 8       |
|                                            |            |          |            | 1             |      | 1         |       |         |            |

inte

### AVX-512 Designed for HPC

- Promotions of many AVX and AVX2 instructions to AVX-512
  - 32-bit and 64-bit floating-point instructions from AVX
    - Scalar and 512-bit
  - 32-bit and 64-bit integer instructions from AVX2
- Many new instructions to speedup HPC workloads



### AVX-512 features (I): More & Bigger Registers

#### AVX: VADDPS YMM0, YMM3, [mem]

- Up to 16 AVX registers
  - 8 in 32-bit mode
- 256-bit width
  - 8 x FP32
  - 4 x FP64

#### AVX-512: VADDPS ZMM0, ZMM24, [mem]

- Up to 32 AVX registers
  - 8 in 32-bit mode
- 512-bit width
  - 16 x FP32
  - 8 x FP64

But you need many more features to use all that real estate effectively...

| float32 A[N], B[N];                                              |  |
|------------------------------------------------------------------|--|
| <pre>for(i=0; i&lt;8; i++) {         A[i] = A[i] + B[i]; }</pre> |  |
| float32 A[N], B[N];                                              |  |
| for(i=0; i< <mark>16</mark> ; i++)                               |  |
| ${A[i] = A[i] + B[i];}$                                          |  |

233

inte

### **AVX-512 Mask Registers**

#### 8 Mask registers of size 64-bits

- k1-k7 can be used for predication
  - k0 can be used as a destination or source for mask manipulation operations
  - k0 cannot be used as input mask for vector operations
    - k0 encoding treated as "no mask"

#### 4 different mask granularities. For instance, at 512b:

- Packed Integer Byte use mask bits [63:0]
  - VPADDB zmm1 {k1}, zmm2, zmm3
- Packed Integer Word use mask bits [31:0]
  - VPADDW zmm1 {k1}, zmm2, zmm3
- Packed IEEE FP32 and Integer Dword use mask bits [15:0]
  - VADDPS zmm1 {k1}, zmm2, zmm3
- Packed IEEE FP64 and Integer Qword use mask bits [7:0]
  - VADDPD zmm1 {k1}, zmm2, zmm3



|         |           | Vector Length |     |     |  |  |  |
|---------|-----------|---------------|-----|-----|--|--|--|
|         |           | 128           | 256 | 512 |  |  |  |
|         | Byte      | 16            | 32  | 64  |  |  |  |
|         | Word      | 8             | 16  | 32  |  |  |  |
| element | Dw ord/SP | 4             | 8   | 16  |  |  |  |
| size    | Qw ord/DP | 2             | 4   | 8   |  |  |  |

# **Gather & Scatter**

D/Q/SP/DP element types D/Q indices Instruction can partially execute k-reg Mask used as completion mask



VMOVDQU64 zmm1, Q[rsi] VMOVDQU64 zmm2, R[rsi] VGATHERQQ zmm0 {k2}, [rax+zmm1\*8] VSCATTERQQ [rax+zmm2\*8] {k3}, zmm0



### Bandwidth Analysis





MCDRAM latency more than DDR at low loads but much less at high loads

### **Optimization Next Steps**

- specfem3D\_globe performance is limited by memory latency issues...try s/w prefetch or change the data layout.
- Analyze usage of mixed data-type in a loop... float & double (if any).
- Fat loop causing register pressure, compiler can't vectorize explore loop split...
- > Analyze compiler vectorization efficiency.

**Mitigate memory access latency issues:** An indirect (random) access was transformed into a unit-stride access. Mesh data in the SPECFEM3D\_GLOBE solver is invariant over time/solver steps. Hence, it is a valid transformation to copy data and make it a linear access.

```
Baseline:
xixl = xix(ijk,1,1,ispec)
```

Changed: ia\_xix(ijk,1,1,ele\_num) = xix(ijk,1,1,ispec)

xixl = ia\_xix(ijk,1,1,ele\_num)

#### Gain:

~1.40x

**Compiler Vectorization – Loop Fission:** The compute loops 'iso' and 'tiso' are huge. The compiler is unable to vectorize these loops. So, a manual loop fission was done. A similar effect can be realized by using '!DIR\$ DISTRIBUTE POINT' syntax supported by Intel compilers for loop distribution/ fission.

| Baseline:       | Changed:     |
|-----------------|--------------|
| do k=1,NGLLZ    | do k=1,NGLLZ |
| do j=1,NGLLY    | do j=1,NGLLY |
| do i=1,NGLLX    | do i=1,NGLLX |
| Loop Body 1     | Loop Body 1  |
| Loop Body 2     | enddo        |
| enddo           |              |
| •••             | do k=1,NGLLZ |
|                 | do j=1,NGLLY |
|                 | do i=1,NGLLX |
|                 | Loop Body 2  |
| Gain:<br>~1.69x |              |

**IVDEP, SIMD directives:** Some hotspots in the solver are nested loops with trip counts  $5 \times 5$  and  $5 \times 25$ . These are 'm x m' loops, matrix-matrix multiplication. The compiler optimization reports (use -qopt-report flag) indicated that not all these loops were vectorized. Using IVDEP or SIMD directives helped the compiler to generate vector code for these loops.

#### **Baseline:**

```
do k=1,NGLLZ
do j=1,NGLLY
do i=1,NGLLX
do l=1,5
Loop Body
enddo
```

#### **Changed:**

```
do k=1,NGLLZ
!$OMP SIMD PRIVATE(j,1)
  do i=1,NGLLX
    do j=1,NGLLY
    do 1=1,5
       Loop Body
    enddo
```

#### Gain: ~2.09x

### Writing "low-level" Intrinsic functions

A simple example of intrinsic usage is show below:

```
for(j=0; j<N; j+=8){
    __m512d vecA = _mm512_load_pd(&a[j]);
    __m512d vecB = _mm512_load_pd(&b[j]);
    __m512d vecC = _mm512_pow_pd(vecA,vecB);
    _mm512_store_pd(&c[j],vecC);
}</pre>
```

The performance can be quite good compared to the serial code below.

```
for(j=0; j<N; j++) c[j]=pow(a[j],b[j]);</pre>
```

http://www.prace-ri.eu/best-practice-guide-knights-landing-january-2017

# **Optimization Summary**

- Use compiler optimization options...
- Do analysis of your code to find hotspots.
- Optimize by code changes which include:
- ✓ Data transformation for mitigating access latencies
- Compiler vectorization and loop optimizations
- $\checkmark\,$  Data alignment and padding for arrays
- ✓ Redundant compute elimination
- ✓ IVDEP, SIMD directives usage

and more...



Intel Confidential



34

inte

### Thank you!

(intel)