Introduction to HPC Applications
The Need of Scalable Architectures and more

Carlos Jaime Barrios Hernández, PhD
@carlosjaimebh
#SCCAMP2021
The (Big) Questions: What and How?
Why?

- Large Data Sets
- Complex Mathematics
- Complex Models
- Real Time
- Interaction and Confrontation
- Large Scale Visualization
- High Resolution
- High Performance and Capacity
  - VR Needs
  - Big Data and Deep Learning
Big Problems, Smart Solutions
Challenges

Infrastructure
- Post Moore Era Architectures
  • Parallel Balancing, I/O, Memory Challenges
- Dark Sillicio
- Exascale
  • Computer Efficiency (Processing/Energy Consumption)
- Hybrid Platforms (CISC+RISC+Others)
  • TPU, ARM...
- Data Management
- Advanced Networks
- Fog/Edge
- HPC@Pocket
- ... Quantum Computing

Platform
- Programmability
  • New Languages and Compilers
- Computing Efficiency
- Data Movement and Processing (In Situ, In Transit, Workflows)
- HPC as a Service
  • Science Gateways, Containers
- Viz as a Service (In Situ)
- Protocols
- IA and Deep Learning Frameworks
- Quantum Computing

Applications
- IA and Deep Learning
- Algorithms Implementation
- Use of Interpretators (as Python)
- Community versions
- Open Algorithms, Open Data
- Ultra Scale Applications
- ...and more!
About Parallelism

- **Concurrency** is a property of systems in which several computations are executing simultaneously, and potentially interacting with each other.

- **Implicit parallelism** is a characteristic of a programming language that allows a compiler or interpreter to automatically exploit the parallelism inherent to the computations expressed by some of the language's constructs.

- **Explicit parallelism** is the representation of concurrent computations by means of primitives in the form of special-purpose directives or function calls.

- We need two (mixed) approach in Architecture: Applications and Hardware (system).
Elements of Parallelism

1. Computing Problems
   • Numerical (Intensive Computing, Large Data Sets)
   • Logical (AI Problems)
2. Parallel Algorithms and Data Structures
   + Special Algorithms (Numerical, Symbolic)
   + Data Structures (Dependency Analysis)
   + Interdisciplinary Action (Due to the Computing Problems)
3. System Software Support
   + High Level Languages (HLL)
   + Assemblers, Linkers, Loaders
   + Models Programming
   + Portable Parallel Programming Directives and Libraries
   + User Interfaces and Tools
4. Compiler Support
   + Implicit Parallelism Approach
     + Parallelizing Compiler
     + Source Codes
   + Explicit parallelism Approach
     + Programmer Explicitly
       + Sequential Compilers, Low Level Libraries
       + Concurrent Compilers (HLL)
     + Concurrency Preserving Compiler
5. Parallel Hardware Architecture
   + Processors
   + Memory
   + Network and I/O
   + Storage
Pervasive and Thinking Parallelism

- It is not a question of « Parallel Universes » (Almost)
- Data Sources
- Processing and Treatment
- Resources (Available and Desire)
- Energy Consumption
- Natural “thinking” (Natural Compute?)
Thinking in Parallel (computing) – The Typical Visions

Concurrent: 2 queues, 1 vending machine

Parallel: 2 queues, 2 vending machines
Thinking in Parallel (computing) – an OPL hierarchy
Any Parallel System is concurrent: Simultaneous Processing, Parallel but limited resources.

From J. Armstrong Notes: http://joearms.github.io/2013/04/05/concurrent-and-parallel-programming.html
Serial vs Concurrent/Parallel Approach

Reduction in Execution Time (However, overhead problem)
Instructions to Multithreading (To exploit Parallelism)
Synchronzation (with all derivated concerns...)
Concurrent vs Concurrency/Parallelism Behavior

Time

<table>
<thead>
<tr>
<th>Concurrency (Without Parallelism)</th>
<th>Thread 1</th>
<th>Thread 2</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>fork</td>
<td></td>
</tr>
</tbody>
</table>

Non Shared Processing Resources (However the Memory...)
Switching
Parallel Threads (Multitasking, Multithreading)

Shared Processing Resources
Switching
Non Parallel Threads (Non Multitasking, Yes Multithreading)
Concurrency vs Concurrency/Parallelism Example

Single System
- Multiple Threads in Runtime
- Almost Synchronization Strategies
- Memory Allocation

Dual System
- Multiple Parallel Threads in Runtime
- Strategies to Parallellism following models (PRAM, LogP, etc) addressed to exploit memory and overhead reduction
Sequential Processing

- All of the algorithms we’ve seen so far are sequential:
  - They have one “thread” of execution
  - One step follows another in sequence
  - One processor is all that is needed to run the algorithm
Concurrent Systems

- A system in which:
  - Multiple tasks can be executed at the same time
  - The tasks may be duplicates of each other, or distinct tasks
  - The overall time to perform the series of tasks is reduced
Advantages of Concurrency

- Concurrent processes can reduce duplication in code.
- The overall runtime of the algorithm can be significantly reduced.
- More real-world problems can be solved than with sequential algorithms alone.
- Redundancy can make systems more reliable.
Disadvantages of Concurrency

- Runtime is not always reduced, so careful planning is required.
- Concurrent algorithms can be more complex than sequential algorithms.
- Shared data can be corrupted.
- Communications between tasks is needed.
Parallel Computing

- Parallel Computing exploit
  Concurrency
  - In “system” terms, concurrency exists when a problem can be decomposed in sub problems that can safely executed at same time (in other words, concurrently)

https://ignorelist.files.wordpress.com/2012/01/the-art-of-concurrency.pdf
How to Exploit (Better) Concurrency

- (Remember) Mixed Approach
  (Algorithms/Applications - Hardware/System.
- Good Techniques from Software Engineering
- Good Problem knowledge from scientific (domain) expertise
- Confrontation and Performance Evaluation
The Hardware/System Approach
Shared, Distributed and Hybrid Memory Architectures

- Memory Exploitation involves Memory Hierarchy
- Models as PRAM, BSP, etc..
- All modern architectures to HPC allows different memory models
  - Shared Memory (Inside Nodes)
  - Distributed Memory (Among Nodes)
  - Hybrid Memory
    - Using Accelerators (GPUs, MICs)
    - Interaction Nodes/Processors
Flynn’s Taxonomy*

*S Proposed by M. Flynn in 1966
Moore’s Law – The number of transistors on integrated circuit chips (1971-2016)

Moore's law describes the empirical regularity that the number of transistors on integrated circuits doubles approximately every two years. This advancement is important as other aspects of technological progress – such as processing speed or the price of electronic products – are strongly linked to Moore’s law.
The (Post) Moore Era

After 120 years...
The Moore’s Law is Dead

Jack Dongarra
Parallel Computing Everywhere

It is more than a publicity!
Parallel Computing Evolution
(From the LLNL Vision by Rob Neely)

Advancements in (High Performance) Computing Have Occurred in Several Distinct “Eras”

Truth is: we don’t want to call this next era. It’s currently defined by its inability to be defined.

Each of these eras define not so much a common hardware architecture, but a common programming model.

Rob Neely
Configurable Architectures

- Dual Cores (Symmetric Multithreading)
- MultiCore Arrays
- Scalar + Many Cores (Highly threaded workloads)
- Manycore arrays

Large Scale Cores
(High Single Thread Performance)
Further Taxonomy
(Derivate from MIMD for distributed memory programming)

- **SPMD (Single Program, Multiple Data Streams or Single Process, Multiple Data)**
  - Multiple autonomous processors simultaneously executing the same program on different data.
  - It is the most common taxonomy.

- **MPMD (Multiple Program Multiple Data)**
  - Multiple autonomous processors simultaneously operating at least 2 independent programs.
The Distributed Shared Memory Access

- Main Memory in Parallel Machines is a hybrid between shared memory and distributed memory.
- Uniform Memory Access (UMA) is proposed for End-User Systems (Generally)
- Distributed memory systems have Non-Uniform Memory Access (NUMA) architecture
Multicores

- Multicore Computer:
  - Multicore Processor includes multiple execution units (cores)
  - Minimal (Physic) Processing unit is the core. The Core support processing of threads (almost one thread)
  - Cache memory is important to threads exchange between cores and memory.

Dual CPU example

Dual CPU Core Chip

- CPU Core and L1 Caches
- CPU Core and L1 Caches
- Bus Interface and L2 Caches
Symmetric Multiprocessing

- Symmetric multiprocessors:
  - A symmetric multiprocessor (SMP) is a computer system connected to a main shared memory.
  - Intel’s Xeon is the most known SMP system.
  - Sun Microsystems UltraSPARC was the first multiprocessing system.
Massive Parallel Processing (MPP)

- Computer system with many independent arithmetic units or entire microprocessors, that run in parallel.
- MPPA is a MIMD (Multiple Instruction streams, Multiple Data) architecture, with distributed memory accessed locally, not shared globally.
- GPGPU Computing exploit MPP.
More of Parallel Computers

- Reconfigurable Computing with Field-Programmable Gate Arrays (FPGA).
- General-Purpose Computing on Graphics Processing Units (GPGPU).
  - Programmin with CUDA and OpenCL (i.e.)
- Application-Specific Integrated Circuits (ASIC).
- Vector Processors. (SIMD)
Distributed Computers

• Distributed computing:
  
  • Distributed Computing is a Distributed Memory Multiprocessor System connected by a network.
  
  • Distributed computers are **highly scalable**!
  
  • Cases of Distributed Computing:
    
    • Cluster computing (Parallel Distributed Computing)
    
    • Grid Computing
Large Scale Architectures

• Large Scale Architecture (LSA) allows to trait large scale problems.
  • LSAs need Large Scale Software
  • LSAs are distributed systems.
    • Cluster Computing Platform
    • Grid Computing Infrastructure
  • The Fault tolerance is a critical problem in LSA systems.
An Spain Exemple: BSC-CNS Marenostrum

www.bsc.es

- 11.15 Petaflops
- 384.75 TB Memory
- 3,465 Computing Nodes
  - 2x Intel Xeon Plantium 8160 24C / 2.1Ghz
  - 216 Nodes with 12x32GB DDR4 2667 DIMMS (8GB Cores)
  - 3240 Nodes with 12x8GB DDR4-2667 DIMMS (2GB Cores)
- Network
  - 100Gb Intel Omni-Path Full Fat Tree
  - 100Gb Ethernet
- Operating System
  - Suse Linux Enterprise Server 12SP2
Grid Computing

- Grid Computing implies technology, technics and methodology to support Parallel*/Distributed Computing.
- Grid Computing needs Grid Computing Infrastructure and dedicated and high disponibility networks or interconexion.
- Different Types or Possibilities:
  - Experimental Testbeds
  - Production Grids
  - Lightweight Grids
  - Desktop Grid Computing (May be Lightweight too)
- Grid Computing is in the back of Cloud Computing Systems (from Infrastructure Point of View)
Grid Computing Features

- Grid Computing Features:

- Infrastructure
  - High Availability
  - High Performance
  - Heterogeneity
  - Pervasive
  - Scalability

- Methodology
  - Different User Levels
  - Multi Administration

- Politics
  - Security
  - Use
  - Privacy

Grid Computing Architecture
(Typical Diagram)

[*]From http://gridcafe.web.cern.ch
Grid Computing Architecture
(remember the Cluster Architecture)

Sequential Applications

Parallel Applications

Parallel Programming Environment

Middleware
(Single System Image and Availability Infrastructure)

PC/Workstation/Cluster/Devices/Sensors
Communications Software
Network Interface Hardware

PC/Workstation/Cluster/Devices/Sensors
Communications Software
Network Interface Hardware

PC/Workstation/Cluster/Devices/Sensors
Communications Software
Network Interface Hardware

PC/Workstation/Cluster/Devices/Sensors
Communications Software
Network Interface Hardware

Interconnection Network/Switch
An Example: The French Aladdin Grid5000 (G5K)

- G5K has 5000 processors distributed in 9 sites France wide, for research in Grid Computing, eScience and Cyber-infrastructures
- G5K project aims at building a highly reconfigurable, controllable and monitorable experimental Grid platform
- All clusters will be connected to Renater with a 10Gb/s link (or at least 1 Gb/s, when 10Gb/s is not available yet).
  - IntraCluster
    - Myrinet
    - GigaEthernet / Infiniband
  - Grid
    - Giga Ethernet (Best case 10GB/s, Nate case: 1GB/s)
  - Inter-Grid
    - Ethernet (~1GB/s)

www.grid5000.org
Volunteer Computing

- Volunteer computing is a type of distributed computing in which computer owners donate their computing resources (such as processing power and storage) to one or more "projects".
  - BOINC (Seti@home)
  - Xgrid
  - GridMP
- Associated with P2P
- Can be associated with High Throughput Computing (HTC) or High Performance Computing (HCP)
An Example: The BOINC Architecture
Cloud Computing

- Internet-based computing, whereby shared resources, software, and information are provided to computers and other devices on demand.
- Cloud computing describes a new supplement, consumption, and delivery model for IT services based on the Internet, and it typically involves over-the-Internet-provision of dynamically scalable and often virtualized resources.
Cloud Computing Visibility

Cloud Computing Deployment Types

- Private Cloud
- Public Cloud
  - Resources are dynamically provisioned on a fine-grained, self-service basis over the Internet, via web applications/web services
- Community Cloud
  - Established where several organizations have similar requirements and seek to share infrastructure
- Hybrid Cloud
- InterCloud
  - Cloud of Clouds

Different very known examples: AWS, MS Azure, Google Cloud..
UltraScale Systems

Mesh network of micro data centers that process or store critical data locally

Extends Cloud computing and services to the edge of the network

« Ultrascale systems are envisioned as large-scale complex systems joining parallel and distributed computing systems that will be two to three orders of magnitude larger that today’s systems » (Carretero et al.)
HPC Hybrid Systems (HPC@Pocket)

- High Performance Capabilities
  - Multiple Cores (i.e. more than 192 cores in Jetson)
- Co-Design Architecture
  - Allowing multiple networks and protocols
  - Software Implementation Mechanisms (Now, very known, i.e. CUDA, OpenCL… same Python)
- Low Power
- Low Cost
  - Depending of the device… (≈1 € per core)
- However, Integration/interaction demands efficiency

NVidia® Jetson TK1/TX1
An Example: NVIDIA Jetson Nano

JETSON NANO SPECIFICATIONS

- **GPU**: 128 Core Maxwell, 472 GFLOPs (FP16)
- **CPU**: 4 core ARM A57 @ 1.43 GHz
- **Memory**: 4 GB 64 bit LPDDR4 25.6 GB/s
- **Storage**: 16 GB eMMC
- **Video Encode**: 4K @ 30 | 4x 1080p @ 30 | 8x 720p @ 30 (H.264/H.265)
- **Video Decode**: 4K @ 60 | 2x 4K @ 30 | 8x 1080p @ 30 | 16x 720p @ 30 (H.264/H.265)
- **Camera**: 12 (3x4 or 4x2) MIPI CSI-2 DPHY 1.1 lanes (1.5 Gbps)
- **Display**: HDMI 2.0 or DP1.2 | eDP 1.4 | DSI (1 x2) 2 simultaneous
- **UPI/H**: 1 x1/2/4 PCIe
- **1 USB 3.0**
- **SDIO/MIPI/SysI/Os/GPIOs/I2C**: 1x SDIO / 2x SPI / 5x SysI/O / 13x GPIOs / 6x I2C
Integration and Interaction

- Typical Protocols
  - TCP/UDP
  - Interaction with Large Scale Systems
- Redundancy
  - Availability
  - Easy Performance Monitoring
  - Fault Tolerance
- Embedded O.S. and Package Management
  - Scheduling Resources and Uses (i.e., NIX/NIX OS)
  - Containers
    - Fluidity (Data and Applications)
    - Micro-Architectures
    - Usability
  - Interaction with Large Scale Systems with Big Heterogeneous Application OS / Sched

Currently is the MSc These of Carlos Gomez (Co-Advising with O. Richard)
How Exploit HPC Architectures with Cloud Visibility Models?

HPC as A Service Model

* Red Components are (most) concerned at Viz As A Service

- User/Scientist Oriented Services
- Developer Oriented Services
- Infrastructure Oriented Services
GPGPU Accelerate Computing Architecture

Latency Processor + Throughput processor

More Detailed and Explained in the Thursday Session
NVIDIA Tensor Cores Implementation Architecture

\[ D = \begin{pmatrix} A_{0,0} & A_{0,1} & A_{0,2} & A_{0,3} \\ A_{1,0} & A_{1,1} & A_{1,2} & A_{1,3} \\ A_{2,0} & A_{2,1} & A_{2,2} & A_{2,3} \\ A_{3,0} & A_{3,1} & A_{3,2} & A_{3,3} \end{pmatrix} \begin{pmatrix} B_{0,0} & B_{0,1} & B_{0,2} & B_{0,3} \\ B_{1,0} & B_{1,1} & B_{1,2} & B_{1,3} \\ B_{2,0} & B_{2,1} & B_{2,2} & B_{2,3} \\ B_{3,0} & B_{3,1} & B_{3,2} & B_{3,3} \end{pmatrix} + \begin{pmatrix} C_{0,0} & C_{0,1} & C_{0,2} & C_{0,3} \\ C_{1,0} & C_{1,1} & C_{1,2} & C_{1,3} \\ C_{2,0} & C_{2,1} & C_{2,2} & C_{2,3} \\ C_{3,0} & C_{3,1} & C_{3,2} & C_{3,3} \end{pmatrix} \]

\[ D = AB + C \]
VOLTA TENSOR OPERATION

FP16 storage/input

Full precision product

Sum with FP32 accumulator

Convert to FP32 result

Also supports FP16 accumulator mode for inferencing
Deep Learning Accomplished Promises

V100 measured on pre-production hardware.

9.3x faster

cuBLAS Mixed Precision (FP16 input, FP32 compute)
Matrix Multiply (M=N=K=2048)
And... Quantum computing?

(Advanced Computing Point of view)

- A quantum computer is a machine that performs calculations based on the laws of quantum mechanics, which is the behavior of particles at the sub-atomic level.
A bit of data is represented by a single atom that is in one of two states denoted by $|0\rangle$ and $|1\rangle$. A single bit of this form is known as a qubit.

A physical implementation of a qubit could use the two energy levels of an atom. An excited state representing $|1\rangle$ and a ground state representing $|0\rangle$. 

![Diagram of qubit states](image)
Consider a 3 bit qubit register. An equally weighted superposition of all possible states would be denoted by:

\[ |\psi\rangle = \frac{1}{\sqrt{8}} |000\rangle + \frac{1}{\sqrt{8}} |001\rangle + \ldots + \frac{1}{\sqrt{8}} |111\rangle \]
Due to the nature of quantum physics, the destruction of information in a gate will cause heat to be evolved which can destroy the superposition of qubits.

Ex.

The AND Gate

<table>
<thead>
<tr>
<th>Input</th>
<th>Output</th>
</tr>
</thead>
<tbody>
<tr>
<td>A</td>
<td>B</td>
</tr>
<tr>
<td>0</td>
<td>0</td>
</tr>
<tr>
<td>0</td>
<td>1</td>
</tr>
<tr>
<td>1</td>
<td>0</td>
</tr>
<tr>
<td>1</td>
<td>1</td>
</tr>
</tbody>
</table>

In these 3 cases, information is being destroyed.

This type of gate cannot be used. We must use **Quantum Gates**.
Quantum Computers Today

- Enterprises produce Quantum Computing Laboratory Infrastructure (Non for production)
  - (Real) Quantum Computing (D-Wave, IBM)
  - Quantum Computing Simulators (Atos)
Quantum Computing Tomorrow

(a) Local Machine
Public Network
QPU

(b) Interconnect
Node CPU QPU
Node CPU QPU
Node CPU QPU
Node CPU QPU

(c) Quantum Interconnect
Node CPU QPU
Node CPU QPU
Node CPU QPU
Node CPU QPU

FIGURE 2. Three macroarchitectures for integrating quantum computing with conventional computing. (a) A local machine remotely accesses a QPU through public cloud network. (b) A network of quantum-accelerated nodes communicate through a common interconnect. (c) A network of quantum-accelerated nodes communicate through both conventional and quantum networks.

FIGURE 3. A component diagram representing the microarchitecture of a HPC-QC node with a common interconnect as depicted in Figure 2(b). The diagram shows the major components needed for the operation of a QPU within the HPC node infrastructure. Individual components are grouped into so-called out-of-band and in-band scopes and are placed on the left-hand and right-hand side of the figure, respectively. The QPU, which contains the qubits and is capable of processing quantum information, is depicted at the lower part, whereas classical information processing components are shown in the upper part of the figure. Several hardware (HW) devices control the QPU environment, which has a direct effect on qubit properties and thus the quality of execution of instructions.
About Q... Algorithms and Code...

Consortiums propose Quantum Frameworks
- IBM, Microsoft, ATOS
- Quantum Computing Community...

- However without a good use for real Quantum mathematical abstraction.

Q# Interface
FIGURE 4. Decomposition of the software architecture required for quantum-HPC integration into a series of workflow steps, each exposing a unique set of service interfaces. This architecture maximizes the flexibility, modularity, and extensibility of the suggested integration strategy. Here we show the workflow decomposed into programming, compilation, IR lowering, and execution components. Each component exposes a series of service interfaces intended for the implementation of concrete use cases.
Then..... The challenges (Are now for computing people not for physicists)

- A « Real » Abstraction of Quantum Architecture
  - Quantum Memory (Optimized « in-Memory » System)
  - Software/Application Elements
    - Definition of a Language with the « Assembly » possibilities
    - The Concept of a Operating System
- The « Production » Applications
  - Quantum Algorithms
  - Quantum Application Design and Code Structure
  - The Concept of Optimization (and compilation) in Quantum Computing
The Challenges in Detail: Post Moore Era Architectures

- Sustainable-Hybrid Technology
  - RISC/CISC
    - GPUs, Hybrid ARM/FPGAs, Accelerators, CPUS…
    - I/O’s and Memory Management
- The “Data Treatment” Goal
  - Large Scale Data Sets (Supported by the Architecture
  - However scale capabilities changes
  - In-Situ and In-Transit Problem
- Very Known Schedulers, O.S. and Package Management
  - However, it is important to observe the architecture
- Exascale constrains
  - Computer Efficiency (Energy Consumption / Energy Aware)
The Software/Applications Approach
About High Performance Computing

- HPC is useful to being faster, more precise overall, to solve large problems and to treat, intrinsically, parallelism in essence.
- However allows
  - Technological Advantage
  - Technological Independency
  - Competitively
  - Energy Savings

- But, HPC is expensive
What & Why

• What is high performance computing (HPC) from Parallel Programming Approach?
  • The use of the most efficient algorithms on computers capable of the highest performance to solve the most demanding problems.

• Why HPC?
  • Large problems – spatially/temporally
    • 10,000 x 10,000 x 10,000 grid → 10^12 grid points → 4x10^12 double variables → 32x10^12 bytes = 32 Tera-Bytes.
    • Usually need to simulate tens of millions of time steps.
    • On-demand/urgent computing; real-time computing;
  • Weather forecasting; protein folding; turbulence simulations/CFD; aerospace structures; Full-body simulation/ Digital human …
  • And Remember the slides 2 and 3…
HPC Examples

Earthquake simulation
Surface velocity 75 sec after earthquake

Flu pandemic simulation
300 million people tracked

Density of infected population, 45 days after breakout
HPC Examples: Blood Flow in Human Vascular Network

• Cardiovascular disease accounts for about 50% of deaths in western world;
• Formation of arterial disease strongly correlated to blood flow patterns;

In one minute, the heart pumps the entire blood supply of 5 quarts through 60,000 miles of vessels, that is a quarter of the distance between the moon and the earth

Computational challenges: Enormous problem size
HPC Example: Homogeneous Turbulence

Direct Numerical Simulation of Homogeneous Turbulence: $4096^3$
How HPC fits into Scientific Computing

Air flow around an airplane
Navier-stokes equations
Algorithms, BCs, solvers, Application codes, supercomputers
Viz software

Physical Processes
Mathematical Models
Numerical Solutions
Data Visualization, Validation, Physical insight

HPC
Advantages of Parallelization

• **Cheaper**, in terms of Price/Performance Ratio
• **Faster** than equivalently expensive uniprocessor machines
• **Handle** **bigger** problems
• **More scalable**: the performance of a particular program may be improved by execution on a large machine
• **More reliable**: In theory if processors fail we can simply use others
How to Parallelize?: Traditional Way

Actually applied for current well-known applications with sequential implementations.

Addressed (mainly) for distributed memory applications

It’s good as first approach of scientific computing algorithm for (alone) scientists programmers.

However this is not a traditional course...

Designing and Building Parallel Programs, by Ian Foster in http://www.mcs.anl.gov/~itf/dbpp/
Design Spaces of Parallel Programming*

- FC: Finding Concurrency (Structuring Problem to expose exploitable concurrency)
- AS: Algorithm Structure (Structure Algorithm to take advantage of Concurrency)
- SS: Supporting Structures (Interfaces between Algorithms and Environments)
- IM: Implementation Mechanisms (Define Programming Environments)

*Patterns for Parallel Programming, Timoty Mattson, Beverly A. Sanders and Berna L. Massingill, Software Pattern Series, Addison-Wesley 2004
Concurrent Programming General Steps

1. Analysis
   - Identify Possible Concurrency
     - Hotspot: Any partition of the code that has a significant amount of activity
     - Time spent, Independence of the code...

2. Design and Implementation
   - Threading the algorithm

3. Tests of Correctness
   - Detecting and Fixing Threading Errors

4. Tune of Performance
   - Removing Performance Bottlenecks
     - Logical errors, contention, synchronization errors, imbalance, excessive overhead
     - Tuning Performance Problems in the code (tuning cycles)

• From: Patterns for Parallel Programming., by T. Mattson., B. Sanders and B. MassinGill (Ed. Addison Weslley, 2009) Web Site: http://www.cise.ufl.edu/research/ParallelPatterns/
# Distributed Vs. Shared Memory Programming
(Remember Architecture Features)

## Common Features
- Redundant Work
- Dividing Work
- Sharing Data (Different Methods)
- Dynamic / Static Allocation of Work
  - Depending on the nature of the serial algorithm, the resulting concurrent version, number of threads / processors

## Only to Shared Memory
- Local Declarations and Thread-Local Storage
- Memory Effects:
  - False Sharing
- Communication in Memory
- Mutual Exclusion
- Producer / Consumer Model
- Reader / Writer Locks (In Distributed Memory is Boss / Worker)
Decomposition

**Task Parallelism**
- Task A
- Task B
- Task C
- Task D
- Task E
- Task F
- Task G
- Task H
- Task I

**Data Parallelism**
- Elem 1
- Elem 2
- Elem 3
- Elem 4
- Elem 5
- Elem 6
- Elem 7
- Elem 8
- Elem 9

Tasks Decomposition: Task Parallelism
Data Decomposition: Data Parallelism / Geometric Parallelism
Task Parallelism: What are the tasks and how are they defined?

- There should be at least as many tasks as there will be threads (or cores)
  - It is almost always better to have (many) more tasks than threads.

- **Granularity** must be large enough to offset the overhead that will be needed to manage the tasks and threads
  - More computation: higher granularity (coarse-grained)
  - Less computation: lower granularity (fine-grained)

Granularity is the amount of computation done before synchronization is needed.
Task Granularity

Core 0
- overhead
- task
- overhead
- task

Core 1
- overhead
- task
- overhead
- task

Core 2
- overhead
- task
- overhead
- task

Core 0
- overhead
- task

Core 1
- overhead
- task

Core 3
- overhead
- task

Fine-grained decomposition

Coarse-grained decomposition
Granularity in Implementations

Coarse grid: Higher Performance
             Lower Accuracy
             (Using Nodes)

Fine grid:  Lower Performance
             Higher Accuracy
             (Using Processors)

Dynamic grid: Target performance where
              accuracy is required
              (Using Processors and
               Nodes)
Task Decomposition Considerations

- What are the tasks and how are defined?
- What are the dependencies between task and how can they be satisfied?
- How are the task assigned to threads?

Tasks must be assigned to threads for execution
Task Dependencies

Order Dependency

Data Dependency

Enchantingly Parallel Code: Code without dependencies
Data Decomposition
Considerations
(Geometric Decomposition)

Data Structures must be (commonly) divided in arrays or logical structures.

- How should you divide the data into chunks?
- How should you ensure that the tasks for each chunk have access to all data required for update?
- How are the data chunks assigned to threads?
How are the data chunks (and tasks) assigned to threads?

- Data Chunks are associated with tasks and are assigned to threads statically or dynamically.

- Via Scheduling
  - Static: when the amount of computations within tasks is uniform and predictable.
  - Dynamic: to achieve a good balance due to variability in the computation needed by chunk.

- Require many (more) tasks than threads.
How should you divide data into chunks?

- By individual elements
- By rows
- By groups of columns
- By blocks
Data Decomposition have an additional dimension.
It determines what the neighboring chunks are and how any exchange of data will be handled during the course of the chunk computations.

- Regular shapes: Common Regular data organizations.
- Irregular shapes: may be necessary due to the irregular organizations of the data.
How should you ensure that the tasks for each chunk have access to all data required for update?

- Using Ghost Cells
  - Using ghost cells to hold copied data from a neighboring chunk.
Data Sharing Pattern

- Data decomposition might define some data that must be shared among the tasks.
- Data dependencies can also occur when one task needs access to some portions of the another task’s local data.
  - Read Only
  - Effectively Local (Accessed by one of the tasks)
  - Read Write
    - Accumulative
    - Multiple read / Single Write
Tasks and Domain Decomposition Patterns

- **Task Decomposition Patterns**
  - Understand the computationally intensive parts of the problem.
  - Finding Tasks (as much...)
    - Actions that are carried out to solve the problem
    - Actions are distinct and relatively independent.

- **Data Decomposition Patterns**
  - Data decomposition implied by tasks.
  - Finding Domains:
    - Most computationally intensive part of the problem is organized around the manipulation of large data structure.
    - Similar operators are being applied to different parts of the data structure.
  - In shared memory programming environments, data decomposition will be implied by task decomposition (To see in detail in the OpenMP session).
Concurrent Computation from Serial Codes

Sequential Consistency Property: Getting the same answer as the serial code on the same input data set, comparing sequence of execution in concurrent solutions of the concurrent algorithms.
Not Parallelizable Jobs, Tasks and Algorithms

- Algorithms with state
- Recurrences
- Induction Variables
- Reductions
- Loop-carried Dependencies

Concurrent Design Models Features

- **Efficiency**
  - Concurrent applications must run quickly and make good use of processing resources.

- **Simplicity**
  - Easier to understand, develop, debug, verify and maintain.

- **Portability**
  - In terms of threading portability.

- **Scalability**
  - It should be effective on a wide range of number of threads and cores, and sizes of data sets.
Design Evaluation Pattern

- Production of analysis and decomposition:
  - Task decomposition to identify concurrency
  - Data decomposition to identify data local to each task
  - Group of tasks and order of groups to satisfy temporal constraints
  - Dependencies among tasks

- Design Evaluation
  - Suitability for the target platform
  - Design Quality
  - Preparation for the next phase of the design
Algorithm Structures

- Organizing by Tasks
  - Task Parallelism
  - Divide and Conquer

- Organizing by Data Decomposition
  - Geometric Decomposition
  - Recursive Data

- Organizing by Flow of Data
  - Pipeline
  - Event-Based Coordination
Algorithm Structure Decision Tree
(Major Organizing Principle)

Start

Organize By Tasks
- Linear
- Recursive
  - Task Parallelism
  - Divide and Conquer

Organize By Data Decomposition
- Linear
- Recursive
  - Geometric Decomposition

Organize By Flow of Data
- Linear
- Recursive
  - Recursive Data
  - Pipeline
  - Event-Based Coordination
Divide and Conquer Strategy

Problem

Subproblem

Solve

Subsolution

Split

Subproblem

Solve

Subsolution

Split

Subproblem

Solve

Subsolution

Split

Subproblem

Solve

Subsolution

Solution
Divide and Conquer Parallel Strategy
Recursive Data Strategy

- Involves an operation on a recursive data structure that appears to require sequential processing:
  - Lists
  - Trees
  - Graphs

- Recursive Data structure is completely decomposed into individual elements.

- Structure in the form of a loop (top-level structure)

- Simultaneously updating all elements of the data structure (Synchronization)

- Examples:
  - Partial sums of a linked list.

- Uses:
  - Widely used on SIMD platforms (HPF77)
  - Combinatorial optimization Problems.
  - Partial sums
  - List ranking
  - Euler tours and ear decomposition
  - Finding roots of trees in a forest of rooted directed trees.
Pipeline Strategy

- Involves performing a calculation on many sets of data, where the calculation can be viewed in terms of data flowing through a sequence of stages
- Instruction pipeline in modern CPUs
- Vector Processing (Loop-level pipelining)
- Algorithm-level Pipelining
- Signal Processing
- Graphics
- Shell Programs in Unix
Event-Based Coordination Strategy

+ Application decomposed into groups of semi-independent tasks interacting in an irregular fashion.
+ Interaction determined by a flow of data between the groups, implying ordering constraints between the tasks.
Some Conclusions

+ High Performance Computing allows science and mathematics dreams and implementations... i.e. Artificial Intelligence Implementation, data analytics, blockchain and more...

+ Computer systems involve different technologies and hybrid architectures, demanding sustainability, dynamicity and they need support changes in the scale of data and processing.... And all processing in parallel.
  + Of course, observing requirements of the applications and large scale behavior (i.e. IoT platforms)

+ Power consumption, energy aware and computational efficiency reach sustainability. It is proposed from the design of the architecture and it must be dynamic.
  + Exascale challenges : Co-Design

+ Big and little (embedded) HPC Architectures with the same challenges (memory contention, stable speed – up, parallel coherence) follows same kind of solutions, but with different scale of treatment observing the data level.
  + Involving Software Engineering, Computer Architecture, Data Analytics and Performance Evaluation.

+ HPC is expensive (but It is more expensive to not have HPC Knowledge and Resources)

+ Parallel Computing is not a tendency. (From 2015 is mandatory in all universities and colleges in USA parallel computing, scientific computing and advanced computing courses in science and engineering programs (programming computing is mandatory also in high school from 2009).
Recommended Lectures

• Patterns for Parallel Programming., by T. Mattson., B. Sanders and B. MassinGill (Ed. Addison Weslley, 2009) Web Site: http://www.cise.ufl.edu/research/ParallelPatterns/
• Designing and Building Parallel Programs, by Ian Foster in http://www.mcs.anl.gov/~itf/dbpp/
• Lectures in the site: www.sc-camp.org
Class work

- Revision of Chapter 2 of Designing and Building Parallel Programs, by Ian Foster in http://www.mcs.anl.gov/~itf/dbpp/
- Solve in the Exercises Section the 1 and 2 numerals.
- Imagine a solution for a real-world high complex problem to solve in the campus (conceptually)
- Read http://www.cs.wisc.edu/multifacet/papers/ieeecomputer08_amdahl_multicore.pdf
Questions?
Follow us: @SuperCCamp
Or visit: www.sc-camp.org

#SCCAMP2021