Parallel Computing - An Introduction

Parallel Computing - An Introduction

Parallel Computing is a form of computation in which many calculations are carried out simultaneously,
operating on the principle that large problems can often be divided
into smaller ones, which are then solved concurrently ("in parallel").
The core motivation for parallelizing is speeding up computing tasks.

1. Types of Parallelism

There are various forms of parallel processing:
- Bit Level Parallelism, each bit of a word is processed at the same time
- Instruction Level Parallelism, execute multiple instructions simultaneously
- Data Parallelism, (a.k.a. loop level parallelism) focuses on distributing the data across different parallel processing units
- Task Parallelism, (a.k.a. functional parallelism) focuses on distributing execution task(code + data) across different parallel processing units

2. Types of Parallel Computer

Using Flynn's Taxonomy, computer architecture can be divided into:
- Single Instruction, Single Data stream (SISD)
- Single Instruction, Multiple Data streams (SIMD)
- Multiple Instruction, Single Data stream (MISD)
- Multiple Instruction, Multiple Data streams (MIMD)

Today's parallel computer are all MIMD type, in more coarse-grained style, parallel computer can be further divided into:
- SPMD Single Program, Multiple Data
- MPMD Multiple Program, Multiple Data

According to the memory architecture, parallel computer can be divided as:

- Shared Memory Computer

In
this kind of computer, each processing node share the same global
memory address space. Programming on these computer can be as easy as
on multicore workstation.

Shared memory computer is easy to
programming, but since bus is used among all processing node, the scale
is limited(usually several tens), as bus contention will become the
bottle neck when scale arises.

Shared memory computer can be further divided into two kinds:
- UMA/cc-UMA, all processing node share the same physical memory device through a bus
- NUMA/cc-NUMA,
each process node has local physical memory but accessible by other
nodes, but the access time depends on the memory location relative to a
node

Uniformed Memory Access(from [1])

Non-Uniformed Memory Access(from [1])

- Distributed Memory Computer

In
this kind of computer, each node has its own private memory address
space and can't access other node's memory directly. Usually,
processing nodes are connected using some kind of interconnection
network.

Distributed memory computer can scale to very large
since no bus contention occurs. But it's more complicated to write
program on this kind of computers.

Distributed Memory (from[1])

- Distributed Shared Memory Computer

In
this kind of computer, their hardware architecture is usually the same
as Distributed Memory system, but its interface for application
developers is the same as Shared Memory system.

DSM is usually implemented using software extension to OS, which some performance penalty.

3. Parallel Programming Model

With these powerful computer in hand, how to programming on them?

3.1 Conceptually, there are tow models to write parallel programming.

Threads Model

There are two well known API interface about multi-threading:
- OpenMP (Open Multi-Processing), using compiler directives, environment variable and library to provide threading supports

- PThreads (Posix Threads), supports thread by means of library only.

a Process with two Threads(from wiki)

Message Passing Model

In
this model, a set of tasks use their local memory for computation,
communication among these tasks is conducted by means of
sending/receiving network messages.

There are two standards:
- MPI (Message Passing Interface)
- PVM (Parallel Virtual Machine)

Typical Message Passing Patterns are listed below:

Collective communications examples

Message Passing Pattern (from[1])

3.2 Other factors to Consider when Designing Parallel Programs

a. Problem Decomposition/Partitioning
- Data Partitioning
- Functional Partitioning

b. Communication Considering [1]
- Latency, is the time it takes to send a minimal (0 byte) message from point A to point B.
- Bandwidth, is the amount of data that can be communicated per unit of time.
- Async vs Sync.
- Point-to-Point,
involves two tasks with one task acting as the sender/producer of data,
and the other acting as the receiver/consumer.
- Collective,
involves data sharing between more than two tasks, which are often
specified as being members in a common group, or collective.

c. Load Balancing

Load balancing refers to the practice of distributing work among tasks so that all tasks are kept busy all of the time

How? [1]
- Equally partition workload
- Use dynamic work assignment

d. Granularity [1]

Measured by Computation/Communication Ratio, because periods of computation are typically separated from periods of communication by synchronization events.

- Coarse-grain Parallelism, relatively large amounts of computational work are done between communication/synchronization events
- Fine-grain Parallelism, relatively small amounts of computational work are done between communication events

[Reference]

[1] Parallel Programming Tutorial by LLNL
[2] Parallel Programming Models and Paradigms

[3] Book - Designing and Building Parallel Programs
[4] Book - Introduction to Parallel Computing 2E
[5] Book - Parallel and Distributed Programming Using C++
[6] Book - Patterns for Parallel Programming