operating on the principle that large problems can often be divided
into smaller ones, which are then solved concurrently ("in parallel").
The core motivation for parallelizing is speeding up computing tasks.
1. Types of Parallelism
There are various forms of parallel processing:
- Bit Level Parallelism, each bit of a word is processed at the same time
- Instruction Level Parallelism, execute multiple instructions simultaneously
- Data Parallelism, (a.k.a. loop level parallelism) focuses on distributing the data across different parallel processing units
- Task Parallelism, (a.k.a. functional parallelism) focuses on distributing execution task(code + data) across different parallel processing units
2. Types of Parallel Computer
Using Flynn's Taxonomy, computer architecture can be divided into:
- Single Instruction, Single Data stream (SISD)
- Single Instruction, Multiple Data streams (SIMD)
- Multiple Instruction, Single Data stream (MISD)
- Multiple Instruction, Multiple Data streams (MIMD)
Today's parallel computer are all MIMD type, in more coarse-grained style, parallel computer can be further divided into:
- SPMD Single Program, Multiple Data
- MPMD Multiple Program, Multiple Data
According to the memory architecture, parallel computer can be divided as:
- Shared Memory Computer
this kind of computer, each processing node share the same global
memory address space. Programming on these computer can be as easy as
on multicore workstation.
Shared memory computer is easy to
programming, but since bus is used among all processing node, the scale
is limited(usually several tens), as bus contention will become the
bottle neck when scale arises.
Shared memory computer can be further divided into two kinds:
- UMA/cc-UMA, all processing node share the same physical memory device through a bus
each process node has local physical memory but accessible by other
nodes, but the access time depends on the memory location relative to a
- Distributed Memory Computer
this kind of computer, each node has its own private memory address
space and can't access other node's memory directly. Usually,
processing nodes are connected using some kind of interconnection
Distributed memory computer can scale to very large
since no bus contention occurs. But it's more complicated to write
program on this kind of computers.
- Distributed Shared Memory Computer
this kind of computer, their hardware architecture is usually the same
as Distributed Memory system, but its interface for application
developers is the same as Shared Memory system.
DSM is usually implemented using software extension to OS, which some performance penalty.
3. Parallel Programming Model
With these powerful computer in hand, how to programming on them?
3.1 Conceptually, there are tow models to write parallel programming.
There are two well known API interface about multi-threading:
- OpenMP (Open Multi-Processing), using compiler directives, environment variable and library to provide threading supports
Message Passing Model
this model, a set of tasks use their local memory for computation,
communication among these tasks is conducted by means of
sending/receiving network messages.
There are two standards:
- MPI (Message Passing Interface)
- PVM (Parallel Virtual Machine)
Typical Message Passing Patterns are listed below:
3.2 Other factors to Consider when Designing Parallel Programs
a. Problem Decomposition/Partitioning
- Data Partitioning
- Functional Partitioning
b. Communication Considering 
- Latency, is the time it takes to send a minimal (0 byte) message from point A to point B.
- Bandwidth, is the amount of data that can be communicated per unit of time.
- Async vs Sync.
involves two tasks with one task acting as the sender/producer of data,
and the other acting as the receiver/consumer.
involves data sharing between more than two tasks, which are often
specified as being members in a common group, or collective.
c. Load Balancing
Load balancing refers to the practice of distributing work among tasks so that all tasks are kept busy all of the time
- Equally partition workload
- Use dynamic work assignment
d. Granularity 
Measured by Computation/Communication Ratio, because periods of computation are typically separated from periods of communication by synchronization events.
- Coarse-grain Parallelism, relatively large amounts of computational work are done between communication/synchronization events
- Fine-grain Parallelism, relatively small amounts of computational work are done between communication events
 Parallel Programming Tutorial by LLNL
 Parallel Programming Models and Paradigms
 Book - Designing and Building Parallel Programs
 Book - Introduction to Parallel Computing 2E
 Book - Parallel and Distributed Programming Using C++
 Book - Patterns for Parallel Programming