Towards Automatic and Adaptive Optimizations of MPI Collective Operations
by Jelena PjesivacGrbovic
Abstract
Message passing interface (MPI) and collective operations (CO)
 CO: a subset of MPI standard that deals with processes …
 synchronization,
 data exchange and
 computation among a group of processes.
 CO can be a performance bottleneck
 CO parameters:
 input parameters (e.g., communicator and message size);
 system characteristics (e.g., interconnect type);
 the application computation and communication pattern;
 internal algorithm parameters (e.g., internal segment size) (referred as “method”)
How to performance improvement of MPI collective operations
 In our framework, during a collective call, a systemspeciﬁc decision function is invoked to select the most appropriate method for the particular collective instance.
 This dissertation focuses on automatic techniques for systemspeciﬁc decision function generation.
 Our approach takes the following steps:
 we collect method performance information on the system of interest;
 we analyze this information using
 parallel communication models,
 graphical encoding methods, and
 decision trees;
 based on the previous step, we automatically generate the systemspeciﬁc decision function to be used at runtime. In situation when a detailed performance measurement is not feasible, method performance models can be used to supplement the measured method performance information.
Ch.1 Introduction
Limits of single processor systems
$\rightarrow$ Parallel computing
 Highperformance computing: TOP500 project
Difficulty of parallel software programming
 Level of parallelism
 Instructionlevel
 Threadlevel
 Processlevel
Programmer’s view of system memory
 Shared memory
 Every process is able to access to remote data seamlessly (with penalty)
 Amenable for finegrain parallelism
 Distributed memory
 Explicit message passing
 Check UMA and NUMA
Message passing interface (MPI)
 Help library and application developer to create portable and high performance code more easily
 Allowing system vendors to utilize their specialized hardware features
Collective operations (CO)
 An important subset of the MPI standards.
 Operations used to exchange the information among a group of processors
 Commonly used bottleneck
 Performance depends on….
 System properties
 Algorithm and internal parameters =
Method
(eg. segment sizes)
Performance improvement of MPICOs
 Runtime collective method selection process
 Used ….
 Performance models
 Contains 105 performance models for 35 diﬀerent collective algorithms
 Graphical encoding
 Statistical learning
 Performance models
 To automatically build adaptable, efficient, and fast runtime decision functions
 Covered
FastEthernet
,GigE
,MX
,Inﬁniband
,Cray speciﬁc Portals
Ch2. Message Passing Interface
2.1 MPI Standrad

MPI1 standard (MPI Forum, 1995)
• Pointtopoint communication
• (Intercommunicator) Collective operations
• Process groups
• Communication contexts (communicators)
• Process topologies
• Bindings for FORTRAN 77 and C
• Environmental Management and inquiry
• Proﬁling interface  MPI1.2 standard (MPI Forum, 1997) includes …
• Onesided / Explicit sharedmemory operations I/O functions
• Dynamic process management
• Explicit support for threads
• Bindings for C++
 I/O functions
 MPI2 standard (MPI Forum, 1997) includes …
 Intracommunicator collective operations (between two communicators)
2.2 MPI collective operations

Barrier :
 Synchronization routine. Block the caller process until all process have reached this synchronization point

Broadcast :
 Broadcast from a root process to all processes

Scatter :
 Scatter from a root process to all processes
 $i^{th}$ block of data is sent to $i^{th}$ process
 Inverse operation of Gather
 Similar to vector transposition from rowvec (data in a process) to colvec (data with same index)

Gather :
 Gather data ordered by process rank to root process
 Similar to vector transposition from colvec to rowvec

AllGather :
 Same with Gather, except that the result of the operations is available on all processes
 Gather followed by Broadcast

AlltoAll :
 Total exchange among the processes
 Send to all, receive from all
 eg.
 a process $r$ sends block $j$ to a process $j$
 the process $j$ receives it as the block $r$
 Scatter for all rowvecs
 Similar to matrix transposition

Reduce :
 Combine the elements in the send buffer of all processes using a specified operation
 Result is stored in the $recvbuf$ at the root process

ReduceScatter :
 Reduce followed by Scatter

AllReduce :
 Reduce followed by Broadcast

Scan :
 Prefix reduction across the processes
 The receive buffer at process $k$ holds reduction of the values in the send buffers of processes with ranks $0$ ~ $k$

ExScan :
 Exclusive scan : reduction of values with ranks $0$ ~ $(k1)$

AlltoAllw :
 Generalized AlltoAll
 Each process receives messages with different datasize and datatype from others
MPICO  Illustration 

Broadcast  
Scatter  
Gather  
AllGather  
AlltoAll  
Reduce  
AllReduce  
ReduceScatter  
Scan  
ExScan 
Ch.3 Literature Review
3.1 MPIimplementation and COs

MPICH algorithms

allreduce: recursive doubling, Rabenseifner’s algorithm, and reduce + broadcast.
• reduce: binomial tree and Rabenseifner’s algorithm.
• allgather: bruck, recursive doubling, and ring algorithms.
• alltoall: bruck, linear, and two versions of pairwise exchange (for poweroftwo and nonpoweroftwo process case).
• broadcast: binomial tree, scatter + allgather (implemented in number of ways).
• barrier: bruck (dissemination algorithm).

allreduce: recursive doubling, Rabenseifner’s algorithm, and reduce + broadcast.

FT(fault tolerant)MPI
 In the case of failure, FTMPI can
 abort the job (nonFT behavior),
 respawn the dead process,
 shrink/resize the application to remove the missing processes, or
 leave the application as is, and
 create holes in the MPI COMM WORLD communicator.
 So, ensures that the inﬂight messages will be either canceled or received.
 In the case of failure, FTMPI can

OpenMPI
 an open source, peer reviewed, highperformance, production quality MPI implementation.
 Three layers using modula component architecture (MCA)
 MPI layer: implements MPI semantics
 Runtime environment: provides a resource manger, global data store, messaging layer, a peer discovery system
 Portability layer: useful functions and data structures
 Components in collective framework of OpenMPI
 Basic, Self, Tuned, Hierarchical, Shared memory, Nonblocking

COs in hardware
 High performance networks
 Myrinet’s
MX
, Open Fabrics (Inﬁniband
), andQuadrics
 Myrinet’s
 Network interface cards (NICs)
 Offload protocol processing from the host CPU
 Bypass the OS
 Interact directly with MPI processes
 Pointtopoint communication
 NICbased COs
 Better than software implementation based on pointtopoint comm (comm = communication)
 More consistent performance than softwarebased approach
 Not subject to CPU scheduling
 Worse for some hardwarebased collectives
 IBM’s Blue Gene/L
 Multiple networks in a system
 Torus network for pointtopoint comm
 Collective network for optimized collectives and comm with I/O nodes
 Global interrupt network
 Detail
 Broadcast : mesh algorithm with torus network
 Multicast capability of torus network
 Packet level pipelining
 AlltoAll : linear algorithm
 Major limitation: used only on a fully system partition
 Broadcast : mesh algorithm with torus network
 Multiple networks in a system
 High performance networks
3.2 Parallel communication models
 Most commonly used parallel comm models:
 Grama, Ananth, et al. Introduction to parallel computing, 2003.
 Introduced basic collective comm operations
 Message splitting

Hockney :
 To assess the performance of allgather, broadcast, alltoall, reducescatter, reduce, and allreduce collectives

LogP/LogGP :
 To ﬁnd an optimal algorithm and parameters for topologyaware collective operations incorporated in the MagPIe library

PLogP:
 To evaluate performance of broadcast and scatter operations on intracluster communication.
 Grama, Ananth, et al. Introduction to parallel computing, 2003.
3.3 Algorithm selection and automatic tuning
 Exhaustive testing
 Statistical learning methods
 Support vector methods
 Bayesian decision rule approach
 Markov decision process
 (Non)parametric (geometric) modeling
Ch.4 Parallel Communication Models
4.1 Algorithms for MPI COs
1. Virtual topologies

Classification by data direction
 Onetomany
 Broadcast
 Scatter
 Manytoone
 Reduce
 Gather
 Manytomany
 Barrier
 AlltoAll
 AllReduce
 AllGather
 Expressed as unidirectional data flow:
 receive data from preceding node(s),
 process data, if required,
 send data to succeeding node(s).
 Onetomany

Virtual topologies determine the preceding and succeeding nodes in the algorithm
 Goal: Grow balanced trees
 Sometimes less balanced trees are beneficial4
2. Collective Algorithms
This section discusses diﬀerent algorithms for MPI collective operations.

Barrier
 Double Ring (Token ring algorithm)
 2 P steps (P : the number of processes in the communicator)
 Faninfanout (Central server algorithm)
 2P messages
 Zerobyte gather operation followed by zerobyte broadcast operation
 Then, total number of messages: O(P) $\rightarrow$ O($log_2(P)$)
 Recursive doubling (figure)
 O($log_2(P)$) steps
 At step $k$, rank $r$ exchanges a zerobyte message with rank ($r$ $XOR$ $2^k$ ).
 At the end of $log_2 (P)$ steps, the algorithm guarantees that all processes have entered the barrier, and thus everyone is allowed to leave.
 Need more steps when P is not exact power of two.
 Bruck / Dissemination algorithm (figure)
 O($log_2(P)$) = $\lceil log_2(P) \rceil$ steps, regardless of P.
 At the step $k$, process $r$ sends a message to rank $(r + 2^k )$ and receives message from rank $(r − 2^k )$ with wrap around.
 Double Ring (Token ring algorithm)

Broadcast
 Generalized broadcast with virtual topologies
 Implements a broadcast operation as a communication pipeline
 For all message segments, process $r$ receives the segment $s$ from the parent $parent(r)$, and forwards it to all of its children $children(r)$.
 Virtual topologies: pipeline, flat tree, binomial tree
 Splitbinary tree algorithm
 An optimization of the regular binary tree broadcast algorithm
 Split message in half, send left and right halves down to left and right subtrees, respectively.
 Last leaves should find the other half from the process with opposite rank (or from the root)
 Effect: Half bandwidth with one more step
 Generalized broadcast with virtual topologies

Scatter(v)
 Linear algorithm
 Binomial algorithm

Gather(v)
 Linear algorithm without synchronization
 Linear algorithm with synchronization
 Binomial algorithm

Allgather(v)
 Gather + Broadcast
 Bruck
 Recursive doubling
 Ring
 Neighbor exchange

Alltoall(v/w)
 Linear without synchronization
 Linear with synchronization
 Pairwise exchange
 Bruck algorithm

Reduce
 Generalized reduce with virtual topologies
 Rabenseifner’s algorithm and its variations

Reducescatter
 Reduce + scatterv
 Recursive halving algorithm
 Ring

Allreduce
 Reduce + Broadcast
 Recursive doubling
 Rabenseifner’s algorithm
 Ring without segmentation
 Ring with segmentation

Scan
 Linear without segmentation
 Linear with segmentation
 Binomial algorithm
 Exscan
To be updated …
4.2 Parallel communication models
1. Modeling pointtopoint communication

Hockney model

LogP/LogGP models

PLogP model
2. Modeling computation
4.3 Performance models of MPI collective operations
1. Building a performance model: splitbinary broadcast

Hockney model

LogP/LogGP models

PLogP model
2 . Building a performance model: linear gather with synchronization

Hockney model

LogP/LogGP models

PLogP model
3. Building a performance model: recursive doubling allgather

Hockney model

LogP/LogGP models

PLogP model
4. Performance models of collective algorithms
4.4 Evaluation of MPI collective operation models
1. Model parameters
2. Performance of diﬀerent collective algorithms

Barrier performance

Reduce performance

Alltoall performance
3. Final comments about parallel computation models
Reference
 Jelena PjesivacGrbovic’s PhD Thesis: “Towards Automatic and Adaptive Optimizations of MPI Collective Operations”, University of Tennessee, 2007
 Thakur, Rajeev, Rolf Rabenseifner, and William Gropp. “Optimization of collective communication operations in MPICH.“ The International Journal of High Performance Computing Applications 19.1 (2005): 4966.
Notes Mentioning This Note
Table of Contents
 Towards Automatic and Adaptive Optimizations of MPI Collective Operations
 Abstract
 Ch.1 Introduction
 Ch2. Message Passing Interface
 Ch.3 Literature Review
 Ch.4 Parallel Communication Models
 Reference