Functions | Variables
genericParallelization.c File Reference

Generic master-worker parallelization with either pthreads or MPI. More...

Functions

void perSiteLogLikelihoodsPthreads (pllInstance *tr, partitionList *pr, double *lhs, int n, int tid)
 
void broadcastAfterRateOpt (pllInstance *tr, pllInstance *localTree, partitionList *pr, int n, int tid)
 broadcast rates after rate optimization. More...
 
void branchLength_parallelReduce (pllInstance *tr, double *dlnLdlz, double *d2lnLdlz2, int numBranches)
 Reduce the first and second derivative of the likelihood function. More...
 
void pllMasterPostBarrier (pllInstance *tr, partitionList *pr, int jobType)
 Cleanup step once the master barrier succeeded. More...
 
char * getJobName (int type)
 helper that yields a string representation of a parallel region. More...
 
void pllLockMPI (pllInstance *tr)
 Lock the MPI slave processes prior allocating partitions. More...
 
void pllFinalizeMPI (void)
 
void pllInitMPI (int *argc, char **argv[])
 Sets up the MPI environment. More...
 
void pinToCore (int tid)
 Pins a thread to a core (for efficiency). More...
 
void pllStartPthreads (pllInstance *tr, partitionList *pr)
 
void pllStopPthreads (pllInstance *tr)
 
boolean isThisMyPartition (partitionList *localPr, int tid, int model)
 Check if a partition is assign to a thread/process. More...
 
void pllMasterBarrier (pllInstance *tr, partitionList *pr, int jobType)
 A generic master barrier for executing parallel parts of the code. More...
 

Static functions

static void distributeYVectors (pllInstance *localTree, pllInstance *tr, partitionList *localPr)
 Distribute y-vectors during initialization. More...
 
static void distributeWeights (pllInstance *localTree, pllInstance *tr, partitionList *localPr)
 Distribute the weights in the alignment of slave process/threads. More...
 
static boolean execFunction (pllInstance *tr, pllInstance *localTree, partitionList *pr, partitionList *localPr, int tid, int n)
 Generic entry point for parallel regions (mostly broadcasts traversal descriptor first). More...
 
static void * likelihoodThread (void *tData)
 
static void multiprocessorScheduling (pllInstance *tr, partitionList *pr, int tid)
 Top-level function for the multi processor scheduling scheme (assigns full partitions to workers). More...
 
static void computeFraction (partitionList *localPr, int tid, int n)
 Computes partition size for all partitions (for cyclic distribution of sites) More...
 
static void computeFractionMany (partitionList *localPr, int tid)
 Computes partition size for all partitions (in case full partitions are assigns to workers). More...
 
static void initializePartitionsMaster (pllInstance *tr, pllInstance *localTree, partitionList *pr, partitionList *localPr, int tid, int n)
 Initialize the partitioning scheme (master function) in parallel environment. More...
 
static char * addBytes (char *buf, void *toAdd, size_t numBytes)
 Pthreads helper function for adding bytes to communication buffer. More...
 
static char * popBytes (char *buf, void *result, size_t numBytes)
 Pthreads helper function for removing bytes from communication buffer. More...
 
static void defineTraversalInfoMPI (void)
 Create a datastructure for sending the traversal descriptor. More...
 
static boolean pllWorkerTrap (pllInstance *tr, partitionList *pr)
 Traps worker MPI processes. More...
 
static int partCompare (const void *p1, const void *p2)
 Compare partition sizes. More...
 
static int doublesToBuffer (double *buf, double *srcTar, pllInstance *tr, partitionList *pr, int n, int tid, boolean read, boolean countOnly)
 Read from buffer or writes rates into buffer. Return number of elems written. More...
 
static void collectDouble (double *dst, double *src, pllInstance *tr, partitionList *pr, int n, int tid)
 Collect doubles from workers to master. More...
 
static void broadCastAlpha (partitionList *localPr, partitionList *pr)
 broadcast a new alpha (for the GAMMA model) More...
 
static void broadCastRates (partitionList *localPr, partitionList *pr)
 Master broadcasts rates. More...
 
static void reduceEvaluateIterative (pllInstance *tr, pllInstance *localTree, partitionList *localPr, int tid, boolean getPerSiteLikelihoods)
 Evaluate the likelihood of this topology (PThreads/MPI implementation) More...
 
static void broadcastTraversalInfo (pllInstance *localTree, pllInstance *tr, partitionList *localPr)
 
static void assignAndInitPart1 (pllInstance *localTree, pllInstance *tr, partitionList *localPr, partitionList *pr, int *tid)
 Initialize structures for slave process/threads. More...
 

Variables

static pthread_t * threads
 
static threadDatatData
 
volatile int jobCycle
 
volatile int threadJob
 
boolean treeIsInitialized
 
double masterTimePerPhase
 
double timeBuffer [NUM_PAR_JOBS]
 
double timePerRegion [NUM_PAR_JOBS]
 
volatile char * barrierBuffer
 
MPI_Datatype TRAVERSAL_MPI
 

Detailed Description

Generic master-worker parallelization with either pthreads or MPI.

Worker threads/processes mostly work on a local tree. Implementationwise, MPI operations are abstracted as good as possible via defines (that translate to no-ops or memcpy-calls in the pthreads version).

Todo:
the code still contains many memory copy operations that could be executed more efficiently in-place

Function Documentation

static char * addBytes ( char *  buf,
void *  toAdd,
size_t  numBytes 
)
static

Pthreads helper function for adding bytes to communication buffer.

Copy from to buf numBytes bytes

Parameters
bufWhere to place bytes

toAdd Where to copy them from

numBytes How many to copy

Returns
Pointer to the end of placed data in communication buffer (first free slot)
static void assignAndInitPart1 ( pllInstance localTree,
pllInstance tr,
partitionList localPr,
partitionList pr,
int *  tid 
)
static

Initialize structures for slave process/threads.

Allocate all memory structures required by slave threads/processes

Parameters
trPLL Instance
localTreeA local PLL instance for the slave process/thread which is initialized in this function based on tr

pr List of partitions

Parameters
localPrA local list of partitions for the slave process/thread which will be initialized based on pr

tid The slave process/thread ID

Note
This function should never be called by the master thread, but is called by master process in MPI implementation.

+ Here is the caller graph for this function:

void branchLength_parallelReduce ( pllInstance tr,
double *  dlnLdlz,
double *  d2lnLdlz2,
int  numBranches 
)

Reduce the first and second derivative of the likelihood function.

We collect the first and second derivatives from the various threads and sum them up. It's similar to what we do in pllEvaluateGeneric() with the only difference that we have to collect two values (firsrt and second derivative) instead of onyly one (the log likelihood

Warning
operates on global reduction buffers globalResult
Parameters
trtree
dlnLdlzfirst derivative
d2lnLdlz2second derivative
void broadcastAfterRateOpt ( pllInstance tr,
pllInstance localTree,
partitionList pr,
int  n,
int  tid 
)

broadcast rates after rate optimization.

Parameters
treLibrary instance
localTreelocal library instance
nnumber of workers
tidworker id
Todo:
mpi_alltoallv/w may be more efficient, but it is a hell to set up

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

static void broadCastAlpha ( partitionList localPr,
partitionList pr 
)
static

broadcast a new alpha (for the GAMMA model)

Parameters
localTreelocal library instance
trlibrary instance
tidworker id

+ Here is the caller graph for this function:

static void broadCastRates ( partitionList localPr,
partitionList pr 
)
static

Master broadcasts rates.

Parameters
localTreelocal library instance
trlibrary instance
tidworker id

+ Here is the caller graph for this function:

static void collectDouble ( double *  dst,
double *  src,
pllInstance tr,
partitionList pr,
int  n,
int  tid 
)
static

Collect doubles from workers to master.

Parameters
dstdestination array
srcsource array
trlibrary instance
nnumber of workers
tidworker id

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

static void computeFraction ( partitionList localPr,
int  tid,
int  n 
)
static

Computes partition size for all partitions (for cyclic distribution of sites)

Parameters
localPrthe local partitions instance
tidthread id
nnumber of workers

+ Here is the caller graph for this function:

static void computeFractionMany ( partitionList localPr,
int  tid 
)
static

Computes partition size for all partitions (in case full partitions are assigns to workers).

Parameters
localPrthe local partitions instance
tidthread id

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

static void defineTraversalInfoMPI ( void  )
static

Create a datastructure for sending the traversal descriptor.

Note
This seems to be a very safe method to define your own mpi datatypes (often there are problems with padding). But it is not entirely for the weak of heart...

+ Here is the caller graph for this function:

static void distributeWeights ( pllInstance localTree,
pllInstance tr,
partitionList localPr 
)
static

Distribute the weights in the alignment of slave process/threads.

Allocate space in the local tree structure for the alignment weights. Then copy the weights vector from the master process/thread to the slaves.

Parameters
trPLL instance
localTreeLocal library instance for the current process/thread
localPrLocal list of partitions for the current process/thread
Todo:
The alignment weights should go to the partitions structure rather than the tree structure

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

static void distributeYVectors ( pllInstance localTree,
pllInstance tr,
partitionList localPr 
)
static

Distribute y-vectors during initialization.

Distribute the alignment data to the slave process/threads. Each slave copies the data (alignment) from its assigned partition to its local partition structure.

Parameters
trPLL instance
localTreeLocal library instance for the current thread
localPrLocal list of partitions structure for the current thread

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

static int doublesToBuffer ( double *  buf,
double *  srcTar,
pllInstance tr,
partitionList pr,
int  n,
int  tid,
boolean  read,
boolean  countOnly 
)
static

Read from buffer or writes rates into buffer. Return number of elems written.

If read is set to PLL_TRUE, then the contents srcTar are copied to buf. Otherwise, the contents of buf are moved to srcTar.

Parameters
bufBuffer
srcTarPointer to either source or destination array
trPLL instance
nnumber of workers
tidprocess id
readIf read-mode then set to PLL_TRUE
countOnlyif PLL_TRUE, simply return the number of elements

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

static boolean execFunction ( pllInstance tr,
pllInstance localTree,
partitionList pr,
partitionList localPr,
int  tid,
int  n 
)
static

Generic entry point for parallel regions (mostly broadcasts traversal descriptor first).

This function here handles all parallel regions in the Pthreads version, when we enter this function pllMasterBarrier() has been called by the master thread from within the sequential part of the program, tr is the library instance (tree) at the master thread, localTree is the library instance (tree) at the worker threads

While this is not necessary, adress spaces of threads are indeed separated for easier transition to a distributed memory paradigm

Parameters
trlibrary instance
localTreelocal library instance
tidworker id
nnumber of workers

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

char * getJobName ( int  type)

helper that yields a string representation of a parallel region.

Parameters
typetype of parallel region

+ Here is the caller graph for this function:

static void initializePartitionsMaster ( pllInstance tr,
pllInstance localTree,
partitionList pr,
partitionList localPr,
int  tid,
int  n 
)
static

Initialize the partitioning scheme (master function) in parallel environment.

Initialize the partition scheme in all processes/threads. This is a wrapper function that calls all necessary functions for allocating the local structures for slave threads and for distributing all necessary data from the master threads, such as alignment data, and weight vectors.

Parameters
trPLL instance
localTreeLocal PLL instance for the slave process/thread
prList of partitions
localPrLocal partition structure for the slave process/thread
tidProcess/thread id
nNumber of processes/threads

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

boolean isThisMyPartition ( partitionList localPr,
int  tid,
int  model 
)

Check if a partition is assign to a thread/process.

Checks whether partition model from partition list localPr is assigned to be processed by process/thread with id tid.

Parameters
localTreeLocal PLL instance
tidThread/Process id
modelPartition number

+ Here is the caller graph for this function:

static void * likelihoodThread ( void *  tData)
static

Target function where the threads/processes are trapped

The threads/processes spend all of their time in this function running operations on the data (computing likelihoods).

Parameters
tDataStructure that contains the vital information for the thread/process, i.e. PLL instance, list of partitions and thread ID
Note
The data in tData are different for pthreads and MPI. Expand this section.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

static void multiprocessorScheduling ( pllInstance tr,
partitionList pr,
int  tid 
)
static

Top-level function for the multi processor scheduling scheme (assigns full partitions to workers).

tr->manyPartitions is set to PLL_TRUE if the user has indicated via -Q that there are substantially more partitions than threads/cores available. In that case we do not distribute sites from each partition in a cyclic fashion to the cores , but distribute entire partitions to cores. Achieving a good balance of alignment sites to cores boils down to the multi-processor scheduling problem known from theoretical comp. sci. which is NP-complete. We have implemented very simple "standard" heuristics for solving the multiprocessor scheduling problem that turn out to work very well and are cheap to compute.

Parameters
prList of partitions
tidId of current process/thread

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

static int partCompare ( const void *  p1,
const void *  p2 
)
static

Compare partition sizes.

Parameters
p1pointer to a partition
p2pointer to another partition

+ Here is the caller graph for this function:

void perSiteLogLikelihoodsPthreads ( pllInstance tr,
partitionList pr,
double *  lhs,
int  n,
int  tid 
)

Compute per-site log likelihoods (PThreads version)

Worker threads evaluate the likelihood on their sites

Parameters
trTree instance
lhsLikelihood array
nNumber of threads
tidThread id

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pinToCore ( int  tid)

Pins a thread to a core (for efficiency).

This is a non-portable function that works only on some linux distributions of pthreads. It sets the affinity of each thread to a specific core so that the performance is not degraded due to threads migration.

Note
It is only called if _PORTABLE_PTHREADS is not defined
Parameters
tidthe thread id

+ Here is the caller graph for this function:

void pllFinalizeMPI ( void  )

Finalize MPI run

Finalizes MPI run by synchronizing all processes (master + slaves) with a barrier so that all free their allocated resources. Then MPI_Finalize () is called.

Todo:
Similarly as with the pllLockMPI function, this should not be called by the user, but it is called implicitly at the end of pllDestroyInstance. Probably this function should be removed and inline code be placed in pllDestroyInstance.

+ Here is the caller graph for this function:

void pllInitMPI ( int *  argc,
char **  argv[] 
)

Sets up the MPI environment.

Calls the MPI_Init function and makes sure all processes store their process ID and the total number of processes, using a barrier.

Note
this should be the first call that is executed in your main method.
Parameters
argcAddress of argc from main
argvAddress of argv from main
void pllLockMPI ( pllInstance tr)

Lock the MPI slave processes prior allocating partitions.

MPI slave processes are locked and wait until the master process has read the number of partitions, which it then broadcasts to slaves, effectively unlocking them. The slave processes will then allocate their own data structures and be locked in the likelihood function.

Parameters
trPLL instance
Todo:
This function should not be called by the user. It is called at pllCreateInstance. Probably this function should be removed and inline code be placed in pllCreateInstance.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pllMasterBarrier ( pllInstance tr,
partitionList pr,
int  jobType 
)

A generic master barrier for executing parallel parts of the code.

A generic master barrier through which the master thread/process controls the work job execution. Through the parameter jobType the master instructs the slaves of what type of work they must conduct.

Parameters
trPLL instance
prList of partitions
jobTypeType of job to be conducted

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pllMasterPostBarrier ( pllInstance tr,
partitionList pr,
int  jobType 
)

Cleanup step once the master barrier succeeded.

This is master specific code called once the barrier is passed. Stuff such as reduction operations. If we execute this here, we can keep the code mostly free from parallel -specific code.

Parameters
trPLL instance
prList of partitions
jobTypeJob that is to be executed

+ Here is the caller graph for this function:

void pllStartPthreads ( pllInstance tr,
partitionList pr 
)

Start PThreads

Start JOINABLE threads by executing pthread_create. The threads are attached to the pllLikelihoodThread function

Parameters
trPLL instance
prList of partitions
Todo:
This function should never be called by the user. It is called implicitly at pllInitModel. Perhaps we should add a check or inline the code

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

void pllStopPthreads ( pllInstance tr)

Stop PThread

Stop threads by pthread_join

Parameters
trPLL instance
Todo:
This function should never be called by the user. It is implicitly called at pllPartitionsDestroy. We should inline the code

+ Here is the caller graph for this function:

static boolean pllWorkerTrap ( pllInstance tr,
partitionList pr 
)
static

Traps worker MPI processes.

Note
This function should be called immediately after initMPI()
Parameters
trPLL instance
prList of partitions
Returns
Returns /b PLL_FALSE if the callee was the master thread/process, otherwise /b PLL_TRUE
Note
for the broadcasting, we need to, if the tree structure has already been initialized

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

static char * popBytes ( char *  buf,
void *  result,
size_t  numBytes 
)
static

Pthreads helper function for removing bytes from communication buffer.

Copies numBytes from communication buffer buf to some local buffer buf

Parameters
bufWhere to store the bytes
resultWhere to copy from
numBytesHow many to copy
Returns
Pointer to the end of read data in communication buffer (first free slot)
static void reduceEvaluateIterative ( pllInstance tr,
pllInstance localTree,
partitionList localPr,
int  tid,
boolean  getPerSiteLikelihoods 
)
static

Evaluate the likelihood of this topology (PThreads/MPI implementation)

Evaluate the likelihood of the topology described in the PLL instance. First every thread calls pllEvaluateIterative where it computes the log likelihoods for the portion of each assigned partition. The results (for all partition) are stored as elements of a local buffer array (buf). This is done by all threads. Subsequently, an MPI_Reduce operation sums the contents of corresponding elements of the local buffer arrays into another array (targetBuf) which are the log likelihoods of each (complete) partition. Finally, the last array is copied to the master thread/process. In addition, if getPerSiteLikelihoods is enabled the log likelihoods for each site in the (compressed) alignment are stored in the array tr->lhs.

Parameters
trPLL instance
trLocal (thread/process) PLL instance
prLocal (thread/process) list of partitions
tidThread/Process ID
getPerSiteLikelihoodsIf set to PLL_TRUE, compute the log likelihood for each site.

+ Here is the call graph for this function:

+ Here is the caller graph for this function:

Variable Documentation

volatile int threadJob

current job to be done by worker threads/processes