index<--

Parallel Computing

In Fortran there are several programming models you can use to implement a parallel algorithm. Most models are using special compiler options and additional libraries. Native compilation does not use parallel computing by default. Our investigation is work in progress!

Efficiency vs Performance

Before making your application run in parallel, you should consider what your goal is. You must see the larger picture to avoid hard work with little gain. Using parallelization will improve performance but will always reduce overall efficiency of the system.

Performance is measured in time gained. This is good in general but sometimes it may be too expensive. If the server is already busy running many processes, starting a Fortran program that is consuming all the resources will slow down other applications that are already running in the background and are overheating the system, reducing performance.

Execution Overhead

Parallel processing is not always a good idea. A parallel program is wasting time, splitting a job in smaller parts to be run in parallel, then aggregating the partial results. This overhead can increase the total processing time. After putting effort in coding a faster program, you end up with a program that is slower than the single thread version.

I/O optimization

Most blocking operations in a global process are I/O operations on diverse devices. Mechanical hard-disks & optical-disks are slow. Lately there are SSD storage devices that are much faster. Using better hardware can improve I/O performance significantly.

A good idea is to read data from one source and write data in parallel on different files. Having one file output/process will separate the processes and will improve performance. The problem is, after you finish processing you must aggregate the partial results back into a single file using a single thread.

Compiler optimization

Fortran compilers can target a specific platform. You can use compiler flags (options) to optimize the generated code and take advantage of special microprocessor features, specific to a platform that can improve performance of your application significantly. That may be a good alternative before using parallelization.

System parallelization

You can use Bash to run processes in parallel using the operating system multitasking capability. You can create a Bash script that starts different processes in the background. This is the most easy way to create a parallel process.

Map Reduce Model

Advantages

Using this model, parallel processes can have different complexity levels and different duration. For large projects, you can organize processes using a job scheduler. You learn this technology in a Software Engineering course.

Disadvantages

Interprocess communication is difficult. Also you have to use two programming languages. Data transfer is expensive and there is a lot of code that you must create to receive data from input, parse the data and create the output data.

Loop parallelization

Loops are computing intensive and can be run in parallel. You can use the Fortran compiler to parallelize loops using multi-threading. After the loop is finalized the main thread can aggregate the result. Not all loops can be executed in parallel. If a process is using nested loops, probably only the outer loop will be executed in parallel.

Competition for resources

If you are not careful, running system processes simultaneously with loop parallelization will cause a competition for resources. This can overwhelm the server, and slow down all the processes. Maybe a better idea is to use multiple computers connected in a network if you need more computing power. This technique of parallelization is called Distributed Computing.

Automatic parallelization

Some older versions of compilers may have automatic parallelization features. The idea was: you do not need to modify your program. The compiler can decide if your program can be optimized using parallel computing. This model of parallelization seams to be abandoned and replaced by explicit parallelization models.

Explicit parallelization

For better control over parallelization you need to use compiler directives that trigger parallel code generators and enable specific parts of the application to run in parallel. You need skills to read, create and debug code designed for parallel execution.

Parallelization methods

Fortran has different methods of parallelization that can be used to increase process performance. Different compilers are implementing the parallelization standards in different ways. Fortran specification is hazy and unclear (on purpose). Here is a list of methods we have identified:

do concurrent (loops)
openMP (Open Multi Processing)
coarrays (CAF) Co-Array Fortran
MPI (Message Passing Interface)

Disclaim: Next code snippets are not runnable. These are fragments of code. You must research specific methods and compiler flags to generate parallel code. Check the compiler reference books before designing your code for a specific platform.

do concurrent (loops)

Fortran 2008 specification describes a new kind of loop. Do loop is augmented with keyword "CONCURRENT" or "concurrent". This enables more effective parallel execution of native Fortran code without the use of non-standard directives.

Declaring the loop concurrently enables the compiler to decide if the loop is good enough to be executed in parallel. The intention to run in parallel requires you to follow several restrictions, otherwise the compiler will not enable parallel execution.

Restrictions

Do not use interruption statements that would prevent the loop from executing all its iterations: RETURN, EXIT, GOTO, CYCLE.
Do not use image control statements: STOP, SYNC, LOCK/UNLOCK, EVENT;
Do not ALLOCATE/DEALLOCATE coarrays inside the loop or nested loop or any other subprogram called from the loop.
Calling a procedure that is not PURE from inside the loop. A pure procedure do not have side effects.
Deallocation of any polymorphic entity, as that could cause an impure FINAL subroutine to be called.
You can't mess with the IEEE floating-point control and status flags.
You can't modify an object in one iteration and expect to be able to read it in another.

Advantages

In return for accepting these restrictions, a DO CONCURRENT might compile into code that exploits the parallel features of the target machine to run the iterations of the DO CONCURRENT construct without using any OpenACC or OpenMP directive.

! fortran fragment
integer,dimension(n) :: j, k
integer :: i, m
m = 10
i = 15
do concurrent (i = 1:n, j(i)> 0) local (m) shared (j, k)
   m =  mod (k(i), j(i))
   k(i) = k(i) – m
end do
print *, i, m] ! expected 15 10

External References

openMP (Open Multi Processing)

OpenMP (Open Multi-Processing) is an application programming interface (API) that supports multi-platform shared-memory. It consists of a set of compiler directives, library routines, and environment variables that enable run-time parallelization.


!$OMP PARALLEL DO DEFAULT(NONE) PRIVATE(i) REDUCTION(+:pi)
do i = 1, limit
  pi = pi + (-1)**(i+1) / real( 2*i-1, kind=rk )
end do
!$OMP END PARALLEL DO

External References

Wikipedia Open MP

coarrays (CAF) Co-Array Fortran

Fortran 2008 contains the coarray parallel. It is the first time that a parallel programming model has been added to the language as a standard feature, portable across all platforms. Compilers supporting the model are available or under development from all the major compiler vendors.

The coarray programming model consists of two new features added to the language, an extension of the normal array syntax to represent data decomposition plus an extension to the execution model to control parallel work distribution.

Execution Model

The coarray execution model is based on the Single Program Multiple Data (SPMD). A CAF program is replicated a number of times. Each copy has its own set of data objects and is named image. All images are executed asynchronously.

!coarray declaration
real :: a(n)[*]
complex :: z[0:*]
integer :: index(n)[*]
real :: b(n)[p, *]
real :: c(n,m)[0:p, -3:q, +3:*]
real, allocatable :: w(:)[:,:]
type(field),allocatable :: max[:,:]

External References

Wikipedia Coarrays

MPI (Message Passing Interface)

Message Passing Interface (MPI) is a communication protocol for parallel programming. MPI is specifically used to allow applications to run in parallel across a number of separate computers connected by a network.

Distributed system

A distributed system consists of a collection of autonomous computers, connected through a network and distribution middleware, which enables computers to coordinate their activities and to share the resources of the system so that users perceive the system as a single, integrated computing facility.

Advantages

Highly efficient
Scalability
Less tolerant of failures
High Availability

program mpi
   include 'mpif.h'
   integer rank, size, ierror, tag, status(MPI_STATUS_SIZE)
   
   call MPI_INIT(ierror)
   call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierror)
   call MPI_COMM_RANK(MPI_COMM_WORLD, rank, ierror)
   print*, 'node', rank, ': Hello world'
   call MPI_FINALIZE(ierror)
end program

References

Open MP vs MPI

Until Fortran is establishing the standard we can use one of these: Open MP or MPI. Selecting the right one is an Engineering decision. Here are some considerations:

Pros of OpenMP

easier to program and debug than MPI
directives can be added incrementally - gradual parallelization
can still run the program as a serial code
serial code statements usually don't need modification
code is easier to understand and maybe more easily maintained

Cons of OpenMP

can only be run in shared memory computers
requires a compiler that supports OpenMP
mostly used for loop parallelization

Pros of MPI

runs on either shared or distributed memory architectures
can be used on a wider range of problems than OpenMP
each process has its own local variables
distributed memory computers are less expensive than large shared memory computers

Cons of MPI

requires more programming changes to go from serial to parallel version
can be harder to debug
performance is limited by the communication network between the nodes

Go back: Fortran Tutorial