03 March 2021
Being a compiled language designed for scientific and engineering applications, Fortran is well suited for producing very efficient code. However, the process of tuning code requires developers to understand the language as well as the tools available (compilers, performance libraries, etc) in order to generate the highest performance. This section identifies programming patterns to use Fortran and the Intel® Fortran compiler most effectively.
Fortran compilers have multiple levels of optimization available from no optimization to extreme optimization tuned to a particular machine architecture and operating system combination. Timing tests alone are not a good indication for the compiler’s ability to optimize a particular portion of code. Therefore, developers should generate optimization reports to get line-by-line reporting on optimizations such as vectorization, parallelization, memory and cache usage, threading, and others.
https://software.intel.com/content/www/us/en/develop/articles/vectorization-and-optimization-reports.html
Each mathematical operation has an effective “strength”, and some operations can be equivalently represented as a combination of multiple reduced-strength operations that have better performance than the original. As part of the code optimization step, compilers may be able to identify areas where a mathetical operation’s strength can be reduced. Compilers are not able to optimize all expensive operations. For example, cases where a portion of an expensive mathematical operation is a variable contained in a derived data type are frequently skipped. Therefore, it is recommended that expensive subroutines be profiled and searched for possible strength reduction opportunities.
A concrete example of operator strength reduction is in dividing many elements of an array by a constant.
module_type%scale_factor = 10.0
do i = 1
if array(i) < 30.0
array(i) = array(i) / module_type%scale_factor
end if
end do
In this case, a multiplication of real numbers is less expensive than a division of real numbers. The code can be factored so that the inverse of the scale factor is computed outside of the loop and the mathematical operation in the loop is converted to a multiplication.
module_type%scale_factor = 10.0
inverse_scale_factor = 1.0 / module_type%scale_factor
do i = 1
if array(i) < 30.0
array(i) = array(i) * inverse_scale_factor
end if
end do
Intel Coarray tutorial: https://software.intel.com/content/www/us/en/develop/documentation/fortran-compiler-coarray-tutorial/top/modifying-the-program-to-use-coarrays.html
Coarrays are a feature introduced to the Fortran language in the 2008 standard to provide language-level parallelization for array operations in a Single Program, Multiple Data (SPMD) programming paradigm. The Fortran standard leaves the method of parallelization up to the compiler, and the Intel® Fortran compiler uses MPI.
Coarrays are used to split an array operation across multiple copies of the program. The copies are called “images”. Each image has its own local variables, plus a portion of any coarrays as shared variables. A coarray can be thought of as having extra dimensions, referred to as “codimensions”. A single image (typically the 1-th index) controls the I/O and problem setup, and images can read from memory in other images.
For operations on large arrays such as constructing a super-array from many sub-arrays (as in the construction of a Jacobian matrix), the coarray feature of Fortran 08 can parallelize the procedure improving overall computational efficiency.
.. TODO: Add example of coarray implementation in Fortran
Fortran represents arrays in column-major order. This means that a multidimensional array is represented in memory with column elements being adjacent. If a given element in an array is at a location in memory, one element before in memory corresponds to the element above it in its column.
In order to make use of the single instruction, multiple data
features of modern processors, array construction and access
should happen in column-major order. That is, loops should loop
over the left-most index quickest. Slicing should occur with
the :
also on the left-most index when possible.
With this in mind, data should be represented as structures of arrays rather than arrays of structures. Concretely, this means that data types within OpenFAST should contain the underlying arrays and arrays should generally contain only numeric types. Wikipedia page on AoS vs SoA: https://en.wikipedia.org/wiki/AoS_and_SoA
Here’s a pretty good reference on array in C and Fortran (and derivates): https://eli.thegreenplace.net/2015/memory-layout-of-multi-dimensional-arrays
The short program below displays the distance in memory in units of bytes between elements of an array and neighboring elements.
program memloc
implicit none
integer(kind=8), dimension(3, 3) :: r, distance_up, distance_left
! Take the element values as their "ID"
! r(row, column)
r(1,:) = (/ 1, 2, 3 /)
r(2,:) = (/ 4, 5, 6 /)
r(3,:) = (/ 7, 8, 9 /)
print *, "Reference array:"
call pretty_print_array(r)
! Compute the distance between matrix elements. Inputs to the `calculate_distance` function
! are indices for elements in the equation loc(this_element) - loc(other_element)
distance_up(1,:) = (/ calculate_distance( 1,1 , 1,1), calculate_distance( 1,2 , 1,2), calculate_distance( 1,3 , 1,3) /)
distance_up(2,:) = (/ calculate_distance( 2,1 , 1,1), calculate_distance( 2,2 , 1,2), calculate_distance( 2,3 , 1,3) /)
distance_up(3,:) = (/ calculate_distance( 3,1 , 2,1), calculate_distance( 3,2 , 2,2), calculate_distance( 3,3 , 2,3) /)
print *, "Distance in memory (bytes) for between an element and the one above it (top row zeroed):"
call pretty_print_array(distance_up)
distance_left(1,:) = (/ calculate_distance( 1,1 , 1,1), calculate_distance( 1,2 , 1,1), calculate_distance( 1,3 , 1,2) /)
distance_left(2,:) = (/ calculate_distance( 2,1 , 2,1), calculate_distance( 2,2 , 2,1), calculate_distance( 2,3 , 2,2) /)
distance_left(3,:) = (/ calculate_distance( 3,1 , 3,1), calculate_distance( 3,2 , 3,1), calculate_distance( 3,3 , 3,2) /)
print *, "Distance in memory (bytes) for between an element and the one to the its left (first column zeroed):"
call pretty_print_array(distance_left)
contains
integer(8) function calculate_distance(c1, r1, c2, r2)
integer, intent(in) :: c1, r1, c2, r2
calculate_distance = loc(r(c1, r1)) - loc(r(c2, r2))
end function
subroutine pretty_print_array(array)
integer(8), dimension(3,3), intent(in) :: array
print *, "["
print '(I4, I4, I4)', array(1,1), array(1,2), array(1,3)
print '(I4, I4, I4)', array(2,1), array(2,2), array(2,3)
print '(I4, I4, I4)', array(3,1), array(3,2), array(3,3)
print *, "]"
end subroutine
end program