Vectorization¶
Modern CPUs have Vector Processing Units (VPUs) that allow the processor to do the same instruction on multiple data, SIMD per cycle.
System | microarchitecture | Instruction Set | SIMD width |
---|---|---|---|
Perlmutter | Zen-3 (Milan) | AVX2+ | 256 bits |
Auto-vectorization¶
In many cases a compiler is able to transform sequential code into vector operations automatically - a process known as automatic vectorization.
Example
do i = 1, n
c(i) = a(i) + b(i)
end do
Could be transformed by the compiler such that blocks of 4 elements are processed at a time:
do i = 1, n, 4
c(i) = a(i) + b(i)
c(i+1) = a(i+1) + b(i+1)
c(i+2) = a(i+2) + b(i+2)
c(i+3) = a(i+3) + b(i+3)
end do
Vectorization requirements¶
-
The loop trip count must be known at entry to the loop at runtime. Statements that can change the trip count dynamically at runtime (such as Fortran's
EXIT
, computedIF
, etc. or C/C++'sbreak
) must not be present inside the loop. -
Branching in the loop inhibits vectorization. Thus, C/C++'s
switch
statements are not allowed. However,if
statements are allowed as long as they can be implemented as masked assignments. The calculation is done for allif
branches but the results is stored only for those elements for which the mask evaluates to true. -
Only the innermost loop is eligible for vectorization. If the compiler transforms an outer loop into an inner loop as a result of optimization, then the loop may be vectorized.
-
A function call or I/O inside a loop prohibits vectorization. Intrinsic math functions such as
cos
,sin
, etc. are allowed because such library functions are usually vectorized versions. A loop containing a function that is inlined by the compiler can be vectorized because there will be no more function call. -
Data dependencies in the loop could prevent vectorization.
-
Non-contiguous memory access hampers vectorization efficiency. Eight consecutive ints or floats, or four consecutive doubles, may be loaded directly from memory in a single AVX instruction. But if they are not adjacent, they must be loaded separately using multiple instructions, which is considerably less efficient.
Data dependency¶
read-after-write¶
Also known as "flow dependency". Vectorization creates wrong results.
do i=2,n
a(i) = a(i-1) + 1
end do
write-after-read¶
Also known as "anti-dependency" and can be vectorized.
do i=2,n
a(i-1) = a(i) + 1
end do
write-after-write¶
Also known as "output dependency" and cannot be vectorized.
do i=2,n
a(i-1) = x(i)
a(i) = 2.0 * i
end do
Reduction operations¶
Reduction operations can be vectorized.
s=0.0
do i=1,n
s = s + a(i) * b(i)
end do
Memory alignment¶
Data movement instructions are more efficient when operating on data objects that are aligned.
Note
While Fortran does not have extensions in the language itself for data alignment some compilers provide non-portable directives or command line flags: the Cray compiler has the directive !DIR$ VECTOR ALIGNED
and the Intel compiler has the compiler flag -align array64byte
.
Fortran alignment example¶
The following test code examines the effect of memory alignment in a simple-minded matrix-matrix multiplication case. We pad the matrices with extra rows to make them aligned at certain boundaries.
program matmat
implicit none
integer :: n = 31
integer :: itmax = 200000
#ifdef REAL4
real, allocatable :: a(:,:), b(:,:), c(:,:)
#else
real*8, allocatable :: a(:,:), b(:,:), c(:,:)
#endif
#ifdef ALIGN16
!dir$ attributes align : 16 :: a,b,c
#elif defined(ALIGN32)
!dir$ attributes align : 32 :: a,b,c
#elif defined(ALIGN64)
!dir$ attributes align : 64 :: a,b,c
#endif
integer i, j, k, it
integer :: vl, nr
integer*8 c1, c2, cr, cm
real*8 dt
!... Vector length
#ifdef ALIGN16
vl = 16 / (storage_size(a) / 8)
#elif defined(ALIGN32)
vl = 32 / (storage_size(a) / 8)
#elif defined(ALIGN64)
vl = 64 / (storage_size(a) / 8)
#else
vl = 1
#endif
nr = ((n + (vl - 1)) / vl) * vl ! padded row dimension
allocate (a(nr,n), b(nr,n), c(nr,n))
!... Initialization
do j=1,n
#if defined(ALIGN16) || defined(ALIGN32) || defined(ALIGN64)
!dir$ vector aligned
#endif
do i=1,nr
a(i,j) = cos(i * 0.1 + j * 0.2)
b(i,j) = sin(i * 0.1 + j * 0.2)
c(i,j) = 0.
end do
end do
!... Main loop
call system_clock(c1, cr, cm)
do it=1,itmax
do j=1,n
do k=1,n
#if defined(ALIGN16) || defined(ALIGN32) || defined(ALIGN64)
!dir$ vector aligned
#endif
do i=1,nr
c(i,j) = c(i,j) + a(i,k) * b(k,j)
end do
end do
end do
end do
call system_clock(c2, cr, cm)
print *, c(1,1)+c(n,n), dble(c2-c1)/dble(cr)
deallocate(a, b, c)
end
AoS vs SoA¶
Array of Structures vs Structures of Arrays
A data object can become complex with multiple component elements or attributes. Programmers often represent a group of such data objects using an array of Fortran's derived data type or C's struct objects (i.e., an array of structures or AoS). Although an AoS provides a natural way to represent such data, memory reference of any component requires non-unit stride access. Such a situation is illustrated in the following example code. When the main loop is transformed into a vector loop, three components of a 'coords' object will be stored into three separate vector registers, one for each component. With the AoS data layout, loading into such a register will require stride 3 (or more) access, reducing efficiency of the vector load.
A better data structure for vectorization is to separate each component of the objects into its own array, and then form a data object composed of three arrays (i.e., a structure of arrays or SoA). When the main loop is vectorized, each component will be loaded into a separate register but this will be done with unit-stride access. Therefore, vectorization will be more efficient.
program aossoa
implicit none
integer :: n = 1000
integer :: itmax = 10000000
#ifdef SOA
type coords
real, pointer :: x(:), y(:), z(:)
end type
type (coords) :: p
#else
type coords
real :: x, y, z
end type
type (coords), allocatable :: p(:)
#endif
real, allocatable :: dsquared(:)
integer i, it
integer*8 c1, c2, cr, cm
real*8 dt
!... Initialization
#ifdef SOA
allocate(p%x(n), p%y(n), p%z(n), dsquared(n))
do i=1,n
p%x(i) = cos(i + 0.1)
p%y(i) = cos(i + 0.2)
p%z(i) = cos(i + 0.3)
end do
#else
allocate(p(n), dsquared(n))
do i=1,n
p(i)%x = cos(i + 0.1)
p(i)%y = cos(i + 0.2)
p(i)%z = cos(i + 0.3)
end do
#endif
!... Main loop
call system_clock(c1, cr, cm)
do it=1,itmax
#ifdef SOA
do i=1,n
dsquared(i) = p%x(i)**2 + p%y(i)**2 + p%z(i)**2
end do
#else
do i=1,n
dsquared(i) = p(i)%x**2 + p(i)%y**2 + p(i)%z**2
end do
#endif
end do
call system_clock(c2, cr, cm)
dt = dble(c2-c1)/dble(cr)
print *, dsquared(1)+dsquared(n/2)+dsquared(n), dt
#ifdef SOA
deallocate(p%x, p%y, p%z, dsquared)
#else
deallocate(p, dsquared)
#endif
end
Elemental functions¶
Elemental functions are functions that can be also invoked with an array actual argument and return array results of the same shape as the argument array. This convenient feature is quite common in Fortran as it is widely used in many intrinsic functions.
A function call inside a loop generally inhibits vectorization. However, if an elemental function is called within a loop, the loop can be executed in vector mode. In vector mode, the function is called with multiple data packed in a vector register and returns packed data.
Fortran example¶
module fofx
implicit none
contains
elemental function f(x)
real(8) :: f
real(8), intent(in) :: x
f = cos(x * x + 1.0_8) / (x * x + 1.0_8)
end function f
end module fofx
program main
use fofx
implicit none
integer :: n = 1024
integer :: itmax = 1000000
real(8), allocatable :: a(:), x(:)
integer :: i, it
integer(8) :: c1, c2, cr, cm
real(8) :: dt
allocate (a(n), x(n))
!... Initialization
do i=1,n
x(i) = cos(i * 0.1_8) + 0.2_8
end do
!... Main loop
call system_clock(c1, cr, cm)
do it=1,itmax
do i=1,n
a(i) = f(x(i))
end do
x(n) = x(n) + 1.0_8
end do
call system_clock(c2, cr, cm)
dt = real(c2-c1, 8)/real(cr, 8)
write(*,*) n, a(1)+a(n/2)+a(n), dt
deallocate(a, x)
end program main
Pointer aliasing¶
OpenMP¶
The OpenMP standard has the SIMD construct since 4.0 to specify the execution of a loop in vectorization mode (i.e., SIMD operations).
#pragma omp simd [clause...]
!$omp simd [clause...]
where the optional clause could be:
safelen(length)
- maximum length for safe vectorization without incurring data dependencyaligned(list[:alignment])
- list of the variables are aligned to the number of bytes expressed in the optional parameter.reduction(reduction-identifier:list)
- list the variables where a reduction operation (i.e.,+
for summation,min
for minimum,max
for maximum, etc.) result is storedcollapse(n)
- how many levels of the nested loops that immediately follow the OpenMP directive should be collapsed into a single aggregate loop with larger iteration space.
Memory alignment¶
do it=1,itmax
do j=1,n
do k=1,n
#if defined(ALIGN16)
!$omp simd aligned(a,b,c:16)
#elif defined(ALIGN32)
!$omp simd aligned(a,b,c:32)
#elif defined(ALIGN64)
!$omp simd aligned(a,b,c:64)
#endif
do i=1,nr
c(i,j) = c(i,j) + a(i,k) * b(k,j)
end do
end do
end do
end do
Elemental functions¶
It is also possible to declare that a function can be vectorized with OpenMP.
module fofx
implicit none
contains
#ifdef ELEMENTAL
!$omp declare simd (f)
#endif
function f(x)
#ifdef REAL4
real f, x
#else
real*8 f, x
#endif
#ifdef REAL4
f = cos(x * x + 1.e0) / (x * x + 1.e0)
#else
f = cos(x * x + 1.d0) / (x * x + 1.d0)
#endif
end function f
end module fofx
...