Is it possible to achieve in C++ the performance one can get from Fortran?

This is the question I asked myself recently. If you write a scientific code in Fortran, you can expect a huge performance boost compared to the same program in C or C++. Well, unless you use some compiler extensions, in which case you get the same performance, or better.
Let’s try this on a 3D propagation sample, with a 8-points stencil.

Here is the Fortran code:

PROGRAM test2Df
 
  IMPLICIT NONE
 
  REAL, ALLOCATABLE :: P(:,:,:)
  REAL, ALLOCATABLE :: Q(:,:,:)
 
  INTEGER N, iter
  INTEGER i
 
  N = 500
  iter = 100
 
  ALLOCATE(P(N,N,N))
  ALLOCATE(Q(N,N,N))
 
  P(:,:,:) = 1500
  Q(:,:,:) = 1500
 
  do i = 1, iter
    CALL COMPUTE(P, Q, N)
    CALL COMPUTE(Q, P, N)
  enddo
 
  PRINT *, P(N/2+1,N/2+1,N/2+1)
 
  DEALLOCATE(P)
  DEALLOCATE(Q)
 
END PROGRAM
 
SUBROUTINE COMPUTE(P, Q, N)
 
  IMPLICIT NONE
 
  REAL :: P(N,N,N)
  REAL :: Q(N,N,N)
  INTEGER :: N
 
  INTEGER i, j, k
 
  do k = 5, N-4
    do j = 5, N-4
      do i = 5, N-4
        Q(i, j, k) = P(i, j, k) &
          + (P(i-1, j, k) - P(i+1, j, k)) &
          + (P(i-2, j, k) - P(i+2, j, k)) &
          + (P(i-3, j, k) - P(i+3, j, k)) &
          + (P(i-4, j, k) - P(i+4, j, k)) &
          + (P(i, j-1, k) - P(i, j+1, k)) &
          + (P(i, j-2, k) - P(i, j+2, k)) &
          + (P(i, j-3, k) - P(i, j+3, k)) &
          + (P(i, j-4, k) - P(i, j+4, k)) &
          + (P(i, j, k-1) - P(i, j, k+1)) &
          + (P(i, j, k-2) - P(i, j, k+2)) &
          + (P(i, j, k-3) - P(i, j, k+3)) &
          + (P(i, j, k-4) - P(i, j, k+4))
      enddo
    enddo
  enddo
 
END SUBROUTINE

(In fact, this code does nothing, every sample is always at 1500, it’s just meant to be an example.)

Then, the C/C++ code is almost identical (a real C code would not be different than this one, and a C++ code would only add some smart pointers IMHO):

 
#include <iostream>
#include <cstdlib>
 
#define XYZ(x, y, z, N) ((x) + (y) * (N) + (z) * (N) * (N))
 
void compute(float *P, float *Q, int N);
 
int main(int argc, char** argv)
{
  const int N = 100;
  const int iter = 100;
 
  P = new float[N*N*N];
  Q = new float[N*N*N];
 
  for(int i = 0; i < N*N*N; ++i)
  {
    P[i] = 1500;
    Q[i] = 1500;
  }
 
  for(int i = 0; i < iter; ++i)
  {
    compute(P, Q, N);
    compute(Q, P, N);
  }
 
  std::cout << P[XYZ(N/2, N/2, N/2, N)] << std::endl;
 
  delete[] P;
  delete[] Q;
 
  return EXIT_SUCCESS;
}
 
void compute(float *P, float *Q, int N)
{
  for(int k = 4; k < N-4; ++k)
  {
    for(int j = 4; j < N-4; ++j)
    {
      for(int i = 4; i < N-4; ++i)
      {
        Q[XYZ(i, j, k, N)] = P[XYZ(i, j, k, N)]
          + (P[XYZ(i+1, j, k, N)] - P[XYZ(i-1, j, k, N)])
          + (P[XYZ(i+2, j, k, N)] - P[XYZ(i-2, j, k, N)])
          + (P[XYZ(i+3, j, k, N)] - P[XYZ(i-3, j, k, N)])
          + (P[XYZ(i+4, j, k, N)] - P[XYZ(i-4, j, k, N)])
          + (P[XYZ(i, j+1, k, N)] - P[XYZ(i, j-1, k, N)])
          + (P[XYZ(i, j+2, k, N)] - P[XYZ(i, j-2, k, N)])
          + (P[XYZ(i, j+3, k, N)] - P[XYZ(i, j-3, k, N)])
          + (P[XYZ(i, j+4, k, N)] - P[XYZ(i, j-4, k, N)])
          + (P[XYZ(i, j, k+1, N)] - P[XYZ(i, j, k-1, N)])
          + (P[XYZ(i, j, k+2, N)] - P[XYZ(i, j, k-2, N)])
          + (P[XYZ(i, j, k+3, N)] - P[XYZ(i, j, k-3, N)])
          + (P[XYZ(i, j, k+4, N)] - P[XYZ(i, j, k-4, N)]);
      }
    }
  }
}

Now, these two code compiled with ICC 10.1 with -xW (AMD64 platform, Linux) behave differently. With N=100, the Fortran code executes in 1.7s, the C/C++ one in 3.3s! With N=500, the same factor applies, with 338s for Fortran and 610s for C/C++.

Now, the difference is that in the computation loop, Fortran can optimize better than C/C++ because of the Fortran norm. Indeed, the pointers P and Q cannot alias, taht is, P cannot point to an element of Q, and vice-versa. This means that some elements can be reused in the stencil. This is not the case with the C/C++ code.

Some options exist to add the restruction (-fno-fnalias as a compiler option), but it’s not perfect, as this is a program behaviour modification (and brakes the standard). So C99 introduced a special keyword, restrict. But it is only a C99 option, and C0x does not seem to add it to the language. Too bad, because this changes everything. I’ve used an ICC extension that proposes the restrict keyword in C++ as __restrict__:

void compute(float * __restrict__ P, float * __restrict__ Q, int N)

Now, the C/C++ program is as fast as the Fortran one, with 1.7s for N=100 and 325s for N=500.

Note also that you have to pass each pointer to the computation function, in C or in Fortran. If the pointers are passed inside a structure, the compiler can’t tell if the pointers inside the structure are aliased or not, so it uses the safe codepath.

As a conclusion, I will just say that to achieve Fortran-like performance, C99 proposes a keyword (but not every compiler uses it correctly yet…); unfortunately the next C++ standard does not. This is a shame, as it will not a burden on current compilers. Compiler suits already can generate this kind of codepath (they usually provide a Fortran compiler as well as a C++ compiler), so adding it inside the C++ compiler will not be much trouble. In fact, at least two C++ compilers (gcc and icc) provide their own extension, but it’s only an unofficial extension (one cannot rely on them).

Without an inclusion inside the C++ standard, performance cannot be expected from C++ on every platform with every compiler.

8 thoughts on “Is it possible to achieve in C++ the performance one can get from Fortran?

  1. Other implementations in c++ are noticeably faster than yours.
    eg without the multiplications, using 3D matrices.

    (PS. feel free to beautify the following 🙂 )

    #include 
    #include 
     
    const int N = 100;
    void compute(float P[N][N][N], float Q[N][N][N]);
    int main(int argc, char** argv)
    {
     
      const int iter = 100;
      float P[N][N][N];
      float Q[N][N][N];
      for(int i = 0; i &lt; N; ++i)
      for(int j = 0; j &lt; N; ++j)
      for(int k = 0; k &lt; N; ++k)
      {
        P[i][j][k] = 1500.;
        Q[i][j][k] = 1500.;
      }
     
      for(int i = 0; i &lt; iter; ++i)
      {
        compute(P, Q);
        compute(Q, P);
      }
     
      std::cout &lt;&lt; P[N/2][N/2][N/2] &lt;&lt; std::endl;
     
     
      return EXIT_SUCCESS;
    }
     
    void compute(float P[N][N][N], float Q[N][N][N])
    {
      for(int k = 4; k &lt; N-4; ++k)
      {
        for(int j = 4; j &lt; N-4; ++j)
        {
          for(int i = 4; i &lt; N-4; ++i)
          {
            Q[i][ j][ k] = P[i][ j][ k]
              + (P[i+1][ j][ k] - P[i-1][ j][ k])
              + (P[i+2][ j][ k] - P[i-2][ j][ k])
              + (P[i+3][ j][ k] - P[i-3][ j][ k])
              + (P[i+4][ j][ k] - P[i-4][ j][ k])
              + (P[i][ j+1][ k] - P[i][ j-1][ k])
              + (P[i][ j+2][ k] - P[i][ j-2][ k])
              + (P[i][ j+3][ k] - P[i][ j-3][ k])
              + (P[i][ j+4][ k] - P[i][ j-4][ k])
              + (P[i][ j][ k+1] - P[i][ j][ k-1])
              + (P[i][ j][ k+2] - P[i][ j][ k-2])
              + (P[i][ j][ k+3] - P[i][ j][ k-3])
              + (P[i][ j][ k+4] - P[i][ j][ k-4]) ;
          }
        }
      }
    }
  2. I don’t know if this runs any faster, but that fortran subroutine should be written as follows. The “flying v’s” or do loops is so Fortran77. 🙂

    SUBROUTINE COMPUTE(P, Q, N)
     
      IMPLICIT NONE
     
      REAL :: P(N,N,N)
      REAL :: Q(N,N,N)
      INTEGER :: N
     
      Q(5:N-4, 5:N-4, 5:N-4) = P(5:N-4, 5:N-4, 5:N-4) &
           + (P(5:N-4-1, 5:N-4, 5:N-4) - P(5:N-4+1, 5:N-4, 5:N-4)) &
           + (P(5:N-4-2, 5:N-4, 5:N-4) - P(5:N-4+2, 5:N-4, 5:N-4)) &
           + (P(5:N-4-3, 5:N-4, 5:N-4) - P(5:N-4+3, 5:N-4, 5:N-4)) &
           + (P(5:N-4-4, 5:N-4, 5:N-4) - P(5:N-4+4, 5:N-4, 5:N-4)) &
           + (P(5:N-4, 5:N-4-1, 5:N-4) - P(5:N-4, 5:N-4+1, 5:N-4)) &
           + (P(5:N-4, 5:N-4-2, 5:N-4) - P(5:N-4, 5:N-4+2, 5:N-4)) &
           + (P(5:N-4, 5:N-4-3, 5:N-4) - P(5:N-4, 5:N-4+3, 5:N-4)) &
           + (P(5:N-4, 5:N-4-4, 5:N-4) - P(5:N-4, 5:N-4+4, 5:N-4)) &
           + (P(5:N-4, 5:N-4, 5:N-4-1) - P(5:N-4, 5:N-4, 5:N-4+1)) &
           + (P(5:N-4, 5:N-4, 5:N-4-2) - P(5:N-4, 5:N-4, 5:N-4+2)) &
           + (P(5:N-4, 5:N-4, 5:N-4-3) - P(5:N-4, 5:N-4, 5:N-4+3)) &
           + (P(5:N-4, 5:N-4, 5:N-4-4) - P(5:N-4, 5:N-4, 5:N-4+4))
     
    END SUBROUTINE
  3. Of course, it’s better to use the array notation. It should be also more readable this way.
    I have to say that I don’t know much about the speed difference between the two versions. What I can say is that I’ve seen Fortran 90 software with do loops, and the people coding it were really focused on performance (HPC guys). I think that some optimization are more easily done with do loops than with the array notation (cache blocking loops, for instance), but I may be wrong 😉
    Either way, performance seem to reach a maximum when using Fortran, which is not the case in C++ without the restrict extension.

  4. This may be here-say, but I have been told that you should use array notation whenever possible, because the scalarizer (I believe that is the word) in the Fortran compiler is on average better than you are at writing code. One of Fortran’s few strengths is its ability to do strides, slices, and other array addressing goodness.

    Of course you could always just go to assember, if you think you are smarter than the compiler. 🙂

    Fortran is the only language I know where multi-dimensional arrays are first-class citizens.

  5. I’d check out the C++ template library Eigen:
    http://eigen.tuxfamily.org/

    It gets extremely fast performance on a wide range of vector/matrix operations, including SSE enhancements and auto-vectorization.

    Essentially Eigen gives great performance to matrices in C++. Whether it will cover everything you want? I don’t know because I’m not used to reading Fortran that much.

    1. I’m trying to use it, but there seems to be a huge lack: no 3D arrays. And I really need 3D arrays for a 3D propagation loop. I’ve checked inside the code, eigen uses the restrict keyword. So it uses something outside the C++ standard. As I’ve said, if you restrict yourself to the C++ standard, you can’t get the same performance as the one with Fortran. This just confirms that everyone already uses the restrict keyword. It should have been put inside C++0x.

      I’ve used the Fortran 90 loop kris gave, and I get a factor 3, which is very good. Additional loops are vectorized and the subroutine is complitely inlined; So I have to check if the routine is really doing the same as mine (perhaps N and iter are marked as constants, in which case every loop can be unrolled, but this is not the same as the original code).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.