Lab 2 --Gary Howell
Experimenting with Communication Times, Collective Communications, and Serial Speedups


  • Editing Files in UNIX

    The vi editor is useful as being universally available on unix machines and since it does not require a forwarded xterm (See lab 1). Perhaps you are not fond of vi and would prefer to think about something else rather than its commands. There are other available editors. One easy to use editor is "nano".

     
    >nano myfile
    

    will edit myfile. Some editor commands should appear at the bottom of the screen. CTRL X will save your file. Nano doesn't need a GUI and may be more natural to use than vi. Sometimes nano users cannot detect when lines break, inconvenient as new lines can change the meaning of code.

    Using Windows edited files in linux or unix often fails because the "new line" symbol differs in Windows and linux. A Windows file "foolfile" can be converted to linux format by typing

    dos2unix foofile
    

    Many Unix users prefer the emacs editor to vi. It requires forwarding a GUI to a remote terminal. If you're logging in via putty from an NCSU lab machine, first click the xwin program. At the prompt

    >emacs myfile
    

    You might also try the
    >emacs -nw myfile
    

    option. This lets you use emacs even if you haen't succeeded in getting a forwarded xterm. Another easy to use editor is nedit. It also requires a forwarded xterm.

  • Other resources on these web pages

    These labs are not at all complete. The HowTo page at HowTo has useful information on how to get started, how to manage storage, commonly encountered errors, etc. More specific information (on some commonly used software packages) is available from Software.

  • Using a Scatter and a Reduce Copy ranpak.f from pachec/ppmpi_f/chap03 to pachec/ppmpi_f/chap05. Having added environmental variables, by "add pgi64_hydra", create files monte2.o and ranpak.o (compile with the -c flag) .
    mpif90 -c monte2.f ranpak.f 
    

    Then link these to produce an executable.
    mpif90 monte2.o ranpak.o -o  monte2 
    

    There are also mpif977 mpicc, and mpiCC commands available. These correspond to pgf77, pgcc, and pgCC, but the "add pgi64_hydra" command also links in the MPI libraries (actually these are MPI-2 librariesi). Similarly, we could have invoked "add intel64_hydra". GNU compilers can also be used, though the GNU compiled code typically runs more slowly.

  • Aside on Compiling with a Make file Though the above commands could be put in a script (make a file with the above two lines, make it executable), this would result in recompiling all the code every time you made a change in one source code file. So for big executables which link many .o files, the preferred solution is to use a "make" file. Then only the required files are recompiled.

    If you want to see how a makefile for this simple problem would look, try a file makemonte2 consisting of

    FC = mpif77
    mont:
    	$(FC) -c monte2.f ranpak.f
    	$(FC) -o monte2 monte2.o ranpak.o 
    

    Make sure that the 'executable' lines after mont: start with tabs not blanks!!
    make -f makemonte2
    

    will do the compile. If you google on "make + tutorial" you can find many tutorials. For example, Colby College Make tutorial . You want to learn for example, how to specify libraries to link to, specify include files, and how to specify compiler options. Of course another good method is to just look at existing make files.

    Traditionally, unix users would port codes by modifying make files. More recently, one runs a "configure script", which automatically reconfigures the make file to match the local compilers, compiler options, and library locations (these are inputs from the user).

    Here's a link to a more complete unix/linux set of tutorials Professor Norm Matloff, CS UC Davis Tutorial Center.

  • Revise the code Try revising the MPI_Reduce call. Can you make the code accumulate the product of the numbers as opposed to the sum? Can you make it give the maximum of the numbers?

    Then use an MPI_ALLREDUCE call to deposit the answer on all the processors. Here's the argument list for an MPI_ALLREDUCE and the Fortran declarations.

    MPI_ALLREDUCE(SENDBUF,RECVBUF,COUNT,DATATYPE,OP,COMM,IERROR)
    SENDBUF(*), RECVBUF(*)
    INTEGER COUNT, DATATYPE, OP, COMM, IERROR

    IN sendbuf -- choose a data type
    OUT recvbuf -- typically choose the same data type as in the sendbuf
    IN count integer number of elements in send buffer
    IN datatype MPI data type of elements of send buffer
    IN op operation -- adding, getting max, etc.
    IN comm communicator -- for example MPI_COMM_WORLD
    OUT ierror integer, zero if successful

  • Write a Run Script

    Modify your run script from the first lab. Get it to run. Uncomment the MPI_Barrier. Compare the times.

  • Wouldn't it be great if we could write prototype communication codes in matlab?

    Actually the most commonly used engineering language may not be C or Fortran. It's probably Matlab. Why? Because we can first try out our code and see if it does what we want it do, and then, when we're happy with it but need it to run faster then we convert it to Fortran or C.

    So wouldn't it be great if we could parallelize our Matlab code with MPI calls. Then we could easily try out communication patterns. And actually it turns out we can. We've installed the bcmpi package from Ohio Supercomputing. For help in using bcmpi, see the page bcMPI.

  • In-Cache Floating Point Operations Usually, a good introduction to parallel computing should spend a good deal of time in how to get code to run fast. For a short course, this is tough, but from here on is some more material introducing the topic. (For some more material on topics I think are important, you look through other lectures from the CSC783 course I taught a few years ago. These are on-line, under previous courses)

    This following is mainly based on Chapter 5 of Goedecker and Hoisie and is one of the CSC 783 lectures.

    I sometimes think that the main way to speed code serial or parallel, is to keep data local. In fast cache, or main cache, or RAM, or in parallel to have "prefetched it" from some other processor memory.

    But keeping data local doesn't guarantee full-speed performance. It's really very easy to have your data local and still have slow code. As you've now seen, even an in-cache matrix vector multiply may not run at the full processor speed. Having profiled to see what part of the code that takes a good bit of time, it's worth concentrating on it and seeing if were it in cache could we make it go pretty fast. Some would say this is "too hard" , but usually I've found that if a class keeps playing with something after a while its speeds improve a lot. And the Goedecker and Hoisie tricks are not that hard to play--and can help a lot.

    A caveat. For scientific computations we tend to just want to count floating point operations, but in general integer operations are also important. For example in a sparse matrix vector multiply, we store only the nonzero entries of a matrix along with integer vectors. In the simplest scheme there is an integer vector for which the kth entry gives the column of the kth element of the matrix and there is another integer row vector, the kth element of which is the row of the kth element of the matrix.

    If the 3 vectors are afloat, indx, indy, then the element afloat(k) corresponds to the element A(indx(k),indy(k)) of the original matrix. So the integer computations for indexing, become important in a matrix vector multiply.

    But at least let's start with making in-cache flops go fast.

  • Fused Multiply Adds Suppose we can do the operation
    >a = b+c*d
    in one clock cycle. Then the sum
    >a = b+ c*d + e*f
    can be rewritten in two lines as
    >a = b+ c*d
    >a = a + e*f
    which could be 2 clockcycles. But if the original sum is switched to be written as

    >a = e*f + b+ c*d
    Then it's pretty likely that the compiler will execute it as
    >a = e*f
    >a = a + b
    >a = a + c*d

    which looks like 3 clock cycles? But actually, there is also a dependency. So there's likely to be an extra delay between each operation. So perhaps the first version is 3 clock cycles and the second one is 5? Often if we have a loop, we can eliminate the dependency.

  • Getting instruction-level parallelism

    Hopefully, the chip and compiler designers are aware of these same issues and automating some of the following with thread-level parallelism and ability to do "out-of-order" computation? The following example shows the compiler and chip designers haven't yet . . .

    Consider a sum computation.

    
          program getsum
    
    C  This is a driver program to test Mflop rate for a sum 
          implicit none
          integer lda, n, i, j,k, idum     
          parameter(lda=2000)
          real*8 a(lda,lda),x(lda)  
          real*8  temp,mflops,sum ,tot ,t 
          real*4 ftime, pretim, entim
          external ftime
          
          print*, 'input n'
          read*, n
    
    c  Initialize matrices a and b
          do j = 1,n
            do i = 1,n
               a(i,j) =  1.d0/(i+2*j-1) 
            end do
          end do 
       
          pretim = ftime()
    c
    c  Write your matrix vector multiply here 
    c  
          do k = 1, 200  
          do j = 1,n
             sum = 0.d0 
             do i=1,n
                t = a(i,j) 
    c           sum = sum + abs( a(i,j) ) 
                sum = sum + t 
             end do
             x(j) = sum 
             tot = x(j) + sum 
          end do  
          end do 
          entim = ftime() - pretim 
          print* , 'sum =', sum 
          print*,' entim =', entim 
          if (entim.lt.1.e-6) then
              print*,' too fast to time'
          else
    c 
    c  The next formula assumes the matrix was square n by n
    c   Possibly, you will have to repeat the matvec
    c   k times to get a timing.  If so you will want 
    c   to multiply the RHS of the equation below by k.  
    c 
           mflops = k* (n)**2/entim/1.d6
            print*,' Mflop rate of', mflops
          endif
          stop
          end
    
    This ran at around 100 Mflops. (problem the timer kept returning very small nonzero numbers, so put in a less than test). So imitating the suggestion of Goedecker and Hoisie, let's "unroll" the loop.
      
          program getsum 
    
    C  This is a driver program to test Mflop rate for a sum 
          implicit none
          integer lda, n, i, j,k, idum     
          parameter(lda=2000)
          real*8 a(lda,lda),x(lda)  
          real*8  temp,mflops,sum ,tot ,t ,sum1,sum2,sum3
          real*4 ftime, pretim, entim
          external ftime
          
          print*, 'input n'
          read*, n
    
    c  Initialize matrices a and b
          do j = 1,n
            do i = 1,n
               a(i,j) =  1.d0/(i+2*j-1) 
            end do
          end do 
       
          pretim = ftime()
    c
    c  Write your matrix vector multiply here 
    c  
          do k = 1, 200  
          do j = 1,n
             sum = 0.d0 
             sum1 = 0.d0
             sum2 = 0.d0 
             sum3 = 0.d0 
             do i=1,n,4
                sum = sum + ( a(i,j) ) 
                sum1 = sum1 + ( a(i+1,j) ) 
                sum2 = sum2 + ( a(i+2,j) ) 
                sum3 = sum3 + ( a(i+3,j) ) 
             end do
             x(j) = sum1+sum2+sum3+sum 
             tot = x(j) + tot 
          end do  
          end do 
          entim = ftime() - pretim 
          print* , 'sum =', sum 
          print*,' entim =', entim 
          if (entim.lt.1.e-6) then
              print*,' too fast to time'
          else
    c 
    c  The next formula assumes the matrix was square n by n
    c   Possibly, you will have to repeat the matvec
    c   k times to get a timing.  If so you will want 
    c   to multiply the RHS of the equation below by k.  
    c 
            mflops = k* (n)**2/entim/1.d6
            print*,' Mflop rate of', mflops
          endif
          stop
          end
    
    These ran typically at 200 Mflops, leading one to believe that the Xeon can handle two independent streams. Some surprises
  • It made no difference to speed if I put in an abs(a(i,j)) one each computation.
  • I tried these with the -O4 option on the pgi compiler. That should have done the loop unrolling, but made no difference.
  • Next I tried it with the intel ifc (now ifort) compiler. There I got in-cache performance of about 1000 to 1200 Mflops (for matrices of size 100 or so, adding up a thousand times to get a long enough time to clock). The first version without the loop-unrolling was again about half speed.

    Note that only adds are performed. Let's try putting a multiply in also so that the computational line is
    >sum = sum + a(i,j)*a(i,j)
    One might expect that an add and a multiply could then be perform on each step. Then the flop rate for the ifc compiler was about 20% faster (1200 to 1500 Mflops). This is faster then the rate the data could be sucked from RAM, but still not nearly as fast as the advertised 6 Gflops.

  • Software Pipelining In the above program, we're getting only about 1/5 peak speed for the calculation.
                sum = sum +  a(i,j) * a(i,j)  
                sum1 = sum1 +  a(i+1,j)  * a(i+1,j)  
                sum2 = sum2 +  a(i+2,j)  * a(i+2,j)  
                sum3 = sum3 +  a(i+3,j)  * a(i+3,j) 
    

    where presumably the machine is actually executing
                a0  = a(i,j)       
                sum = sum +  a0 * a0  
                a1  = a(i+1,j) 
                sum1 = sum1 +  a1  * a1 
                a2   =  a(i+2,j)  
                sum2 = sum2 +  a2  * a2 
                a3   =  a(i+3,j)  
                sum3 = sum3 +  a3  * a3 
    
    This is in the case that the matrix A is small enough (80 by 80) say which is 80*80*8 = 64000 bytes to be easily within cache memory. So apparently the machine is not able to do flops on each clock cycle. Since there are no "stores" needed with sum, sum1, sum2, sum3 fitting in registers, it must be that loads of a(i,j) are not occuring fast enough. Or that the load operation has a delay before the multiply add can occur.

    There are a couple of possible solutions. One is to take a greater depth of loop-unroll, perhaps instead of 4, we should unroll to a length of 12? The authors give an example where loop-unrolling of depth 12 worked very well on an Alpha chip.

    Possibly we should make the code more explicit by making some new variables and a prefetch so there is no dependency on the previous step ?

                a2  = a(i+2,j) 
                sum = sum +  a0 * a0  
                a3 = a(i+3,j) 
                sum1 = sum1 +  a1  * a1  
                a0 = a(i+4,j) 
                sum2 = sum2 +  a2  * a2 
                a1 = a(i+5,j)  
                sum3 = sum3 + a3 * a3 
    
    where this overlaps the loop content to avoid dependency (making the first and last steps a bit more complicated).

    A solution that seems to work in on Pentiums in the middle of a matrix vector multiply is to loop unroll in both the i and j directions. Then the machine can be loading several columns of A at the same time.

  • Ratio of Floating Point Ops to Memory Accesses

    If we write a matrix vector multiply as (Fortran, for C transpose the discussion)

         do i=1,n
            do j=1,m 
               b(i) = b(i) + a(i,j) * x(j)
            end do
         and do 
    

    then on each step of the inner loop we need a new value of x(j), which since x is stored sequentially, has probably already been fetched (when x(j-1) was fetched). But we are accessing a by the row, so a(i,j) is probably not around. Also we're having to reload b(i). For an in-cache calculation, we would be likely to have b(i) in L1 cache, a(i,j) in L2 cache. So fetching b(i) is not so bad. Switching the order of loops
         do j=1,n
            do i=1,m 
               b(i) = b(i) + a(i,j) * x(j)
            end do
         end do 
    
    then on each step of the inner loop we need both a new value of a(i,j) and also a new value b(i). We already have x(j) in register. Fortunately, a(i,j) and b(i) were probably fetched on the last step. But still to go through one inner loop, we have to load the entire column of a and also the entire vector b. i.e., b has to reloaded m times ( presumably only from L1 cache or for a large enough problem from L2 cache).

    But if we block the code then we can "usually" have all the elements of a, b, and x ready and we don't have to reload elements of b the n times.

         do j=1,n,4
            do i=1,m,4
               b(i) = b(i) + a(i,j)*x(j) + a(i,j+1)*x(j+1) +  a(i,j+2)*x(j+2)
        +                             a(i,j+3)*x(j+3)
               b(i+1) = b(i+1) + a(i+1,j)*x(j)+a(i+1,j+1)*x(j+1)+ a(i+1,j+2)*x(j+2)
        +                             a(i+1,j+3)*x(j+3)
               b(i+2) = b(i+2) + a(i+2,j)*x(j)+a(i+2,j+1)*x(j+1)+ a(i+2,j+2)*x(j+2)
        +                             a(i+2,j+3)*x(j+3)
               b(i+3) = b(i+3) + a(i+3,j)*x(j)+a(i+3,j+1)*x(j+1)+ a(i+3,j+2)*x(j+2)
        +                             a(i+3,j+3)*x(j+3)
            end do
         end do 
    c      And of course we need some clean up steps when m, n are not 
    c           multiplies of 4
    

    Here b is reloaded only n/4 times.

  • Register Spills

    On 32 register machines, it made some sense to worry about when loops got too long to fit in registers and spilled out, e.g., this might make sense on the p690. The Xeon has only 8 registers, so we'll "always" spill. But if we can do loops which maximize the amount of data that stays in L1 cache, that can be a big help. Similarly, it's better to keep data in L2 than have it spill into DRAM ...

  • A bit more on loop unrolling

    Loop-unrolling has been used for fifteen or twenty years. Prominently Dongarra was able to use the first Crays to full potential by developing vector operations as the BLAS-1 library. Now we need the trick to get reasonable performance Now, loop-unrolling is accomplished by many opitimizing compilers. Often loop-unrolling is part of the compiler default optimization. But the loops are typically unrolled in only one dimension.

    It's hard to choose the optimal loop-unrolling parameters automatically and many times compilers get it wrong. A priori I tend to think that loops such as the above work best if more or less square such as the 4 by 4 example, but exhaustive searches of the best loops lead to long skinny blocks. (Clint Whaley's ATLAS BLAS automatically -- actually by timing some choice--chooses the "best" or nearly the best. There's also a Berkeley competitor, but it's in my experience less robust. Philosophically the Berkeley BLAS works more from first principles for a given machine, the ATLAS BLAS is empirical, timing various options and choosing the best).

    If you unroll by hand don't forget to work out the "remainder" pieces carefully. It's a good time to compare some various levels of unrolling. As an example, of when you might unroll by hand, consider matrix operations that don't quite fit in the BLAS routines. For example, I need to produce the conjugate of a complex vector and my PGI Fortran program will do only 20 million conjugates a second.

  • Aliasing

    Aliasing is when data in a given memory location is referred to under two different names. In C this is allowed. In Fortran, compilers are allowed to assume this does not occur. Traditionally, this has meant that Fortran compilers have more information and hence can perform optimizations that would not be allowable for a C compiler.

    So in fortran, if A is an array, then typically the call

    > call myfoo(a,a,4)

    would not be allowable because the array A is referenced to have two different meanings inside

    > subroutine myfoo(a,b,4)

    The non-aliasing rule is not enforced in Fortran, but users who use aliasing may find their code is not portable.

    Because aliasing is not allowed, the Fortran compiler writer has a little better chance produce efficienct code. That being said, C is close to hardware and ATLAS BLAS for example is written in C.

  • Fortran 90, overloading functions, avoiding interrupts

    There are some nice operations in Fortran 90. For example you can add two vectors by
    > c = a(1:4) + b(1:40)
    which is much nicer for the programmer than writing a loop. But when it comes times to optimize, you probably convert most of these f90 calls to BLAS calls (at least for frequently used portions of the code). Especially for calls repeated many times for short vectors or small matrices.

    Also Fortran 90 and C++ allow operator overloading. This can make code writing easier. But for purposes of performance temporary space is allocated run time. That's a time consuming task, perhaps equivalent to an interrupt.

    System interrupts, e.g., for floating point exceptions may be known to be harmless. For example, underflow may just mean zero. It's possible to test for NaNs (not a numbers) and take another branch if these occur. When I was trying out the "perfect" code of a few sections further on, I found that I could run with NaNs at at about 2 million (meaningless) flops per second. But putting an "if" to test for each NaN would be slower.

    Modern chips cope better with branches than they used to. But hey even straight ahead they are usually running only at 5 or 10 per cent of peak, so having lots of branches "ready on spec" only means you'll continue along at 5 % (which is of course better than the alternative of just stalling while everthing for the next after branch step is fetched).

    Does the branch prediction paradigm have ramifications for the C (or Perl) style of putting many subroutine calls inside an "if" error handler?

    Type conversions cost more than flops.

    Conjugations or changing sign cost more than flops?

    Complex arithmetic is good in that complex arithmetic has twice as many flops per byte of memory access.

    Special function evaluations are usually costly. Two approaches to save time, use interpolation or find the fast but not quite as accurate a library (or combine these two).

    There is an overhead for calling a subroutine. Hoisie and Goedecker claim this is about the same as loop overhead per incrementing a loop, about ten cycles on the machine they tested. Many compilers have compiler flags to in-line of designated subroutines. An additional hazard to subroutines and function in C and Fortran 90 is that many variables may have local scope, so a private copy is made for the local function. Possibly advantage is code safety, but does take time. On some Fortran90 compilers, subarray notation causes a copy to be made for a subroutine.

  • The Perfect Code?

    There are 3 different BLAS packages on the blades that can achieve higher than 5 Gflops. The advertised peak speed is 6 Gflops. The following program imitates one given in Goedecker and Hoisie and they were able to get peak speed with it on an IBM Power architecture. So let's try it on the BladeCenter.

    I do timings by first getting an interactive shell by

    >bsub -Is -W 5 csh

    which gives me an interactive shell so that I can then just type the name of the executable. The bsub command gives me a processor of my own to play with, giving more faith that I'm getting the whole processor. Also the burst of time I take doesn't cause someone else a delay on the head node. If the name of the executable file is peak, then I can run it by typing

    >./peak

    The program should run at peak speed? It has lots of good paired matrix vector multiplies. There are 10 variables so it doesn't quite fit in the 8 registers. But no statement depends on the last so surely the compiler and chip should be able to do a couple of flops per clock cycle?

     
    
          program peak 
    
    C  This is a "peak" performance try.  
          implicit none
          integer lda, n, i, j,k, idum     
          parameter(lda=2000)
          real*8 x1,x2,x3,x4,y1,y2,y3,y4,s,t, mflops, mean 
          real*4 ftime, pretim, entim
          external ftime
          
          print*, 'input n'
          read*, n
    
    c  Initialize matrices a and b
          x1 = 1.d-1
          x2 = 5.d-1
          x3 = 4.d-1
          x4 = 1.d-1
          y1 = 9.d-1
          y2 = 2.d0
          y3 = 1.d0
          y4 = 2.d0 
          s  = -.125d-4
          t  = .125d-4 
        
       
          pretim = ftime()
    c
    c  Write the fast flops here 
    c   (lots of fused add, mutliplies, no new data, 
    c        no dependencies) 
    c   But on the Xeon, even this will spill the registers. 
    c  
          do j=1,1000*n
             do i=1,n
                y1 = x1 + s*x2 + t*x3
                y2 = x2 + s*x1 + t*x4
                y3 = x3 + s*x4 + t*x1
                y4 = x4 + s*x3 + t*x2
    
                x1 = y1 + s*y2 + t*y3
                x2 = y2 + s*y1 + t*y4
                x3 = y3 + t*y1 + s*y4
                x4 = y4 + t*y2 + s*y3
             end do 
    c  Some extra stuff to make sure we don't overflow 
             mean = abs(x1)+abs(x2)+abs(x3)+abs(x4)+
         +                     abs(y1)+abs(y2)+abs(y3)+abs(y4)
             mean = 1.d0/mean
             x1 = x1*mean 
             x2 = x2*mean
             x3 = x3*mean 
             x4 = x4*mean
             y1 = y1*mean
             y2 = y2*mean
             y3 = y3*mean
             y4 = y4 *mean 
          end do 
             
          entim = ftime() - pretim 
          print* , 'x1 =', x1
          print*,' entim =', entim 
          if (entim.lt.1.e-4) then
              print*,' too fast to time'
          else
    c 
    c  The next formula assumes the matrix was square n by n
    c   Possibly, you will have to repeat the matvec
    c   k times to get a timing.  If so you will want 
    c   to multiply the RHS of the equation below by k.  
    c 
             mflops = (32.d0*(1.d3*n)*n + 16*(n*1.d3) )/entim/1.d6
             print*,' Mflop rate of', mflops
          endif
          stop
          end
    

    Of course, one needs lots of repeats to get more than the one hundredth second of the Linux clock function. And the repeats first caused the x1, x2, etc. to overflow the floating point numbers, so the initial rate was 2 Mflops (of meaningless flops).

    So put in the averaging to pull numbers back toward one every so often. Then the rate was about 1100 Mflops for the intel ifc compiler. The pgf90 compiler was actually faster at 1300 Mflops.

    So let's simplify the code and then see what the assembler looks like. Here's a simplified version

     
    
          program peak 
    
    C  This is a "peak" performance try.  
          implicit none
          integer lda, n, i, j,k, idum     
    c     parameter(lda=100)
          real*8 x1,x2,x3,x4,y1,y2,y3,y4,s,t, mflops, mean 
          
    
    c  Initialize matrices a and b
          n = 100 
          x1 = 1.d-1
          x2 = 5.d-1
          x3 = 4.d-1
          x4 = 1.d-1
          y1 = 9.d-1
          y2 = 2.d0
          y3 = 1.d0
          y4 = 2.d0 
          s  = -.125d-4
          t  = .125d-4 
        
       
    c
    c  Write the fast flops here 
    c   (lots of fused add, mutliplies, no new data, 
    c        no dependencies) 
    c   But on the Xeon, even this will spill the registers. 
    c  
          do j=1,1000*n
             do i=1,n
                y1 = x1 + s*x2 + t*x3
                y2 = x2 + s*x1 + t*x4
                y3 = x3 + s*x4 + t*x1
                y4 = x4 + s*x3 + t*x2
    
                x1 = y1 + s*y2 + t*y3
                x2 = y2 + s*y1 + t*y4
                x3 = y3 + t*y1 + s*y4
                x4 = y4 + t*y2 + s*y3
             end do 
    c  Some extra stuff to make sure we don't overflow 
          end do 
             
    c 
    c  The next formula assumes the matrix was square n by n
    c   Possibly, you will have to repeat the matvec
    c   k times to get a timing.  If so you will want 
    c   to multiply the RHS of the equation below by k.  
    c 
          stop
          end
    

    Compiling with

    >pgf77 -S peak3.f

    gives the assembler language code produced by the PGI f77 compiler.

    
    	.file	"peak3.f"
    	.version	"01.01"
    # PGFTN 5.0 -opt 1 -norecursive
    # PGFTN 08/26/2004  12:54:24
    # pgf77 peak3.f -S
    # /usr/local/pgi/linux86/5.0/bin/pgftn
    # peak3.f -opt 1 -terse 1 -inform warn -x 119 0xa10000 -x 119 0x100000 -x 122 0x40
    # -x 123 0x1000 -x 127 4 -x 119 0x40000000 -x 80 0x300 -y 80 0x1000 -x 80 0x40000000
    # -x 51 0x20 -x 124 0x1401 -x 80 0x300 -y 80 0x1000 -x 80 0x40000000 -astype 0
    # -stdinc /usr/local/pgi/linux86/5.0/include:/usr/local/include:/usr/lib/gcc-lib/i386-redhat-linux/2.96/include:/usr/lib/gcc-lib/i386-redhat-linux/2.96/include:/usr/include
    # -def unix -def __unix -def __unix__ -def linux -def __linux -def __linux__
    # -def __inline__= -def i386 -def __i386 -def __i386__ -def __NO_MATH_INLINES
    # -def linux86 -def __THROW= -cmdline +pgf77 peak3.f -S -x 124 1 -asm peak3.s
    # lineno: 1
    
    	.text
    	.align	4
    	.long	..EN1_296-MAIN_+0xc8000000
    	.align	16
    	.globl MAIN_
    MAIN_:
    	pushl	%ebp
    	movl	%esp,%ebp
    	subl	$32,%esp
    	movl	%ebx,12(%esp)
    	movl	%edi,8(%esp)
    ..EN1_296:
    # lineno: 0
    
    	movl	$100,%eax
    	movl	%eax,.BSS1
    	fldl	.C1_316
    	fstl	.BSS1+8
    	fldl	.C1_295
    	fstpl	.BSS1+16
    	fldl	.C1_317
    	fstpl	.BSS1+24
    	fstpl	.BSS1+32
    	fldl	.C1_318
    	fstpl	.BSS1+40
    	fldl	.C1_294
    	fstl	.BSS1+48
    	fld1
    	fstpl	.BSS1+56
    	fstpl	.BSS1+64
    	fldl	.C1_320
    	fstpl	.BSS1+72
    	fldl	.C1_319
    	fstpl	.BSS1+80
    	movl	$100000,%edi
    	movl	$1,%edx
    	movl	%edx,.BSS1+88
    	testl	%edi,%edi
    	jle	.LB1_324
    .LB1_322:
    # lineno: 30
    
    	movl	.BSS1,%ebx
    	movl	$1,%eax
    	movl	%eax,.BSS1+92
    	testl	%ebx,%ebx
    	jle	.LB1_327
    .LB1_325:
    # lineno: 31
    
    	fldl	.BSS1+24
    	fldl	.BSS1+80
    	fld	%st(1)
    	fmul	%st(1),%st
    	fldl	.BSS1+16
    	fldl	.BSS1+72
    	fld	%st(1)
    	fmul	%st(1),%st
    	fldl	.BSS1+8
    	fstl	-8(%ebp)
    	fadd	%st,%st(1)
    	fxch	%st(1)
    	faddp	%st,%st(4)
    	fxch	%st(3)
    	fstl	.BSS1+40
    	fldl	.BSS1+32
    	fstl	-16(%ebp)
    	fld	%st(5)
    	fmul	%st(1),%st
    	fxch	%st(5)
    	fmul	%st(3),%st
    	fadd	%st(4),%st
    	faddp	%st,%st(5)
    	fxch	%st(4)
    	fstl	.BSS1+48
    	fldl	-8(%ebp)
    	fmul	%st(6),%st
    	fxch	%st(5)
    	fmul	%st(3),%st
    	fadd	%st(7),%st
    	faddp	%st,%st(5)
    	fxch	%st(4)
    	fstl	.BSS1+56
    	fxch	%st(3)
    	fmul	%st(5),%st
    	fxch	%st(6)
    	fmul	%st(2),%st
    	fldl	-16(%ebp)
    	faddp	%st,%st(1)
    	faddp	%st,%st(6)
    	fxch	%st(5)
    	fstl	.BSS1+64
    	fld	%st(4)
    	fmul	%st(3),%st
    	fld	%st(2)
    	fmul	%st(5),%st
    	fadd	%st(7),%st
    	faddp	%st,%st(1)
    	fstpl	.BSS1+8
    	fld	%st(4)
    	fmul	%st(1),%st
    	fld	%st(2)
    	fmul	%st(7),%st
    	fadd	%st(5),%st
    	faddp	%st,%st(1)
    	fstpl	.BSS1+16
    	fld	%st(1)
    	fmul	%st(1),%st
    	fxch	%st(6)
    	fmul	%st(5),%st
    	fadd	%st(3),%st
    	faddp	%st,%st(6)
    	fxch	%st(5)
    	fstpl	.BSS1+24
    	fmulp	%st,%st(1)
    	fxch	%st(1)
    	fmulp	%st,%st(2)
    	fxch	%st(1)
    	faddp	%st,%st(2)
    	faddp	%st,%st(1)
    	fstpl	.BSS1+32
    	movl	.BSS1+92,%eax
    	addl	$1,%eax
    	movl	%eax,.BSS1+92
    	subl	$1,%ebx
    	jg	.LB1_325
    .LB1_327:
    # lineno: 41
    
    	movl	.BSS1+88,%eax
    	addl	$1,%eax
    	movl	%eax,.BSS1+88
    	subl	$1,%edi
    	jg	.LB1_322
    .LB1_324:
    # lineno: 43
    
    	xorl	%eax,%eax
    	movl	%eax,4(%esp)
    	movl	%eax,(%esp)
    	call	ftn_stop
    # lineno: 52
    
    	movl	-20(%ebp),%ebx
    	movl	-24(%ebp),%edi
    	movl	%ebp,%esp
    	popl	%ebp
    	ret
    	.type	MAIN_,@function
    	.size	MAIN_,.-MAIN_
    	.data
    	.align	8
    .C1_319:
    	.long	0xeb1c432d, 0x3eea36e2	#  1.25000000000000006E-5
    .C1_320:
    	.long	0xeb1c432d, 0xbeea36e2	# -1.25000000000000006E-5
    .C1_294:
    	.long	0x0, 0x40000000	#  2.00000000000000000E+0
    .C1_318:
    	.long	0xcccccccd, 0x3feccccc	#  9.00000000000000022E-1
    .C1_317:
    	.long	0x9999999a, 0x3fd99999	#  4.00000000000000022E-1
    .C1_295:
    	.long	0x0, 0x3fe00000	#  5.00000000000000000E-1
    .C1_316:
    	.long	0x9999999a, 0x3fb99999	#  1.00000000000000006E-1
    # STATIC VARIABLES
    	.data
    	.local	.BSS1
    	.comm	.BSS1,96,8
    # COMMON BLOCKS
    
    Looking at the assembler, we see a correspondence to the Fortran code line numbers. There are 65 assembler code lines correponding to the 32 flops inside the block. So if there is a flop every other clock cycle, it looks as if one assembler instruction is being executed every clock cycle. Corresponding to the 16 floating point adds are 6 fadd instructions and 10 faddp . To the 16 floating point multiplies correspond 14 fmul and 2 fmulp instructions. Here's a reference to Intel assembler Corresponding to the 16 floating point adds are 6 fadd instructions and 10 faddp . To the 16 floating point multiplies correspond 14 fmul and 2 fmulp instructions. Here's a reference to Intel assembler. So far my code rearrangements have not helped. Pentium Assembler

    Conclusion, it's going to be tough to get these processors to run at faster than about 1200 Mflops (20% of advertised peak). So when we look at code efficiency we may often be satisfied if we can get 600 Mflops (10% of peak) per processor.

    In comparison, I saw an estimate from a German professor of scientific computation, that the average code in a high performance computing site runs at 1% of peak efficiency.

  • Possible Further Work

    Rework your matrix vector multiplies. Try loop unrolling in both the row and column and see if you can get nearer to the expected in-cache performance of 1200 Mflops. Also can you find good compiler optimizations?

    Alternately, you could explain how to make the "ideal" code run at the advertised rate.

    Here's a reference from Intel Speeding up Matrix Vector Multiplication