Experimenting with Communication Times, Collective Communications, and Serial Speedups

**Editing Files in UNIX****Other resources on these web pages****Using a Scatter and a Reduce****Compiling with a Make file****Revise the code****Write a Run Script****Prototyping MPI Communication Patterns in Matlab****In-Cache Floating Point Operations****Fused multiply-adds****Getting instruction-level parallelism****Software Pipelining****Ratio of Floating Point Ops to Memory Accesses****A bit more on loop unrolling****A Perfectly Efficient Code?****Further Work**

The vi editor is useful as being universally available on unix machines and since it does not require a forwarded xterm (See lab 1). Perhaps you are not fond of vi and would prefer to think about something else rather than its commands. There are other available editors. One easy to use editor is "nano".

>nano myfile

will edit myfile. Some editor commands should appear at the bottom of the screen. CTRL X will save your file. Nano doesn't need a GUI and may be more natural to use than vi. Sometimes nano users cannot detect when lines break, inconvenient as new lines can change the meaning of code.

Using Windows edited files in linux or unix often fails because the "new line" symbol differs in Windows and linux. A Windows file "foolfile" can be converted to linux format by typing

dos2unix foofile

Many Unix users prefer the emacs editor to vi. It requires forwarding
a GUI to a remote terminal.
If you're logging in via putty from an NCSU lab machine, first click the
xwin program. At the prompt

>emacs myfile

You might also try the

>emacs -nw myfile

option. This lets you use emacs even if you haen't succeeded in getting a forwarded xterm. Another easy to use editor is nedit. It also requires a forwarded xterm.

These labs are not at all complete. The HowTo page at
**HowTo**
has useful information on how to get started, how to manage storage,
commonly encountered errors, etc. More specific information (on
some commonly used software packages) is available from **Software**.

mpif90 -c monte2.f ranpak.f

Then link these to produce an executable.

mpif90 monte2.o ranpak.o -o monte2

There are also mpif977 mpicc, and mpiCC commands available. These correspond to pgf77, pgcc, and pgCC, but the "add pgi64_hydra" command also links in the MPI libraries (actually these are MPI-2 librariesi). Similarly, we could have invoked "add intel64_hydra". GNU compilers can also be used, though the GNU compiled code typically runs more slowly.

If you want to see how a makefile for this simple problem would look,
try a file makemonte2 consisting of

FC = mpif77 mont: $(FC) -c monte2.f ranpak.f $(FC) -o monte2 monte2.o ranpak.o

Make sure that the 'executable' lines after mont: start with tabs not blanks!!

make -f makemonte2

will do the compile. If you google on "make + tutorial" you can find many tutorials. For example,

Traditionally, unix users would port codes by modifying make files. More recently, one runs a "configure script", which automatically reconfigures the make file to match the local compilers, compiler options, and library locations (these are inputs from the user).

Here's a link to a more complete unix/linux set of tutorials
**Professor
Norm Matloff, CS UC Davis Tutorial Center**.

Then use an MPI_ALLREDUCE call to deposit the answer on all the processors. Here's the argument list for an MPI_ALLREDUCE and the Fortran declarations.

MPI_ALLREDUCE(SENDBUF,RECVBUF,COUNT,DATATYPE,OP,COMM,IERROR)

INTEGER COUNT, DATATYPE, OP, COMM, IERROR

IN sendbuf -- choose a data type

OUT recvbuf -- typically choose the same data type as in the sendbuf

IN count integer number of elements in send buffer

IN datatype MPI data type of elements of send buffer

IN op operation -- adding, getting max, etc.

IN comm communicator -- for example MPI_COMM_WORLD

OUT ierror integer, zero if successful

Modify your run script from the first lab. Get it to run. Uncomment the MPI_Barrier. Compare the times.

Actually the most commonly used engineering language may not be C or Fortran. It's probably Matlab. Why? Because we can first try out our code and see if it does what we want it do, and then, when we're happy with it but need it to run faster then we convert it to Fortran or C.

So wouldn't it be great if we could parallelize our Matlab
code with MPI calls. Then we could easily try out communication
patterns. And actually it turns out we can. We've installed
the bcmpi package from Ohio Supercomputing. For help in using bcmpi,
see the page
**bcMPI**.

This following is mainly based on Chapter 5 of Goedecker and Hoisie and is one of the CSC 783 lectures.

I sometimes think that the main way to speed code serial or parallel, is to keep data local. In fast cache, or main cache, or RAM, or in parallel to have "prefetched it" from some other processor memory.

**But keeping data local doesn't guarantee full-speed performance.** It's really very easy to have your data local and still have slow code. As you've now seen,
even an in-cache matrix vector multiply may not run at the full processor
speed. Having profiled to see what part of the code that takes a good bit of time,
it's worth concentrating on it and seeing if were it in cache could we make
it go pretty fast. Some would say this is "too hard" , but usually I've
found that if a class keeps playing with something after a while its
speeds improve a lot. And the Goedecker and Hoisie tricks are not that
hard to play--and can help a lot.

A caveat.
For scientific computations we tend to just want to count floating point
operations, but in general integer operations are also important. For
example in a sparse matrix vector multiply, we store only the nonzero
entries of a matrix along with integer vectors. In the simplest scheme
there is an integer vector for which the

If the 3 vectors are afloat, indx, indy, then the element afloat(k) corresponds to the element A(indx(k),indy(k)) of the original matrix. So the integer computations for indexing, become important in a matrix vector multiply.

But at least let's start with making in-cache flops go fast.

>a = b+c*d

in one clock cycle. Then the sum

>a = b+ c*d + e*f

can be rewritten in two lines as

>a = b+ c*d

>a = a + e*f

which could be 2 clockcycles. But if the original sum is switched to be written as

>a = e*f + b+ c*d

Then it's pretty likely that the compiler will execute it as

>a = e*f

>a = a + b

>a = a + c*d

which looks like 3 clock cycles? But actually, there is also a dependency. So there's likely to be an extra delay between each operation. So perhaps the first version is 3 clock cycles and the second one is 5? Often if we have a loop, we can eliminate the dependency.

Hopefully, the chip and compiler designers are aware of these same issues and automating some of the following with thread-level parallelism and ability to do "out-of-order" computation? The following example shows the compiler and chip designers haven't yet . . .

Consider a sum computation.

program getsum C This is a driver program to test Mflop rate for a sum implicit none integer lda, n, i, j,k, idum parameter(lda=2000) real*8 a(lda,lda),x(lda) real*8 temp,mflops,sum ,tot ,t real*4 ftime, pretim, entim external ftime print*, 'input n' read*, n c Initialize matrices a and b do j = 1,n do i = 1,n a(i,j) = 1.d0/(i+2*j-1) end do end do pretim = ftime() c c Write your matrix vector multiply here c do k = 1, 200 do j = 1,n sum = 0.d0 do i=1,n t = a(i,j) c sum = sum + abs( a(i,j) ) sum = sum + t end do x(j) = sum tot = x(j) + sum end do end do entim = ftime() - pretim print* , 'sum =', sum print*,' entim =', entim if (entim.lt.1.e-6) then print*,' too fast to time' else c c The next formula assumes the matrix was square n by n c Possibly, you will have to repeat the matvec c k times to get a timing. If so you will want c to multiply the RHS of the equation below by k. c mflops = k* (n)**2/entim/1.d6 print*,' Mflop rate of', mflops endif stop endThis ran at around 100 Mflops. (problem the timer kept returning very small nonzero numbers, so put in a less than test). So imitating the suggestion of Goedecker and Hoisie, let's "unroll" the loop.

program getsum C This is a driver program to test Mflop rate for a sum implicit none integer lda, n, i, j,k, idum parameter(lda=2000) real*8 a(lda,lda),x(lda) real*8 temp,mflops,sum ,tot ,t ,sum1,sum2,sum3 real*4 ftime, pretim, entim external ftime print*, 'input n' read*, n c Initialize matrices a and b do j = 1,n do i = 1,n a(i,j) = 1.d0/(i+2*j-1) end do end do pretim = ftime() c c Write your matrix vector multiply here c do k = 1, 200 do j = 1,n sum = 0.d0 sum1 = 0.d0 sum2 = 0.d0 sum3 = 0.d0 do i=1,n,4 sum = sum + ( a(i,j) ) sum1 = sum1 + ( a(i+1,j) ) sum2 = sum2 + ( a(i+2,j) ) sum3 = sum3 + ( a(i+3,j) ) end do x(j) = sum1+sum2+sum3+sum tot = x(j) + tot end do end do entim = ftime() - pretim print* , 'sum =', sum print*,' entim =', entim if (entim.lt.1.e-6) then print*,' too fast to time' else c c The next formula assumes the matrix was square n by n c Possibly, you will have to repeat the matvec c k times to get a timing. If so you will want c to multiply the RHS of the equation below by k. c mflops = k* (n)**2/entim/1.d6 print*,' Mflop rate of', mflops endif stop endThese ran typically at 200 Mflops, leading one to believe that the Xeon can handle two independent streams. Some surprises

Note that only adds are performed. Let's try putting a multiply in also
so that the computational line is

>sum = sum + a(i,j)*a(i,j)

One might expect that an add and a multiply could then be perform
on each step.
Then the flop rate for the ifc compiler was about 20% faster (1200 to 1500
Mflops). This is faster then the rate the data could be sucked
from RAM, but still not nearly as fast as the advertised 6 Gflops.

sum = sum + a(i,j) * a(i,j) sum1 = sum1 + a(i+1,j) * a(i+1,j) sum2 = sum2 + a(i+2,j) * a(i+2,j) sum3 = sum3 + a(i+3,j) * a(i+3,j)

where presumably the machine is actually executing

a0 = a(i,j) sum = sum + a0 * a0 a1 = a(i+1,j) sum1 = sum1 + a1 * a1 a2 = a(i+2,j) sum2 = sum2 + a2 * a2 a3 = a(i+3,j) sum3 = sum3 + a3 * a3This is in the case that the matrix A is small enough (80 by 80) say which is 80*80*8 = 64000 bytes to be easily within cache memory. So apparently the machine is not able to do flops on each clock cycle. Since there are no "stores" needed with sum, sum1, sum2, sum3 fitting in registers, it must be that loads of a(i,j) are not occuring fast enough. Or that the load operation has a delay before the multiply add can occur.

There are a couple of possible solutions. One is to take a greater depth of loop-unroll, perhaps instead of 4, we should unroll to a length of 12? The authors give an example where loop-unrolling of depth 12 worked very well on an Alpha chip.

Possibly we should make the code more explicit by making some new variables and a prefetch so there is no dependency on the previous step ?

a2 = a(i+2,j) sum = sum + a0 * a0 a3 = a(i+3,j) sum1 = sum1 + a1 * a1 a0 = a(i+4,j) sum2 = sum2 + a2 * a2 a1 = a(i+5,j) sum3 = sum3 + a3 * a3where this overlaps the loop content to avoid dependency (making the first and last steps a bit more complicated).

A solution that seems to work in on Pentiums in the middle of a matrix vector multiply is to loop unroll in both the i and j directions. Then the machine can be loading several columns of A at the same time.

If we write a matrix vector multiply as (Fortran, for C transpose the discussion)

do i=1,n do j=1,m b(i) = b(i) + a(i,j) * x(j) end do and do

then on each step of the inner loop we need a new value of x(j), which since x is stored sequentially, has probably already been fetched (when x(j-1) was fetched). But we are accessing a by the row, so a(i,j) is probably not around. Also we're having to reload b(i). For an in-cache calculation, we would be likely to have b(i) in L1 cache, a(i,j) in L2 cache. So fetching b(i) is not so bad. Switching the order of loops

do j=1,n do i=1,m b(i) = b(i) + a(i,j) * x(j) end do end dothen on each step of the inner loop we need both a new value of a(i,j) and also a new value b(i). We already have x(j) in register. Fortunately, a(i,j) and b(i) were probably fetched on the last step. But still to go through one inner loop, we have to load the entire column of a and also the entire vector b. i.e., b has to reloaded m times ( presumably only from L1 cache or for a large enough problem from L2 cache).

But if we block the code then we can "usually" have all the elements of a, b, and x ready and we don't have to reload elements of b the n times.

do j=1,n,4 do i=1,m,4 b(i) = b(i) + a(i,j)*x(j) + a(i,j+1)*x(j+1) + a(i,j+2)*x(j+2) + a(i,j+3)*x(j+3) b(i+1) = b(i+1) + a(i+1,j)*x(j)+a(i+1,j+1)*x(j+1)+ a(i+1,j+2)*x(j+2) + a(i+1,j+3)*x(j+3) b(i+2) = b(i+2) + a(i+2,j)*x(j)+a(i+2,j+1)*x(j+1)+ a(i+2,j+2)*x(j+2) + a(i+2,j+3)*x(j+3) b(i+3) = b(i+3) + a(i+3,j)*x(j)+a(i+3,j+1)*x(j+1)+ a(i+3,j+2)*x(j+2) + a(i+3,j+3)*x(j+3) end do end do c And of course we need some clean up steps when m, n are not c multiplies of 4

Here b is reloaded only n/4 times.

On 32 register machines, it made some sense to worry about when loops got too long to fit in registers and spilled out, e.g., this might make sense on the p690. The Xeon has only 8 registers, so we'll "always" spill. But if we can do loops which maximize the amount of data that stays in L1 cache, that can be a big help. Similarly, it's better to keep data in L2 than have it spill into DRAM ...

Loop-unrolling has been used for fifteen or twenty years. Prominently Dongarra was able to use the first Crays to full potential by developing vector operations as the BLAS-1 library. Now we need the trick to get reasonable performance Now, loop-unrolling is accomplished by many opitimizing compilers. Often loop-unrolling is part of the compiler default optimization. But the loops are typically unrolled in only one dimension.

It's hard to choose the optimal loop-unrolling parameters automatically and many times compilers get it wrong. A priori I tend to think that loops such as the above work best if more or less square such as the 4 by 4 example, but exhaustive searches of the best loops lead to long skinny blocks. (Clint Whaley's ATLAS BLAS automatically -- actually by timing some choice--chooses the "best" or nearly the best. There's also a Berkeley competitor, but it's in my experience less robust. Philosophically the Berkeley BLAS works more from first principles for a given machine, the ATLAS BLAS is empirical, timing various options and choosing the best).

If you unroll by hand don't forget to work out the "remainder" pieces carefully. It's a good time to compare some various levels of unrolling. As an example, of when you might unroll by hand, consider matrix operations that don't quite fit in the BLAS routines. For example, I need to produce the conjugate of a complex vector and my PGI Fortran program will do only 20 million conjugates a second.

Aliasing is when data in a given memory location is referred to under two different names. In C this is allowed. In Fortran, compilers are allowed to assume this does not occur. Traditionally, this has meant that Fortran compilers have more information and hence can perform optimizations that would not be allowable for a C compiler.

So in fortran, if A is an array, then typically the call

> call myfoo(a,a,4)

would not be allowable because the array A is referenced to have two different meanings inside

> subroutine myfoo(a,b,4)

The non-aliasing rule is not enforced in Fortran, but users who use aliasing may find their code is not portable.

Because aliasing is not allowed, the Fortran compiler writer has a little better chance produce efficienct code. That being said, C is close to hardware and ATLAS BLAS for example is written in C.

There are some nice operations in Fortran 90. For example you can add
two vectors by

> c = a(1:4) + b(1:40)

which is much nicer for the programmer than writing a loop. But when
it comes times to optimize, you probably convert most of these f90 calls
to BLAS calls (at least for frequently used portions of the code). Especially
for calls repeated many times for short vectors or small matrices.

Also Fortran 90 and C++ allow operator overloading. This can make code writing easier. But for purposes of performance temporary space is allocated run time. That's a time consuming task, perhaps equivalent to an interrupt.

System interrupts, e.g., for floating point exceptions may be known to be harmless. For example, underflow may just mean zero. It's possible to test for NaNs (not a numbers) and take another branch if these occur. When I was trying out the "perfect" code of a few sections further on, I found that I could run with NaNs at at about 2 million (meaningless) flops per second. But putting an "if" to test for each NaN would be slower.

Modern chips cope better with branches than they used to. But hey even straight ahead they are usually running only at 5 or 10 per cent of peak, so having lots of branches "ready on spec" only means you'll continue along at 5 % (which is of course better than the alternative of just stalling while everthing for the next after branch step is fetched).

Does the branch prediction paradigm have ramifications for the C (or Perl) style of putting many subroutine calls inside an "if" error handler?

Type conversions cost more than flops.

Conjugations or changing sign cost more than flops?

Complex arithmetic is good in that complex arithmetic has twice as many flops per byte of memory access.

Special function evaluations are usually costly. Two approaches to save time, use interpolation or find the fast but not quite as accurate a library (or combine these two).

There is an overhead for calling a subroutine. Hoisie and Goedecker claim this is about the same as loop overhead per incrementing a loop, about ten cycles on the machine they tested. Many compilers have compiler flags to in-line of designated subroutines. An additional hazard to subroutines and function in C and Fortran 90 is that many variables may have local scope, so a private copy is made for the local function. Possibly advantage is code safety, but does take time. On some Fortran90 compilers, subarray notation causes a copy to be made for a subroutine.

There are 3 different BLAS packages on the blades that can achieve higher than 5 Gflops. The advertised peak speed is 6 Gflops. The following program imitates one given in Goedecker and Hoisie and they were able to get peak speed with it on an IBM Power architecture. So let's try it on the BladeCenter.

I do timings by first getting an interactive shell by

>bsub -Is -W 5 csh

which gives me an interactive shell so that I can then just type the name of the executable. The bsub command gives me a processor of my own to play with, giving more faith that I'm getting the whole processor. Also the burst of time I take doesn't cause someone else a delay on the head node. If the name of the executable file is peak, then I can run it by typing

>./peak

The program should run at peak speed? It has lots of good paired matrix vector multiplies. There are 10 variables so it doesn't quite fit in the 8 registers. But no statement depends on the last so surely the compiler and chip should be able to do a couple of flops per clock cycle?

program peak C This is a "peak" performance try. implicit none integer lda, n, i, j,k, idum parameter(lda=2000) real*8 x1,x2,x3,x4,y1,y2,y3,y4,s,t, mflops, mean real*4 ftime, pretim, entim external ftime print*, 'input n' read*, n c Initialize matrices a and b x1 = 1.d-1 x2 = 5.d-1 x3 = 4.d-1 x4 = 1.d-1 y1 = 9.d-1 y2 = 2.d0 y3 = 1.d0 y4 = 2.d0 s = -.125d-4 t = .125d-4 pretim = ftime() c c Write the fast flops here c (lots of fused add, mutliplies, no new data, c no dependencies) c But on the Xeon, even this will spill the registers. c do j=1,1000*n do i=1,n y1 = x1 + s*x2 + t*x3 y2 = x2 + s*x1 + t*x4 y3 = x3 + s*x4 + t*x1 y4 = x4 + s*x3 + t*x2 x1 = y1 + s*y2 + t*y3 x2 = y2 + s*y1 + t*y4 x3 = y3 + t*y1 + s*y4 x4 = y4 + t*y2 + s*y3 end do c Some extra stuff to make sure we don't overflow mean = abs(x1)+abs(x2)+abs(x3)+abs(x4)+ + abs(y1)+abs(y2)+abs(y3)+abs(y4) mean = 1.d0/mean x1 = x1*mean x2 = x2*mean x3 = x3*mean x4 = x4*mean y1 = y1*mean y2 = y2*mean y3 = y3*mean y4 = y4 *mean end do entim = ftime() - pretim print* , 'x1 =', x1 print*,' entim =', entim if (entim.lt.1.e-4) then print*,' too fast to time' else c c The next formula assumes the matrix was square n by n c Possibly, you will have to repeat the matvec c k times to get a timing. If so you will want c to multiply the RHS of the equation below by k. c mflops = (32.d0*(1.d3*n)*n + 16*(n*1.d3) )/entim/1.d6 print*,' Mflop rate of', mflops endif stop end

Of course, one needs lots of repeats to get more than the one hundredth second of the Linux clock function. And the repeats first caused the x1, x2, etc. to overflow the floating point numbers, so the initial rate was 2 Mflops (of meaningless flops).

So put in the averaging to pull numbers back toward one every so often. Then the rate was about 1100 Mflops for the intel ifc compiler. The pgf90 compiler was actually faster at 1300 Mflops.

So let's simplify the code and then see what the assembler looks like. Here's a simplified version

program peak C This is a "peak" performance try. implicit none integer lda, n, i, j,k, idum c parameter(lda=100) real*8 x1,x2,x3,x4,y1,y2,y3,y4,s,t, mflops, mean c Initialize matrices a and b n = 100 x1 = 1.d-1 x2 = 5.d-1 x3 = 4.d-1 x4 = 1.d-1 y1 = 9.d-1 y2 = 2.d0 y3 = 1.d0 y4 = 2.d0 s = -.125d-4 t = .125d-4 c c Write the fast flops here c (lots of fused add, mutliplies, no new data, c no dependencies) c But on the Xeon, even this will spill the registers. c do j=1,1000*n do i=1,n y1 = x1 + s*x2 + t*x3 y2 = x2 + s*x1 + t*x4 y3 = x3 + s*x4 + t*x1 y4 = x4 + s*x3 + t*x2 x1 = y1 + s*y2 + t*y3 x2 = y2 + s*y1 + t*y4 x3 = y3 + t*y1 + s*y4 x4 = y4 + t*y2 + s*y3 end do c Some extra stuff to make sure we don't overflow end do c c The next formula assumes the matrix was square n by n c Possibly, you will have to repeat the matvec c k times to get a timing. If so you will want c to multiply the RHS of the equation below by k. c stop end

Compiling with

>pgf77 -S peak3.f

gives the assembler language code produced by the PGI f77 compiler.

.file "peak3.f" .version "01.01" # PGFTN 5.0 -opt 1 -norecursive # PGFTN 08/26/2004 12:54:24 # pgf77 peak3.f -S # /usr/local/pgi/linux86/5.0/bin/pgftn # peak3.f -opt 1 -terse 1 -inform warn -x 119 0xa10000 -x 119 0x100000 -x 122 0x40 # -x 123 0x1000 -x 127 4 -x 119 0x40000000 -x 80 0x300 -y 80 0x1000 -x 80 0x40000000 # -x 51 0x20 -x 124 0x1401 -x 80 0x300 -y 80 0x1000 -x 80 0x40000000 -astype 0 # -stdinc /usr/local/pgi/linux86/5.0/include:/usr/local/include:/usr/lib/gcc-lib/i386-redhat-linux/2.96/include:/usr/lib/gcc-lib/i386-redhat-linux/2.96/include:/usr/include # -def unix -def __unix -def __unix__ -def linux -def __linux -def __linux__ # -def __inline__= -def i386 -def __i386 -def __i386__ -def __NO_MATH_INLINES # -def linux86 -def __THROW= -cmdline +pgf77 peak3.f -S -x 124 1 -asm peak3.s # lineno: 1 .text .align 4 .long ..EN1_296-MAIN_+0xc8000000 .align 16 .globl MAIN_ MAIN_: pushl %ebp movl %esp,%ebp subl $32,%esp movl %ebx,12(%esp) movl %edi,8(%esp) ..EN1_296: # lineno: 0 movl $100,%eax movl %eax,.BSS1 fldl .C1_316 fstl .BSS1+8 fldl .C1_295 fstpl .BSS1+16 fldl .C1_317 fstpl .BSS1+24 fstpl .BSS1+32 fldl .C1_318 fstpl .BSS1+40 fldl .C1_294 fstl .BSS1+48 fld1 fstpl .BSS1+56 fstpl .BSS1+64 fldl .C1_320 fstpl .BSS1+72 fldl .C1_319 fstpl .BSS1+80 movl $100000,%edi movl $1,%edx movl %edx,.BSS1+88 testl %edi,%edi jle .LB1_324 .LB1_322: # lineno: 30 movl .BSS1,%ebx movl $1,%eax movl %eax,.BSS1+92 testl %ebx,%ebx jle .LB1_327 .LB1_325: # lineno: 31 fldl .BSS1+24 fldl .BSS1+80 fld %st(1) fmul %st(1),%st fldl .BSS1+16 fldl .BSS1+72 fld %st(1) fmul %st(1),%st fldl .BSS1+8 fstl -8(%ebp) fadd %st,%st(1) fxch %st(1) faddp %st,%st(4) fxch %st(3) fstl .BSS1+40 fldl .BSS1+32 fstl -16(%ebp) fld %st(5) fmul %st(1),%st fxch %st(5) fmul %st(3),%st fadd %st(4),%st faddp %st,%st(5) fxch %st(4) fstl .BSS1+48 fldl -8(%ebp) fmul %st(6),%st fxch %st(5) fmul %st(3),%st fadd %st(7),%st faddp %st,%st(5) fxch %st(4) fstl .BSS1+56 fxch %st(3) fmul %st(5),%st fxch %st(6) fmul %st(2),%st fldl -16(%ebp) faddp %st,%st(1) faddp %st,%st(6) fxch %st(5) fstl .BSS1+64 fld %st(4) fmul %st(3),%st fld %st(2) fmul %st(5),%st fadd %st(7),%st faddp %st,%st(1) fstpl .BSS1+8 fld %st(4) fmul %st(1),%st fld %st(2) fmul %st(7),%st fadd %st(5),%st faddp %st,%st(1) fstpl .BSS1+16 fld %st(1) fmul %st(1),%st fxch %st(6) fmul %st(5),%st fadd %st(3),%st faddp %st,%st(6) fxch %st(5) fstpl .BSS1+24 fmulp %st,%st(1) fxch %st(1) fmulp %st,%st(2) fxch %st(1) faddp %st,%st(2) faddp %st,%st(1) fstpl .BSS1+32 movl .BSS1+92,%eax addl $1,%eax movl %eax,.BSS1+92 subl $1,%ebx jg .LB1_325 .LB1_327: # lineno: 41 movl .BSS1+88,%eax addl $1,%eax movl %eax,.BSS1+88 subl $1,%edi jg .LB1_322 .LB1_324: # lineno: 43 xorl %eax,%eax movl %eax,4(%esp) movl %eax,(%esp) call ftn_stop # lineno: 52 movl -20(%ebp),%ebx movl -24(%ebp),%edi movl %ebp,%esp popl %ebp ret .type MAIN_,@function .size MAIN_,.-MAIN_ .data .align 8 .C1_319: .long 0xeb1c432d, 0x3eea36e2 # 1.25000000000000006E-5 .C1_320: .long 0xeb1c432d, 0xbeea36e2 # -1.25000000000000006E-5 .C1_294: .long 0x0, 0x40000000 # 2.00000000000000000E+0 .C1_318: .long 0xcccccccd, 0x3feccccc # 9.00000000000000022E-1 .C1_317: .long 0x9999999a, 0x3fd99999 # 4.00000000000000022E-1 .C1_295: .long 0x0, 0x3fe00000 # 5.00000000000000000E-1 .C1_316: .long 0x9999999a, 0x3fb99999 # 1.00000000000000006E-1 # STATIC VARIABLES .data .local .BSS1 .comm .BSS1,96,8 # COMMON BLOCKSLooking at the assembler, we see a correspondence to the Fortran code line numbers. There are 65 assembler code lines correponding to the 32 flops inside the block. So if there is a flop every other clock cycle, it looks as if one assembler instruction is being executed every clock cycle. Corresponding to the 16 floating point adds are 6 fadd instructions and 10 faddp . To the 16 floating point multiplies correspond 14 fmul and 2 fmulp instructions. Here's a reference to Intel assembler Corresponding to the 16 floating point adds are 6 fadd instructions and 10 faddp . To the 16 floating point multiplies correspond 14 fmul and 2 fmulp instructions. Here's a reference to Intel assembler. So far my code rearrangements have not helped.

Conclusion, it's going to be tough to get these processors to run at faster than about 1200 Mflops (20% of advertised peak). So when we look at code efficiency we may often be satisfied if we can get 600 Mflops (10% of peak) per processor.

In comparison, I saw an estimate from a German professor of scientific computation, that the average code in a high performance computing site runs at 1% of peak efficiency.

Rework your matrix vector multiplies. Try loop unrolling in both the row and column and see if you can get nearer to the expected in-cache performance of 1200 Mflops. Also can you find good compiler optimizations?

Alternately, you could explain how to make the "ideal" code run at the advertised rate.

Here's a reference from Intel
**Speeding up Matrix Vector Multiplication**