Lab 1 -- MPI Short Course fall 2010

  • Logging in First get into putty ,
    >ssh -l myname login.hpc.ncsu.edu 
    

    then enter your password ..

    Copy /share3/gwhowell/pachec.tar to your home directory by

    >cp /gpfs_share/gwhowell/pachec.tar . 
    >tar xvf pachec.tar
    

    This last command "untarred the tar ball".
    >cd pachec/ppmpi_f/chap03 
    >ls 
    

    I've compiled some references at Unix and Supercomputing References and Tutorials, so you might look at those if you feel you need some more info on UNIX. Warning, some of these tutorials may be a bit different than the way we do things here.

  • Compiling There is a more detailed web page. http://www.ncsu.edu/itd/hpc/Documents/BladeCenter/GettingStartedbc.php.

    For a not very optimized version using the Portland group compilers ..

    >add pgi64_hydra
    >mpif90 pmonte.f -o pmonte
    

    The "add pgi64_hydra" put the pgif90 compiler as the mpif90 compiler, and also automated the linking to the mpich (MPI) libraries. We could do this by hand by copying the mpif.h include file into the current directory, and then
    >pgf90 -c pmonte.f
    >pgf90 -o pmonte pmonte.o /usr/local/apps/mpich2/pgi105x64/1.3a2/lib/libfmpich.a /usr/local/apps/mpich2/pgi105x64/1.3a2/lib/libmpich.a /usr/local/apps/mpich2/pgi105x64/1.3a2/lib/libmpl.a
    

    Or if you prefer C, start with

    cd pachec/pmpi_c/chap03
    

    Then after "add pgi64_hydra", do

    mpicc greetings.c -o greetings
    

    A job can be submitted by

    bsub < bgreet 
    

    LINIX aside. Linux (an open source form of Unix) is command line oriented. So you spend time learning commands which can be complicated. Having learned a command, you can avoid typing it in full by writing a file consisting of the command and then making the file executable.

    Open a file with a UNIX editor such as emacs, vi, or nano (nano is easiest but there are sometimes mysterious errors with long text lines, vi is most universal, emacs is preferred by many UNIX cognoscenti, but requires a sometimes elusive "xterm") and store the command lines in it. Via vi, you would type "vi compilepmonte", then "i" for insert, type the lines and whenever you make a mistake you do the "escape" key, delete with "x" then start inserting again with "i". Then you save by "escape" to get out of insert, then ":wq" to write and quit .. A vi tutorial from the University of Washington is How to Use the vi Editor.

    Having produced the file "compilepmonte" with the lines needed to compile pmonte.f , say, make it executable by

    >chmod +x compilepmonte
    

    and can then run it to produce the executable pmonte by
    >./compilepmonte
    

    More complicated programs are usually compiled with "make" (see the 2nd lab).


  • Running a Batch Job

    There are several thousand computational CPUs on the blade center, most packaged wth 4 or 8 cores for each blade. Most blades have 2 GBytes of RAM per core, with all the cores on a blade having access to the blade's RAM. Communication between blades and to the shared file system is by GBit ethernet.

    Users run jobs on the computational nodes by submitting bsub scripts which will put their jobs in an LSF (Load Sharing Facility) queue. Here's a sample bsub script which runs the executable pmonte.

    #! /bin/csh
    #BSUB -W 5
    #BSUB -n 4
    #BSUB -R span[ptile=4]
    source /usr/local/apps/env/pgi64_hydra.csh
    unsetenv MPICH_NO_LOCAL 
    mpiexec_hydra  ./pmonte
    #BSUB -o /share/gwhowell/chap03/pmonte.out.%J
    #BSUB -e /share/gwhowell/chap03/pmonte.err.%J
    #BSUB -J pmonte
    

    If this file is bstry, then it can be submitted to LSF by
    bsub < bstry 
    

    If in the same window, you have already typed
    add pgi64_hydra 
    

    you can omit the line "source /usr/local/apps/env/pgi64_hydra.csh".

    Some explanation of the bstry script:
    The -W 5 asks for 5 minutes. -n 4 asks for 4 processors. The -o names the standard out, and the -e the standard err. The %J appends the LSF job ID integer to the file names so that you don't get them appended to the same files on successive runs.

    The -R span[ptile=4] puts 4 processors on each blade. Setting the same number of processors on each blade by ptile allows the mpich library to use local memory (RAM) to avoid passing messages through the ethernet interconnect. On the blade center, you can get up to 16 processors on each blade (ptile up to 16). Using processors on the same blade is often much faster than using the interconnect. Using RAM for the mpich messages is only possible if you use the line "unsetenv MPICH_NO_LOCAL"

    Of course, you need to change the gwhowell to your own user name and make sure that the directory you propose to write to exists and that you have write privileges in it.

    Writing these files to /share is convenient in that /share has no space constraint. However, files on /share are not backed up. In fact, they are purged, i.e., files on /share older than a month or so (or two weeks or whatever it takes to keep some space on the disk) are deleted.

    You might look in the .err and .out files .. There is only one line in the .out file which is related to the job submission. Of course, if the submission had failed there would be more. You might try changing some things to make the submission fail. For instance, I got some pretty mysterious errors by omitting the line "real*8 rand" in the pmonte.f file.

    For example, give a bad path to the executable.

    Log out and when you log back in, don't type "add pgi64_hydra". What happens if you compile with pgi, then use "add intel64_hydra"?

    As an exercise, now try compiling and running some other program from chap03. Say ring.f

    Is it actually true, that the number of processers has to be even?

    Actually, the program does not (usually) hang for an even number of processors. The sends return (provided adequate buffer space is available). So are then ready to receive. The even number of processor code is "unsafe" in that it depends on adequate buffer space.

  • Profiling and Timing A main reason to go to parallel processing is to make codes run faster. It's often less trouble to get serial speed-ups. In any case, some of the principles involved, serial and parallel, are the same, so we may as well learn something first about the art of getting good performance in serial. Some of the tricks in Goedecker and Hoisie (SIAM Press, 2003) are a bit arcane. But others are very straightforward and would be silly not to try.

    In particular, it makes sense to

  • Profile and time to see which parts of the code are using time,
  • Try compiler optimization,
  • To search out to see if someone has done good optimized libraries, e.g. BLAS.
  • Make sure we're looping in the right direction (row or column to make sure data called into cache is reused),
  • Try blocking for data locality,
  • Know what features of Fortran 90, C and C++, etc. are likely to slow code.
  • Then if we can find the part of the code that's running slow we can try some of the more arcane stuff.

  • Timers

    One way to time codes is by prepending time to the call to the executable, e.g.

    time ./foo.exe
    
    This gives results of "user" time. "User" time should be taken with a grain of salt for parallel computation. It can for example, be the total time spent in various processes launched. Really we are more interested in "wall clock time". One way to get "wall clock time" is from the reported start and end times reported by LSF. The example programs show the use of the "wall clock timer" MPI_Wtime(), which can be placed at the start and end of an MPI program. See Timers for a bit more.

  • Profilers Running a code under a profiler produces a table that details how long the code spent in various subroutines. Typically, the profiler is turned on by compiling the code with the special compiler dependent flags, the most common of which is the -pg flag. Then when the code is executed, a log file is produced. This can have various names, but

    >ls -lrt


    will list files in reverse order of time, so the newest files will appear last. So if you've just run a profiled code you can easily find the log file. For example, running a file compiled with -pg and with gnu or pgi compilers (pgf90, pgcc, gcc, g77, g++) will produce an output file gmon.out. To see the contents of gmon.out, compiled by running an executable foo.exe, try typing

    >gprof foo.exe

    Each line of the file will correspond to a subroutine, and will tell you how often the subroutine was called and the total time spent in that subroutine. Generally, if we want to speed a code, knowing where the code spends most of its time shows us where to concentrate. Or if a subroutine is called "oodles" of times maybe it should be inlined. One tutorial for gprof is The GNU gprof.

    The gprof profile samples times in codes compiled with the -pg flag. Time spent in other parts of the code is not reported.

  • Example with the PGI Fortran compiler

    Here's an example run. It was compiled with the pgf77 compiler using the -Mprof=func flag. What do we see? After the executable ran, the file "pgprof.out" was produced. It follows.


    PROF NODALL 0 a.out 1093292124 0
    h blade1-13 23023 0 1
    t 1 7
    p 0
    f zge062704_1.f
    r zgebrd110 1 238 1 49.3378 23.3085
    r zgebd3 1 702 1 7.52057e-05 6.07718e-05
    r zlabr2 1 1342 199 26.0292 1.0356
    r zgeupm 1 2064 1393 22.0503 0.918796
    r zgemver 1 2342 207 2.94329 2.94329
    r zgemvt2 1 3113 1393 21.1315 21.1315
    f zrivbrd2.f
    r zrivbrd 1 1 1 109.268 59.9301
    z
    The driver is zrivbrd (contained in the fortran file zrivbrd2.f). According to the logfile, it takes 59 seconds. Actually, looking at the code, the driver calls some library routines from LAPACK, which has been linked to but not compiled with the -p flag. Since the LAPACK codes were not compiled with the profiler, the times spent in the library routines are attributed to the driver.

    Another time consuming routine is zgebrd110, which required 23.3 seconds most of which was actually spent in running the BLAS routine dgemm (matrix matrix multiply). The next longest time 21.1 secs, which calls the BLAS matrix vector multiply dgemv. zgemvt2 was called 1393 times, (called only from zgeupm also called 1393 times). The routine zgemver was called 2342 times and required 2.94 seconds (it calls BLAS dgemv and also dger which are rank one updates). For a prettier display of the results, I typed
    >pgprof pgprof.out
    which gave a GUI which gave some better explanations. For instance the 238 is the line of the file zge062704_1.f on which the routine zgebrd110 starts. It also give another informative column of times, which is how long a code spends in a routine and its subroutines. For example zrivbrd and its subroutines required 198.268 seconds.

    These results (blaming all the time on the BLAS calls) seems to indicate that we should make sure we have a good BLAS library. (The results of that experiment will appear below. For a user manual for the PGPROF profiler, see PGPROF

  • gprof, Lint, ftnchek, etc. I happen to know the code we're examining pretty well, as I wrote it. But for someone coming in cold, the PGI log file would be a bit confusing. It doesn't tell us for example, which code calls which. Generally, when looking to modify a code you find yourself constructing a call table.

    One way of course, is by "brute force". Brute force is plausible with some Unix commands. The one indispensable "track it down" command is grep. So for example to search all the .f files in a directory for the ones that contain the characters ZLABR2(,

    >grep ZLABR2( *.f
    Or to search a bunch of .a archive files to find which one has the>GNU elusive function "foobar"

    >nm -r *.a | grep foobar

    where the | pipes the output of "nm -r *.a" which is a massive amount of symbols to grep. Grep throws away all the lines of the output except for those that contain "foobar". The -r flag made sure that the name of the .a file is included on each line of the nm generated symbol table.

    Then by tracking down all the subroutines, you'll eventually construct a tree showing which ones call which. Or you could have just used the -pg compile option. Most of the traditional computer vendors such as IBM, Digital, HP, have their own utitilies which will construct a call table for you. For the open source environments (gcc, g77, etc.), provided you have compiled with the -pg flag. There is also well-known public domain program GNU gprof (by Jay Fenlason), which works with C, Fortran, and Pascal. I hope to give you a longer demo, but "info gprof" would get you going. See also the web page GNU gprof or Class notes from Rice University.

    Call tables are also constructed by programs such as lint for C and ftnchek for Fortran 77. These are public domain, but some supported licensed programs can be purchased. Lint and ftnchek also give a good deal of info about possible programming problems such as mis-typed argument lists, non-portable language constructs, etc.

    One problem with Fortran 90 is that the standard open source tools such as lint, ftnchek and gprof don't yet work with it. So not only do you have to buy a Fortran 90 compiler, but then you have to purchase these as development tools. Fortunately, we have the pgi flavors here. Also on the p690, we have the IBM tools.

  • Timers Of course, for purposes of comparing codes or how long it takes for a subroutine to run, we can just call a timer. C and Fortran 90 have portable CPU timers. Fortran 77 does not.

    One way to get portability is to use the standard C clock function. Then from Fortran, use a wrapper to call it. Here's the wrapper.

    #include < time.h>
    float ftime_()
    /* printf("Here we are:\n clock=%f12.8\n",(float)clock()); */
    {return (float)clock()/CLOCKS_PER_SEC;}

    In some instances, you may need to put an extra underscore after the ftime_. If you wanted to call this from C, you would take away the underscore. Compile it with

    >cc -c timd.c

    and then just include timd.o in the list of object files to be linked into your executable.

    The Fortran or C code has calls to ftime() as follows:

    Having declared pretim and entim as real*4
    pretim = ftime()
    ...
    Code segment to be timed
    ...
    entim = ftime() - pretim

    Then entim is the elapsed CPU time for the "Code segment to be timed". In C the declation would be as "float" and semicolons would be required at the ends of lines.

    The clock function is portable in that as part of the C standard, it exists everywhere C does. A disadvantage with the C clock function or Fortran90 cpu_time is that the resolution is often pretty low. Frequently the smallest nonzero time is 1.e-2 or 1.e-3 seconds. So to time a bit of code you have to get it to repeat many times. Then the data stays in cache so the code runs artificially fast. Occasionally, the compiler figures out a loop is repeated with the same data and figures out it is unnecessary. Then times can get really short.

    Another standard C timing function appropriate for wall clock time is clock_gettime, which returns a struct of which the second component typically has a higher resolution than clock(). For wall clock time I often prefer just to use the MPI_Wtime() function. If you have an MPI library available, just link to it, then between MPI_INIT() and MPI_Finalize() calls you can even though your code is really serial, use the MPI_Wtime() wall clock function. It usually returns an answer with a resolution in microseconds.

    Finally, if you've isolated a section of code which can run as a stand-alone program ./fooexec, you can simply type

    >time ./fooexec

    or

    >timex ./fooexec

    to get screen output detailing how long the code took. Under csh or tcsh shell the time command can give a good deal of other information about the code's run-time performance, e.g., how much memory it used. (See the man page).

    A good on-line reference for timers and other performance tools is LLNL performance_tools tutorial

  • Speed-ups by BLAS, Compiler Optimization, Intel Compiler Initially, we had an LAPACK time (according to clock) of 52 seconds to bidiagonalize a 1600 by 1600 matrix, corresponding to about 206 Mflops. Where the trial routine requires only 48 seconds. 206 Mflops is not as horrible as it sounds as these are double precsion complex flops, where an add corresponds to two double precsion adds, and a multiply to 4 double precision multiplies. Assuming an equal number of adds and mults, there are 3 double precsion flops to a complex flop, so we have a speed of 620 Mflops. Still the peak speed of the processor is around 6 Gflops.

    Getting rid of the profiling and using the -O4 compiler option made little difference in the times.

    Swapping the PGI supplied BLAS for the ATLAS BLAS dropped the LAPACK time down to 25 seconds, i.e., 430 Mflops of complex, equivalently 1320 Mflops for double. Theory: using the Goto BLAS with Intel compiler and flags will push the rate above 500 Mflops. For directions on linking to ATLAS, Intel, and Goto BLAS libraries on the Blade Center, see BLAS libraries

    But the alternative bidiagonalization version was reduced in time only from 48 seconds to 34 seconds. A first problem was that the block size had been set to 3 for purposes of debugging. Returning the block size to the usual 16, the time went to about one half second more than the current LAPACK routine. Looking carefully at the profiler data, it turned out that the average zgemver calls took longer than the average call to zgemvt. I then realized that most of the zgemver calls (one per call to zlabr2) were actually zgemvt calls with a "zero" update. Fixing these calls so that they do not call a matrix update routine avoided one write of the matrix per call to zlabr2. And reduced the time of execution to .7 seconds less than the current LAPACK call. Some remaining functions that take more than one second are the call to zlabr2 and the call to zgeupm. remaining investigation is whether the conjugation (not expressed as a BLAS call) could be responsible. Perhaps some of the vector operations for which conjugation is done an element at a time need some optimization?

  • Hardware Counters--Papi In quantum mechanics or anthropology, observations influence the observed. So using profilers or timers entails using "interrupts" distorting program execution. A less intrusive approach to observing the code in the wild is to use a hard ware counter. It samples: Every so many steps -- say 1000 clock cycles -- it polls, asking what command is being issued on this clock cycle? Then after program execution is complete, you can look and see what proportion of the time was spent doing what.

    These timers are typically developed as part of the chip design process, and may or may not be available to the public. For example, the Digital Alpha chip had a very nice counter, but alas the sys admin never wanted to leave it on. This is because it would have some "drag" effect on the system.

    Dongarra, et. al, have proposed a portable counter PAPI. It runs on most processors, and is public domain. Performance Application Programming Interface

    Why? It's interesting to compare and see what bits of code can sustain the most flops per instruction.

  • Looking at the Assembler Code Most compilers have an option to let you look at the assembler code. Look in the compiler man page. Mostly commonly, it's a -S (as opposed to -c) option. If you've determined that some bit of code could be a problem area then you can look at the assembler code and try various coding options to try to decrease the number of load and store instructions, etc. Personally, I don't actually write the assembler, just rewrite the Fortran or C. It's also interesting to see what the compiler optimizer has done to your code.
  • In Parallel The pgi profiler claims to work in parallel. We'll come back and try this again when we have some parallel code working.

  • Conclusions

    We've seen how to time and profile. In an example, we got a big speedup (factor of two) by changing to a different BLAS library. More generally, if we found a part of the code that took a significant amount of time and did not have a corresponding optimized library call, we could try some of the techniques from the next lecture to optimize it ourselves. For example, 10% or so of the time in the example profiled code was not in BLAS routines, so I may have to try to optimize that code by hand.