High Performance Computing
How do I use a Debugger on the HPC Machines?
One way to debug Fortran or C code is to write print statements and recompile and rerun. For instance, if you have just changed a bit of code and want to make sure that the new code executes as you think, you might print variables to see if the code modifies them in the way you predict. Or if having added or changed a subroutine, you find that the code fails to execute correctly, you might put print statements at the start of the subroutine to verify that variables are passed correctly.
Using a debugger allows you to accomplish these tasks without repeatedly recompiling. So if you've had to change hundreds of lines of code without good test cases for each few lines, and want to monitor the code behavior line by line, perhaps comparing to a known test case, using a debugger can be helpful. Learning to use a debugger may be useful either for your own future projects or in aiding colleagues.
Stepping through programs. A debugger allows you to step through a Fortran or C program. At each step the program listing is displayed and before going on, you can check current values of program variables. Before starting program execution under the debugger, the user specifies one or more break points. On command, the program runs till the first break point. The user can then go on step by step or can set a new break point and ask the program to continue execution to the next break. All the debuggers described below can be used in this fashion.
Examining core files. When code execution fails, a core file is created, typically called "core" or "core.jobnumber". The core file is in binary format so is not viewable with an editor. Assuming the code has been compiled with the -g flag, debuggers can allow you to examine a core file to see what subroutine crashed and at what line, what program called that routine (and so on through the whole stack). Also the user can print out values of program variables at each level of the stack. dbx on the IBM shared memory machine works well for examining core files. On the blade center, Totalview allows examination of core files; I have not succeeded in examining fortran core files using pgdbg or gdb, but suspect it may be possible.
The program should be compiled with the -g flag, constructing a symbol table that allows a line by line stepping through the source code. Also turn off the -O2 optimizations and all other optimizations. Compiler optimizations are quite a nice set of tricks, but they usually work by rearranging the order of operations, so they make it hard for the debugger to correlate program lines with code execution.
On the Linux Blade Center, the Portland Group C and Fortran compilers work with the gdb and pgdbg debuggers. Totalview works with Intel as well as Portland group compiled codes, with gnu codes. Below we describe a method of using any of the gdb debuggers in parallel. It is also possible to use the pgdbg and Totalview debuggers in parallel. On the shared memory IBM machines, the IBM supplied dbx and pdbx debuggers work well with IBM xlf and xlc compilers. dbx is a good serial debugger and pdbx works well in parallel. dbx works well with core files.
On the Linux blade center, the gdb, pgdbg, and Totalview debuggers are available.
GDB is a classic open source program developed by Richard Stallman. The GUI based interface is called ddd. By making small modifications to code, you can debug parallel MPI jobs. If you learn gdb (or ddd) you can use them on almost any linux based system. gdb works well with codes compiled with gcc, g++ or gfortran. In the past, it also worked well with PGI compilers, but I have not verified that recently.
gives a complete and fairly easy to follow set of instructions. If X11 forwarding works so that you can pop a GUI,
brings up a ddd session that includes a "help" button.
For debugging purposes, compile with the -g flag and no optimization (optimizing can confuse things by rearranging code execution order). For example,
>gfortran foo.f -g -o foo
compiles foo.f to produce the executable file foo, where the -g preserves the symbol table in such a way that the debugger can step through the source code, listing the current code line. Typically at run time, one sets a break point, lets the code execute to that point, then steps it through a suspect section of code, observing variables to see where they go astray.
starts a gdb session attached to the executable foo. Similarly
brings up a ddd GUI based version of gdb, which lets you do more with a mouse, but which also has a window which allows the commands given here to work.
Suppose that the know the code's problem is in SUBROUTINE FOOSUB. At the prompt one can enter,
will run the code till it enters SUBROUTINE FOOSUB.
will step through the code to the next executable line. (Actually I've often found that the code misses the break point at foosub_ the first time and has to run again). 'n' (short for 'next') steps through an executable a line at a time, stepping past a subroutine or function call in one step. To step into a subroutine, use
(short for 'step'). If ivar is a variable inside foosub
gdb> print ivar
will display the current value of ivar. Suppose that A is a two dimensional matrix
gdb> print a(2,3)@5
would print a(2,3) and a total of five adjacent elements from memory, which in Fortran storage is the consecutive entries from a column, but a peculiarity of gdb is that this notation only works in the main program. Inside subroutines, fortran arrays are stored as a vector starting with position 0, stored by columns. So if A has leading dimension lda and A is being used to store a matrix of m rows and n columns
would print the first column of A.
would print the second column of A.
gdb does not seem to have a good way to print a section of a Fortran matrix row (in C matrix rows are stored consecutively, so gdb would easily display a matrix row). So a Fortran row would have to be displayed one print statement at a time (where in pgdbg you could use matlab notation to print a matrix row).
Once you're stepping through foosub, and want to leap to a breakpoint at line 1142, you can set a new breakpoint.
and jump to it by
(provided your code would execute this line). One way to tell where to put the next breakpoint is by opening another xterm with an edit session of the source code. Find the line number you want (in vi, you would park the cursor on the line you want and ascertain its line number by typing :.= ), say 1311, then
gdb> break 1311
would put a break at that line. In ddd, the file you are editing is diplayed as part of a split screen in which you can scroll up and down, so the spare xterm is not quite as necessary, though the spare xterm may still be convenient.
gdb> l 1311
lists lines around 1311 in the command line window.
Of course, this is just a start on how to use a debugger, but you can get a hint that using the debugger can save time on recompiling just to put in print statements.
Displaying a slice of a matrix is a bit easier. While the gdb notation still works, the easier column slice
pgdbg> print a(2:6,3)
and row slice
pgdbg> print a(2,3:5)
notations are also available. Pgdbg has man pages. Help is available from within the debugging sessions by typing
>xlf90 -o foo.exe -g foo.f
You can start a debug session by
Breakpoints are set by the name and line number of the file containing them.
(dbx) stop at "foo.f":1169
This will set a break at line 1169 of foo.f.
(dbx) print a(1,2)
is valid, but there seems to be no way to show a slice of a matrix. Another problem can be that though scalar variables print quickly, there can be a long delay in printing elements of a matrix.
continues execution of the code to the next breakpoint.
One virtue of the dbx debugger is convenience of examining core files.
Suppose that a -g compiled code foo runs and dumps a core
Segmentation fault - core dumped
To investigate the error,
Dbx reports the line where the dump occurred. You can examine the stack (what program called the subroutine and what program called that routine, and so on), and can print variables on each level of the stack.
move up and down the stack respectively.
The p590 has a long man page for dbx which includes example sessions.
The pgdbg, Totalview, and gdb debuggers should each work in parallel. The description here is of using the gdb (or ddd debugger) used on a VCL node.
MPI (Message Passing Interface) code are typically used for distributed memory codes, i.e., each processor has its own memory and communicates to other processors by sending and receiving messages. For purposes of debugging with gdb, we've implemented a shared memory version of MPI. Here the messages are passed within a single shared memory node. It's possible to start more processes than the number of physical cores.
The mpirun call starts up a process. When an MPI_Init call is encountered, that root process starts up the requested number of new processes. By putting a pause after the MPI_Init, we can identify the new processes and attach each new process to a gdb session. Then we can step through each of the processes individually.
Here's an example. From one of the 32 bit login nodes,
This sets up the mpif77, mpicc, mpicxx commands to use the gnu32
library. Compile a simple MPI code, e.g., the monte.f code
from the MPI short course. An excerpt is as follows.
real*8 ans(10), ans2(10) real*8 startim, entim, sum, sindex c c function integer string_len iflag = 1 c call MPI_INIT(ierr) do while (iflag.eq.1) end do c call MPI_COMM_SIZE(MPI_COMM_WORLD, p, ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, my_rank, ierr) * print*,' I am ', my_rank c if (my_rank.eq.0) then cc print*,'input random seed'
Notice the "do while" line just after the MPI_INIT. This line stalls the code indefinitely (taking 100% of the cycles on some core). For this reason it's a good idea to debug on a vcl node (go to vcl.ncsu.edu and get an HPC linux image) as opposed to login01 or login02.
>mpirun -np 2 ./monte
starts up the code. Doing "ps -ef | grep monte"
[gwhowell@login02 ~]$ ps -ef | grep monte gwhowell 25532 25386 0 14:39 pts/64 00:00:00 /bin/sh /home/gwhowell/mpiches/mpich-1.2.7p1/gnu32/ch_shmem/bin/mpirun -np 2 ./monte gwhowell 25560 25532 6 14:39 pts/64 00:01:42 /home/gwhowell/ppmpi_f/chap03/./monte gwhowell 25561 25560 10 14:39 pts/64 00:02:40 /home/gwhowell/ppmpi_f/chap03/./monte
shows that 2 monte jobs have started up. In another window,
gdb gdb> attach 25560 gdb> set iflag = 0
and in yet another window
gdb gdb> attach 25561 gdb> set iflag = 0
bring up gdb sessions attached to these two processes. Setting iflag as 0 in each session pulls the process out of infinite loop. Repeatedly typing "n" will step through a process, and we now have two parallel gdb debug sessions attached to the two processes. A similar approach is outlined in Parallel debugging. (The gdb attach process syntax given there does not quite work on the blade center). The link shows syntax for stalling C codes while the debugger is attached.
You may prefer to start up one or more of the gdb sessions with "ddd". This has a GUI interface which can be useful. For example, when I tried
gdb did not know of any such local variable. By clicking the "display" button in ddd, I was able to find a local variable called my_rank__ and print that.
Totalview debugging can be accomplished in much the same way as outlined here (but so far works only for fortran and C codes, not for C++). For the totalview debugger, try
Last modified: December 18 2008 14:48:48.