Skip title Accessibility statement: we seek to make the HPC web pages accessible to all users. If you encounter accessibility issues with HPC web pages please send a description of the problem by email to eric_sills@ncsu.edu - thank you. NC State
Office of Information Technology
High Performance Computing
Skip menu side bar
Home
About
 
OpNews
 
Help/Accounts
 
Partners
 
User Projects
Services
 
Hardware
 
Software
 
Grid
 
Monitor
HowTo/FAQ
 
Docs & Pubs
 
Courses
 
Other Resources

  How do I use a Debugger on the HPC Machines?



  • An updated discussion November 5, 2008. This discussion is meant to give users an idea how to use debuggers and how to apply them in parallel, but it is by no means complete. One on-line tutorial is Norm Matloff's Debugging Tutorial. Of course, you can find many other on-line tutorials, e.g., gdb for C and C++. If you have more questions on how to use debuggers on the NC State blade center, please contact me (gary_howell@ncsu.edu).

  • What Does a Debugger Do?

    One way to debug Fortran or C code is to write print statements and recompile and rerun. For instance, if you have just changed a bit of code and want to make sure that the new code executes as you think, you might print variables to see if the code modifies them in the way you predict. Or if having added or changed a subroutine, you find that the code fails to execute correctly, you might put print statements at the start of the subroutine to verify that variables are passed correctly.

    Using a debugger allows you to accomplish these tasks without repeatedly recompiling. So if you've had to change hundreds of lines of code without good test cases for each few lines, and want to monitor the code behavior line by line, perhaps comparing to a known test case, using a debugger can be helpful. Learning to use a debugger may be useful either for your own future projects or in aiding colleagues.

    Stepping through programs. A debugger allows you to step through a Fortran or C program. At each step the program listing is displayed and before going on, you can check current values of program variables. Before starting program execution under the debugger, the user specifies one or more break points. On command, the program runs till the first break point. The user can then go on step by step or can set a new break point and ask the program to continue execution to the next break. All the debuggers described below can be used in this fashion.

    Examining core files. When code execution fails, a core file is created, typically called "core" or "core.jobnumber". The core file is in binary format so is not viewable with an editor. Assuming the code has been compiled with the -g flag, debuggers can allow you to examine a core file to see what subroutine crashed and at what line, what program called that routine (and so on through the whole stack). Also the user can print out values of program variables at each level of the stack. dbx on the IBM shared memory machine works well for examining core files. On the blade center, Totalview allows examination of core files; I have not succeeded in examining fortran core files using pgdbg or gdb, but suspect it may be possible.

  • Compiling Code So You Can Use a Debugger

    The program should be compiled with the -g flag, constructing a symbol table that allows a line by line stepping through the source code. Also turn off the -O2 optimizations and all other optimizations. Compiler optimizations are quite a nice set of tricks, but they usually work by rearranging the order of operations, so they make it hard for the debugger to correlate program lines with code execution.

  • What Debuggers are Available?

    On the Linux Blade Center, the Portland Group C and Fortran compilers work with the gdb and pgdbg debuggers. Totalview works with Intel as well as Portland group compiled codes, with gnu codes. Below we describe a method of using any of the gdb debuggers in parallel. It is also possible to use the pgdbg and Totalview debuggers in parallel. On the shared memory IBM machines, the IBM supplied dbx and pdbx debuggers work well with IBM xlf and xlc compilers. dbx is a good serial debugger and pdbx works well in parallel. dbx works well with core files.

  • Debuggers on the Linux BladeCenter

    On the Linux blade center, the gdb, pgdbg, and Totalview debuggers are available.
  • The GDB Debugger
  • The pgdbg Debugger
  • The Totalview Debugger

  • The GDB Debugger

    GDB is a classic open source program developed by Richard Stallman. The GUI based interface is called ddd. By making small modifications to code, you can debug parallel MPI jobs. If you learn gdb (or ddd) you can use them on almost any linux based system. gdb works well with codes compiled with gcc, g++ or gfortran. In the past, it also worked well with PGI compilers, but I have not verified that recently.

    >info gdb

    gives a complete and fairly easy to follow set of instructions. If X11 forwarding works so that you can pop a GUI,

    >ddd

    brings up a ddd session that includes a "help" button.



    For debugging purposes, compile with the -g flag and no optimization (optimizing can confuse things by rearranging code execution order). For example,

    >gfortran foo.f -g -o foo

    compiles foo.f to produce the executable file foo, where the -g preserves the symbol table in such a way that the debugger can step through the source code, listing the current code line. Typically at run time, one sets a break point, lets the code execute to that point, then steps it through a suspect section of code, observing variables to see where they go astray.

    >gdb ./foo

    starts a gdb session attached to the executable foo. Similarly

    >ddd ./foo

    brings up a ddd GUI based version of gdb, which lets you do more with a mouse, but which also has a window which allows the commands given here to work.

    Suppose that the know the code's problem is in SUBROUTINE FOOSUB. At the prompt one can enter,

    gdb>break foosub_

    or

    gdb>b foosub_

    Then entering

    gdb>run

    or

    gdb>r

    will run the code till it enters SUBROUTINE FOOSUB.

    gdb>n

    will step through the code to the next executable line. (Actually I've often found that the code misses the break point at foosub_ the first time and has to run again). 'n' (short for 'next') steps through an executable a line at a time, stepping past a subroutine or function call in one step. To step into a subroutine, use

    gdb>s

    (short for 'step'). If ivar is a variable inside foosub

    gdb> print ivar

    or

    p ivar

    will display the current value of ivar. Suppose that A is a two dimensional matrix

    gdb> print a(2,3)@5

    would print a(2,3) and a total of five adjacent elements from memory, which in Fortran storage is the consecutive entries from a column, but a peculiarity of gdb is that this notation only works in the main program. Inside subroutines, fortran arrays are stored as a vector starting with position 0, stored by columns. So if A has leading dimension lda and A is being used to store a matrix of m rows and n columns

    gdb>p a(0)@m

    would print the first column of A.

    gdb>p a(2*lda)@m

    would print the second column of A.

    gdb does not seem to have a good way to print a section of a Fortran matrix row (in C matrix rows are stored consecutively, so gdb would easily display a matrix row). So a Fortran row would have to be displayed one print statement at a time (where in pgdbg you could use matlab notation to print a matrix row).

    Once you're stepping through foosub, and want to leap to a breakpoint at line 1142, you can set a new breakpoint.

    gdb>break 1142

    and jump to it by

    gdb> cont

    (provided your code would execute this line). One way to tell where to put the next breakpoint is by opening another xterm with an edit session of the source code. Find the line number you want (in vi, you would park the cursor on the line you want and ascertain its line number by typing :.= ), say 1311, then

    gdb> break 1311

    would put a break at that line. In ddd, the file you are editing is diplayed as part of a split screen in which you can scroll up and down, so the spare xterm is not quite as necessary, though the spare xterm may still be convenient.

    gdb> l 1311

    lists lines around 1311 in the command line window.

    dbg> quit

    Of course, this is just a start on how to use a debugger, but you can get a hint that using the debugger can save time on recompiling just to put in print statements.
  • The pgdbg Debugger The pgdbg debugger uses most of the conventional dbg debugger commands. For some on-line documentation, see The Portland group user guide The sample session for gdb will also work for pgdbg, where the session is initiated by

    >pgdbg ./foo

    Displaying a slice of a matrix is a bit easier. While the gdb notation still works, the easier column slice

    pgdbg> print a(2:6,3)

    and row slice

    pgdbg> print a(2,3:5)

    notations are also available. Pgdbg has man pages. Help is available from within the debugging sessions by typing

    pgdbg> help



  • The Totalview Debugger We have a license for the Totalview debugger. It works well with Intel ifc compiled codes. A Totalview tutorial is available at Totalview Tutorial .

  • Debuggers on the IBM p590 To start the dbx debugger, produce an executable foo.exe by compiling it with the IBM Fortran or C compilers with the -g flag.

    >xlf90 -o foo.exe -g foo.f

    You can start a debug session by

    >dbx ./foo.exe

    Breakpoints are set by the name and line number of the file containing them.

    (dbx) stop at "foo.f":1169

    This will set a break at line 1169 of foo.f.

    The syntax

    (dbx) print a(1,2)

    is valid, but there seems to be no way to show a slice of a matrix. Another problem can be that though scalar variables print quickly, there can be a long delay in printing elements of a matrix.

    (dbx) cont

    continues execution of the code to the next breakpoint.

    One virtue of the dbx debugger is convenience of examining core files.

    Suppose that a -g compiled code foo runs and dumps a core

    > foo

    Segmentation fault - core dumped

    To investigate the error,

    >dbx foo

    Dbx reports the line where the dump occurred. You can examine the stack (what program called the subroutine and what program called that routine, and so on), and can print variables on each level of the stack.

    (dbx) up

    (dbx) down

    move up and down the stack respectively.

    (dbx) quit

    exits dbx.

    The p590 has a long man page for dbx which includes example sessions.



  • Parallel Debuggers on the Linux BladeCenter

    The pgdbg, Totalview, and gdb debuggers should each work in parallel. The description here is of using the gdb (or ddd debugger) used on a VCL node.

    MPI (Message Passing Interface) code are typically used for distributed memory codes, i.e., each processor has its own memory and communicates to other processors by sending and receiving messages. For purposes of debugging with gdb, we've implemented a shared memory version of MPI. Here the messages are passed within a single shared memory node. It's possible to start more processes than the number of physical cores.

    The mpirun call starts up a process. When an MPI_Init call is encountered, that root process starts up the requested number of new processes. By putting a pause after the MPI_Init, we can identify the new processes and attach each new process to a gdb session. Then we can step through each of the processes individually.

    Here's an example. From one of the 32 bit login nodes,

    source /home/gwhowell/mpiches/mpich-1.2.7p1/gnu32/ch_shmem/gnu32sh.csh

    This sets up the mpif77, mpicc, mpicxx commands to use the gnu32 library. Compile a simple MPI code, e.g., the monte.f code from the MPI short course. An excerpt is as follows.

          real*8 ans(10), ans2(10)
          real*8 startim, entim, sum, sindex
    c
    c   function
          integer string_len
          iflag = 1
    c
          call MPI_INIT(ierr)
          do while (iflag.eq.1)
          end do
    c
          call MPI_COMM_SIZE(MPI_COMM_WORLD, p, ierr)
          call MPI_COMM_RANK(MPI_COMM_WORLD, my_rank, ierr)
    *     print*,' I am ', my_rank 
    c  
          if (my_rank.eq.0) then
    cc         print*,'input random seed'
    

    Notice the "do while" line just after the MPI_INIT. This line stalls the code indefinitely (taking 100% of the cycles on some core). For this reason it's a good idea to debug on a vcl node (go to vcl.ncsu.edu and get an HPC linux image) as opposed to login01 or login02.
    Typing

    >mpirun -np 2 ./monte
    

    starts up the code. Doing "ps -ef | grep monte"
     
    [gwhowell@login02 ~]$ ps -ef | grep monte
    gwhowell 25532 25386  0 14:39 pts/64   00:00:00 /bin/sh /home/gwhowell/mpiches/mpich-1.2.7p1/gnu32/ch_shmem/bin/mpirun -np 2 ./monte
    gwhowell 25560 25532  6 14:39 pts/64   00:01:42 /home/gwhowell/ppmpi_f/chap03/./monte
    gwhowell 25561 25560 10 14:39 pts/64   00:02:40 /home/gwhowell/ppmpi_f/chap03/./monte
    

    shows that 2 monte jobs have started up. In another window,
    gdb 
    gdb> attach 25560
    gdb>  set iflag = 0 
    

    and in yet another window
    gdb
    gdb> attach 25561
    gdb> set iflag = 0
    

    bring up gdb sessions attached to these two processes. Setting iflag as 0 in each session pulls the process out of infinite loop. Repeatedly typing "n" will step through a process, and we now have two parallel gdb debug sessions attached to the two processes. A similar approach is outlined in Parallel debugging. (The gdb attach process syntax given there does not quite work on the blade center). The link shows syntax for stalling C codes while the debugger is attached.

    You may prefer to start up one or more of the gdb sessions with "ddd". This has a GUI interface which can be useful. For example, when I tried

    gdb>p my_rank
    

    gdb did not know of any such local variable. By clicking the "display" button in ddd, I was able to find a local variable called my_rank__ and print that.

    Totalview debugging can be accomplished in much the same way as outlined here (but so far works only for fortran and C codes, not for C++). For the totalview debugger, try

    source /home/gwhowell/mpiches/mpich-1.2.7p1/intel32/ch_shmem/tv32.csh
    

  • Last modified: December 18 2008 14:48:48.
    Office of Information Technology | NC State University | Raleigh, NC 27695 | Accessibility Statement | Policy Disclaimer | Contact Us