PDA

View Full Version : Question about starting MPICH debug session



Robbie
09-18-2007, 08:22 AM
Hi,

Recently I just received my enterprise licencse. But after installing the valid license, I face some confusing problems when launching the MPI debug session with latest Totalview.8.2.0-1. I installed TVD and the license correctly and the TVD can work well with a serial problem.
My platform is ia64-linux SMP. And I use MPICH2-1.0.5. When MPICH is configured, the location of totalview lies in the "PATH" environmental variable. So the "mtv" module is compiled and linked. The folowing is the configuration options of my MPICH2 1.0.5:
./configure --prefix=/data2/jiangjie/mpich2-svr --with-device=ch3:ssm --enable-fast --enable-mpe=yes CC=/opt/intel/compiler91/bin/icc CFLAGS=-O3 CXX=/opt/intel/compiler91/bin/icc F77=/opt/intel/compiler91/bin/ifort FFLAGS=-O3

My test program is compiled by mpicc (which calls "icc" internally) with "-g -O0" options. It works well with" mpirun -np 2 ./testMPI"

When run with: mpirun -tv -np 2 ./testMPI, a popup window appreas and ask me to whether to stop the parallel job.After answering "YES", there are three processes in the Root window in the following status:
B python(1 active threads)
T python<testMPI>.0(1 active threads)
T python<testMPI>.1(1 active threads)
From this we can see that this MPI job has been stopped and controled by TVD.

Then, if I double-clicked the first process (with status of "B"), the Process window shows that this process is stopped at function MPIR_Breakpoint in mtv.c and the stack trace window list all the stack frames from "start" to "MPIR_Breakpoint".
However, when I double-clicked the second item in the Root window, there is no source code (testMPI.c) appearing in the source code window as expected, only assemble instructions of the process I selected.

Note that the executable lies in the same location with its source code. Even after specifying the search path of source code in the "File>Seach Path...", there is still no source code appears in the source code window.Then I have no means to set a breakpoint. If I press the "GO" button, the parallel job runs completely without any stop and no debug operation is possible.

However, I can open the source file using "File>Open Source..." after selecting the second item in the Root window. After this, I can set a breakpoint or a barrier point. NOTE here the breakpoint or barrier point can take effective only if it is set after "MPI_Init". If it is set before or even on MPI_Init, the process will not stop.

Here the main questions are:
1. Why is the source file not opened automatically after the process item in the Root window has been selected? Instead, I have to open it manually.

2. It seems that the parallel job has been stopped in the middle of "MPI_Init" (most possbily, on "MPIR_Breakpoint"), why not stop it on the entry of the "main" or the first line of the main routine? Is this related to the implementation, configuration or installation of my MPICH? Or is it beceause of my setting up of TVD?

Josh-TotalView-Tech
09-20-2007, 06:03 AM
Hi Robbie,

TotalView will always show you exactly where the process is. If the frame which the process is currently in does not have debug information (the compilation unit was not built with -g) then it will show you the assembler code. Often during MPI startup you will find your processes inside system calls. To get to your source code, you should look at the stack trace in the upper left hand corner of the process window and select the frame containing source. Stack frames with debug information will be indicated with a 'C', 'C++', 'F77', or 'F90' prefixed to the line. In this scenario, you will probably find your code right above the MPI_Init call.

The way TotalView process acquisition works is that we do not attach to the rank processes until MPI_Init. There is no way around this in TotalView versions 8.2 or earlier. However, TotalView 8.3 will feature a new way of launching MPI jobs under TotalView. Part of this is the ability to debug the slave processes before MPI_Init is called. We are currently in the final stages of our beta for 8.3, so if you are interested in signing up please fill in the form at this url: http://www.totalviewtech.com/Support/beta/beta_form.php. We would be very interested in any feedback you have.

The final thing that is worth stating here is that we do not recommend starting TotalView on an MPICH2 job with the -tv option. You will lose some functionality when you do this, namely you will not be able to restart your job under TotalView. We recommend starting totalview by either starting totalview with no arguments and using the New Program dialog to specify the program to debug and the Parallel system used, in this case MPICH2, or by starting TotalView 'totalview ./testMPI -a testMPIargs' then going to the Process > Startup Parameters and clicking on the Parallel tab and make sure MPICH2 and other parallel startup arguments are set as appropriate (subsequent invocations will remember your prior settings). You should note that you need to be using MPICH2 version 1.0.5p4 or greater in order for this to work.

I hope this helps.

Robbie
09-20-2007, 08:34 PM
Hi Joesh,

Thanks for your detailed reply.
Now I have managed to start the MPI debug session correctly either by command "mpirun -tv -np 2 ./testMPI" or via the TVD GUI.

Here I have another question about the "ATTACH" funciton.
For example, on the console, I typed the following cmd:
%mpirun -np 2 ./testMPI
and
%ps aux|grep python
root 6247 0.0 0.0 14032 10416 ? S 10:54 0:00 python2.4 /usr/local/mpich2-v5/bin/mpd.py --ncpus=1 -e -d
jiangjie 7245 3.7 0.0 14016 10192 pts/7 S+ 10:57 0:00 python2.4 /data2/jiangjie/mpich2-svr/bin/mpirun -np 2 ./testMPI
jiangjie 7258 4.2 0.0 14528 10672 ? S 10:57 0:00 python2.4 /usr/local/mpich2-v5/bin/mpd.py --ncpus=1 -e -d
jiangjie 7259 4.3 0.0 14528 10672 ? S 10:57 0:00 python2.4 /usr/local/mpich2-v5/bin/mpd.py --ncpus=1 -e -d
root 7270 0.0 0.0 3136 1600 pts/3 S+ 10:57 0:00 grep python

%ps aux|grep testMPI
jiangjie 7245 0.4 0.0 14016 10192 pts/7 S+ 10:57 0:00 python2.4 /data2/jiangjie/mpich2-svr/bin/mpirun -np 2 ./testMPI
jiangjie 7260 0.0 0.0 8448 3040 ? S 10:57 0:00 ./testMPI
jiangjie 7261 0.0 0.0 8448 3040 ? S 10:57 0:00 ./testMPI

When I try to attach the python process (pid=7245) in the "File>New Program" window, Process window shows that the
selected python process has been stopped somewhere. And the MPI job also stopped. But TVD doesn't ask me if I would
like to stop all MPI processes. And the "Group>Attach subset" menu item is dim and no program souce code is displayed.
Then if I press "Go", the MPI job will run freely. So I have no way to debug the MPI job.

Have I attached to the correct process? Should I attach to another process so as to control all processes of the MPI job?
Or should I do anything more?

Regards,
Jie Jiang

PS: I found that there is much difference between TVD 8.2 and TVD 8.0 or ealier. I didn't encounter so many problems before.

Josh-TotalView-Tech
09-21-2007, 01:29 PM
Excellent question.

In order to attach to a running MPICH2 job there are three things that you must consider.

1. You must use version MPICH2 version 1.0.5p4 or later.

2. You must start use the option -tvsu to mpiexec, mpirun, or python when you start the job. This will tell python to load TotalView support for a future attach.

3. You must attach to the python process which started everything else. In your ps output it looks like that is PID 7245. But in general you'll be able to find the right python process by looking for the one that has the -tvsu argument.

You can then use TotalView's File > New Program > Attach to Process dialog and attach to the entire job simply by attaching to the one python process. You should note that while TotalView does display the parent-child relationship of the processes in a tree display, the right process is not what looks like the parent python process of other processes. It is usually displayed without any children, most likely because their is an intermediary processes between the startup python and the rest of the job.

I hope this helps.

Robbie
09-21-2007, 08:15 PM
Hi Josh,

After upgrading from mpich2-1.0.5 to 1.0.5p4 and run the MPI job with "mpirun -tvsu -np 2 ./testMPI",
I can attach to the parallel job now. In the Root window, there will be 3 stopped process.

However, when I select my worker process, there is no source file opened in the source code window, and the stack frame window( on the upper-left of the Prcocess window )is empty. So I have no way to tell where the process is stopped,and can't control the executation of the MPI program without source code and current stack frame.

Note I do have add the source location to the search path. And also, I start the MPI job just in the directroy where both the executable and the source file reside.

Any suggestion?