PDA

View Full Version : Error on startup on struct MPIR_PROCDESC



lhartzman
03-23-2009, 01:07 PM
I'm trying to debug a parallel application using TV8 (8.3) on an RHEL4 cluster. When the debugger starts up to run the application I get an error message box that says: "MPI library contains no suitable type definition for struct MPIR_PROCDESC". This keeps me from being able to do any debugging.

How do I resolve this error? Is there something in the PATH or LD_LIBRARY_PATH that has to be added?

On another cluster (RHEL3) I'm running TV8 (8.4) and don't have a problem. So I would imagine that it is either the specific installation or a setup issue.

All help appreciated.

Thanks.

Les

PeterT-RogueWave
03-24-2009, 10:20 AM
Hello Les,

If you are getting the MPIR_PROCDESC message, you are likely using the 'classic' launch for TotalView, and TV is looking for some information from the MPI binary (mpirun?) regarding which processes it should attach to on which nodes. This is coming from a public API we set up a long time ago. If the MPI implementor sets up the information according to this API, then TotalView should be able to attach to the various nodes and pick up the newly spawned processes automatically. MPIR_PROCDESC is just a typedef for the MPIR_Proctable, but it gives us the sizes so we don't have alignment problems and start picking up garbage. I don't know which MPI you are using, but the two most common issues we see are that the MPI binaries are either built with optimization, and MPIR_PROCDESC is lost in the process, or the binaries are stripped of debug information entirely. We usually see the former as a symptom of the MPI build, and the later when you install from an rpm package.

If you have the sources, you can usually reconfigure or rebuild the MPI binaries so that procdesc is not lost. With MPICH, the file you want to check for is debugutil.c which is in ./src/env/. The Makfile in that directory should single out this file and state that it needs to be built with no optimization. But often the build line contains $CFLAGS and CFLAGS is set (in a previous, top level, Makefile, as containing -O2 or -O3. I just tell people to remove the $CFLAGS entirely. It's a small file and is only losing important information with optimization, and gains little from the other CFLAGS. We see this less often with MPICH2, but the corresponding files (dbginit.c in particular) is found in src/mpi/debugger.

Many, but not all, of the other MPI implementations are derived off these two, and have similar source files, so you can generally look for those and modify as need be.


HOWEVER, not everyone need implement this protocol. In this case we have come up with an 'indirect' method of launching. First you start TotalView with just the command

totalview

and then, when the New Program window pops up, enter the name of the program you want to debug. Then switch to the Parallel tab, and pick the Parallel System from the drop down menu. Specify the number of tasks you want to run, and pick the nodes only when you have a current allocation and want to restrict them. If you need to specify a -hostfile or equivalent, or other MPI arguments, add them to the Additional Arguments box. Hit OK, and you should see your programs source. Set breakpoints now, because, if this works, it may not stop to ask you what you want to do. That varies with the MPI.

If this does not get the MPIR_PROCDESC message, you should be all set. If you have any problems connecting to the remote node, I'll be posting a message about what to do in that case shortly.

Regards,

lhartzman
03-24-2009, 10:59 AM
Hello Pete,

Thanks for the information. I am using MPI/Pro 1.7 from Verari. The regular cluster I use this on works fine. I don't know if the cluster I'm trying to use TV on and not succeeding has MPI installed the same or not (different sys admins).

I'm not sure I can use your alternate scenario as I'm going through PBS to schedule the job, hence no control over the nodes used (this is a production cluster with many users).

I'll see if I can find out anything about how 1.7 was installed.

Les

PeterT-RogueWave
03-24-2009, 02:08 PM
Les

>Thanks for the information. I am using MPI/Pro 1.7 from Verari. The regular cluster I use this on works fine. I don't know if the cluster I'm
> trying to use TV on and not succeeding has MPI installed the same or not (different sys admins).

We don't have that one in house, so haven't really tested with it. The fact that TotalView is coming up indicates they have implemented some support for it, but I suspect we are running into the optimization or stripped problem, and if you don't have the sources to build with, it is all under the control of Verari.

> I'm not sure I can use your alternate scenario as I'm going through PBS to schedule the job, hence no control over the nodes used (this is
> a production cluster with many users).

It might still be possible, but not quite in the way I assumed. MPI/Pro is not one of the parallel system choices we have. However, you can always set up one of your own. The parallel support is in a file called, parallel_support.tvd and that is found in the

<install_path>/toolworks/totalview.8.<version>/lib directory.

That file has a lot of comments in it explaining exactly what is going on and which methods are being used. It suggests you not edit that file, but set up your own methods for unknown MPI's in you .tvdrc file in a

dset TV::parallel_configs {

}

section. I don't know enough about MPI/Pro to suggest how to fill that out, but assuming you can figure that out, you might be able to submit a job with something like

totalview -mpi "MPI/Pro" -np 4 ./mympi_prog

And then let the batch system worry about the node allocations.

Regards,

lhartzman
03-24-2009, 04:30 PM
Pete,

I'll see what I can figure out on my own.

Les