View Full Version : License tokens for managers (Scali & Intel MPI)

11-28-2006, 01:05 PM
When using TotalView with a Team license to debug either Scali MPI or Intel MPI jobs, I find that more license tokens are checked out than MPI ranks created (NP + 1 for Scali, NP + 2 for Intel MPI 1.0). It appears that these tokens are assigned to the manager processes--it seems quite clear for Scali MPI that the mpimon process is getting a token assigned to it, for selecting "View->Display Managers" in the root window shows the mpimon process.

Is there a way to stop TotalView from assigning a license to the manager process(es), or to detach from it so the extra license token is not in use? We have N team tokens, and cannot run an N-way MPI job within TotalView because of this restriction.


11-30-2006, 07:59 AM
Hi Steve,

TotalView's team licensing provides the user with the ability to attach to a certain number of processes. This is a more general concept than MPI. In the specific situation of debugging a MPI job TotalView gathers a list of processes that make up the parallel job from the MPI runtime; the user can always decide which tasks they want to attach to.

TotalView will generally take a process token for every process that it is attached to. When you go over a process token limit TotalView should prompt you asking if you want to kill or detach from the process that put you over the limit. In your case if you choose to detach you should be able to at least debug N-1 (Scali MPI) or N-2 (Intel MPI) processes. In this scenario, if you want to restart the job you should be aware that TotalView does not have control of the process you detached from so you may have to manually kill the process.

In the scenario I described, TotalView will simply detach from the last process (or two) that it attached to, which ever one that may be. If you would like to do this in a more controlled fashion you could use a powerful feature (particularly if you are debugging at a very large scale) called Subset Attach. If you go to File > Preferences and go to the Parallel tab and check the "When a job goes parallel" "Ask what to do" option (the bottom set of options), then when you start a job under TotalView it will open the Attach Subset dialog box. You can use this dialog box to choose which processes you don't want to be attached to. As before you should be aware that TotalView know longer has control of the process you choose to detach from.

11-30-2006, 12:14 PM

Thanks for this information. I tested this with Scali MPI, and see that I can get N-1 processes. It does not seem to be possible to attach and detach from processes during the run; that is, if I start a 4-way Scali job and have only 4 licenses available, I can attach to three of them at the beginning. However, I cannot detach from one and attach to the fourth one; also, once I detach from one I cannot reattach to it. Are these expected behaviors?


11-30-2006, 12:25 PM
Hi Steve,

How are you attempting to attach to the 4th process or reattach to one you have detached from? The best way to do this is to use the Group > Attach Subset dialog. Have you tried that? What errors are you receiving?

12-05-2006, 12:55 PM

Thanks for your reply; sorry for the delay in responding. Here's what I have done (TotalView-8.0.0-2), running a 4-rank MPI job with only 4 Team license tokens available:

* Run TotalView, with no arguments, to get the "New Program" window.

* Enter the executable name in the "Program:" area, leave "Start a new process" chosen, put command line arguments in place using the "Arguments" tab (in this case, because I've set up Scali MPI to be run from within TotalView using mpimon, I add the "--" and the hostname arguments required by mpimon in this location), and ensure Scali MPI is selected for the Parallel system in the Parallel tag, and click "OK". I should note that this properly does everything if there are enough licenses to cover the manager and the MPI ranks.

* Running lmstat at this time shows 1 license in use.

* When the main window pops up, I click on "go". A couple of "Question" windows pop up related to the loading of libraries, and then the window saying "this is a parallel job, do you wan to stop to set breakpoints?" appears. I click on "yes". TotalView works to attach to remote servers, and I am finally given a window which reads: "Unable to acquire a license for 5 processes. The current maximum is 2. You can kill or detach from these processes. Kill them now?" Checking the license server with lmstat shows 2 licenses in use, so I can sort of understand the bit that the current maximum is 2 licenses (that's what's still available). However, it seems odd that TotalView doesn't recognize that the 2 licenses in use are part of this process family (MPI manager and local rank), so only 3 additional licenses should be needed to cover the remote processes. In any event, I click "no", so the processes are not killed. I then open the Group -> Attach Subset window, and pick ranks 0, 1, and 2 to be attached.

* At this point, prior to clicking on "OK", there are processes with IDs 1, 3, 4, and 5 in the control window (the little one titled "Etnus TotalView 8.0.0-2"), which are respectively ranks 0, 1, 2, and 3. I then click on "OK", and processes 3, 4, and 5 disappear, and 6 and 7 take their place. So now I have three processes in this control window, with IDs 1, 6, and 7, which are ranks 0, 1, and 2 of a parallel job. Each has status "T" at the moment. I click the "Out" button a few times to get to an interesting subroutine where I can set a breakpoint, set one in a useful place, and then hit "Go". When I get to the breakpoint, I want to detach from process ID 6 (now rank 1) and attach to rank 3.

* I try to do this by again entering the Group -> Attach Subset window, clicking on rank 1 to detach from it, and then on rank 3 to attach to it. At this point, after clicking "OK", process 8 is created, according to the output on stdout/stderr. I again get the window complaining about a license shortage: I need 5 but only have 4. I again do not kill processes, and now appear to have only the processes with IDs 1 and 7 (that is ranks 0 and 2) in the control window. At this point, when I click on "Go" (I should note that the mode of the control buttons has been "Group (Control)" the entire time), things begin to run again, and the job continues from where it left off--this may be different from what I reported previously. However, I still have no control over process 8, which appears as having been exited if I enable View -> Display Exited Threads in the control window. Furthermore, lmstat still shows me with 4 licenses checked out, even though only three non-exited items are shown in the control window (View -> Display Managers is enabled).

* At this point, if I try to reattach to rank 1, I am successful in doing so, though it now has ID 9 rather than ID 6.

* If I try to detach from rank 1 and attach to rank 3 in two separate invocations of Group -> Attach Subset, it does in fact work. However, it seems that trying to detach from rank 0 is apparently not a good idea: the license count got messed up when I reattached to rank 0, so now I can have only two active ranks before I get the license shortage error.

So the problem is that the license isn't freed up in a timely (in my opinion, anyway) manner when trying to simultaneously attach and detach from processes in the Group -> Attach Subset window.

This is an extremely minor nuisance issue, now that I have seen how to go about it. Thanks for driving me to document exactly what I was doing, so I could find the workaround.