PDA

View Full Version : How to start parallel debug session using SLURM



Robbie
11-25-2006, 02:43 AM
Hi,
I'd like to debug MPI task using TotalView. However, our parallel environment uses SLURM to lauch jobs.
So, I'd like to know how to combine TotalView with SLURM?
I note the TotalView's parallel system in the "Parallel" tab when staring a new program does not include SLURM.
But also heared that TotalView did can be used with SLURM.
But how?

Best regards,
Robbie

Josh-TotalView-Tech
11-27-2006, 09:00 AM
Hi Robbie,

There are a couple ways you can startup a parallel job with TotalView. The "Parallel tab" option is relativly new. The "parallel systems" in this dialog are customizable. If you would like to see SLURM on this list add the following to your $HOME/.tvdrc file:

dset TV::parallel_configs {

[nbsp][nbsp][nbsp][nbsp] name: SLURM;
[nbsp][nbsp][nbsp][nbsp] description: SLURM;
[nbsp][nbsp][nbsp][nbsp] starter: srun %s %p %a;
[nbsp][nbsp][nbsp][nbsp] style: manager_process;
[nbsp][nbsp][nbsp][nbsp] tasks_option: -n;
[nbsp][nbsp][nbsp][nbsp] nodes_option: -N;
[nbsp][nbsp][nbsp][nbsp] env: ;
[nbsp][nbsp][nbsp][nbsp] force_env: false;

}

As an alternative you can startup TotalView on your parallel program by running:

[nbsp][nbsp][nbsp][nbsp] totalview srun -a [srun args] <program> [program args]

When TotalView starts up you will at first be debugging srun, hitting the GO button will execute:

[nbsp][nbsp][nbsp][nbsp] srun <program> [program args]

under the debugger.

Robbie
11-27-2006, 05:09 PM
Hi Josh,

Thanks a lot for your detailed reply.
As we know, Totalview lauches the debug server, tvdsvr, through rsh/ssh on the compute nodes allocated by srun .
It means that the target compute nodes must provide remote service like rsh or ssh.

Now I can't lauch the debug session because my system doesn't provide such services and rsh reports "connection refused".
It seems that this prolem can be solved just by providing rsh/ssh service on compute nodes. Right?

Best regards,
Robbie

mwolfe
11-27-2006, 06:34 PM
Robbie,

An alternative is to use srun to bulk launch the tvdsvr's.
What TV platform do you have SLURM on, linux-x86?
I can give you bulk launch strings to place in either your
global or your local .tvdrc file.

Robbie
11-27-2006, 11:23 PM
Hi Matt,

That's what I wanted. It would be better to have srun to bulk lauch all the necessary tvdsvr's.

My platform is Linux-IA64. Please tell me how to do it.

Best regards,
Robbie




Robbie,

An alternative is to use srun to bulk launch the tvdsvr's.
What TV platform do you have SLURM on, linux-x86?
I can give you bulk launch strings to place in either your
global or your local .tvdrc file.

mwolfe
11-28-2006, 09:51 AM
Robbie,

Include the following lines in either your global or your local .tvdrc file:

<pre>
# Set Local Preferences
# Root window File > Preferences...
# Bulk Launch page
# If Linux IA64, and SLURM is running, use bulk launch
if {[string compare [dset TV::platform] linux-ia64] == 0 && ![catch {exec scontrol ping >& /dev/null}]} {
#
# Enable debug server bulk launch: Checked
dset -set_as_default TV::bulk_launch_enabled true

# Command:
# Beginning with TV 7X.1, TV supports SLURM and %J.
dset -set_as_default TV::bulk_launch_string {srun --jobid=%J -N%N -n%N -w`awk -F. 'BEGIN {ORS=","} {if (NR==%N) ORS=""; print $1}' %t1` -l --input=none %B/tvdsvr%K -callback_host %H -callback_ports %L -set_pws %P -verbosity %V -working_directory %D %F}

# Temp File 1 Prototype:
# Host Lines:
# SLURM NodeNames need to be unadorned hostnames. In case %R returns
# fully qualified hostnames, list the hostnames in %t1 here, and use
# awk in the launch string above to strip away domain name suffixes.
dset -set_as_default TV::bulk_launch_tmpfile1_host_lines {%R}
}
</pre>

Robbie
11-29-2006, 02:21 AM
Hi Matt,

After copying the lines into tvdrc file, and run: totalview srun -a -n count app,

the bulk lauch failed and produced the following messages:
.....
srun --jobid=704 -N1 -n1 -w`awk -F. 'BEGIN {ORS=","} {if (NR==1) ORS=""; print $1}' /vol5/s611/src/ygxmpi/TVT1I5WIXH` -l --input=none /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr -callback_host server -callback_ports node81:16382 -set_pws 760f1e5b:75c82a89 -verbosity info -working_directory /vol5/s611/src/ygxmpi
0: /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr: line 1: dirname: command not found
srun: error: node81: task0: Exited with exit code 1
0: /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr: line 1: dirname: command not found
0: /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr: line 1: dirname: command not found
0: /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr: line 1: /bin/pwd: No such file or directory
0: /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr: line 1: dirname: command not found
0: /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr: Unable to find installation directory.

But the file /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr does exist and no problem with its access right.
Note here the tvdsvr in linux-ia64/bin is a symbol link to bin/tvdsvr. But even I copy the bin/tvdscr into this dir, problem still exists.

Any suggetstion?

Best regards,
Robbie

mwolfe
11-29-2006, 10:09 AM
Robbie,

Do you have coreutils installed?

<pre>
tla0{mwolfe}44: which dirname
/usr/bin/dirname
tla0{mwolfe}45: rpm -qf /usr/bin/dirname
coreutils-5.2.1-31.2
tla0{mwolfe}46: rpm -qa | grep coreutils
coreutils-5.2.1-31.2
tla0{mwolfe}47:
</pre>

Robbie
11-29-2006, 06:00 PM
Hi Matt,

Oh...Our compute nodes don't install coreutils in order to save ramdisk space!
Thanks.

Another problem. After TotalView has lauched a parallel debug session, using "totalview srun -a -n xxx app",
sometimes the debuggees (parallel processes) are still in "Running State" (result of "squeue") even if TotalView has exited(Ctrl+Q).
This will cause the allocated node resource can't be released and available compute nodes will be exhausted sooner or later.
Any idea?

Beat regards,
Robbie

mwolfe
11-29-2006, 09:26 PM
Robbie,

You may want to uncomment the last 3 lines of lib/kill_support.tvd .

Also, we train our users to make a practice of following a TV session
with "squeue -u $USER" and if necessary, an "scancel" to clean up
the allocation in which TV was running. In a perfectly integrated world,
that wouldn't be necessary, but it's a good habit today to prevent wasted
compute resources.

Robbie
12-02-2006, 08:58 AM
Hi Matt,

Thanks for your help. I have managed to lauch MPI debug sesseion with SLURM.
Here I have another question.
As we know that TotalView can attach to an existing process lauched by srun.
In this case, the stdin, stdout and stderr are all redirected to srun.
Then when this process is attached by TotalView, how to handle its stdin,stdout and stderr ?
Still managed by srun, or controlled by TotalView?

Best regards,
Robbie

mwolfe
12-04-2006, 07:28 PM
Hi Robbie,

Still managed by srun, as far as I know.