Page 1 of 2 12 LastLast
Results 1 to 10 of 12

Thread: How to start parallel debug session using SLURM

  1. #1
    Junior Member
    Join Date
    Nov 2006
    Posts
    12

    How to start parallel debug session using SLURM

    Hi,
    I'd like to debug MPI task using TotalView. However, our parallel environment uses SLURM to lauch jobs.
    So, I'd like to know how to combine TotalView with SLURM?
    I note the TotalView's parallel system in the "Parallel" tab when staring a new program does not include SLURM.
    But also heared that TotalView did can be used with SLURM.
    But how?

    Best regards,
    Robbie

  2. #2
    Senior Member
    Join Date
    Jun 2006
    Location
    Natick MA
    Posts
    145

    Re: [Robbie] How to start parallel debug session using SLURM

    Hi Robbie,

    There are a couple ways you can startup a parallel job with TotalView. The "Parallel tab" option is relativly new. The "parallel systems" in this dialog are customizable. If you would like to see SLURM on this list add the following to your $HOME/.tvdrc file:

    dset TV:arallel_configs {

    [nbsp][nbsp][nbsp][nbsp] name: SLURM;
    [nbsp][nbsp][nbsp][nbsp] description: SLURM;
    [nbsp][nbsp][nbsp][nbsp] starter: srun %s %p %a;
    [nbsp][nbsp][nbsp][nbsp] style: manager_process;
    [nbsp][nbsp][nbsp][nbsp] tasks_option: -n;
    [nbsp][nbsp][nbsp][nbsp] nodes_option: -N;
    [nbsp][nbsp][nbsp][nbsp] env: ;
    [nbsp][nbsp][nbsp][nbsp] force_env: false;

    }

    As an alternative you can startup TotalView on your parallel program by running:

    [nbsp][nbsp][nbsp][nbsp] totalview srun -a [srun args] <program> [program args]

    When TotalView starts up you will at first be debugging srun, hitting the GO button will execute:

    [nbsp][nbsp][nbsp][nbsp] srun <program> [program args]

    under the debugger.
    Josh Carlson

  3. #3
    Junior Member
    Join Date
    Nov 2006
    Posts
    12

    Re: [Josh-Etnus] How to start parallel debug session using SLURM

    Hi Josh,

    Thanks a lot for your detailed reply.
    As we know, Totalview lauches the debug server, tvdsvr, through rsh/ssh on the compute nodes allocated by srun .
    It means that the target compute nodes must provide remote service like rsh or ssh.

    Now I can't lauch the debug session because my system doesn't provide such services and rsh reports "connection refused".
    It seems that this prolem can be solved just by providing rsh/ssh service on compute nodes. Right?

    Best regards,
    Robbie

  4. #4
    Junior Member
    Join Date
    Aug 2006
    Posts
    12

    Re: [Robbie] How to start parallel debug session using SLURM

    Robbie,

    An alternative is to use srun to bulk launch the tvdsvr's.
    What TV platform do you have SLURM on, linux-x86?
    I can give you bulk launch strings to place in either your
    global or your local .tvdrc file.

    -Matt Wolfe, LLNL

  5. #5
    Junior Member
    Join Date
    Nov 2006
    Posts
    12

    Re: [mwolfe] How to start parallel debug session using SLURM

    Hi Matt,

    That's what I wanted. It would be better to have srun to bulk lauch all the necessary tvdsvr's.

    My platform is Linux-IA64. Please tell me how to do it.

    Best regards,
    Robbie


    Robbie,

    An alternative is to use srun to bulk launch the tvdsvr's.
    What TV platform do you have SLURM on, linux-x86?
    I can give you bulk launch strings to place in either your
    global or your local .tvdrc file.

  6. #6
    Junior Member
    Join Date
    Aug 2006
    Posts
    12

    Re: [Robbie] How to start parallel debug session using SLURM

    Robbie,

    Include the following lines in either your global or your local .tvdrc file:

    <pre>
    # Set Local Preferences
    # Root window File > Preferences...
    # Bulk Launch page
    # If Linux IA64, and SLURM is running, use bulk launch
    if {[string compare [dset TV:latform] linux-ia64] == 0 && ![catch {exec scontrol ping >& /dev/null}]} {
    #
    # Enable debug server bulk launch: Checked
    dset -set_as_default TV::bulk_launch_enabled true

    # Command:
    # Beginning with TV 7X.1, TV supports SLURM and %J.
    dset -set_as_default TV::bulk_launch_string {srun --jobid=%J -N%N -n%N -w`awk -F. 'BEGIN {ORS=","} {if (NR==%N) ORS=""; print $1}' %t1` -l --input=none %B/tvdsvr%K -callback_host %H -callback_ports %L -set_pws %P -verbosity %V -working_directory %D %F}

    # Temp File 1 Prototype:
    # Host Lines:
    # SLURM NodeNames need to be unadorned hostnames. In case %R returns
    # fully qualified hostnames, list the hostnames in %t1 here, and use
    # awk in the launch string above to strip away domain name suffixes.
    dset -set_as_default TV::bulk_launch_tmpfile1_host_lines {%R}
    }
    </pre>

    -Matt Wolfe, LLNL

  7. #7
    Junior Member
    Join Date
    Nov 2006
    Posts
    12

    Re: [mwolfe] How to start parallel debug session using SLURM

    Hi Matt,

    After copying the lines into tvdrc file, and run: totalview srun -a -n count app,

    the bulk lauch failed and produced the following messages:
    .....
    srun --jobid=704 -N1 -n1 -w`awk -F. 'BEGIN {ORS=","} {if (NR==1) ORS=""; print $1}' /vol5/s611/src/ygxmpi/TVT1I5WIXH` -l --input=none /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr -callback_host server -callback_ports node81:16382 -set_pws 760f1e5b:75c82a89 -verbosity info -working_directory /vol5/s611/src/ygxmpi
    0: /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr: line 1: dirname: command not found
    srun: error: node81: task0: Exited with exit code 1
    0: /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr: line 1: dirname: command not found
    0: /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr: line 1: dirname: command not found
    0: /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr: line 1: /bin/pwd: No such file or directory
    0: /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr: line 1: dirname: command not found
    0: /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr: Unable to find installation directory.

    But the file /vol5/s611/toolworks/totalview.8.0.0-2/linux-ia64/bin/tvdsvr does exist and no problem with its access right.
    Note here the tvdsvr in linux-ia64/bin is a symbol link to bin/tvdsvr. But even I copy the bin/tvdscr into this dir, problem still exists.

    Any suggetstion?

    Best regards,
    Robbie

  8. #8
    Junior Member
    Join Date
    Aug 2006
    Posts
    12

    Re: [Robbie] How to start parallel debug session using SLURM

    Robbie,

    Do you have coreutils installed?

    <pre>
    tla0{mwolfe}44: which dirname
    /usr/bin/dirname
    tla0{mwolfe}45: rpm -qf /usr/bin/dirname
    coreutils-5.2.1-31.2
    tla0{mwolfe}46: rpm -qa | grep coreutils
    coreutils-5.2.1-31.2
    tla0{mwolfe}47:
    </pre>

    -Matt Wolfe, LLNL

  9. #9
    Junior Member
    Join Date
    Nov 2006
    Posts
    12

    Re: [mwolfe] How to start parallel debug session using SLURM

    Hi Matt,

    Oh...Our compute nodes don't install coreutils in order to save ramdisk space!
    Thanks.

    Another problem. After TotalView has lauched a parallel debug session, using "totalview srun -a -n xxx app",
    sometimes the debuggees (parallel processes) are still in "Running State" (result of "squeue") even if TotalView has exited(Ctrl+Q).
    This will cause the allocated node resource can't be released and available compute nodes will be exhausted sooner or later.
    Any idea?

    Beat regards,
    Robbie

  10. #10
    Junior Member
    Join Date
    Aug 2006
    Posts
    12

    Re: [Robbie] How to start parallel debug session using SLURM

    Robbie,

    You may want to uncomment the last 3 lines of lib/kill_support.tvd .

    Also, we train our users to make a practice of following a TV session
    with "squeue -u $USER" and if necessary, an "scancel" to clean up
    the allocation in which TV was running. In a perfectly integrated world,
    that wouldn't be necessary, but it's a good habit today to prevent wasted
    compute resources.

    -Matt Wolfe, LLNL

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •