PDA

View Full Version : Getting remote debugging working



NeilOwens
02-15-2008, 04:45 AM
I have a FC5 development box trying to remotely debug an application running on an x86 gentoo box. When I try to Add a host in the New Program... dialog, I get the message "Can't attach to cluster #<number>: -1". I'm aware that going between two different linux distros may not work, but I'm just trying it anyway. If we end up using it we can always have dedicated gentoo development boxes to do this specific task. I imagine I've made a noob error somewhere here, so I'll describe the setup to see if you can spot where I've got it wrong.

I'm currently evaluating TotalView for use in my department, so I'm using a demo licence file. I've installed totalview on the target, and it's using an identical licence file as the development box I want to do the debugging.

ssh definitely works to the box, I can even start the tvdsvr maually from the terminal using an implicit ssh command. (i.e. so if I enter ssh root@<ip> <path>/tvdsvr -server it starts). I've set up the ssh keys so no password is required for my user to log in as root on the target.

I've mounted the source folder on my development box as a mount point on the target so the paths are identical.

rsh doesn't work, for reasons I haven't yet worked out. I've changed the single debug server launch command to

ssh -l root %R -n "%B/tvdsvr%K -working_directory %D -callback %L -set_pw %P -verbosity %V %F"

Can anyone see any schoolboy errors?

PeterT-RogueWave
02-17-2008, 09:00 PM
Neil,

I don't think it would make that much of a difference going from FC5 to Gentoo. Linux distro's are close enough that they should work. Of course, I don't have a Gentoo version around to check, but I went from FC5 to Ubuntu 6.10 without any problems. Is the FC5 platform the same architecture as the x86 Gentoo? You will get a failure if one of them is a 64bit x86-64, and the other is a 32 bit x86. That is considered cross-platform and we really on handle that on a few platforms (Bluegene, SiCortex, Cray) at the moment. Otherwise you appear to have done all the steps I would normally suggest for trying to get the communication working between the systems. I would probably set the environment variable, TVDSVRLAUNCHCMD to ssh, rather than changing the %C substitution character directly to ssh in the launch string, but the effect is the same.

If you are still having problems, I would suggest sending your console output from TotalView and the output of uname -a from both systems to suppor@totalviewtech.com, and we can get into more detail there. I'll sum up any results from that back in a reply to this note.

Regards.

NeilOwens
02-18-2008, 01:53 AM
Thanks for your response Pete.

I've made somewhat more progress, but without solving the problem. By changing the start string to

ssh root@%R -n "%B/tvdsvr%K -working_directory %D -callback %L -set_pw %P -verbosity %V %F"

The full start string as reported by the event log is

ssh root@10.0.1.13 -n "/usr/toolworks/totalview.8.4.0-0/linux-x86/bin/tvdsvr -working_directory / -callback 10.0.0.2:4142 -set_pw 64b3bd88:5b474b34 -verbosity info "

I can do a ps -ef | grep tvd on the target to confirm that the tvdsvr process is being started via the ssh session, and this appears to be largely identical to the above start string, with the difference of the running process actually being called tvdsvrmain and started in the folder

"/usr/toolworks/totalview.8.4.0-0/linux-x86/bin/../../linux-x86/bin/tvdsvrmain", presumably it is started using a relative path from the tvdsvr application folder.

However, I'm still getting my "Can't attach to cluster" error. I've put the hostname of the debug box into the hosts file on the target. I've tried using the %L switch in the startup command and hardcoding the IP address in, but no success.

The Event Log lists the "Launching TotalView Debugger Server with command:" and the command, but nothing else.

I've tried increasing the timeout. It seems to take ~20 seconds for ps to pick up the tvdsvr application, but with a longer timeout (with a good 60 seconds whilst the tvdsvrmain app is running on the target) it sits there showing a "Waiting for connection from debugger server on 10.0.1.13" dialog for the timeout period, and then gives me the "can't attach to cluster" error box.

If I try starting the server manually and then using the hostname:port format to stop TVD from trying to start the server it immediately fails with a -1014 error.

Additionally, when I try to add a host and it fails, next time I try it doesn't seem to reattempt but fails immediately. I'm having to shut TotalView down and bring it up from workbench to make it retry properly. Is there some way to make it do a proper reattempt?

I think I need to see some debug logs to work out why the connection is failing. I have looked for some kind of output log from totalview to work out what it's doing when it fails but I can't find it. I've looked in the toolworks tree and under /var/log, but no dice. Where does it live?

The uname -r outputs are:

Target: 2.6.20-gentoo-r8
Debug box: 2.6.20-1.2320.fc5smp

NeilOwens
02-18-2008, 06:36 AM
OK, I think I've worked out my problem. My debug box has two NICs, and therefore 2 IP addresses. One is connected to the internet (eth0) and the work domain, and the other is a dedicated interface to the target hardware (eth1).

If I understand correctly, TVD creates a listening socket, and then the TVD server connects back to it? I reckon that TVD is creating its listening socket on eth0, and the TVD server is doing a connect back to the wrong IP address. Is there any way to tell TVD which interface to create that listening socket on? Or am I going to have to frig the routes?

PeterT-RogueWave
02-18-2008, 07:27 AM
Neil,

It looks like you're starting this from the workbench Console output should go to stdout or stderr but I think that might get diverted by the workbench manager. I'm setting that up on my laptop to see what happens and if there is a way to avoid it getting lost. However, in the TotalVIew root window (the small one that lists the attached processes, if any) you can go to Tools->Event Log and see all the output. Or just run TotalVIew itself without the WorkBench and all the output will go to the terminal window where you started.

I'd strongly suggest sending this to support where we can work it in more depth. It seems like it might be getting a bit involved, and I hate this font ;-) Send it to support@totalviewtech.com and attention Pete Thompson.