Discussion:
Sporadic errors in array tasks with a PE
kisielk
2010-04-08 16:26:23 UTC
Permalink
I noticed yesterday that some of my users' array tasks which use a PE were randomly failing. I'd get an email from SGE that looks like the following:

Job 60497 caused action: Queue "***@node077.cluster" set to ERROR
User = rosalia
Queue = ***@node077.cluster
Start Time = <unknown>
End Time = <unknown>
failed before job:04/07/2010 21:44:03 [1049:32249]: unable to find job file "/opt/sge/cluster/spool/node077/job_script
Shepherd trace:
04/07/2010 21:44:03 [0:32247]: shepherd called with uid = 0, euid = 0
04/07/2010 21:44:03 [0:32247]: starting up 6.2u5
04/07/2010 21:44:03 [0:32247]: setpgid(32247, 32247) returned 0
04/07/2010 21:44:03 [0:32247]: do_core_binding: "binding" parameter not found in config file
04/07/2010 21:44:03 [0:32247]: no prolog script to start
04/07/2010 21:44:03 [0:32247]: /bin/true
04/07/2010 21:44:03 [0:32247]: /bin/true
04/07/2010 21:44:03 [0:32247]: parent: forked "pe_start" with pid 32248
04/07/2010 21:44:03 [0:32247]: using signal delivery delay of 120 seconds
04/07/2010 21:44:03 [0:32247]: parent: pe_start-pid: 32248
04/07/2010 21:44:03 [0:32248]: child: starting son(pe_start, /bin/true, 0);
04/07/2010 21:44:03 [0:32248]: pid=32248 pgrp=32248 sid=32248 old pgrp=32247 getlogin()=<no login set>
04/07/2010 21:44:03 [0:32248]: reading passwd information for user 'rosalia'
04/07/2010 21:44:03 [0:32248]: setting limits
04/07/2010 21:44:03 [0:32248]: setting environment
04/07/2010 21:44:03 [0:32248]: Initializing error file
04/07/2010 21:44:03 [0:32248]: switching to intermediate/target user
04/07/2010 21:44:03 [1049:32248]: closing all filedescriptors
04/07/2010 21:44:03 [1049:32248]: further messages are in "error" and "trace"
04/07/2010 21:44:03 [1049:32248]: using "/bin/bash" as shell of user "rosalia"
04/07/2010 21:44:03 [1049:32248]: now running with uid=1049, euid=1049
04/07/2010 21:44:03 [1049:32248]: execvp(/bin/true, "/bin/true")
04/07/2010 21:44:03 [0:32247]: wait3 returned 32248 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0)
04/07/2010 21:44:03 [0:32247]: pe_start exited with exit status 0
04/07/2010 21:44:03 [0:32247]: reaped "pe_start" with pid 32248
04/07/2010 21:44:03 [0:32247]: pe_start exited not due to signal
04/07/2010 21:44:03 [0:32247]: pe_start exited with status 0
04/07/2010 21:44:03 [0:32247]: parent: forked "job" with pid 32249
04/07/2010 21:44:03 [0:32247]: parent: job-pid: 32249
04/07/2010 21:44:03 [0:32249]: child: starting son(job, /opt/sge/cluster/spool/node077/job_scripts/60497, 0);
04/07/2010 21:44:03 [0:32249]: pid=32249 pgrp=32249 sid=32249 old pgrp=32247 getlogin()=<no login set>
04/07/2010 21:44:03 [0:32249]: reading passwd information for user 'rosalia'
04/07/2010 21:44:03 [0:32249]: setosjobid: uid = 0, euid = 0
04/07/2010 21:44:03 [0:32249]: setting limits
04/07/2010 21:44:03 [0:32249]: RLIMIT_CPU setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
04/07/2010 21:44:03 [0:32249]: RLIMIT_FSIZE setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
04/07/2010 21:44:03 [0:32249]: RLIMIT_DATA setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
04/07/2010 21:44:03 [0:32249]: RLIMIT_STACK setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
04/07/2010 21:44:03 [0:32249]: RLIMIT_CORE setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
04/07/2010 21:44:03 [0:32249]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 4294967296 hard 4294967296) resulting: (soft 4294967296 hard 4294967296)
04/07/2010 21:44:03 [0:32249]: RLIMIT_RSS setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
04/07/2010 21:44:03 [0:32249]: setting environment
04/07/2010 21:44:03 [0:32249]: Initializing error file
04/07/2010 21:44:03 [0:32249]: switching to intermediate/target user
04/07/2010 21:44:03 [1049:32249]: closing all filedescriptors
04/07/2010 21:44:03 [1049:32249]: further messages are in "error" and "trace"
04/07/2010 21:44:03 [1049:32249]: now running with uid=1049, euid=1049
04/07/2010 21:44:03 [1049:32249]: unable to find job file "/opt/sge/cluster/spool/node077/job_scripts/60497"
04/07/2010 21:44:03 [0:32247]: wait3 returned 32249 (status: 2816; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 11)
04/07/2010 21:44:03 [0:32247]: job exited with exit status 11
04/07/2010 21:44:03 [0:32247]: reaped "job" with pid 32249
04/07/2010 21:44:03 [0:32247]: job exited not due to signal
04/07/2010 21:44:03 [0:32247]: job exited with status 11
04/07/2010 21:44:03 [0:32247]: now sending signal KILL to pid -32249
04/07/2010 21:44:03 [0:32247]: no tasker to notify
04/07/2010 21:44:03 [0:32247]: failed starting job
04/07/2010 21:44:03 [0:32247]: /bin/true
04/07/2010 21:44:03 [0:32247]: /bin/true
04/07/2010 21:44:03 [0:32247]: parent: forked "pe_stop" with pid 32250
04/07/2010 21:44:03 [0:32247]: using signal delivery delay of 120 seconds
04/07/2010 21:44:03 [0:32247]: parent: pe_stop-pid: 32250
04/07/2010 21:44:03 [0:32250]: child: starting son(pe_stop, /bin/true, 0);
04/07/2010 21:44:03 [0:32250]: pid=32250 pgrp=32250 sid=32250 old pgrp=32247 getlogin()=<no login set>
04/07/2010 21:44:03 [0:32250]: reading passwd information for user 'rosalia'
04/07/2010 21:44:03 [0:32250]: setting limits
04/07/2010 21:44:03 [0:32250]: setting environment
04/07/2010 21:44:03 [0:32250]: Initializing error file
04/07/2010 21:44:03 [0:32250]: switching to intermediate/target user
04/07/2010 21:44:03 [1049:32250]: closing all filedescriptors
04/07/2010 21:44:03 [1049:32250]: further messages are in "error" and "trace"
04/07/2010 21:44:03 [1049:32250]: using "/bin/bash" as shell of user "rosalia"
04/07/2010 21:44:03 [1049:32250]: now running with uid=1049, euid=1049
04/07/2010 21:44:03 [1049:32250]: execvp(/bin/true, "/bin/true")
04/07/2010 21:44:03 [0:32247]: wait3 returned 32250 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0)
04/07/2010 21:44:03 [0:32247]: pe_stop exited with exit status 0
04/07/2010 21:44:03 [0:32247]: reaped "pe_stop" with pid 32250
04/07/2010 21:44:03 [0:32247]: pe_stop exited not due to signal
04/07/2010 21:44:03 [0:32247]: pe_stop exited with status 0
04/07/2010 21:44:03 [0:32247]: no tasker to notify
04/07/2010 21:44:03 [0:32247]: no epilog script to start

Shepherd error:
04/07/2010 21:44:03 [1049:32249]: unable to find job file "/opt/sge/cluster/spool/node077/job_scripts/60497"

Shepherd pe_hostfile:
node077.cluster 2 ***@node077.cluster UNDEFINED
node069.cluster 1 ***@node069.cluster UNDEFINED
node074.cluster 1 ***@node074.cluster UNDEFINED


The thing is, this doesn't happen for all of the array tasks, even ones which previously ran on the same node, so I'm not sure what is going on here. Maybe someone who's more familiar can help me figure out what is wrong?

The PE is for OpenMPI and is set up according to the instructions in their FAQ.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252719

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
kisielk
2010-04-15 16:53:12 UTC
Permalink
I'm still having this problem, it seems to be limited to this user's job for the time being.

For some more info, the cluster is "stateless" and does not NFS mount the SGE directory. Instead it has a tmpfs to which all the SGE binaries, spool directory, etc are copied to. This is set up mostly according to the info in the howto. These same nodes have no problem running other jobs, including other array jobs it seems.

Not sure how to go about debugging this...

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=253548

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
elauzier
2010-04-19 19:16:01 UTC
Permalink
Use the "-b y" qsub switch. This is stated in the docs, but I missed it too the first time I ran into this issue.

The problem is that when using array jobs and pe spreading over many hosts, there is a race condition that goes on and for large jobs spread across many hosts, you will see this especially if SGE is very busy...

To get around this, eliminate the need for the job file... use "-b y"...but...

You cannot use the embedded SGE options in your scriptfile...

Takes a bit of adjustment, but it is very doable and once done works very well...

Ed Lauzier

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=254049

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
Loading...