Sporadic errors in array tasks with a PE

kisielk

2010-04-08 16:26:23 UTC

I noticed yesterday that some of my users' array tasks which use a PE were randomly failing. I'd get an email from SGE that looks like the following:

Job 60497 caused action: Queue "***@node077.cluster" set to ERROR
User = rosalia
Queue = ***@node077.cluster
Start Time = <unknown>
End Time = <unknown>
failed before job:04/07/2010 21:44:03 [1049:32249]: unable to find job file "/opt/sge/cluster/spool/node077/job_script
Shepherd trace:
04/07/2010 21:44:03 [0:32247]: shepherd called with uid = 0, euid = 0
04/07/2010 21:44:03 [0:32247]: starting up 6.2u5
04/07/2010 21:44:03 [0:32247]: setpgid(32247, 32247) returned 0
04/07/2010 21:44:03 [0:32247]: do_core_binding: "binding" parameter not found in config file
04/07/2010 21:44:03 [0:32247]: no prolog script to start
04/07/2010 21:44:03 [0:32247]: /bin/true
04/07/2010 21:44:03 [0:32247]: /bin/true
04/07/2010 21:44:03 [0:32247]: parent: forked "pe_start" with pid 32248
04/07/2010 21:44:03 [0:32247]: using signal delivery delay of 120 seconds
04/07/2010 21:44:03 [0:32247]: parent: pe_start-pid: 32248
04/07/2010 21:44:03 [0:32248]: child: starting son(pe_start, /bin/true, 0);
04/07/2010 21:44:03 [0:32248]: pid=32248 pgrp=32248 sid=32248 old pgrp=32247 getlogin()=<no login set>
04/07/2010 21:44:03 [0:32248]: reading passwd information for user 'rosalia'
04/07/2010 21:44:03 [0:32248]: setting limits
04/07/2010 21:44:03 [0:32248]: setting environment
04/07/2010 21:44:03 [0:32248]: Initializing error file
04/07/2010 21:44:03 [0:32248]: switching to intermediate/target user
04/07/2010 21:44:03 [1049:32248]: closing all filedescriptors
04/07/2010 21:44:03 [1049:32248]: further messages are in "error" and "trace"
04/07/2010 21:44:03 [1049:32248]: using "/bin/bash" as shell of user "rosalia"
04/07/2010 21:44:03 [1049:32248]: now running with uid=1049, euid=1049
04/07/2010 21:44:03 [1049:32248]: execvp(/bin/true, "/bin/true")
04/07/2010 21:44:03 [0:32247]: wait3 returned 32248 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0)
04/07/2010 21:44:03 [0:32247]: pe_start exited with exit status 0
04/07/2010 21:44:03 [0:32247]: reaped "pe_start" with pid 32248
04/07/2010 21:44:03 [0:32247]: pe_start exited not due to signal
04/07/2010 21:44:03 [0:32247]: pe_start exited with status 0
04/07/2010 21:44:03 [0:32247]: parent: forked "job" with pid 32249
04/07/2010 21:44:03 [0:32247]: parent: job-pid: 32249
04/07/2010 21:44:03 [0:32249]: child: starting son(job, /opt/sge/cluster/spool/node077/job_scripts/60497, 0);
04/07/2010 21:44:03 [0:32249]: pid=32249 pgrp=32249 sid=32249 old pgrp=32247 getlogin()=<no login set>
04/07/2010 21:44:03 [0:32249]: reading passwd information for user 'rosalia'
04/07/2010 21:44:03 [0:32249]: setosjobid: uid = 0, euid = 0
04/07/2010 21:44:03 [0:32249]: setting limits
04/07/2010 21:44:03 [0:32249]: RLIMIT_CPU setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
04/07/2010 21:44:03 [0:32249]: RLIMIT_FSIZE setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
04/07/2010 21:44:03 [0:32249]: RLIMIT_DATA setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
04/07/2010 21:44:03 [0:32249]: RLIMIT_STACK setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
04/07/2010 21:44:03 [0:32249]: RLIMIT_CORE setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
04/07/2010 21:44:03 [0:32249]: RLIMIT_VMEM/RLIMIT_AS setting: (soft 4294967296 hard 4294967296) resulting: (soft 4294967296 hard 4294967296)
04/07/2010 21:44:03 [0:32249]: RLIMIT_RSS setting: (soft 0INFINITY hard 0INFINITY) resulting: (soft 0INFINITY hard 0INFINITY)
04/07/2010 21:44:03 [0:32249]: setting environment
04/07/2010 21:44:03 [0:32249]: Initializing error file
04/07/2010 21:44:03 [0:32249]: switching to intermediate/target user
04/07/2010 21:44:03 [1049:32249]: closing all filedescriptors
04/07/2010 21:44:03 [1049:32249]: further messages are in "error" and "trace"
04/07/2010 21:44:03 [1049:32249]: now running with uid=1049, euid=1049
04/07/2010 21:44:03 [1049:32249]: unable to find job file "/opt/sge/cluster/spool/node077/job_scripts/60497"
04/07/2010 21:44:03 [0:32247]: wait3 returned 32249 (status: 2816; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 11)
04/07/2010 21:44:03 [0:32247]: job exited with exit status 11
04/07/2010 21:44:03 [0:32247]: reaped "job" with pid 32249
04/07/2010 21:44:03 [0:32247]: job exited not due to signal
04/07/2010 21:44:03 [0:32247]: job exited with status 11
04/07/2010 21:44:03 [0:32247]: now sending signal KILL to pid -32249
04/07/2010 21:44:03 [0:32247]: no tasker to notify
04/07/2010 21:44:03 [0:32247]: failed starting job
04/07/2010 21:44:03 [0:32247]: /bin/true
04/07/2010 21:44:03 [0:32247]: /bin/true
04/07/2010 21:44:03 [0:32247]: parent: forked "pe_stop" with pid 32250
04/07/2010 21:44:03 [0:32247]: using signal delivery delay of 120 seconds
04/07/2010 21:44:03 [0:32247]: parent: pe_stop-pid: 32250
04/07/2010 21:44:03 [0:32250]: child: starting son(pe_stop, /bin/true, 0);
04/07/2010 21:44:03 [0:32250]: pid=32250 pgrp=32250 sid=32250 old pgrp=32247 getlogin()=<no login set>
04/07/2010 21:44:03 [0:32250]: reading passwd information for user 'rosalia'
04/07/2010 21:44:03 [0:32250]: setting limits
04/07/2010 21:44:03 [0:32250]: setting environment
04/07/2010 21:44:03 [0:32250]: Initializing error file
04/07/2010 21:44:03 [0:32250]: switching to intermediate/target user
04/07/2010 21:44:03 [1049:32250]: closing all filedescriptors
04/07/2010 21:44:03 [1049:32250]: further messages are in "error" and "trace"
04/07/2010 21:44:03 [1049:32250]: using "/bin/bash" as shell of user "rosalia"
04/07/2010 21:44:03 [1049:32250]: now running with uid=1049, euid=1049
04/07/2010 21:44:03 [1049:32250]: execvp(/bin/true, "/bin/true")
04/07/2010 21:44:03 [0:32247]: wait3 returned 32250 (status: 0; WIFSIGNALED: 0, WIFEXITED: 1, WEXITSTATUS: 0)
04/07/2010 21:44:03 [0:32247]: pe_stop exited with exit status 0
04/07/2010 21:44:03 [0:32247]: reaped "pe_stop" with pid 32250
04/07/2010 21:44:03 [0:32247]: pe_stop exited not due to signal
04/07/2010 21:44:03 [0:32247]: pe_stop exited with status 0
04/07/2010 21:44:03 [0:32247]: no tasker to notify
04/07/2010 21:44:03 [0:32247]: no epilog script to start

Shepherd error:
04/07/2010 21:44:03 [1049:32249]: unable to find job file "/opt/sge/cluster/spool/node077/job_scripts/60497"

Shepherd pe_hostfile:
node077.cluster 2 ***@node077.cluster UNDEFINED
node069.cluster 1 ***@node069.cluster UNDEFINED
node074.cluster 1 ***@node074.cluster UNDEFINED

The thing is, this doesn't happen for all of the array tasks, even ones which previously ran on the same node, so I'm not sure what is going on here. Maybe someone who's more familiar can help me figure out what is wrong?

The PE is for OpenMPI and is set up according to the instructions in their FAQ.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=252719

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].