Discussion:
SGE multiple job performance error.
fernandosilva
2010-11-24 15:12:33 UTC
Permalink
When my users submit jobs normally (slowly) all works fine. When they
run scripts to automate many job submissions at once they eventually
start getting the below error. This is a new qmaster server on an ESXi
vm with a local ext4 filesystem. I would appreciate any help please!

Unable to run job: error writing object "3215655" to spooling
database cannot close transaction: There is no open transaction
transaction function of rule "default rule" in context
"berkeleydb spooling" failed job 3215655 was rejected cause it
couldn't be written. Exiting. ERROR: could not close qsub pipe
lsq6-img_26oct10.7: Continuing for now, but this pipe might have
gone bad.

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=298353

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
rdickson
2010-11-25 17:15:48 UTC
Permalink
We now have both Opteron and Xeon chips in one of our clusters, and the
owner wants to be able to specify jobs which can only be run on the
Xeons. So I defined a custom complex:

$ qconf -sc | grep chip
chip chip RESTRING == YES NO
NONE 0

...and then set it in the global exec host to the majority case:

$ qconf -se global | grep complex
complex_values fluent-par=8,fluentall=20,chip=opteron

...and on the Xeon hosts, I set the individual exec host complex_values:

$ qconf -se cl317 | grep complex
complex_values slots=8,h_vmem=24G,chip=xeon

But then qsub doesn't seem to be able to find the "chip=xeon" hosts:

$ qsub -l h_rt=0:1:0,chip=xeon -b y sleep 15
Unable to run job: error: no suitable queues.
Exiting.

I can submit to one of the Xeon hosts explicitly, and I can request an
Opteron no problem:

$ qsub -l h_rt=0:1:0,h=cl317 -b y sleep 15
Your job 2158687 ("sleep") has been submitted
$ qsub -l h_rt=0:1:0,chip=opteron -b y sleep 15
Your job 2158689 ("sleep") has been submitted

...and the system does seem to recognize what the complex value is on
the host, because it won't let me request 'opteron' and run on a host
that's got 'chip=xeon' set:

$ qsub -l h_rt=0:1:0,chip=opteron,h=cl317 -b y sleep 15
Unable to run job: error: no suitable queues.
Exiting.

Is this a defect, or am I missing something? I searched
http://gridengine.sunsource.net/issues/query.cgi, but couldn't identify
a matching issue. We're running GE 6.1u6.

I'd appreciate any light you can shed on the matter. Thanks.
--
Ross Dickson Computational Research Consultant
+1 902 494 6710 Skype: ross.m.dickson
ACEnet - Compute Canada

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=298745

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
rdickson
2010-11-30 18:50:59 UTC
Permalink
Hi folks.

For anyone else who gets into such a situation: The 'global' setting
for such a custom complex appears not to be overridden by the 'exechost'
setting as I assumed it would be. 'qsub -w v' helped here (and I feel
foolish for not thinking of it sooner.) I solved the problem by removing
the complex_value from the global exechost (qconf -me global) and
setting it for each individual exechost:

for host in $hostlist; do
qconf -mattr exechost complex_values chip=opteron $host
done


Cheers,
Ross Dickson
Post by rdickson
We now have both Opteron and Xeon chips in one of our clusters, and the
owner wants to be able to specify jobs which can only be run on the
$ qconf -sc | grep chip
chip chip RESTRING == YES NO
NONE 0
$ qconf -se global | grep complex
complex_values fluent-par=8,fluentall=20,chip=opteron
$ qconf -se cl317 | grep complex
complex_values slots=8,h_vmem=24G,chip=xeon
$ qsub -l h_rt=0:1:0,chip=xeon -b y sleep 15
Unable to run job: error: no suitable queues.
Exiting.
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=300720

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
Loading...