subnode with empty slots but jobs in queue

Hi,

Post by jsadino
---------- Forwarded message ----------
Date: Fri, Dec 3, 2010 at 1:10 PM
Subject: subnode with empty slots but jobs in queue
Hello,
I have a subnode that is currently using 7 out of its 8 slots. I have jobs waiting in the queue, but they will not start processing. Everything was working fine a couple weeks ago, and then it just stopped. I restarted the subnode a couple of times to try to fix it, but that did not work. I also modified the np_load_ave so that it equals 100, but that did not work either.

the load_threshold can also be set to none, when cores = slots.

Did you define/request any memory or other resource? Any resource quota set in place?

The waiting jobs are serial ones?

-- Reuti

Post by jsadino
There is probably a really easy answer for this. Can anyone point me in the right direction?
Thank you!
Jeff Sadino

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302394

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

jlforrest

2010-12-06 17:47:00 UTC

Post by jsadino
I have a subnode that is currently using 7 out of its 8 slots. I
have jobs waiting in the queue, but they will not start processing.
Everything was working fine a couple weeks ago, and then it just
stopped.

the load_threshold can also be set to none, when cores = slots.
Did you define/request any memory or other resource? Any resource quota set in place?
The waiting jobs are serial ones?

I have a similar problem with SGE 6.2u4. I have a node
with 48-cores which will only run 30 jobs. Here is the
relevant output from qconf:

---
hostlist @allhosts
seq_no 0
load_thresholds NONE
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make mpi mpich orte
rerun FALSE
slots 1,[compute-0-0.local=4],[compute-0-1.local=4], \
[compute-0-2.local=4],[compute-0-3.local=4], \
[compute-0-5.local=4],[compute-0-4.local=4], \
[compute-0-6.local=4],[compute-0-7.local=48], \
[compute-0-8.local=48]
---

Right now compute-0-8 is down, although qstat still shows
some jobs for it. (Why would this happen?)

The qstat output for compute-0-7 shows

***@compute-0-7.local BIP 0/48/48 29.05 lx26-amd64

and then it shows 48 serial jobs underneath! Yet, ssh-ing to
compute-0-7 and running ps clearly only shows 29 jobs running.

All the jobs in this cluster are serial jobs. Any idea why
I can't run 18 more jobs on compute-0-7? I restarted the
qmaster but it didn't make any difference.

Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
***@berkeley.edu

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302517

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-12-06 18:04:49 UTC

the load_threshold can also be set to none, when cores = slots.
Did you define/request any memory or other resource? Any resource quota set in place?
The waiting jobs are serial ones?

I have a similar problem with SGE 6.2u4. I have a node
with 48-cores which will only run 30 jobs. Here is the
---
seq_no 0
load_thresholds NONE
suspend_thresholds NONE
nsuspend 1
suspend_interval 00:05:00
priority 0
min_cpu_interval 00:05:00
processors UNDEFINED
qtype BATCH INTERACTIVE
ckpt_list NONE
pe_list make mpi mpich orte
rerun FALSE
slots 1,[compute-0-0.local=4],[compute-0-1.local=4], \
[compute-0-2.local=4],[compute-0-3.local=4], \
[compute-0-5.local=4],[compute-0-4.local=4], \
[compute-0-6.local=4],[compute-0-7.local=48], \
[compute-0-8.local=48]
---
Right now compute-0-8 is down, although qstat still shows
some jobs for it. (Why would this happen?)

SGE assumes some network problems. You will have to use `qdel -f ...` to get rid of these jobs.

Post by jlforrest
The qstat output for compute-0-7 shows

So, all 48 out of 48 seem to be used up.

Post by jlforrest
and then it shows 48 serial jobs underneath! Yet, ssh-ing to
compute-0-7 and running ps clearly only shows 29 jobs running

What is `qstat -g t -l h=compute-0-7.local -s r` showing?

-- Reuti

Post by jlforrest
All the jobs in this cluster are serial jobs. Any idea why
I can't run 18 more jobs on compute-0-7? I restarted the
qmaster but it didn't make any difference.
Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302517

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302521

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

jlforrest

2010-12-06 18:14:38 UTC

Post by jlforrest
Right now compute-0-8 is down, although qstat still shows
some jobs for it. (Why would this happen?)

SGE assumes some network problems. You will have to use `qdel -f ...` to get rid of these jobs.

I've now done that.

Post by jlforrest
The qstat output for compute-0-7 shows

So, all 48 out of 48 seem to be used up.

Post by jlforrest
and then it shows 48 serial jobs underneath! Yet, ssh-ing to
compute-0-7 and running ps clearly only shows 29 jobs running

What is `qstat -g t -l h=compute-0-7.local -s r` showing?

It shows nothing. But, it also shows nothing for the
nodes that are working correctly, e.g. consider compute-0-0
whose status is shown as

compute-0-0 lx26-amd64 4 4.97 7.8G 831.5M 11.7G 75.7M

Running 'qstat -g t -l h=compute-0-0 -s' results in
no output. Is this correct?

Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
***@berkeley.edu

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302522

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-12-06 18:17:40 UTC

Post by jlforrest
Right now compute-0-8 is down, although qstat still shows
some jobs for it. (Why would this happen?)

SGE assumes some network problems. You will have to use `qdel -f ...` to get rid of these jobs.

I've now done that.

Post by jlforrest
The qstat output for compute-0-7 shows

So, all 48 out of 48 seem to be used up.

Post by jlforrest
and then it shows 48 serial jobs underneath! Yet, ssh-ing to
compute-0-7 and running ps clearly only shows 29 jobs running

What is `qstat -g t -l h=compute-0-7.local -s r` showing?

No, I forgot to mention -u "*" in addition to get the list of all users' jobs.

-- Reuti

Post by jlforrest
Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302522

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302523

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

jlforrest

2010-12-06 18:55:29 UTC

Post by jlforrest
Running 'qstat -g t -l h=compute-0-0 -s' results in
no output. Is this correct?

No, I forgot to mention -u "*" in addition to get the list of all users' jobs.

No problem. At least it wasn't me screwing up. The output
is below.

I think I might have some idea of what might be causing
this. compute-0-7 crashed last week, I think on 12/02/2010.
I brought it up soon afterwards. So, the jobs that show
a submit time of before 12/02/2010 are not really there.
I counted and there are 19 of them. This, plus the 29 that
are running, equals 48, which is the number of cores.

So the real question is why did these jobs remain
visible to SGE after compute-0-7 was rebooted.

job-ID prior name user state submit/start at queue
master ja-task-ID
------------------------------------------------------------------------------------------------------------------
6954 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
5874 0.55500 Job descri wendy r 11/30/2010 14:38:49
***@compute-0-7.local MASTER
6959 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
5228 0.55500 Job descri maximoff r 11/23/2010 15:22:34
***@compute-0-7.local MASTER
6980 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6969 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6088 0.55500 Job descri maximoff r 12/01/2010 11:35:19
***@compute-0-7.local MASTER
6965 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6973 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6977 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
5873 0.55500 Job descri wendy r 11/30/2010 14:37:34
***@compute-0-7.local MASTER
5225 0.55500 Job descri maximoff r 11/23/2010 15:14:34
***@compute-0-7.local MASTER
6093 0.55500 Job descri maximoff r 12/01/2010 11:37:04
***@compute-0-7.local MASTER
5224 0.55500 Job descri maximoff r 11/23/2010 15:13:04
***@compute-0-7.local MASTER
6962 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6970 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6091 0.55500 Job descri maximoff r 12/01/2010 11:36:19
***@compute-0-7.local MASTER
6979 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6967 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6971 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6957 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6956 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6961 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6098 0.55500 Job descri maximoff r 12/01/2010 11:41:49
***@compute-0-7.local MASTER
6096 0.55500 Job descri maximoff r 12/01/2010 11:40:19
***@compute-0-7.local MASTER
6084 0.55500 Job descri maximoff r 12/01/2010 11:11:34
***@compute-0-7.local MASTER
6090 0.55500 Job descri maximoff r 12/01/2010 11:36:04
***@compute-0-7.local MASTER
5226 0.55500 Job descri maximoff r 11/23/2010 15:17:04
***@compute-0-7.local MASTER
6978 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
3003 0.55500 QQQ mforrest r 10/29/2010 11:33:56
***@compute-0-7.local MASTER
6960 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6958 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6085 0.55500 Job descri maximoff r 12/01/2010 11:11:49
***@compute-0-7.local MASTER
6087 0.55500 Job descri maximoff r 12/01/2010 11:34:49
***@compute-0-7.local MASTER
5230 0.55500 Job descri maximoff r 11/23/2010 15:28:04
***@compute-0-7.local MASTER
6089 0.55500 Job descri maximoff r 12/01/2010 11:35:34
***@compute-0-7.local MASTER
6099 0.55500 Job descri maximoff r 12/01/2010 11:42:34
***@compute-0-7.local MASTER
6981 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6955 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6974 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6982 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6963 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6964 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6966 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6972 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6976 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6975 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER
6968 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
***@compute-0-7.local MASTER

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302529

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-12-06 19:02:45 UTC

Post by jlforrest
Running 'qstat -g t -l h=compute-0-0 -s' results in
no output. Is this correct?

No, I forgot to mention -u "*" in addition to get the list of all users' jobs.

Was the node only rebooted, or also the local spool directory of SGE removed? When the local spool directory exists after the reboot, the execd would inform the qmaster about the failed jobs. When there is no information on the node about the last running jobs, the execd won't tell anything to the qmaster, and on its own it's waiting for the jobs to reappear.

-- Reuti

Post by jlforrest
job-ID prior name user state submit/start at queue
master ja-task-ID
------------------------------------------------------------------------------------------------------------------
6954 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
5874 0.55500 Job descri wendy r 11/30/2010 14:38:49
6959 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
5228 0.55500 Job descri maximoff r 11/23/2010 15:22:34
6980 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6969 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6088 0.55500 Job descri maximoff r 12/01/2010 11:35:19
6965 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6973 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6977 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
5873 0.55500 Job descri wendy r 11/30/2010 14:37:34
5225 0.55500 Job descri maximoff r 11/23/2010 15:14:34
6093 0.55500 Job descri maximoff r 12/01/2010 11:37:04
5224 0.55500 Job descri maximoff r 11/23/2010 15:13:04
6962 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6970 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6091 0.55500 Job descri maximoff r 12/01/2010 11:36:19
6979 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6967 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6971 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6957 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6956 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6961 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6098 0.55500 Job descri maximoff r 12/01/2010 11:41:49
6096 0.55500 Job descri maximoff r 12/01/2010 11:40:19
6084 0.55500 Job descri maximoff r 12/01/2010 11:11:34
6090 0.55500 Job descri maximoff r 12/01/2010 11:36:04
5226 0.55500 Job descri maximoff r 11/23/2010 15:17:04
6978 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
3003 0.55500 QQQ mforrest r 10/29/2010 11:33:56
6960 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6958 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6085 0.55500 Job descri maximoff r 12/01/2010 11:11:49
6087 0.55500 Job descri maximoff r 12/01/2010 11:34:49
5230 0.55500 Job descri maximoff r 11/23/2010 15:28:04
6089 0.55500 Job descri maximoff r 12/01/2010 11:35:34
6099 0.55500 Job descri maximoff r 12/01/2010 11:42:34
6981 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6955 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6974 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6982 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6963 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6964 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6966 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6972 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6976 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6975 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
6968 0.55500 T.1.0.N.11 an r 12/06/2010 09:07:19
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302529

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302532

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

jlforrest

2010-12-06 19:12:27 UTC

Post by reuti
Was the node only rebooted, or also the local spool directory of SGE
removed? When the local spool directory exists after the reboot, the
execd would inform the qmaster about the failed jobs. When there is
no information on the node about the last running jobs, the execd
won't tell anything to the qmaster, and on its own it's waiting for
the jobs to reappear.

This is a Rocks cluster so after the node
crashed it was reinstalled from scratch. This
removed the local spool directory, which would
explain my problem. In fact, from what you say,
this would happen whenever a Rocks node
is reinstalled if there were running SGE
job when the node crashed, right?

I'm going to manually remove the bogus
jobs.

As always, thanks for your help. You deserve
a Nobel Prize.
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
***@berkeley.edu

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302535

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

jlforrest

2010-12-06 19:29:39 UTC

Post by reuti
When the local spool directory exists after the reboot, the
execd would inform the qmaster about the failed jobs. When there is
no information on the node about the last running jobs, the execd
won't tell anything to the qmaster, and on its own it's waiting for
the jobs to reappear.

I was thinking about this. I wonder if this
is the right thing to do. If the actual
contents of the local spool directory is
empty, or different than what the qmaster
expects, then what point is there for the
qmaster to continue to think that the
jobs exist, or will ever come back?
In other words, shouldn't the contents
of the local spool directory determine
the qmaster's conception of reality?
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
***@berkeley.edu

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=302559

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-12-07 18:39:12 UTC