nodes overloaded: processes placed on already full nodes

Hi,

Post by steve_s
We're using SGE for a while now and are quite happy with it.
However, lately we observed the following. We have a bunch of 8-core
nodes connected by Infiniband and running MPI jobs across nodes. We found
that processed often get placed on full nodes which have 8 MPI processes
already running. This leaves us with many oversubscribed (load 16
instead of 8) nodes. This happens although there are many empty nodes
left in the queue. It is almost as if the slots already taken on one
node are ignored by SGE.

how many slots are defined in the queue definition, and how many queues do you have defined?

-- Reuti

Post by steve_s
This is seen with OpenMPI and Intel MPI and with different applications.
No applications does threading or anything that would create more
processes than requested slots.
Did anybody have similar observations? We are thankful for any hints on
how to debug this.
best,
Steve
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305816

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305818

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

steve_s

2010-12-15 14:55:34 UTC

Post by steve_s
However, lately we observed the following. We have a bunch of 8-core
nodes connected by Infiniband and running MPI jobs across nodes. We found
that processed often get placed on full nodes which have 8 MPI processes
already running. This leaves us with many oversubscribed (load 16
instead of 8) nodes. This happens although there are many empty nodes
left in the queue. It is almost as if the slots already taken on one
node are ignored by SGE.

how many slots are defined in the queue definition, and how many queues do you have defined?

$ qconf -sql
adde.q
all.q
test.q
vtc.q

Only the first and last queue are used and only the first is used for
parallel jobs. Nodes belong to only one queue at a time such that jobs
in different queues cannot run on the same node.

8 slots (see attachment for full output).

$ qconf -sq adde.q | grep slot
slots 8

Thank you.

best,
Steve

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305824

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

templedf

2010-12-15 15:13:12 UTC

This is a known issue. When scheduling parallel jobs with 6.2 to 6.2u5,
the scheduler ignores host load. This often results in jobs piling up
on a few nodes while other nodes are idle. The issue is fixed in 6.2u6
(currently only available in product form).

Daniel

how many slots are defined in the queue definition, and how many queues do you have defined?

$ qconf -sql
adde.q
all.q
test.q
vtc.q
Only the first and last queue are used and only the first is used for
parallel jobs. Nodes belong to only one queue at a time such that jobs
in different queues cannot run on the same node.
8 slots (see attachment for full output).
$ qconf -sq adde.q | grep slot
slots 8
Thank you.
best,
Steve
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305824

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305831

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-12-15 15:28:11 UTC

Post by templedf
This is a known issue. When scheduling parallel jobs with 6.2 to 6.2u5,
the scheduler ignores host load.

Yep.

Post by templedf
This often results in jobs piling up
on a few nodes while other nodes are idle.

As far as I understood the problem, the nodes are oversubscribed by getting more than 8 processes scheduled.

Post by templedf
The issue is fixed in 6.2u6
(currently only available in product form).
Daniel

how many slots are defined in the queue definition, and how many queues do you have defined?

Did you change the host assignment to certain queues, while jobs were still running? Maybe you need to limit the number total slots per machine to 8 in an RQS or setting it for each host's complex_values.

Another reason for virtual oversubscription: processes in state "D" count as running and dispite the fact of the high load, all is in best order.

-- Reuti

Post by templedf

Post by steve_s
8 slots (see attachment for full output).
$ qconf -sq adde.q | grep slot
slots 8
Thank you.
best,
Steve
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305824

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305831

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305837

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

steve_s

2010-12-15 16:23:06 UTC

Post by templedf
This is a known issue. When scheduling parallel jobs with 6.2 to 6.2u5,
the scheduler ignores host load.

Yep.

Post by templedf
This often results in jobs piling up
on a few nodes while other nodes are idle.

OK, good to know. We're running 6.2u3 here.

I'm not sure if I get this right: Even if the load is ignored, doesn't
SGE keep track of already given-away slots on each node? I always
thought that this is the way jobs are scheduled in the first place
(besides policies and all that, but that should have nothing to do with
load or slots in this context).

Given that SGE knows i.e. np_load_avg on each node, I thought we could
circumvent the problem by setting np_load_avg to requestable=YES and
then something like

$ qsub -hard -l 'np_load_avg < 0.3' ...

but this gives me

"Unable to run job: denied: missing value for request "np_load_avg".
Exiting."

whereas using "=" or ">" works. I guess the reason is what is stated in
complex(5):

">=, >, <=, < operators can only be overridden, when the new value
is more restrictive than the old one."

So, I cannot use "<". If that is the case, what can we do about it? Do
we need to define a new complex attribute (say 'np_load_avg_less') along
with a load_sensor or can we hijack np_load_avg in another way?

Post by reuti
As far as I understood the problem, the nodes are oversubscribed by getting more than 8 processes scheduled.

Exactly.

Post by reuti
Did you change the host assignment to certain queues, while jobs were still running? Maybe you need to limit the number total slots per machine to 8 in an RQS or setting it for each host's complex_values.

No, we didn't change the host assignment.

Sorry, but what do you mean by RQS? Did not see that in the
documentation so far.

Post by reuti
Another reason for virtual oversubscription: processes in state "D" count as running and dispite the fact of the high load, all is in best order.

Oversubscribed nodes do not always run 16 instead of 8 processes, some
only 14 or so. Nevertheless, the load is always almost exactly 16. As
far as I can see, processes on these oversubscribed nodes (with > 8
processes) run with ~50% CPU load each.

best,
Steve

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305856

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-12-17 13:16:49 UTC

Post by templedf
This is a known issue. When scheduling parallel jobs with 6.2 to 6.2u5,
the scheduler ignores host load.

Yep.

Post by templedf
This often results in jobs piling up
on a few nodes while other nodes are idle.

You can only specify a value, the relation is defined already in the complex definition.

Post by steve_s
but this gives me
"Unable to run job: denied: missing value for request "np_load_avg".
Exiting."
whereas using "=" or ">" works. I guess the reason is what is stated in

When > is working, it's a bug. I get: Unable to run job: unknown resource "fubar>12". (same for <, maybe it was fixed in 6.2u5).

Post by steve_s
">=, >, <=, < operators can only be overridden, when the new value
is more restrictive than the old one."
So, I cannot use "<". If that is the case, what can we do about it? Do
we need to define a new complex attribute (say 'np_load_avg_less') along
with a load_sensor or can we hijack np_load_avg in another way?

Post by reuti
As far as I understood the problem, the nodes are oversubscribed by getting more than 8 processes scheduled.

Exactly.

So, we now what to deal with.

No, we didn't change the host assignment.
Sorry, but what do you mean by RQS? Did not see that in the
documentation so far.

man sge_resource_quota

When you have more than one queue on a maschine, all slots might get used and thus oversubscribing the machine. Hence the total number of used slots across all queues at a time on each machine must be limited. When you have only one queue per machine, then this can't happen though.

Post by reuti
Another reason for virtual oversubscription: processes in state "D" count as running and dispite the fact of the high load, all is in best order.

What does:

ps -e f

(f w/o -) show on such a node? Are all the processes bound to an sge_shepherd, or did some jump out of the processes tree and weren't killed?

-- Reuti

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306441

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

steve_s

2010-12-21 14:58:50 UTC

Post by steve_s
$ qsub -hard -l 'np_load_avg < 0.3' ...

You can only specify a value, the relation is defined already in the complex definition.

[...]

Post by reuti
When > is working, it's a bug. I get: Unable to run job: unknown
resource "fubar>12". (same for <, maybe it was fixed in 6.2u5).

Yes, you are right. The only thing that works is "=":
$ qsub -hard -l 'np_load_avg=0.3' ...

That is no solution to the original problem, though (but apparently not
required, either -- see my last post).

[...]

Post by reuti
ps -e f
(f w/o -) show on such a node? Are all the processes bound to an
sge_shepherd, or did some jump out of the processes tree and weren't
killed?

There are no sge_shepherd's on the nodes. I did not set up SGE on the
machine but what I understand from the documentation is that
sge_shepherd is only used in the case of "tight integration" of PEs.
In our case, the PE starts the MPI processes.

best,
Steve

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307894

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-12-21 17:22:13 UTC

Post by steve_s
$ qsub -hard -l 'np_load_avg < 0.3' ...

You can only specify a value, the relation is defined already in the complex definition.

[...]

Post by reuti
When > is working, it's a bug. I get: Unable to run job: unknown
resource "fubar>12". (same for <, maybe it was fixed in 6.2u5).

$ qsub -hard -l 'np_load_avg=0.3' ...
That is no solution to the original problem, though (but apparently not
required, either -- see my last post).
[...]

Post by reuti
ps -e f
(f w/o -) show on such a node? Are all the processes bound to an
sge_shepherd, or did some jump out of the processes tree and weren't
killed?

Well, even with a loose integration, you have to honor the lost of granted machines for your job. What do you mean in detail by "the PE starts the MPI processes"? You will need at least a sgeexecd on the nodes, so that SGE is aware of its existence and can make a suitable slot allocation for your job. (The sgeexecd will then start the shepherd in case of a tight integration.)

-- Reuti

Post by steve_s
best,
Steve
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307894

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307937

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

steve_s

2010-12-21 18:21:48 UTC

Post by reuti
ps -e f
(f w/o -) show on such a node? Are all the processes bound to an
sge_shepherd, or did some jump out of the processes tree and weren't
killed?

Well, even with a loose integration, you have to honor the lost of
granted machines for your job. What do you mean in detail by "the PE
starts the MPI processes"? You will need at least a sgeexecd on the
nodes, so that SGE is aware of its existence and can make a suitable
slot allocation for your job. (The sgeexecd will then start the
shepherd in case of a tight integration.)

Yes, sge_execd is present on each node, as well as sge_shepherd-$JOB_ID
on the master node, where the job-script is executed:

4693 ? Sl 33:32 /cm/shared/apps/sge/current/bin/lx26-amd64/sge_execd
12165 ? S 0:00 \_ sge_shepherd-60013 -bg
12389 ? S 0:00 \_ python /cm/shared/apps/intel/impi/3.2.2.006/bin64/mpiexec ....

Apparently, we have tight integration then. I did look for sge_shepherd
on the wrong node (not the master node). This is the first time I take a
closer look at these daemons, that's why a little confusion here (we got
the machine pre-configured and all, getting familiar with the system
always takes a factor of pi longer than expected). Sorry for the noise.

Now that we know what to look for, we can search for jobs which do not
behave.

best,
Steve

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307950

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-12-21 18:31:02 UTC

Post by reuti
ps -e f
(f w/o -) show on such a node? Are all the processes bound to an
sge_shepherd, or did some jump out of the processes tree and weren't
killed?

Well, even with a loose integration, you have to honor the lost of
granted machines for your job. What do you mean in detail by "the PE
starts the MPI processes"? You will need at least a sgeexecd on the
nodes, so that SGE is aware of its existence and can make a suitable
slot allocation for your job. (The sgeexecd will then start the
shepherd in case of a tight integration.)

Yes, sge_execd is present on each node, as well as sge_shepherd-$JOB_ID
4693 ? Sl 33:32 /cm/shared/apps/sge/current/bin/lx26-amd64/sge_execd
12165 ? S 0:00 \_ sge_shepherd-60013 -bg
12389 ? S 0:00 \_ python /cm/shared/apps/intel/impi/3.2.2.006/bin64/mpiexec ....
Apparently, we have tight integration then. I did look for sge_shepherd
on the wrong node (not the master node). This is the first time I take a
closer look at these daemons, that's why a little confusion here (we got
the machine pre-configured and all, getting familiar with the system
always takes a factor of pi longer than expected). Sorry for the noise.

The sge_shepherd will be started on each slave node in case of a tight integration too. When you have a loose integration and no sge_shepherd on the slaves, then there maybe processes which survive the crash of a job and hence results in the effect you observed. Simply because SGE doesn't know anything about the processes started by a simple rsh/ssh outside of SGE's context.

There is a Howto for the tight integration of MPICH2 prior 1.3 and Intel MPI which you are using into SGE:

http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html

http://gridengine.sunsource.net/howto/remove_orphaned_processes.html

Intel MPICH2 will at some point in the future also use the Hydra startup manager.

-- Reuti

Post by steve_s
Now that we know what to look for, we can search for jobs which do not
behave.
best,
Steve
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307950

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307954

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

steve_s

2010-12-21 19:09:27 UTC

Post by reuti
The sge_shepherd will be started on each slave node in case of a tight
integration too. When you have a loose integration and no sge_shepherd
on the slaves, then there maybe processes which survive the crash of a
job and hence results in the effect you observed. Simply because SGE
doesn't know anything about the processes started by a simple rsh/ssh
outside of SGE's context.

OK, makes sense. I checked again, and yes: sge_shepherd only on master.
sge_shepherds on the slaves are from different jobs.

Post by reuti
There is a Howto for the tight integration of MPICH2 prior 1.3 and
http://gridengine.sunsource.net/howto/mpich2-integration/mpich2-integration.html
http://gridengine.sunsource.net/howto/remove_orphaned_processes.html
Intel MPICH2 will at some point in the future also use the Hydra startup manager.

We will have a look at these. Thanks very much indeed.

best,
Steve

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307961

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

hjmangalam

2010-12-17 19:11:52 UTC

I may be either missing info or context, but we had this problem with
6.2 with overlapping Qs and it was resolved by explicitly specifying
the threshold for the Qs by setting np_load_avg to be just over 1.

$ qconf -sq long |grep load_thresholds
load_thresholds np_load_avg=1.1

We often get overlapping Q execution hosts registering their
displeasure by entering an overload state, but only by a few
percentage points (1 compute process per core plus a few % due to
system processes).

Almost all our Qs are overlapping due to competing requirements /
hardware and this seems to address that part of it fine. (tho I'd much
prefer to keep them separate for simplicity's sake).

hjm

Post by templedf
This is a known issue. When scheduling parallel jobs with 6.2
to 6.2u5, the scheduler ignores host load.

Yep.

Post by templedf
This often results in jobs piling up
on a few nodes while other nodes are idle.

OK, good to know. We're running 6.2u3 here.
I'm not sure if I get this right: Even if the load is ignored,
doesn't SGE keep track of already given-away slots on each node? I
always thought that this is the way jobs are scheduled in the
first place (besides policies and all that, but that should have
nothing to do with load or slots in this context).
Given that SGE knows i.e. np_load_avg on each node, I thought we
could circumvent the problem by setting np_load_avg to
requestable=YES and then something like
$ qsub -hard -l 'np_load_avg < 0.3' ...
but this gives me
"Unable to run job: denied: missing value for request
"np_load_avg". Exiting."
whereas using "=" or ">" works. I guess the reason is what is
">=, >, <=, < operators can only be overridden, when the new
value is more restrictive than the old one."
So, I cannot use "<". If that is the case, what can we do about it?
Do we need to define a new complex attribute (say
'np_load_avg_less') along with a load_sensor or can we hijack
np_load_avg in another way?

Post by reuti
As far as I understood the problem, the nodes are oversubscribed
by getting more than 8 processes scheduled.

Exactly.

Post by reuti
Did you change the host assignment to certain queues, while jobs
were still running? Maybe you need to limit the number total
slots per machine to 8 in an RQS or setting it for each host's
complex_values.

No, we didn't change the host assignment.
Sorry, but what do you mean by RQS? Did not see that in the
documentation so far.

Post by reuti
Another reason for virtual oversubscription: processes in state
"D" count as running and dispite the fact of the high load, all
is in best order.

Oversubscribed nodes do not always run 16 instead of 8 processes,
some only 14 or so. Nevertheless, the load is always almost
exactly 16. As far as I can see, processes on these oversubscribed
nodes (with > 8 processes) run with ~50% CPU load each.
best,
Steve
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMe
ssageId=305856

--
Harry Mangalam - Research Computing, NACS, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
MSTB=Bldg 415 (G-5 on <http://today.uci.edu/pdf/UCI_09_map_campus.pdf>
Lat/Long: 33.642025,-117.844414 (paste into google maps)
--
Like the autumn leaves / Our rights flutter to the ground /
So too, our trousers. <http://goo.gl/boJcT>

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306535

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

reuti

2010-12-17 21:17:57 UTC

Post by hjmangalam
I may be either missing info or context, but we had this problem with
6.2 with overlapping Qs and it was resolved by explicitly specifying
the threshold for the Qs by setting np_load_avg to be just over 1.
$ qconf -sq long |grep load_thresholds
load_thresholds np_load_avg=1.1
We often get overlapping Q execution hosts registering their
displeasure by entering an overload state, but only by a few
percentage points (1 compute process per core plus a few % due to
system processes).

Yes, this avoids oversubscription, but may leave slots unused as also processes in state "D" count as running and can create an artificial higher load. The usual approach to limit slots across serveral queues is one of these:

http://gridengine.sunsource.net/ds/viewMessage.do?dsMessageId=253527&dsForumId=38

-- Reuti

Post by hjmangalam
Almost all our Qs are overlapping due to competing requirements /
hardware and this seems to address that part of it fine. (tho I'd much
prefer to keep them separate for simplicity's sake).
hjm

Post by templedf
This is a known issue. When scheduling parallel jobs with 6.2
to 6.2u5, the scheduler ignores host load.

Yep.

Post by templedf
This often results in jobs piling up
on a few nodes while other nodes are idle.

OK, good to know. We're running 6.2u3 here.
I'm not sure if I get this right: Even if the load is ignored,
doesn't SGE keep track of already given-away slots on each node? I
always thought that this is the way jobs are scheduled in the
first place (besides policies and all that, but that should have
nothing to do with load or slots in this context).
Given that SGE knows i.e. np_load_avg on each node, I thought we
could circumvent the problem by setting np_load_avg to
requestable=YES and then something like
$ qsub -hard -l 'np_load_avg < 0.3' ...
but this gives me
"Unable to run job: denied: missing value for request
"np_load_avg". Exiting."
whereas using "=" or ">" works. I guess the reason is what is
">=, >, <=, < operators can only be overridden, when the new
value is more restrictive than the old one."
So, I cannot use "<". If that is the case, what can we do about it?
Do we need to define a new complex attribute (say
'np_load_avg_less') along with a load_sensor or can we hijack
np_load_avg in another way?

Post by reuti
As far as I understood the problem, the nodes are oversubscribed
by getting more than 8 processes scheduled.

Exactly.

Post by reuti
Did you change the host assignment to certain queues, while jobs
were still running? Maybe you need to limit the number total
slots per machine to 8 in an RQS or setting it for each host's
complex_values.

No, we didn't change the host assignment.
Sorry, but what do you mean by RQS? Did not see that in the
documentation so far.

Post by reuti
Another reason for virtual oversubscription: processes in state
"D" count as running and dispite the fact of the high load, all
is in best order.

Oversubscribed nodes do not always run 16 instead of 8 processes,
some only 14 or so. Nevertheless, the load is always almost
exactly 16. As far as I can see, processes on these oversubscribed
nodes (with > 8 processes) run with ~50% CPU load each.
best,
Steve
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMe
ssageId=305856

--
Harry Mangalam - Research Computing, NACS, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
MSTB=Bldg 415 (G-5 on <http://today.uci.edu/pdf/UCI_09_map_campus.pdf>
Lat/Long: 33.642025,-117.844414 (paste into google maps)
--
Like the autumn leaves / Our rights flutter to the ground /
So too, our trousers. <http://goo.gl/boJcT>
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306535

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=306604

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

steve_s

2010-12-21 14:57:19 UTC

Yes, this avoids oversubscription, but may leave slots unused as also
processes in state "D" count as running and can create an artificial
higher load. The usual approach to limit slots across serveral queues
http://gridengine.sunsource.net/ds/viewMessage.do?dsMessageId=253527&dsForumId=38

We have adopted this solution (set up an RQS to limit slots per node)
and it seems to work so far.

Our queues do not overlap, but the overload was (at least partly) caused
by dead jobs of which SGE had apparently no knowledge.

Thank you all for your help.

best,
Steve

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=307892

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

Continue reading on narkive:

Search results for 'nodes overloaded: processes placed on already full nodes' (Questions and Answers)

is altunative cancer treatment just as good as medical science treatment?

started 2006-07-22 18:30:14 UTC

alternative medicine

whats bether ps3 or xbox 360?

started 2008-01-18 18:29:54 UTC

video & online games

started 2006-07-07 05:56:14 UTC

acessing cached pages?

computers & internet