steve_s
2010-12-15 13:58:14 UTC
Hello
We're using SGE for a while now and are quite happy with it.
However, lately we observed the following. We have a bunch of 8-core
nodes connected by Infiniband and running MPI jobs across nodes. We found
that processed often get placed on full nodes which have 8 MPI processes
already running. This leaves us with many oversubscribed (load 16
instead of 8) nodes. This happens although there are many empty nodes
left in the queue. It is almost as if the slots already taken on one
node are ignored by SGE.
This is seen with OpenMPI and Intel MPI and with different applications.
No applications does threading or anything that would create more
processes than requested slots.
Did anybody have similar observations? We are thankful for any hints on
how to debug this.
best,
Steve
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305816
To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
We're using SGE for a while now and are quite happy with it.
However, lately we observed the following. We have a bunch of 8-core
nodes connected by Infiniband and running MPI jobs across nodes. We found
that processed often get placed on full nodes which have 8 MPI processes
already running. This leaves us with many oversubscribed (load 16
instead of 8) nodes. This happens although there are many empty nodes
left in the queue. It is almost as if the slots already taken on one
node are ignored by SGE.
This is seen with OpenMPI and Intel MPI and with different applications.
No applications does threading or anything that would create more
processes than requested slots.
Did anybody have similar observations? We are thankful for any hints on
how to debug this.
best,
Steve
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305816
To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].