Discussion:
taking hosts offline
(too old to reply)
neubauer
2010-12-14 15:46:45 UTC
Permalink
Raw Message
Hello!
If i have to maintain a host, i would like to take it temporarily
offline, but afterwards but it back online to the cluster. I need to do
this, because i want all running jobs to finish, but no new job should
start on this host.

In PBS i could do it with one command: pbsnodes -o hostname1
And back online: pbsnodes -c hostname1
Is there something similar in GE?

Thanks,
Sebastian

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305487

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
craffi
2010-12-14 16:18:20 UTC
Permalink
Raw Message
Hi Sebastian,

Read the manpage for 'qmod' , you want the 'qmod -d all.q@<nodename>' --
this will (d)isable the node which allows running jobs to finish but
will prevent new work from landing.

Wildcards work as well: qmod -d 'all.q@*'

-Chris
Post by neubauer
Hello!
If i have to maintain a host, i would like to take it temporarily
offline, but afterwards but it back online to the cluster. I need to do
this, because i want all running jobs to finish, but no new job should
start on this host.
In PBS i could do it with one command: pbsnodes -o hostname1
And back online: pbsnodes -c hostname1
Is there something similar in GE?
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305492

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
neubauer
2010-12-14 16:24:54 UTC
Permalink
Raw Message
Post by craffi
Hi Sebastian,
this will (d)isable the node which allows running jobs to finish but
will prevent new work from landing.
This was exactly what i was looking for!
Thank you very much for your fast and helpful answer!

Sebastian
Post by craffi
-Chris
Post by neubauer
Hello!
If i have to maintain a host, i would like to take it temporarily
offline, but afterwards but it back online to the cluster. I need to do
this, because i want all running jobs to finish, but no new job should
start on this host.
In PBS i could do it with one command: pbsnodes -o hostname1
And back online: pbsnodes -c hostname1
Is there something similar in GE?
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305492
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=305494

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
fx
2010-12-22 23:13:13 UTC
Permalink
Raw Message
Post by craffi
Hi Sebastian,
this will (d)isable the node which allows running jobs to finish but
will prevent new work from landing.
See http://www.nw-grid.ac.uk/LivScripts for a simple script
(disable-nodes) which does that, and the reverse. However, you might
find it better to restrict the nodes to a specific ACL to allow testing
them (e.g. sge-restrict-nodes from the same page); we'll often have such
nodes running HPL to see if they stand up. Note those scripts use node
numbers, and work cross-cluster, courtesy of genders. (I think there
are similar things by other people lying around, but those are the ones
I can point to.)
--
Dave Love
Advanced Research Computing, Computing Services, University of Liverpool
AKA ***@gnu.org

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=310559

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
templedf
2010-12-29 00:22:43 UTC
Permalink
Raw Message
Another approach is to create a host group, e.g. @disabled, and an RQS
that limits "hosts @disabled to slots=0". To disable a host, just add
it to the host group. The benefit of this approach is that it works for
all queues on the host without needing to enumerate them.

Daniel
Post by fx
Post by craffi
Hi Sebastian,
this will (d)isable the node which allows running jobs to finish but
will prevent new work from landing.
See http://www.nw-grid.ac.uk/LivScripts for a simple script
(disable-nodes) which does that, and the reverse. However, you might
find it better to restrict the nodes to a specific ACL to allow testing
them (e.g. sge-restrict-nodes from the same page); we'll often have such
nodes running HPL to see if they stand up. Note those scripts use node
numbers, and work cross-cluster, courtesy of genders. (I think there
are similar things by other people lying around, but those are the ones
I can point to.)
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=310598

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
neubauer
2010-12-29 15:27:04 UTC
Permalink
Raw Message
That's quite a good idea, thank you!

Sebastian
Post by templedf
it to the host group. The benefit of this approach is that it works for
all queues on the host without needing to enumerate them.
Daniel
Post by fx
Post by craffi
Hi Sebastian,
this will (d)isable the node which allows running jobs to finish but
will prevent new work from landing.
See http://www.nw-grid.ac.uk/LivScripts for a simple script
(disable-nodes) which does that, and the reverse. However, you might
find it better to restrict the nodes to a specific ACL to allow testing
them (e.g. sge-restrict-nodes from the same page); we'll often have such
nodes running HPL to see if they stand up. Note those scripts use node
numbers, and work cross-cluster, courtesy of genders. (I think there
are similar things by other people lying around, but those are the ones
I can point to.)
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=310598
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=310757

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
fx
2010-12-31 14:58:48 UTC
Permalink
Raw Message
Post by templedf
it to the host group. The benefit of this approach is that it works for
all queues on the host without needing to enumerate them.
The approach I recommended uses a host group. Do people not normally
test nodes in batch before letting users back on them, which the
additional ACL allows?

I should have mentioned the refinement of maintaining a host comment
complex recording why the node is (semi-)disabled, which isn't in the
version I referred to. I.e. the sge-restrict-nodes should have a
--reason arg, which sets the string-valued `problem' complex, and
sge-unrestrict-nodes nullifies it. (The hostgroup isn't redundant with
the complex defined, because an RQS can't restrict on the basis of the
complex as far as I know.)
--
Dave Love
Advanced Research Computing, Computing Services, University of Liverpool
AKA ***@gnu.org

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=311370

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
Loading...