Discussion:
|E|commlib error: got select error (Broken pipe)
adarsh
2010-11-30 04:57:05 UTC
Permalink
Hi,

I want to know that is it necessary to have default directory of cells mounted over NFS.

I simply SCP all SGE package to Execution Hosts after installing qmaster.
After this I execute ./install_execd at slaves.

MY Qmon shows all hosts with their loads.

Yet I am not able to successfully finish my job.Execution Host Logs ( messages ) shows

11/30/2010 09:30:36| main|ws34-rak-lin|I|starting up GE 6.2u5 (lx24-amd64)

But my job remain in qw state after submission.

Qmaster Logs shows :-

11/30/2010 09:26:08|listen|ws37-mah-lin|E|commlib error: got select error (Broken pipe)
11/30/2010 09:26:15|listen|ws37-mah-lin|E|commlib error: got read error (closing "ws34-rak-lin/execd/1")

When I issue ./qstat command it shows result in the file attached.

Please be kind to help.

Thanks & Regards
Adarsh Sharma

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=300487

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
reuti
2010-12-02 09:26:41 UTC
Permalink
Hi,
Post by adarsh
I want to know that is it necessary to have default directory of cells mounted over NFS.
I simply SCP all SGE package to Execution Hosts after installing qmaster.
After this I execute ./install_execd at slaves.
in principle this is possible. You could have even saved some work, when you would have installed the execd also on the qmaster (just to generate the correct $SGE_ROOT/default/common/sgeexecd). Then remove it again from the list of execution hosts.

Then you only need after the transfer of the complete directory $SGE_ROOT to copy $SGE_ROOT/default/common/sgeexecd (or link to) in one or more of the /etc/init.d/rcX.d after adding each one with `qconf -ah <exechost>` as adminitrative hosts. They will be added as exechosts automatically, when the qmaster is contacted.

Some details you may find here:

http://gridengine.sunsource.net/howto/nfsreduce.html
Post by adarsh
MY Qmon shows all hosts with their loads.
Yet I am not able to successfully finish my job.Execution Host Logs ( messages ) shows
11/30/2010 09:30:36| main|ws34-rak-lin|I|starting up GE 6.2u5 (lx24-amd64)
But my job remain in qw state after submission.
Qmaster Logs shows :-
11/30/2010 09:26:08|listen|ws37-mah-lin|E|commlib error: got select error (Broken pipe)
11/30/2010 09:26:15|listen|ws37-mah-lin|E|commlib error: got read error (closing "ws34-rak-lin/execd/1")
Any firewall on any machine for ports 6444 and 6445 (unless you changed the default ports).

The `qacct`will only work on the headnode, as it needs access to the $SGE_ROOT/default/common/accounting which is not shared in your cluster. And even on the headnode it won't display anything of the job right now, as it's written after the job left the system. You can try `qstat -j 6` or `qalter -w p 6` for running/pending ones.

-- Reuti
Post by adarsh
When I issue ./qstat command it shows result in the file attached.
Please be kind to help.
Thanks & Regards
Adarsh Sharma
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=300487
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=301284

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
Loading...