Discussion:
Job remain in qw state
adarsh
2010-11-29 12:30:18 UTC
Permalink
Dear all,

Thanks for U'r replies, now i Have successfully integrated Hadoop with sge such that my ./qhost -F | grep hdfs command shows
all data paths.

Now when I ran simple wordcount job, my job remain at qw state.
Logs in Execution hosts says :

11/29/2010 16:47:34| main|ws37-user-lin|E|shepherd of job 1.1 exited with exit status = 27
11/29/2010 16:47:34| main|ws37-user-lin|E|can't open usage file "active_jobs/1.1/usage" for job 1.1: No such file or directory
11/29/2010 16:47:34| main|ws37-user-lin|E|11/29/2010 16:47:34 [0:9462]: unable to find shell "/bin/csh"

How to get rid of this.
Whether scp accounting file to all nodes is sufficient or we must have mount /default/common on NFS.

I simple copied it to all execution hosts.

Thanks in Advance
Adarsh Sharma

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=300212

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
reuti
2010-11-29 13:30:58 UTC
Permalink
Post by adarsh
Dear all,
Thanks for U'r replies, now i Have successfully integrated Hadoop with sge such that my ./qhost -F | grep hdfs command shows
all data paths.
Now when I ran simple wordcount job, my job remain at qw state.
11/29/2010 16:47:34| main|ws37-user-lin|E|shepherd of job 1.1 exited with exit status = 27
11/29/2010 16:47:34| main|ws37-user-lin|E|can't open usage file "active_jobs/1.1/usage" for job 1.1: No such file or directory
11/29/2010 16:47:34| main|ws37-user-lin|E|11/29/2010 16:47:34 [0:9462]: unable to find shell "/bin/csh"
If you didn't install the csh, you will most likely want:

$ qconf -sq all.q
...
shell /bin/sh
...
shell_start_mode unix_behavior

while the first entry will only be honored for "posix_compliant" setting of the second entry.

==

If you don't want to change the queue's setting, you can also submit your jobs with:

$ qsub -S /bin/sh ...

instead.

==

The job might now be in error state (check `qstat`), and you have to issue `qmod -cj 27`


-- Reuti
Post by adarsh
How to get rid of this.
Whether scp accounting file to all nodes is sufficient or we must have mount /default/common on NFS.
I simple copied it to all execution hosts.
Thanks in Advance
Adarsh Sharma
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=300212
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=300222

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
adarsh
2010-11-30 04:17:02 UTC
Permalink
Thanks but I follow your commands but isn't working, new jobs remain in qw state too.

I modify default shell to /bin/bash in qconf -mq all.q. Also I submit job qsub -S /bin/sh command but jobs remain iq qw state.

In execution hosts, there is no useful logs except Starting ge6.2u5.

But in Qmaster Logs says :-

11/30/2010 09:26:08|listen|ws37-mah-lin|E|commlib error: got select error (Broken pipe)
11/30/2010 09:26:15|listen|ws37-mah-lin|E|commlib error: got read error (closing "ws34-rak-lin/execd/1")

Don't know what to do.

Thanks

Adarsh

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=300481

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
reuti
2010-12-02 09:28:55 UTC
Permalink
Post by adarsh
Thanks but I follow your commands but isn't working, new jobs remain in qw state too.
I modify default shell to /bin/bash in qconf -mq all.q. Also I submit job qsub -S /bin/sh command but jobs remain iq qw state.
Only one of the two measures should be necessary. Or you change the shell_startup_mode to unix_behavior.

What does your script look like? Your /home is shared? The users are the same on all systems by NIS or alike with the same UID?

-- Reuti
Post by adarsh
In execution hosts, there is no useful logs except Starting ge6.2u5.
But in Qmaster Logs says :-
11/30/2010 09:26:08|listen|ws37-mah-lin|E|commlib error: got select error (Broken pipe)
11/30/2010 09:26:15|listen|ws37-mah-lin|E|commlib error: got read error (closing "ws34-rak-lin/execd/1")
Don't know what to do.
Thanks
Adarsh
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=300481
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=301285

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
mhanby
2010-12-02 16:10:08 UTC
Permalink
Doesn't bash also need to be added here:

$ qconf -sconf|grep bash
login_shells bash,sh,ksh,csh,tcsh

-----Original Message-----
From: reuti [mailto:***@staff.uni-marburg.de]
Sent: Thursday, December 02, 2010 3:29 AM
To: ***@gridengine.sunsource.net
Subject: Re: [GE users] Job remain in qw state
Post by adarsh
Thanks but I follow your commands but isn't working, new jobs remain in qw state too.
I modify default shell to /bin/bash in qconf -mq all.q. Also I submit job qsub -S /bin/sh command but jobs remain iq qw state.
Only one of the two measures should be necessary. Or you change the shell_startup_mode to unix_behavior.

What does your script look like? Your /home is shared? The users are the same on all systems by NIS or alike with the same UID?

-- Reuti
Post by adarsh
In execution hosts, there is no useful logs except Starting ge6.2u5.
But in Qmaster Logs says :-
11/30/2010 09:26:08|listen|ws37-mah-lin|E|commlib error: got select error (Broken pipe)
11/30/2010 09:26:15|listen|ws37-mah-lin|E|commlib error: got read error (closing "ws34-rak-lin/execd/1")
Don't know what to do.
Thanks
Adarsh
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=300481
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=301285

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=301379

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
udo
2010-12-02 16:15:43 UTC
Permalink
Hi,
Just a naïve suggestion to check if queue is in error state:

qstat -f |grep E

best,
v
Post by mhanby
-----Original Message-----
Sent: Thursday, December 02, 2010 11:10
Subject: RE: [GE users] Job remain in qw state
$ qconf -sconf|grep bash
login_shells bash,sh,ksh,csh,tcsh
-----Original Message-----
Sent: Thursday, December 02, 2010 3:29 AM
Subject: Re: [GE users] Job remain in qw state
Post by adarsh
Thanks but I follow your commands but isn't working, new jobs remain in qw state too.
I modify default shell to /bin/bash in qconf -mq all.q. Also I submit
job qsub -S /bin/sh
Post by mhanby
command but jobs remain iq qw state.
Only one of the two measures should be necessary. Or you change the
shell_startup_mode
Post by mhanby
to unix_behavior.
What does your script look like? Your /home is shared? The users are the
same on all
Post by mhanby
systems by NIS or alike with the same UID?
-- Reuti
Post by adarsh
In execution hosts, there is no useful logs except Starting ge6.2u5.
But in Qmaster Logs says :-
11/30/2010 09:26:08|listen|ws37-mah-lin|E|commlib error: got select error (Broken pipe)
11/30/2010 09:26:15|listen|ws37-mah-lin|E|commlib error: got read error
(closing "ws34-
Post by mhanby
rak-lin/execd/1")
Post by adarsh
Don't know what to do.
Thanks
Adarsh
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
00481
Post by mhanby
Post by adarsh
To unsubscribe from this discussion, e-mail: [users-
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
01285
Post by mhanby
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=3
01379
[users-***@gridengine.sunsource.net].

------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=301380

To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
Loading...