tvsingh
2010-12-10 16:56:44 UTC
Hello there,
We have a decent size cluster that execute some 3000 jobs on an average on daily basis. I started looking at this setup closely for last couple of weeks and noticed the following errors in the system's messages file:
Dec 9 10:27:18 localhost kernel: sge_qmaster[20498]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000483e3988 error 4
Dec 9 10:28:48 localhost kernel: sge_qmaster[20826]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000484a5988 error 4
Dec 9 10:52:03 localhost kernel: sge_qmaster[21880]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000486ac988 error 4
Dec 10 00:55:46 localhost kernel: sge_qmaster[7994]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000484df988 error 4
The server is based on the binaries of SGE6.2u5 and OS is CentOS 5.x. Also I noticed many a times the memory usage by q master keeps increasing without any visible reason and that leads server to crash.
It does not seem to be due to heavy load as other times the system is running normal even when the load (system's job throughput per hour) is much more.
Any help will be much appreciated,
Thanks in advance,
TV Singh
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=303987
To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
We have a decent size cluster that execute some 3000 jobs on an average on daily basis. I started looking at this setup closely for last couple of weeks and noticed the following errors in the system's messages file:
Dec 9 10:27:18 localhost kernel: sge_qmaster[20498]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000483e3988 error 4
Dec 9 10:28:48 localhost kernel: sge_qmaster[20826]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000484a5988 error 4
Dec 9 10:52:03 localhost kernel: sge_qmaster[21880]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000486ac988 error 4
Dec 10 00:55:46 localhost kernel: sge_qmaster[7994]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000484df988 error 4
The server is based on the binaries of SGE6.2u5 and OS is CentOS 5.x. Also I noticed many a times the memory usage by q master keeps increasing without any visible reason and that leads server to crash.
It does not seem to be due to heavy load as other times the system is running normal even when the load (system's job throughput per hour) is much more.
Any help will be much appreciated,
Thanks in advance,
TV Singh
------------------------------------------------------
http://gridengine.sunsource.net/ds/viewMessage.do?dsForumId=38&dsMessageId=303987
To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].