seg fault with SGE 6.2u5 server
2010-12-10 16:56:44 UTC
Hello there,

We have a decent size cluster that execute some 3000 jobs on an average on daily basis. I started looking at this setup closely for last couple of weeks and noticed the following errors in the system's messages file:

Dec 9 10:27:18 localhost kernel: sge_qmaster[20498]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000483e3988 error 4
Dec 9 10:28:48 localhost kernel: sge_qmaster[20826]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000484a5988 error 4
Dec 9 10:52:03 localhost kernel: sge_qmaster[21880]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000486ac988 error 4
Dec 10 00:55:46 localhost kernel: sge_qmaster[7994]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000484df988 error 4

The server is based on the binaries of SGE6.2u5 and OS is CentOS 5.x. Also I noticed many a times the memory usage by q master keeps increasing without any visible reason and that leads server to crash.
It does not seem to be due to heavy load as other times the system is running normal even when the load (system's job throughput per hour) is much more.

Any help will be much appreciated,

Thanks in advance,
TV Singh


To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].
2010-12-28 21:19:01 UTC
Post by tvsingh
Hello there,
Dec 9 10:27:18 localhost kernel: sge_qmaster[20498]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000483e3988 error 4
Dec 9 10:28:48 localhost kernel: sge_qmaster[20826]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000484a5988 error 4
Dec 9 10:52:03 localhost kernel: sge_qmaster[21880]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000486ac988 error 4
Dec 10 00:55:46 localhost kernel: sge_qmaster[7994]: segfault at 0000000000000080 rip 0000003b01079a30 rsp 00000000484df988 error 4
The server is based on the binaries of SGE6.2u5 and OS is CentOS
5.x. Also I noticed many a times the memory usage by q master keeps
increasing without any visible reason and that leads server to crash.
There's at least one known cause of qmaster SEGVs fixed by the source
you can get from https://arc.liv.ac.uk/trac/SGE, as posted about here
many times. I'm not aware of specific memory leaks that might be cured
by it, though.
Dave Love
Advanced Research Computing, Computing Services, University of Liverpool
AKA ***@gnu.org


To unsubscribe from this discussion, e-mail: [users-***@gridengine.sunsource.net].