We have several core dumps in our product. These core dump can be
reproduced in the same place. That is system function call
std::basic_istr eam<char,std::c har_traits<char >>::getline. The result of
pstack for the core dump is
pstack core | c++filt
core 'core of 12214: ../bin/QBE_V5 -X 30017
ffffffff7b94431 8 __type_0 std::__find_if< const
char*,std::_Eq_ char_bound<std: :char_traits<ch ar> >
(__type_0,__ty pe_0,__type_1,c onst std::random_acc ess_iterator_ta g&) (ffffffff7a000b c7, ffffffff7a0014f d, a00000000000001 , ffffffff7a0014f d, a, 936) +20
ffffffff7b952c1 8 long
std::_M_read_bu ffered<char,std ::char_traits<c har>,std::_Eq_c har_bound<std:: *char_traits<ch ar>
,std::_Scan_fo r_char_val<std: :char_traits<ch ar> > >(std::basic_is tream<__type_0, __type_1>*,std: :basic_streambu f<__type_0,__ty *pe_1>*,long,__ type_0*,__type_ 2,__type_3,bool ,bool,bool) (ffffffff7a000b c7, 1001e7170, 3ff, 100243bd0, 0, a00000000000005 ) + 84
ffffffff7b9537c 4 std::istream &std::istream:: getline(char*,l ong,char)
(1001e7160, 100243bd0, 400, a, 1001e7160, 0) + 7c
ffffffff7e6b905 c int Service::readOb j(std::ifstream
&,XEPersistentO bj*&) (1001e7160, ffffffff7ffff2a 8, 2, 188d04,
ffffffff7bac93b 8, 10) + 54
000000010001df0 8 int initialize(unsi gned) (1001ce030, 0, 1001ce1d0,
1001ce1d0, 3400, 1) + 660
000000010002c37 c main (3, ffffffff7ffffac 8, ffffffffffecb89 8,
1001bb020, 1001959e8, 134400) + ac
000000010001b6d c _start (0, 0, 0, 0, 0, 0) + 17c
When we debug it with dbx, dbx tells us it's a object specific hardware
error, SIG_BUS error. The result of dbx is
t@null <mailto:t@nul l> (l@1 <mailto:l@1> ) program terminated by
signal BUS (object specific hardware error)
0xffffffff7b944 318: __find_if+0x002 0: ldsb [%o4], %o0
(dbx) regs
current thread: t@null <mailto:t@nul l>
current frame: [1]
g0-g1 0x0000000000000 000 0xffffffff7b953 748
g2-g3 0x0000000000000 000 0x000000010022a e0c
g4-g5 0x0000000000000 001 0x0000000000000 936
g6-g7 0x0000000000000 000 0xffffffff7de02 000
o0-o1 0xffffffff7a000 bc7 0xffffffff7a001 4fd
o2-o3 0x0000000000000 00a 0xffffffff7fffe bbe
o4-o5 0xffffffff7a000 bc7 0x0000000000000 24d
o6-o7 0xffffffff7fffe 221 0xffffffff7b944 2e8
l0-l1 0xffffffff7de02 000 0x0000000000000 000
l2-l3 0x000000010023e f40 0xffffffff7b3eb ec4
l4-l5 0x0000000000000 000 0x0000000000000 000
l6-l7 0x0000000000000 001 0x0000000000000 000
i0-i1 0xffffffff7a000 bc7 0xffffffff7a001 4fd
i2-i3 0x0a00000000000 001 0xffffffff7a001 4fd
i4-i5 0x0000000000000 00a 0x0000000000000 936
i6-i7 0xffffffff7fffe 3c1 0xffffffff7b952 c18
y 0x0000000000000 000
ccr 0x0000000000000 044
pc 0xffffffff7b944 318:__find_if+0 x20 ldsb [%o4], %o0
npc 0xffffffff7b944 31c:__find_if+0 x24 cmp %o0, %o2
(dbx) examine $o4 /s
dbx: warning: unknown language, 'c' assumed
0xffffffff7a000 bc7: "EngineCkptInpu t 99 f " ...
int Service::readOb j( ifstream& strm, XEPersistentObj *& retObj )
{
char* tmp=0;
tmp = new char [BUFSIZ];
strm.getline(tm p,BUFSIZ);
Looking at code, we can not find any suspecting place. It's a pure
system call. I searched similar case through google and got two link
http://groups.google.com/group/comp...._thread/thread...
and
http://groups.google.com/group/comp...._thread/thread....
A SUN engineer said, "It's an error returned by software somewhere deep
down the VM system's hat layer; without knowledge of the mapping at the
address, how it was accessed, it's hard to tell what really is the
matter. Basically, the HAT layer is very low-level part of the virtual
memory system. HAT information describes how a memory page is mapped
on the physical side of the VM (i.e. RAM). " He also suggested "Start
by finding out which address is giving the problem, which instruction
is using the address and how. "
In the implementation of function getline, a large buffer will be
allocated and data will be loaded into the buffer. Then data will be
continuously compared with a required char. The ldsb loads bytes from
the big buffer to register. After loading a byte from register o4 to
o0, the data in register o4 and o2 will be compared to check if
condition is meet.
According to sun sparc instruction, ldsb instruction is used to load a
signed byte from memory into register. It can't cause the core dump of
memory address alignment. The address giving the problem also shows
correct content loading from the services.dat with the dbx command
"examine". So we really don't know why the core dump happened.
Our product will be delivered to customer in few days. It's greatly
urgent for us. Your input and help will be highly appreciated by us.
P.S. OS version is Solaris 10 64bit.
% /usr/platform/sun4u/sbin/prtdiag
System Configuration: Sun Microsystems sun4u Netra t 1400/1405 (4 X
UltraSPARC-II 440MHz)
System clock frequency: 110 MHz
Memory size: 4096 Megabytes
Best Regards
Leslie