|
|
https://www.veritas.com/support/en_US/article.000011284
Panic form had in turnstile_block from zfs
SolarisCAT(vmcore.4/10U)> panic
panic on cpu 16
panic string: GAB: Port h halting system due to client process failure
==== panic kernel thread: 0x2a101947ca0 PID: 0 on CPU: 16 ====
cmd: sched
t_procp: 0x18922c0(proc_sched)
p_as: 0x1893ee0(kas)
zone: global
t_stk: 0x2a101947a90 sp: 0x18c0371 t_stkbase: 0x2a101942000
t_pri: 99(SYS) pctcpu: 0.000000
t_lwp: 0x0 psrset: 0 last CPU: 16
idle: 14 ticks (0.14 seconds)
start: Wed Dec 29 12:58:34 2010
age: 1088859 seconds (12 days 14 hours 27 minutes 39 seconds)
tstate: TS_ONPROC - thread is being run on a processor
tflg: T_TALLOCSTK - thread structure allocated from stk
T_PANIC - thread initiated a system panic
tpflg: none set
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
TS_SIGNALLED - thread was awakened by cv_signal()
pflag: SSYS - system resident process
pc: unix:panicsys+0x48: call unix:setjmp
startpc: genunix:taskq_thread+0x0: save %sp, -0xd0, %sp
unix:panicsys+0x48(0x2a1019475d4, 0x2a1019475b0, 0x18c0d40, 0x1, , , 0x4480001605, , , , , , , , 0x2a1019475d4, 0x2a1019475b0)
unix:vpanic_common+0x78(0x2a1019475d4, 0x2a1019475b0, 0x10003, 0x3de9267e89244, 0x2a101947a8a, 0x2a101947a88)
genunix:cmn_err+0x98(0x3, 0x2a1019475d4, 0x4, 0x2a10194760d, 0x2a10194760e, 0x0)
gab:gab_halt+0xb0(0x37370507, 0x37370507, 0x60063f612c0, 0x0, 0x0, 0x0)
gab:gab_kill_process+0xc8(0x10507)
gab:gab_timerscan+0x400(0x0)
genunix:callout_list_expire+0x5c(0x600565d6e40, 0x600a1af65c0)
genunix:callout_expire+0x1c(0x600565d6e40)
genunix:callout_execute+0xc(0x600565d6e40, 0x60062074dd8, 0x10bd4ec, 0x0, 0xaf434500, 0x0)
genunix:taskq_thread+0x1a4(0x60062beda40, 0x0)
unix:thread_start+0x4()
-- end of kernel thread's stack --
GAB: Port h halting system due to client process failure
SUN ZFS bug
We see customer is om 5.0 MP RP3 HF3
SolarisCAT(vmcore.4/10U)> modinfo | grep -i vx | cut -f 2 -d'('
VxVM 5.0MP3RP3HF2: DMP Driver)
VxVM 5.0MP3RP3HF2 I/O driver)
VxVM 5.0MP3RP3HF2 control/status driver)
VxFS 5.0_REV-5.0MP3RP3i_sol portal driver)
VxFS 5.0_REV-5.0MP3RP3i_sol SunOS 5.10)
VRTS Fence 5.0MP3RP3)
VxQIO 5.0_REV-5.0MP3RP3i_sol Quick I/O driver)
VCS is at 5.0 MP3 RP3
SolarisCAT(vmcore.4/10U)> modinfo | grep -i gab | cut -f 2 -d'('
GAB device 5.0MP3RP3)
Looking through the core we see gab panic the node:
SolarisCAT(vmcore.4/10U)> panic
panic on cpu 16
panic string: GAB: Port h halting system due to client process failure
==== panic kernel thread: 0x2a101947ca0 PID: 0 on CPU: 16 ====
cmd: sched
t_procp: 0x18922c0(proc_sched)
p_as: 0x1893ee0(kas)
zone: global
t_stk: 0x2a101947a90 sp: 0x18c0371 t_stkbase: 0x2a101942000
t_pri: 99(SYS) pctcpu: 0.000000
t_lwp: 0x0 psrset: 0 last CPU: 16
idle: 14 ticks (0.14 seconds)
start: Wed Dec 29 12:58:34 2010
age: 1088859 seconds (12 days 14 hours 27 minutes 39 seconds)
tstate: TS_ONPROC - thread is being run on a processor
tflg: T_TALLOCSTK - thread structure allocated from stk
T_PANIC - thread initiated a system panic
tpflg: none set
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
TS_SIGNALLED - thread was awakened by cv_signal()
pflag: SSYS - system resident process
pc: unix:panicsys+0x48: call unix:setjmp
startpc: genunix:taskq_thread+0x0: save %sp, -0xd0, %sp
unix:panicsys+0x48(0x2a1019475d4, 0x2a1019475b0, 0x18c0d40, 0x1, , , 0x4480001605, , , , , , , , 0x2a1019475d4, 0x2a1019475b0)
unix:vpanic_common+0x78(0x2a1019475d4, 0x2a1019475b0, 0x10003, 0x3de9267e89244, 0x2a101947a8a, 0x2a101947a88)
genunix:cmn_err+0x98(0x3, 0x2a1019475d4, 0x4, 0x2a10194760d, 0x2a10194760e, 0x0)
gab:gab_halt+0xb0(0x37370507, 0x37370507, 0x60063f612c0, 0x0, 0x0, 0x0)
gab:gab_kill_process+0xc8(0x10507)
gab:gab_timerscan+0x400(0x0)
genunix:callout_list_expire+0x5c(0x600565d6e40, 0x600a1af65c0)
genunix:callout_expire+0x1c(0x600565d6e40)
genunix:callout_execute+0xc(0x600565d6e40, 0x60062074dd8, 0x10bd4ec, 0x0, 0xaf434500, 0x0)
genunix:taskq_thread+0x1a4(0x60062beda40, 0x0)
unix:thread_start+0x4()
-- end of kernel thread's stack --
Looking through the message buffer we see load in low and memory is ok:
SolarisCAT(vmcore.4/10U)> msgbuf | more
LLT INFO V-14-1-10024 link 2 (aggr200005) node 1 active
LLT INFO V-14-1-10205 link 2 (aggr200005) node 1 in trouble
LLT INFO V-14-1-10024 link 2 (aggr200005) node 1 active
....
....
GAB WARNING V-15-1-20057 Port h process 6971 inactive 10 sec
GAB WARNING V-15-1-20057 Port h process 6971 inactive 11 sec
GAB WARNING V-15-1-20057 Port h process 6971 inactive 12 sec
GAB WARNING V-15-1-20057 Port h process 6971 inactive 13 sec
GAB WARNING V-15-1-20057 Port h process 6971 inactive 14 sec
GAB WARNING V-15-1-20058 Port h process 6971: heartbeat failed, killing process
GAB INFO V-15-1-20059 Port h heartbeat interval 15000 msec. Statistics:
GAB INFO V-15-1-20129 Port h: heartbeats in 0 ~ 3000 msec: 13604077
GAB INFO V-15-1-20129 Port h: heartbeats in 3000 ~ 6000 msec: 0
GAB INFO V-15-1-20129 Port h: heartbeats in 6000 ~ 9000 msec: 0
GAB INFO V-15-1-20129 Port h: heartbeats in 9000 ~ 12000 msec: 0
GAB INFO V-15-1-20129 Port h: heartbeats in 12000 ~ 15000 msec: 0
GAB INFO V-15-1-20088 System information:
GAB INFO V-15-1-20089 number of cpu: 64
GAB INFO V-15-1-20090 physical memory: 131950112 K
GAB INFO V-15-1-20091 free memory: 10610928 K
GAB INFO V-15-1-20092 average free memory in 5 sec: 10611080 K
GAB INFO V-15-1-20093 average free memory in 30 sec: 10615000 K
GAB INFO V-15-1-20094 number of processes: 1375
GAB INFO V-15-1-20095 load average in 1 min: 11.85
GAB INFO V-15-1-20096 load average in 5 min: 17. 4
GAB INFO V-15-1-20097 load average in 15 min: 25.78
GAB INFO V-15-1-20098 pagein rate: 0
GAB INFO V-15-1-20099 pageout rate: 0
GAB INFO V-15-1-20041 Port h: client process failure: killing process
Looking at had we see it is waiting for an O/S routine to complete:
SolarisCAT(vmcore.4/10U)> proc | grep had
0x6009cfe60e8 6986 1 0 3612672 663552 499712 6 /opt/VRTSvcs/bin/hashadow
0x6009fe4cd18 6971 1 0 73302016 15753216 62332928 255372 /opt/VRTSvcs/bin/had
SolarisCAT(vmcore.4/10U)>
SolarisCAT(vmcore.4/10U)> proc 6986
addr PID PPID RUID/UID size RSS swresv time command
============= ====== ====== ========== ========== ======== ======== ====== =========
0x6009cfe60e8 6986 1 0 3612672 663552 499712 6 /opt/VRTSvcs/bin/hashadow
thread: 0x30030af58a0 state: slp wchan: 0x30030228414 sobj: condition var (from genunix:wait_for_lock+0x34)
SolarisCAT(vmcore.4/10U)> proc 6971
addr PID PPID RUID/UID size RSS swresv time command
============= ====== ====== ========== ========== ======== ======== ====== =========
0x6009fe4cd18 6971 1 0 73302016 15753216 62332928 255372 /opt/VRTSvcs/bin/had
thread: 0x300308b11e0 state: slp wchan: 0x60060d6a7b8 sobj: reader/writer lock
thread: 0x300309f3bc0 state: slp wchan: 0x60060d6a7b8 sobj: reader/writer lock
Looking at the had thread we see:
SolarisCAT(vmcore.4/10U)> proc -L 0x6009fe4cd18
addr PID PPID RUID/UID size RSS swresv time command
============= ====== ====== ========== ========== ======== ======== ====== =========
0x6009fe4cd18 6971 1 0 73302016 15753216 62332928 255372 /opt/VRTSvcs/bin/had
==== user (LWP_SYS) thread: 0x300308b11e0 PID: 6971 ====
cmd: /opt/VRTSvcs/bin/had
t_wchan: 0x60060d6a7b8 sobj: reader/writer lock owner: (reader locked)
t_procp: 0x6009fe4cd18
p_as: 0x60077c7cd30 size: 73302016 RSS: 15753216
hat: 0x3001f009980
cnum: CPU0:439/5958 CPU2:821/6627 CPU4:485/76 CPU6:816/14 CPU8:42/1119 CPU10:501/2450 CPU12:642/9 CPU14:607/3194 CPU16:659/7 CPU18:435/51 CPU20:512/30 CPU22:362/3350 CPU24:432/47 CPU26:342/75 CPU28:525/928 CPU30:130/81 CPU32:24/56 CPU34:11/3 CPU36:11/27 CPU38:9/73 CPU40:4/108 CPU42:9/4 CPU44:11/37 CPU46:10/25 CPU48:15/33 CPU50:17/209 CPU52:11/18 CPU54:14/78 CPU56:16/17 CPU58:11/66 CPU60:13/49 CPU62:43/219
cpusran: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63
zone: global
t_stk: 0x2a10296fae0 sp: 0x2a10296e051 t_stkbase: 0x2a10296a000
t_pri: 157(RT) pctcpu: 0.103167
t_lwp: 0x60063feae60 machpcb: 0x2a10296fae0
mstate: LMS_SLEEP ms_prev: LMS_SYSTEM
ms_state_start: 3 minutes 29.1227795 seconds earlier
ms_start: 12 days 14 hours 24 minutes 39.8208550 seconds earlier
psrset: 0 last CPU: 56
idle: 20913 ticks (3 minutes 29.13 seconds)
start: Wed Dec 29 13:01:28 2010
age: 1088685 seconds (12 days 14 hours 24 minutes 45 seconds)
syscall: #18 stat(, 0xffbf2880) (sysent: genunix:stat32+0x0)
tstate: TS_SLEEP - awaiting an event
tflg: T_DFLTSTK - stack is default size
tpflg: TP_TWAIT - wait to be freed by lwp_wait
TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
pflag: SKILLED - SIGKILL has been posted to the process
SMSACCT - process is keeping micro-state accounting
SMSFORK - child inherits micro-state accounting
pc: genunix:turnstile_block+0x5fc: call unix:swtch
genunix:turnstile_block+0x5fc(0x3008d7e4ec0, 0x1, 0x60060d6a7b8, 0x1833c90, 0x0, 0x0)
unix:rw_enter_sleep+0x19c(0x60060d6a7b8?)
unix:rw_enter(0x60060d6a7b8, 0x1) - frame recycled
zfs:dmu_zfetch_find+0x3c(0x60060d6a7b8, 0x2a10296eb18, 0x1)
zfs:dmu_zfetch+0x74(0x60060d6a7b8, 0x9334000, 0x4000, 0x1)
zfs:dbuf_read+0x120(0x3002b68c018, 0x0, 0x2)
zfs:dnode_hold_impl+0xbc(, , 0x1?)
zfs:dnode_hold(0x60052f6d5c0, 0x499aa, 0x12f7df8, 0x2a10296ed98) - frame recycled
zfs:dmu_buf_hold+0x14(0x60052c45780, 0x499aa, 0x0, 0x0, 0x2a10296ee50)
zfs:zap_lockdir+0x24(, , 0x0, 0x1, 0x1, 0x0, 0x2a10296ef18)
zfs:zap_lookup_norm+0x40(0x60052c45780, 0x499aa, 0x2a10296f680, 0x8, 0x1, 0x2a10296f1d8, 0x0, 0x0, 0x0, 0x0)
zfs:zap_lookup+0x2c(0x60052c45780, 0x499aa, 0x2a10296f680, 0x8, 0x1, 0x2a10296f1d8)
zfs:zfs_match_find+0xd8(0x60052d30380, 0x6009e897a48, 0x2a10296f680, 0x0, 0x1, 0x0)
zfs:zfs_dirent_lock+0x3d0(0x2a10296f2c0, 0x6009e897a48, 0x2a10296f680, 0x2a10296f2c8, 0x6, 0x0)
zfs:zfs_dirlook+0x1a8(0x6009e897a48, 0x2a10296f680, 0x2a10296f678, , 0x0, 0x0)
zfs:zfs_lookup+0x264(0x6009e74e4c0, 0x2a10296f680, 0x2a10296f678, 0x0, 0x0, 0x60092fcef30, 0x60092fcef30, 0x0, 0x0, 0x0)
zfs:zfs_shim_lookup+0x30(0x6009e74e4c0, 0x2a10296f680, 0x2a10296f678, 0x60092fcef30, 0x0, 0x0)
genunix:fop_lookup+0x28(0x6009e74e4c0, 0x2a10296f680, 0x2a10296f678, 0x2a10296f940, 0x0, 0x60052cc7940)
genunix:lookuppnvp+0x344(0x2a10296f940, 0x0, 0x1, 0x0, 0x2a10296fad8, 0x60052cc7940, 0x60052cc7940, 0x60092fcef30)
genunix:lookuppnat+0x120(0x2a10296f940, 0x0, 0x1, 0x0, 0x2a10296fad8, 0x0)
genunix:lookupnameat+0x5c(0x10655a8, 0x0, 0x1, 0x0, 0x2a10296fad8, 0x0)
genunix:cstatat_getvp+0x198(, 0x10655a8, 0x1, 0x1, 0x2a10296fad8, 0x2a10296fad0)
genunix:cstatat32+0x40(, 0x10655a8, , 0xffbf28f4, 0x0, 0x10)
unix:syscall_trap32+0xcc()
-- switch to user thread's user stack --
==== user (LWP_SYS) thread: 0x300309f3bc0 PID: 6971 ====
cmd: /opt/VRTSvcs/bin/had
t_wchan: 0x60060d6a7b8 sobj: reader/writer lock owner: (reader locked)
t_procp: 0x6009fe4cd18
p_as: 0x60077c7cd30 size: 73302016 RSS: 15753216
hat: 0x3001f009980
cnum: CPU0:439/5958 CPU2:821/6627 CPU4:485/76 CPU6:816/14 CPU8:42/1119 CPU10:501/2450 CPU12:642/9 CPU14:607/3194 CPU16:659/7 CPU18:435/51 CPU20:512/30 CPU22:362/3350 CPU24:432/47 CPU26:342/75 CPU28:525/928 CPU30:130/81 CPU32:24/56 CPU34:11/3 CPU36:11/27 CPU38:9/73 CPU40:4/108 CPU42:9/4 CPU44:11/37 CPU46:10/25 CPU48:15/33 CPU50:17/209 CPU52:11/18 CPU54:14/78 CPU56:16/17 CPU58:11/66 CPU60:13/49 CPU62:43/219
cpusran: 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63
zone: global
t_stk: 0x2a1029b7ae0 sp: 0x2a1029b69c1 t_stkbase: 0x2a1029b2000
t_pri: 157(RT) pctcpu: 0.004867
t_lwp: 0x60097ddaf70 machpcb: 0x2a1029b7ae0
mstate: LMS_SLEEP ms_prev: LMS_SYSTEM
ms_state_start: 3 minutes 40.1957090 seconds earlier
ms_start: 12 days 14 hours 24 minutes 39.7662149 seconds earlier
psrset: 0 last CPU: 33
idle: 22020 ticks (3 minutes 40.20 seconds)
start: Wed Dec 29 13:01:28 2010
age: 1088685 seconds (12 days 14 hours 24 minutes 45 seconds)
syscall: #4 write(, 0x1feffa6e8) (sysent: genunix:write32+0x0)
tstate: TS_SLEEP - awaiting an event
tflg: T_DFLTSTK - stack is default size
tpflg: TP_TWAIT - wait to be freed by lwp_wait
TP_MSACCT - collect micro-state accounting information
tsched: TS_LOAD - thread is in memory
TS_DONT_SWAP - thread/LWP should not be swapped
TS_SIGNALLED - thread was awakened by cv_signal()
pflag: SKILLED - SIGKILL has been posted to the process
SMSACCT - process is keeping micro-state accounting
SMSFORK - child inherits micro-state accounting
pc: genunix:turnstile_block+0x5fc: call unix:swtch
genunix:turnstile_block+0x5fc(0x3008d7e4ec0, 0x1, 0x60060d6a7b8, 0x1833c90, 0x0, 0x0)
unix:rw_enter_sleep+0x19c(0x60060d6a7b8?)
unix:rw_enter(0x60060d6a7b8, 0x1) - frame recycled
zfs:dmu_zfetch_find+0x3c(0x60060d6a7b8, 0x2a1029b7488, 0x1)
zfs:dmu_zfetch+0x74(0x60060d6a7b8, 0xb6c8000, 0x4000, 0x1)
zfs:dbuf_read+0x120(0x3006e75dd78, 0x0, 0x2)
zfs:dnode_hold_impl+0xbc(, , 0x1?)
zfs:dnode_hold(0x60052f6d5c0, 0x5b65a, 0x30061a75c40, 0x2a1029b7700) - frame recycled
zfs:dmu_tx_hold_object_impl+0x20(0x30061a75c40, 0x60052c45780, 0x5b65a, 0x2, 0x0, 0x0)
zfs:dmu_tx_hold_bonus+0x18(0x30061a75c40, 0x5b65a, 0x0, 0x30061a75c40, 0x60052f6d5c0, 0x300028a6840)
zfs:zfs_write+0x3fc()
zfs:zfs_shim_write(0x3004bb6b100, 0x2a1029b7a98, 0x0, 0x60092fcef30, 0x0) - frame recycled
genunix:fop_write+0x20(0x3004bb6b100, 0x2a1029b7a98, 0x0, 0x60092fcef30, 0x0)
genunix:write+0x268(0x4)
unix:syscall_trap32+0xcc()
-- switch to user thread's user stack --
The routine run at time of panic was turnstile_block, this is an O/S routine, we need to understand what the O/S was doing at this time as we don't see anything of concern in biowait:
1 thread in biowait() found.
threads in biowait() by device:
count device (thread: max idle time)
1 331(vxio),15005 (0x3002d71bb40: 0.01 seconds) /zones/ss088zq5/root/oracle/DWD/sapdata4
SUN confirmed that a ZFS bug caused the panic
|
|
