如何定位racoracle rac单节点重启启原因

双节点RAC各个节点主机频繁自动重启故障解决_数据库技术_Linux公社-Linux系统门户网站
你好,游客
双节点RAC各个节点主机频繁自动重启故障解决
来源:Linux社区&
作者:ccz320
1)&&&&&&&& 背景介绍:
最近在vmware中搭建了一个10g RAC的双节点实验平台并将oracle RAC从10.2.0.1升级到10.2.0.5,后来发现两台linux经常自动重启;&&2)&&&&&&&& 平台信息:vmware7 + OEL5.7X64 + ASMLib2.0 + ORACLE10.2.0.5&3)&&&&&&&& /var/log/message日志:? NODE1:Linux1Apr 18 20:44:18 Linux1 syslogd 1.4.1: restart.Apr 18 20:44:18 Linux1 kernel: klogd 1.4.1, log source = /proc/kmsg started.Apr 18 20:44:18 Linux1 kernel: Initializing cgroup subsys cpusetApr 18 20:44:18 Linux1 kernel: Initializing cgroup subsys cpuApr 18 20:44:18 Linux1 kernel: Linux version 2.6.32-200.13.1.el5uek (mockbuild@ca-build9.) (gcc version 4.1.2
( 4.1.2-50)) #1 SMP Wed Jul 27 21:02:33 EDT 2011Apr 18 20:44:18 Linux1 kernel: Command line: ro root=/dev/VolGroup00/LogVol00 rhgb quietApr 18 20:44:18 Linux1 kernel: KERNEL supported cpus:Apr 18 20:44:18 Linux1 kernel:&& Intel GenuineIntelApr 18 20:44:18 Linux1 kernel:&& AMD AuthenticAMDApr 18 20:44:18 Linux1 kernel:&& Centaur CentaurHaulsApr 18 20:44:18 Linux1 kernel: BIOS-provided physical RAM map:Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000 - f800 (usable)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: f800 - a0000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: ca000 - cc000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: dc000 - e4000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: e8000 - 0000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000 - bfef0000 (usable)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: bfef0000 - bfeff000 (ACPI data)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: bfeff000 - bff00000 (ACPI NVS)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: bff00000 - 0000 (usable)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000 - 0000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: fec00000 - fec10000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: fee00000 - fee01000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: fffe0000 - 0000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000 - 0000 (usable)Apr 18 20:44:18 Linux1 kernel: DMI present.? NODE2:Linux2Apr 18 20:43:35 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 has been idle for 30.0 seconds, shutting it down.Apr 18 20:43:35 Linux2 kernel: (swapper,0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr .559806 now .306532 dr .559360 adv .807 func (b651ea27:504) .)Apr 18 20:43:35 Linux2 kernel: o2net: no longer connected to node Linux1 (num 0) at 192.168.3.131:7777Apr 18 20:43:56 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 7Apr 18 20:44:05 Linux2 kernel: (o2net,3480,0):o2net_connect_expired:1659 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.Apr 18 20:44:24 Linux2 avahi-daemon[4341]: Registering new address record for 192.168.0.136 on eth0.Apr 18 20:44:26 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 7Apr 18 20:44:28 Linux2 last message repeated 2 timesApr 18 20:44:28 Linux2 kernel: (o2hb-,3564,1):o2dlm_eviction_cb:267 o2dlm has evicted node 0 from group FE7Apr 18 20:44:28 Linux2 kernel: (ocfs2rec,19793,1):ocfs2_replay_journal:1605 Recovering node 0 from slot 0 on device (8,65)Apr 18 20:44:30 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 8Apr 18 20:44:31 Linux2 kernel: (ocfs2rec,19793,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery in slot 0Apr 18 20:44:31 Linux2 kernel: (ocfs2_wq,3567,1):ocfs2_finish_quota_recovery:598 Finishing quota recovery in slot 0Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_get_lock_resource:836 FE7:$RECOVERY: at least one node (0) to recover before lock mastery can beginApr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_get_lock_resource:870 FE7: recovery map is not empty, but must master $RECOVERY lock nowApr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_do_recovery:523 (3573) Node 1 is the Recovery Master for the Dead Node 0 for Domain FE7以上信息在两台机器中会交换出现,说明并不是总是固定的一台机器对另外一台超时。&&4)&&&&&&&& 根据message信息报错,应该是o2cb的idle时间超限导致的,系统中O2CB服务的状态为:[oracle@Linux1]service o2cb statusDriver for "configfs": LoadedFilesystem "configfs": MountedStack glue driver: LoadedStack plugin "o2cb": LoadedDriver for "ocfs2_dlmfs": LoadedFilesystem "ocfs2_dlmfs": MountedChecking O2CB cluster ocfs2: OnlineHeartbeat dead threshold = 301&Network idle timeout: 30000&&&&&&&&&&&&&&&&&&&&&&&&&&& /此处单位为毫秒,正式message中报的30秒&Network keepalive delay: 2000&Network reconnect delay: 2000Checking O2CB heartbeat: Active
相关资讯 & & &
& (03/06/:14)
& (02/13/:04)
& (02/13/:26)
& (02/26/:58)
& (02/13/:58)
& (02/13/:48)
   同意评论声明
   发表
尊重网上道德,遵守中华人民共和国的各项有关法律法规
承担一切因您的行为而直接或间接导致的民事或刑事法律责任
本站管理人员有权保留或删除其管辖留???中的任意内容
本站有权在网站内转载或引用您的评论
参与本评论即表明您已经阅读并接受上述条款oracle rac节点频繁重启问题解决办法
作者:用户
本文讲的是oracle rac节点频繁重启问题解决办法,
这是一个网友的信息,其RAC频繁重启,从9月份以来几乎每间隔2天就要重启,我们先来看看日志。
节点1的alert log:
Tue Oct 28 10:51:40 2014
Thread 1 advanced to log se
这是一个网友的信息,其RAC频繁重启,从9月份以来几乎每间隔2天就要重启,我们先来看看日志。
节点1的alert log:
Tue Oct 28 10:51:40 2014
Thread 1 advanced to log sequence 22792 (LGWR switch)
Current log# 107 seq# 22792 mem# 0: +ORADATA/portaldb/redo107.log
Tue Oct 28 10:57:16 2014
Thread 1 advanced to log sequence 22793 (LGWR switch)
Current log# 108 seq# 22793 mem# 0: +ORADATA/portaldb/redo108.log
Tue Oct 28 11:04:07 2014
Reconfiguration started (old inc 48, new inc 50)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Tue Oct 28 11:04:09 2014
Tue Oct 28 11:04:09 2014
我们接着来看下节点1的crs的alert log信息:
07:19:57.145
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.
Removal of this node from cluster in 14.732 seconds
07:20:05.169
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.
Removal of this node from cluster in 6.708 seconds
07:20:09.175
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.
Removal of this node from cluster in 2.702 seconds
07:20:11.880
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation ; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
09:58:11.620
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.
Removal of this node from cluster in 14.141 seconds
09:58:18.634
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.
Removal of this node from cluster in 7.126 seconds
09:58:23.660
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.
Removal of this node from cluster in 2.100 seconds
09:58:25.763
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation ; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
14:31:07.140
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.
Removal of this node from cluster in 14.105 seconds
14:31:14.169
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.
Removal of this node from cluster in 7.075 seconds
14:31:19.181
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.
Removal of this node from cluster in 2.063 seconds
14:31:21.246
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation ; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
06:02:39.191
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.
Removal of this node from cluster in 14.748 seconds
06:02:47.197
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.
Removal of this node from cluster in 6.742 seconds
06:02:51.203
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.
Removal of this node from cluster in 2.736 seconds
06:02:53.941
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation ; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
06:04:02.765
[crsd(6815946)]CRS-2772:Server 'xxdb2' has been assigned to pool 'ora.portaldb'.
11:03:48.965
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.
Removal of this node from cluster in 14.023 seconds
11:03:55.990
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.
Removal of this node from cluster in 6.998 seconds
11:04:00.008
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.
Removal of this node from cluster in 2.979 seconds
11:04:02.988
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation ; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
11:04:05.992
[cssd(6095264)]CRS-1625:Node xxdb2, number 2, was manually shut down
11:04:05.998
[cssd(6095264)]CRS-1601:CSSD Reconfiguration complete. Active nodes are xxdb1 xxdb2 .
从节点1的crs alert log来看,23号,25号以及28号都出现了节点驱逐的情况。单从上面的信息来看,似乎跟网络有关系。
节点1的ocssd.log 如下:
11:03:47.010: [
CSSD][1029]clssscSelect: cookie accept request
11:03:47.010: [
CSSD][1029]clssgmAllocProc: (113c7b590) allocated
11:03:47.016: [
CSSD][1029]clssgmClientConnectMsg: properties of cmProc 113c7b590 - 0,1,2,3,4
11:03:47.016: [
CSSD][1029]clssgmClientConnectMsg: Connect from con(1d287ee) proc(113c7b590) pid() version 11:2:1:4, properties: 0,1,2,3,4
11:03:47.016: [
CSSD][1029]clssgmClientConnectMsg: msg flags 0x0000
11:03:47.061: [
CSSD][2577]clssnmSetupReadLease: status 1
11:03:48.965: [
CSSD][3605]clssnmPollingThread: node xxdb2 (2) at 50% heartbeat fatal, removal in 14.023 seconds
11:03:48.965: [
CSSD][3605]clssnmPollingThread: node xxdb2 (2) is impending reconfig, flag 2294796, misstime 15977
11:03:48.965: [
CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
11:03:48.965: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:03:49.611: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:03:49.612: [
CSSD][2577]clssnmSetupReadLease: status 1
11:03:49.617: [
CSSD][2577]clssnmCompleteGMReq: Completed request type 17 with status 1
11:03:49.617: [
CSSD][2577]clssgmDoneQEle: re-queueing req 112cdfd50 status 1
11:03:49.619: [
CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 17 for proc (113c9aad0), operation status 1, client status 0
11:03:49.633: [
CSSD][2577]clssnmCompleteGMReq: Completed request type 18 with status 1
11:03:49.633: [
CSSD][2577]clssgmDoneQEle: re-queueing req 112cdfd50 status 1
11:03:49.635: [
CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 18 for proc (113c9aad0), operation status 1, client status 0
11:03:49.671: [
CSSD][1029]clssnmGetNodeNumber: xxdb1
11:03:49.725: [
CSSD][1029]clssnmGetNodeNumber: xxdb2
11:03:49.969: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:03:50.970: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:03:51.248: [
CSSD][3862]clssnmSendingThread: sending status msg to all nodes
11:03:51.248: [
CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
11:03:51.975: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:00.007: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:00.008: [
CSSD][3605]clssnmPollingThread: node xxdb2 (2) at 90% heartbeat fatal, removal in 2.979 seconds, seedhbimpd 1
11:04:01.010: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:02.012: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:02.988: [
CSSD][3605]clssnmPollingThread: Removal started for node xxdb2 (2), flags 0x23040c, state 3, wt4c 0
11:04:02.988: [
CSSD][3605]clssnmMarkNodeForRemoval: node 2, xxdb2 marked for removal
11:04:02.988: [
CSSD][3605]clssnmDiscHelper: xxdb2, node(2) connection failed, endp (c1da79), probe(0), ninf-&endp c1da79
11:04:02.988: [
CSSD][3605]clssnmDiscHelper: node 2 clean up, endp (c1da79), init state 5, cur state 5
11:04:02.988: [GIPCXCPT][3605] gipcInternalDissociate: obj
[c1da79] { gipcEndpoint : localAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', remoteAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
11:04:02.988: [GIPCXCPT][3605] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ]
failed to dissociate obj
[c1da79] { gipcEndpoint : localAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', remoteAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 }, flags 0x0
11:04:02.988: [
CSSD][4119]clssnmDoSyncUpdate: Initiating sync
11:04:02.988: [
CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress
val 1, from -1, changes 61
11:04:02.988: [
CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
11:04:02.988: [
CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
11:04:02.988: [
CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation
11:04:02.988: [
CSSD][4119]clssnmSetupAckWait: Ack message type (11)
11:04:02.988: [
CSSD][4119]clssnmSetupAckWait: node(1) is ALIVE
11:04:02.988: [
CSSD][4119]clssnmSendSync: syncSeqNo(), indicating EXADATA fence initialization complete
11:04:02.988: [
CSSD][4119]clssnmNeedConfReq: No configuration to change
11:04:02.988: [
CSSD][4119]clssnmDoSyncUpdate: Terminating node 2, xxdb2, misstime(30001) state(5)
11:04:02.988: [
CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)
11:04:02.988: [
CSSD][4119]clssnmCheckDskInfo: Checking disk info...
11:04:02.988: [
CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 1, events 0xa0, state 0x0
11:04:02.988: [
CSSD][4119]clssnmCheckSplit: Node 2, xxdb2, is alive, DHB (, ) more than disk timeout of 27000 after the last NHB (, )
11:04:02.988: [
CSSD][4119]clssnmCheckDskInfo: My cohort: 1
11:04:02.988: [
CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 0, events 0x20, state 0x0
11:04:02.988: [
CSSD][4119]clssnmRemove: Start
11:04:02.988: [
CSSD][4119](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, xxdb2, from the cluster in incarnation , node birth incarnation , death incarnation , stateflags 0x234000 uniqueness value
11:04:02.989: [
CSSD][4119]clssnmrFenceSage: Fenced node xxdb2, number 2, with EXADATA, handle 0
11:04:02.989: [
CSSD][4119]clssnmSendShutdown: req to node 2, kill time
11:04:02.989: [
CSSD][4119]clssnmsendmsg: not connected to node 2
可以看到28号,oracle的clssnmPollingThread函数调用出现异常。oracle本质上是通过调用该函数来判断集群节点的心跳是否正常。
从上面的信息来看,似乎提示的是节点2没有网络心跳。
下面的重点是分析节点2的日志了,首先来看下节点2的alert log:
Tue Oct 28 10:59:40 2014
Thread 2 advanced to log sequence 26516 (LGWR switch)
Current log# 208 seq# 26516 mem# 0: +ORADATA/portaldb/redo208.log
Tue Oct 28 11:04:18 2014
NOTE: ASMB terminating
Errors in file /oracle/diag/rdbms/portaldb/portaldb2/trace/portaldb2_asmb_.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1025 Serial number: 3
Errors in file /oracle/diag/rdbms/portaldb/portaldb2/trace/portaldb2_asmb_.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1025 Serial number: 3
ASMB (ospid: ): terminating the instance due to error 15064
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid () as a result of ORA-1092
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid () as a result of ORA-1092
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid () as a result of ORA-1092
Tue Oct 28 11:04:18 2014
ORA-1092 : opitsk aborting process
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid (7864418) as a result of ORA-1092
Instance terminated by ASMB, pid =
Tue Oct 28 11:05:45 2014
Starting ORACLE instance (normal)
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
ORACLE_HOME = /oracle/product/11.2.0
System name: AIX
Node name: xxdb2
Release: 1
Version: 7
数据库实例的alert log我们基本上看不到什么有价值的信息,下面我们来看下最为关键的节点2的ocssd.log 日志:
11:03:58.792: [
CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
11:04:01.217: [
CSSD][3605]clssnmPollingThread: node xxdb1 (1) at 50% heartbeat fatal, removal in 14.773 seconds
11:04:01.217: [
CSSD][3605]clssnmPollingThread: node xxdb1 (1) is impending reconfig, flag 2491406, misstime 15227
11:04:01.217: [
CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
11:04:01.217: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:13.242: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:13.243: [
CSSD][3605]clssnmPollingThread: node xxdb1 (1) at 90% heartbeat fatal, removal in 2.746 seconds, seedhbimpd 1
11:04:14.244: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:14.843: [
CSSD][3862]clssnmSendingThread: sending status msg to all nodes
11:04:14.843: [
CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
11:04:15.246: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:15.994: [
CSSD][3605]clssnmPollingThread: Removal started for node xxdb1 (1), flags 0x26040e, state 3, wt4c 0
11:04:15.994: [
CSSD][3605]clssnmMarkNodeForRemoval: node 1, xxdb1 marked for removal
11:04:15.994: [
CSSD][3605]clssnmDiscHelper: xxdb1, node(1) connection failed, endp (90e), probe(0), ninf-&endp 90e
11:04:15.994: [
CSSD][3605]clssnmDiscHelper: node 1 clean up, endp (90e), init state 5, cur state 5
11:04:15.994: [GIPCXCPT][3605] gipcInternalDissociate: obj
[090e] { gipcEndpoint : localAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', remoteAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x38606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
11:04:15.994: [GIPCXCPT][3605] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ]
failed to dissociate obj
[090e] { gipcEndpoint : localAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', remoteAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x38606, usrFlags 0x0 }, flags 0x0
11:04:15.994: [
CSSD][4119]clssnmDoSyncUpdate: Initiating sync
11:04:15.994: [
CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress
val 2, from -1, changes 4
11:04:15.994: [
CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
11:04:15.994: [
CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
11:04:15.994: [
CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation
11:04:15.994: [
CSSD][4119]clssnmSetupAckWait: Ack message type (11)
11:04:15.994: [
CSSD][4119]clssnmSetupAckWait: node(2) is ALIVE
11:04:15.994: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:15.994: [
CSSD][4119]clssnmSendSync: syncSeqNo(), indicating EXADATA fence initialization complete
11:04:15.994: [
CSSD][4119]clssnmDoSyncUpdate: Terminating node 1, xxdb1, misstime(30004) state(5)
11:04:15.995: [
CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)
11:04:15.995: [
CSSD][4119]clssnmCheckDskInfo: Checking disk info...
11:04:15.995: [
CSSD][4119]clssnmCheckSplit: Node 1, xxdb1, is alive, DHB (, ) more than disk timeout of 27000 after the last NHB (, )
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 2, events 0xa0, state 0x0
11:04:15.995: [
CSSD][4119]clssnmCheckDskInfo: My cohort: 2
11:04:15.995: [
CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1
11:04:15.995: [
CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, xxdb2, is smaller than cohort of 1 nodes led by node 1, xxdb1, based on map type 2
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 2, events 0x0, state 0x0
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 51, mbrc 3, to member 2, events 0x38, state 0x0
11:04:15.995: [
CSSD][4119]###################################
11:04:15.995: [
CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
11:04:15.995: [
CSSD][4119]###################################
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(ocr_xxdb-scan) count(2) master(1) event(2), incarn 20, mbrc 2, to member 2, events 0x78, state 0x0
11:04:15.995: [
CSSD][4119](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.ONSPROC.MASTER) count(2) master(1) event(2), incarn 20, mbrc 2, to member 2, events 0xa0, state 0x0
11:04:15.995: [
CSSD][4119]
----- Call Stack Trace -----
11:04:15.996: [
CSSD][1]clssgmUpdateEventValue: CmInfo State
val 2, changes 13
11:04:15.996: [
CSSD][1]clssgmUpdateEventValue: ConnectedNodes
val , changes 5
11:04:15.996: [
CSSD][1]clssgmCleanupNodeContexts():
cleaning up nodes, rcfg()
11:04:15.996: [
CSSD][1]clssgmCleanupNodeContexts():
successful cleanup of nodes rcfg()
11:04:15.996: [
CSSD][1]clssgmStartNMMon:
completed node cleanup
11:04:15.996: [
CSSD][4119]calling
argument values in hex
11:04:15.996: [
CSSD][4119]location
(? means dubious value)
11:04:15.996: [
CSSD][3348]clssgmUpdateEventValue: HoldRequest
val 1, changes 3
11:04:15.996: [
CSSD][4119]-------------------- -------- -------------------- ----------------------------
11:04:15.997: [
CSSD][4119]clssscExit()+708
11:04:15.997: [
CSSD][4119]
11003ADF0 ? C914 ?
11:04:15.997: [
CSSD][4119]
? 013A80187 ?
11:04:15.997: [
CSSD][4119]
013A80187 ?
11:04:15.997: [
CSSD][4119]clssnmCheckDskInfo(
clssscExit()
111B08E30 ?
11:04:15.997: [
CSSD][4119])+1600
11003ADF0 ? C914 ?
11:04:15.997: [
CSSD][4119]
? 013A80187 ?
11:04:15.997: [
CSSD][4119]
013A80187 ?
11:04:15.997: [
CSSD][4119]clssnmDoSyncUpdate(
clssnmCheckDskInfo(
111B08E30 ? 110E3F130 ?
11:04:15.997: [
CSSD][4119])+4016
11:04:15.997: [
CSSD][4119]
? 013A80187 ?
11:04:15.997: [
CSSD][4119]
013A80187 ?
11:04:15.997: [
CSSD][4119]clssnmRcfgMgrThread
clssnmDoSyncUpdate(
111B08E30 ? 110E3F130 ?
11:04:15.997: [
CSSD][4119]()+2992
11:04:15.997: [
CSSD][4119]
11:04:15.997: [
CSSD][4119]
013A80187 ?
11:04:15.997: [
CSSD][4119]clssscthrdmain()+20
clssnmRcfgMgrThread
111B08E30 ?
11:04:15.997: [
CSSD][4119]8
11:04:15.997: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]_pthread_body()+240
clssscthrdmain()
111B08E30 ?
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]fffffffffffffffc
_pthread_body()
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]
----- End of Call Stack Trace -----
11:04:16.052: [
CSSD][4119]clssnmSendMeltdownStatus: node xxdb2, number 2, has experienced a failure in thread number 3 and is shutting down
11:04:16.052: [
CSSD][4119]clssscExit: Starting CRSD cleanup
11:04:16.052: [
CSSD][2320]clssnmvDiskKillCheck: not evicted, file /dev/rhdisk22 flags 0x, kill block unique 0, my unique
11:04:16.995: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:17.004: [
CSSD][3605]clssnmPollingThread: state(3) clusterState(2) exit
11:04:17.004: [
CSSD][3605]clssscExit: abort already set 1
11:04:17.053: [
CSSD][2320](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node xxdb1, number 1, sync , stamp
11:04:17.053: [
CSSD][2320]clssscExit: abort already set 1
11:04:17.995: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:18.600: [
CSSD][4119]clssscExit: CRSD cleanup successfully completed
11:04:18.602: [ default][4119]kgzf_gen_node_reid2: generated reid cid=2ccaf96bf8ea33b240c4c97,icin=,nmn=2,lnid=,gid=0,gin=0,gmn=0,umemid=0,opid=0,opsn=0,lvl=node hdr=0xfece0100
11:04:18.602: [
CSSD][4119]clssnmrFenceSage: Fenced node xxdb2, number 2, with EXADATA, handle 0
11:04:18.602: [
CSSD][4119]clssgmUpdateEventValue: CmInfo State
val 0, changes 14
11:04:18.602: [
CSSD][1029]clssgmProcClientReqs: Checking RPC Q
11:04:18.602: [
CSSD][1029]clssgmProcClientReqs: Checking dead client Q
11:04:18.602: [
CSSD][1029]clssgmProcClientReqs: Checking dead proc Q
11:04:18.602: [
CSSD][1029]clssgmSendShutdown: Aborting client () proc (111dceb10), iocapables 1.
11:04:18.602: [
CSSD][1029]clssgmSendShutdown: I/O capable proc (111dceb10), pid (), iocapables 1, client ()
11:04:18.602: [
CSSD][1029]clssgmSendShutdown: Aborting client () proc (111d29b50), iocapables 2.
我们过滤掉节点2的ocssd中的关键信息,可以发现如下的内容:
06:03:07.804: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
06:03:07.804: [
CSSD][2320](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node xxdb1, number 1, sync , stamp
06:03:07.804: [
CSSD][2320]###################################
06:03:07.804: [
CSSD][4376]clssnmHandleSync: Node xxdb2, number 2, is EXADATA fence capable
06:03:07.804: [
CSSD][2320]clssscExit: CSSD aborting from thread clssnmvKillBlockThread
06:03:07.804: [
CSSD][4119]clssnmSendSync: syncSeqNo()
06:03:07.804: [
CSSD][2320]###################################
11:04:15.995: [
CSSD][4119]clssnmCheckSplit: Node 1, xxdb1, is alive, DHB (, ) more than disk timeout of 27000 after the last NHB (, )
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 2, events 0xa0, state 0x0
11:04:15.995: [
CSSD][4119]clssnmCheckDskInfo: My cohort: 2
11:04:15.995: [
CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1
11:04:15.995: [
CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, xxdb2, is smaller than cohort of 1 nodes led by node 1, xxdb1, based on map type 2
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 2, events 0x0, state 0x0
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 51, mbrc 3, to member 2, events 0x38, state 0x0
11:04:15.995: [
CSSD][4119]###################################
11:04:15.995: [
CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
11:04:15.995: [
CSSD][4119]###################################
单纯的从25号和28号的数据来看,这2次故障其实是不同的。我们上面的关键信息我们可以发现,25号的原cssd异常是调用clssnmvKillBlockThread出现问题,而
28号是clssnmRcfgMgrThread。
显然,这2个函数是完全不同的类型,第一个函数的针对votedisk的操作,而第2个函数是网络相关的函数.
最后问这哥们最近做过什么变动,据说是换了交换机之后就出现这个现象了。
跟大家分享这个小的案例!
以上是云栖社区小编为您精心准备的的内容,在云栖社区的博客、问答、公众号、人物、课程等栏目也有的相关内容,欢迎继续使用右上角搜索按钮进行搜索oracle rac单节点重启、oracle rac重启节点、oracle rac节点宕机、oracle rac添加节点、oracle rac查看节点,以便于您获取更多的相关知识。
稳定可靠、可弹性伸缩的在线数据库服务,全球最受欢迎的开源数据库之一
6款热门基础云产品6个月免费体验;2款产品1年体验;1款产品2年体验
弹性可伸缩的计算服务,助您降低 IT 成本,提升运维效率
开发者常用软件,超百款实用软件一站式提供
云栖社区()为您免费提供相关信息,包括
oracle rac单节点重启、oracle rac重启节点、oracle rac节点宕机、oracle rac添加节点、oracle rac查看节点的信息
,所有相关内容均不代表云栖社区的意见!

我要回帖

更多关于 rac 关闭一个节点 的文章

 

随机推荐