如何定位racoracle rac单节点重启启原因

你的位置：网站首页 >> 频道首页 >>操作系统 >>如何定位racoracle rac单节点重启启原因

如何定位racoracle rac单节点重启启原因

来源：蜘蛛抓取(WebSpider) 时间：2017-10-03 05:17 标签： rac 关闭一个节点

双节点RAC各个节点主机频繁自动重启故障解决_数据库技术_Linux公社-Linux系统门户网站
你好，游客
双节点RAC各个节点主机频繁自动重启故障解决
来源：Linux社区&
作者：ccz320
1)&&&&&&&& 背景介绍：
最近在vmware中搭建了一个10g RAC的双节点实验平台并将oracle RAC从10.2.0.1升级到10.2.0.5，后来发现两台linux经常自动重启；&&2)&&&&&&&& 平台信息：vmware7 + OEL5.7X64 + ASMLib2.0 + ORACLE10.2.0.5&3)&&&&&&&& /var/log/message日志：? NODE1:Linux1Apr 18 20:44:18 Linux1 syslogd 1.4.1: restart.Apr 18 20:44:18 Linux1 kernel: klogd 1.4.1, log source = /proc/kmsg started.Apr 18 20:44:18 Linux1 kernel: Initializing cgroup subsys cpusetApr 18 20:44:18 Linux1 kernel: Initializing cgroup subsys cpuApr 18 20:44:18 Linux1 kernel: Linux version 2.6.32-200.13.1.el5uek (mockbuild@ca-build9.) (gcc version 4.1.2
( 4.1.2-50)) #1 SMP Wed Jul 27 21:02:33 EDT 2011Apr 18 20:44:18 Linux1 kernel: Command line: ro root=/dev/VolGroup00/LogVol00 rhgb quietApr 18 20:44:18 Linux1 kernel: KERNEL supported cpus:Apr 18 20:44:18 Linux1 kernel:&& Intel GenuineIntelApr 18 20:44:18 Linux1 kernel:&& AMD AuthenticAMDApr 18 20:44:18 Linux1 kernel:&& Centaur CentaurHaulsApr 18 20:44:18 Linux1 kernel: BIOS-provided physical RAM map:Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000 - f800 (usable)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: f800 - a0000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: ca000 - cc000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: dc000 - e4000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: e8000 - 0000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000 - bfef0000 (usable)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: bfef0000 - bfeff000 (ACPI data)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: bfeff000 - bff00000 (ACPI NVS)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: bff00000 - 0000 (usable)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000 - 0000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: fec00000 - fec10000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: fee00000 - fee01000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: fffe0000 - 0000 (reserved)Apr 18 20:44:18 Linux1 kernel: BIOS-e820: 0000 - 0000 (usable)Apr 18 20:44:18 Linux1 kernel: DMI present.? NODE2:Linux2Apr 18 20:43:35 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 has been idle for 30.0 seconds, shutting it down.Apr 18 20:43:35 Linux2 kernel: (swapper,0,0):o2net_idle_timer:1498 here are some times that might help debug the situation: (tmr .559806 now .306532 dr .559360 adv .807 func (b651ea27:504) .)Apr 18 20:43:35 Linux2 kernel: o2net: no longer connected to node Linux1 (num 0) at 192.168.3.131:7777Apr 18 20:43:56 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 7Apr 18 20:44:05 Linux2 kernel: (o2net,3480,0):o2net_connect_expired:1659 ERROR: no connection established with node 0 after 30.0 seconds, giving up and returning errors.Apr 18 20:44:24 Linux2 avahi-daemon[4341]: Registering new address record for 192.168.0.136 on eth0.Apr 18 20:44:26 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 7Apr 18 20:44:28 Linux2 last message repeated 2 timesApr 18 20:44:28 Linux2 kernel: (o2hb-,3564,1):o2dlm_eviction_cb:267 o2dlm has evicted node 0 from group FE7Apr 18 20:44:28 Linux2 kernel: (ocfs2rec,19793,1):ocfs2_replay_journal:1605 Recovering node 0 from slot 0 on device (8,65)Apr 18 20:44:30 Linux2 kernel: o2net: connection to node Linux1 (num 0) at 192.168.3.131:7777 shutdown, state 8Apr 18 20:44:31 Linux2 kernel: (ocfs2rec,19793,0):ocfs2_begin_quota_recovery:407 Beginning quota recovery in slot 0Apr 18 20:44:31 Linux2 kernel: (ocfs2_wq,3567,1):ocfs2_finish_quota_recovery:598 Finishing quota recovery in slot 0Apr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_get_lock_resource:836 FE7:$RECOVERY: at least one node (0) to recover before lock mastery can beginApr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_get_lock_resource:870 FE7: recovery map is not empty, but must master $RECOVERY lock nowApr 18 20:44:31 Linux2 kernel: (dlm_reco_thread,3573,0):dlm_do_recovery:523 (3573) Node 1 is the Recovery Master for the Dead Node 0 for Domain FE7以上信息在两台机器中会交换出现，说明并不是总是固定的一台机器对另外一台超时。&&4)&&&&&&&& 根据message信息报错，应该是o2cb的idle时间超限导致的，系统中O2CB服务的状态为：[oracle@Linux1]service o2cb statusDriver for "configfs": LoadedFilesystem "configfs": MountedStack glue driver: LoadedStack plugin "o2cb": LoadedDriver for "ocfs2_dlmfs": LoadedFilesystem "ocfs2_dlmfs": MountedChecking O2CB cluster ocfs2: OnlineHeartbeat dead threshold = 301&Network idle timeout: 30000&&&&&&&&&&&&&&&&&&&&&&&&&&& /此处单位为毫秒，正式message中报的30秒&Network keepalive delay: 2000&Network reconnect delay: 2000Checking O2CB heartbeat: Active
相关资讯 & & &
& (03/06/:14)
& (02/13/:04)
& (02/13/:26)
& (02/26/:58)
& (02/13/:58)
& (02/13/:48)
　　　同意评论声明
　　　发表
尊重网上道德，遵守中华人民共和国的各项有关法律法规
承担一切因您的行为而直接或间接导致的民事或刑事法律责任
本站管理人员有权保留或删除其管辖留???中的任意内容
本站有权在网站内转载或引用您的评论
参与本评论即表明您已经阅读并接受上述条款oracle rac节点频繁重启问题解决办法
作者：用户
本文讲的是oracle rac节点频繁重启问题解决办法，
这是一个网友的信息，其RAC频繁重启，从9月份以来几乎每间隔2天就要重启，我们先来看看日志。
节点1的alert log：
Tue Oct 28 10:51:40 2014
Thread 1 advanced to log se
这是一个网友的信息，其RAC频繁重启，从9月份以来几乎每间隔2天就要重启，我们先来看看日志。
节点1的alert log：
Tue Oct 28 10:51:40 2014
Thread 1 advanced to log sequence 22792 (LGWR switch)
Current log# 107 seq# 22792 mem# 0: +ORADATA/portaldb/redo107.log
Tue Oct 28 10:57:16 2014
Thread 1 advanced to log sequence 22793 (LGWR switch)
Current log# 108 seq# 22793 mem# 0: +ORADATA/portaldb/redo108.log
Tue Oct 28 11:04:07 2014
Reconfiguration started (old inc 48, new inc 50)
List of instances:
1 (myinst: 1)
Global Resource Directory frozen
* dead instance detected - domain 0 invalid = TRUE
Communication channels reestablished
Master broadcasted resource hash value bitmaps
Non-local Process blocks cleaned out
Tue Oct 28 11:04:09 2014
Tue Oct 28 11:04:09 2014
我们接着来看下节点1的crs的alert log信息：
07:19:57.145
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.
Removal of this node from cluster in 14.732 seconds
07:20:05.169
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.
Removal of this node from cluster in 6.708 seconds
07:20:09.175
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.
Removal of this node from cluster in 2.702 seconds
07:20:11.880
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation ; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
09:58:11.620
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.
Removal of this node from cluster in 14.141 seconds
09:58:18.634
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.
Removal of this node from cluster in 7.126 seconds
09:58:23.660
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.
Removal of this node from cluster in 2.100 seconds
09:58:25.763
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation ; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
14:31:07.140
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.
Removal of this node from cluster in 14.105 seconds
14:31:14.169
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.
Removal of this node from cluster in 7.075 seconds
14:31:19.181
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.
Removal of this node from cluster in 2.063 seconds
14:31:21.246
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation ; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
06:02:39.191
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.
Removal of this node from cluster in 14.748 seconds
06:02:47.197
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.
Removal of this node from cluster in 6.742 seconds
06:02:51.203
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.
Removal of this node from cluster in 2.736 seconds
06:02:53.941
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation ; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
06:04:02.765
[crsd(6815946)]CRS-2772:Server 'xxdb2' has been assigned to pool 'ora.portaldb'.
11:03:48.965
[cssd(6095264)]CRS-1612:Network communication with node xxdb2 (2) missing for 50% of timeout interval.
Removal of this node from cluster in 14.023 seconds
11:03:55.990
[cssd(6095264)]CRS-1611:Network communication with node xxdb2 (2) missing for 75% of timeout interval.
Removal of this node from cluster in 6.998 seconds
11:04:00.008
[cssd(6095264)]CRS-1610:Network communication with node xxdb2 (2) missing for 90% of timeout interval.
Removal of this node from cluster in 2.979 seconds
11:04:02.988
[cssd(6095264)]CRS-1607:Node xxdb2 is being evicted in cluster incarnation ; details at (:CSSNM00007:) in /grid/product/11.2.0/log/xxdb1/cssd/ocssd.log.
11:04:05.992
[cssd(6095264)]CRS-1625:Node xxdb2, number 2, was manually shut down
11:04:05.998
[cssd(6095264)]CRS-1601:CSSD Reconfiguration complete. Active nodes are xxdb1 xxdb2 .
从节点1的crs alert log来看，23号，25号以及28号都出现了节点驱逐的情况。单从上面的信息来看，似乎跟网络有关系。
节点1的ocssd.log 如下：
11:03:47.010: [
CSSD][1029]clssscSelect: cookie accept request
11:03:47.010: [
CSSD][1029]clssgmAllocProc: (113c7b590) allocated
11:03:47.016: [
CSSD][1029]clssgmClientConnectMsg: properties of cmProc 113c7b590 - 0,1,2,3,4
11:03:47.016: [
CSSD][1029]clssgmClientConnectMsg: Connect from con(1d287ee) proc(113c7b590) pid() version 11:2:1:4, properties: 0,1,2,3,4
11:03:47.016: [
CSSD][1029]clssgmClientConnectMsg: msg flags 0x0000
11:03:47.061: [
CSSD][2577]clssnmSetupReadLease: status 1
11:03:48.965: [
CSSD][3605]clssnmPollingThread: node xxdb2 (2) at 50% heartbeat fatal, removal in 14.023 seconds
11:03:48.965: [
CSSD][3605]clssnmPollingThread: node xxdb2 (2) is impending reconfig, flag 2294796, misstime 15977
11:03:48.965: [
CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
11:03:48.965: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:03:49.611: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:03:49.612: [
CSSD][2577]clssnmSetupReadLease: status 1
11:03:49.617: [
CSSD][2577]clssnmCompleteGMReq: Completed request type 17 with status 1
11:03:49.617: [
CSSD][2577]clssgmDoneQEle: re-queueing req 112cdfd50 status 1
11:03:49.619: [
CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 17 for proc (113c9aad0), operation status 1, client status 0
11:03:49.633: [
CSSD][2577]clssnmCompleteGMReq: Completed request type 18 with status 1
11:03:49.633: [
CSSD][2577]clssgmDoneQEle: re-queueing req 112cdfd50 status 1
11:03:49.635: [
CSSD][1029]clssgmCheckReqNMCompletion: Completing request type 18 for proc (113c9aad0), operation status 1, client status 0
11:03:49.671: [
CSSD][1029]clssnmGetNodeNumber: xxdb1
11:03:49.725: [
CSSD][1029]clssnmGetNodeNumber: xxdb2
11:03:49.969: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:03:50.970: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:03:51.248: [
CSSD][3862]clssnmSendingThread: sending status msg to all nodes
11:03:51.248: [
CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
11:03:51.975: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:00.007: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:00.008: [
CSSD][3605]clssnmPollingThread: node xxdb2 (2) at 90% heartbeat fatal, removal in 2.979 seconds, seedhbimpd 1
11:04:01.010: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:02.012: [
CSSD][2577]clssnmvDHBValidateNCopy: node 2, xxdb2, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:02.988: [
CSSD][3605]clssnmPollingThread: Removal started for node xxdb2 (2), flags 0x23040c, state 3, wt4c 0
11:04:02.988: [
CSSD][3605]clssnmMarkNodeForRemoval: node 2, xxdb2 marked for removal
11:04:02.988: [
CSSD][3605]clssnmDiscHelper: xxdb2, node(2) connection failed, endp (c1da79), probe(0), ninf-&endp c1da79
11:04:02.988: [
CSSD][3605]clssnmDiscHelper: node 2 clean up, endp (c1da79), init state 5, cur state 5
11:04:02.988: [GIPCXCPT][3605] gipcInternalDissociate: obj
[c1da79] { gipcEndpoint : localAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', remoteAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
11:04:02.988: [GIPCXCPT][3605] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ]
failed to dissociate obj
[c1da79] { gipcEndpoint : localAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', remoteAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x138606, usrFlags 0x0 }, flags 0x0
11:04:02.988: [
CSSD][4119]clssnmDoSyncUpdate: Initiating sync
11:04:02.988: [
CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress
val 1, from -1, changes 61
11:04:02.988: [
CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
11:04:02.988: [
CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
11:04:02.988: [
CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation
11:04:02.988: [
CSSD][4119]clssnmSetupAckWait: Ack message type (11)
11:04:02.988: [
CSSD][4119]clssnmSetupAckWait: node(1) is ALIVE
11:04:02.988: [
CSSD][4119]clssnmSendSync: syncSeqNo(), indicating EXADATA fence initialization complete
11:04:02.988: [
CSSD][4119]clssnmNeedConfReq: No configuration to change
11:04:02.988: [
CSSD][4119]clssnmDoSyncUpdate: Terminating node 2, xxdb2, misstime(30001) state(5)
11:04:02.988: [
CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)
11:04:02.988: [
CSSD][4119]clssnmCheckDskInfo: Checking disk info...
11:04:02.988: [
CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 1, events 0xa0, state 0x0
11:04:02.988: [
CSSD][4119]clssnmCheckSplit: Node 2, xxdb2, is alive, DHB (, ) more than disk timeout of 27000 after the last NHB (, )
11:04:02.988: [
CSSD][4119]clssnmCheckDskInfo: My cohort: 1
11:04:02.988: [
CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 0, events 0x20, state 0x0
11:04:02.988: [
CSSD][4119]clssnmRemove: Start
11:04:02.988: [
CSSD][4119](:CSSNM00007:)clssnmrRemoveNode: Evicting node 2, xxdb2, from the cluster in incarnation , node birth incarnation , death incarnation , stateflags 0x234000 uniqueness value
11:04:02.989: [
CSSD][4119]clssnmrFenceSage: Fenced node xxdb2, number 2, with EXADATA, handle 0
11:04:02.989: [
CSSD][4119]clssnmSendShutdown: req to node 2, kill time
11:04:02.989: [
CSSD][4119]clssnmsendmsg: not connected to node 2
可以看到28号，oracle的clssnmPollingThread函数调用出现异常。oracle本质上是通过调用该函数来判断集群节点的心跳是否正常。
从上面的信息来看，似乎提示的是节点2没有网络心跳。
下面的重点是分析节点2的日志了，首先来看下节点2的alert log：
Tue Oct 28 10:59:40 2014
Thread 2 advanced to log sequence 26516 (LGWR switch)
Current log# 208 seq# 26516 mem# 0: +ORADATA/portaldb/redo208.log
Tue Oct 28 11:04:18 2014
NOTE: ASMB terminating
Errors in file /oracle/diag/rdbms/portaldb/portaldb2/trace/portaldb2_asmb_.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1025 Serial number: 3
Errors in file /oracle/diag/rdbms/portaldb/portaldb2/trace/portaldb2_asmb_.trc:
ORA-15064: communication failure with ASM instance
ORA-03113: end-of-file on communication channel
Process ID:
Session ID: 1025 Serial number: 3
ASMB (ospid: ): terminating the instance due to error 15064
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid () as a result of ORA-1092
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid () as a result of ORA-1092
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid () as a result of ORA-1092
Tue Oct 28 11:04:18 2014
ORA-1092 : opitsk aborting process
Tue Oct 28 11:04:18 2014
opiodr aborting process unknown ospid (7864418) as a result of ORA-1092
Instance terminated by ASMB, pid =
Tue Oct 28 11:05:45 2014
Starting ORACLE instance (normal)
Starting up:
Oracle Database 11g Enterprise Edition Release 11.2.0.3.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP, Data Mining
and Real Application Testing options.
ORACLE_HOME = /oracle/product/11.2.0
System name: AIX
Node name: xxdb2
Release: 1
Version: 7
数据库实例的alert log我们基本上看不到什么有价值的信息，下面我们来看下最为关键的节点2的ocssd.log 日志：
11:03:58.792: [
CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
11:04:01.217: [
CSSD][3605]clssnmPollingThread: node xxdb1 (1) at 50% heartbeat fatal, removal in 14.773 seconds
11:04:01.217: [
CSSD][3605]clssnmPollingThread: node xxdb1 (1) is impending reconfig, flag 2491406, misstime 15227
11:04:01.217: [
CSSD][3605]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
11:04:01.217: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:13.242: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:13.243: [
CSSD][3605]clssnmPollingThread: node xxdb1 (1) at 90% heartbeat fatal, removal in 2.746 seconds, seedhbimpd 1
11:04:14.244: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:14.843: [
CSSD][3862]clssnmSendingThread: sending status msg to all nodes
11:04:14.843: [
CSSD][3862]clssnmSendingThread: sent 4 status msgs to all nodes
11:04:15.246: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:15.994: [
CSSD][3605]clssnmPollingThread: Removal started for node xxdb1 (1), flags 0x26040e, state 3, wt4c 0
11:04:15.994: [
CSSD][3605]clssnmMarkNodeForRemoval: node 1, xxdb1 marked for removal
11:04:15.994: [
CSSD][3605]clssnmDiscHelper: xxdb1, node(1) connection failed, endp (90e), probe(0), ninf-&endp 90e
11:04:15.994: [
CSSD][3605]clssnmDiscHelper: node 1 clean up, endp (90e), init state 5, cur state 5
11:04:15.994: [GIPCXCPT][3605] gipcInternalDissociate: obj
[090e] { gipcEndpoint : localAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', remoteAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x38606, usrFlags 0x0 } not associated with any container, ret gipcretFail (1)
11:04:15.994: [GIPCXCPT][3605] gipcDissociateF [clssnmDiscHelper : clssnm.c : 3436]: EXCEPTION[ ret gipcretFail (1) ]
failed to dissociate obj
[090e] { gipcEndpoint : localAddr 'gipcha://xxdb2:a2e7-bfa4-887f-6bc', remoteAddr 'gipcha://xxdb1:nm2_xxdb-scan/8be7-baa8-ace4-9d2', numPend 1, numReady 0, numDone 0, numDead 0, numTransfer 0, objFlags 0x0, pidPeer 0, flags 0x38606, usrFlags 0x0 }, flags 0x0
11:04:15.994: [
CSSD][4119]clssnmDoSyncUpdate: Initiating sync
11:04:15.994: [
CSSD][4119]clssscCompareSwapEventValue: changed NMReconfigInProgress
val 2, from -1, changes 4
11:04:15.994: [
CSSD][4119]clssnmDoSyncUpdate: local disk timeout set to 27000 ms, remote disk timeout set to 27000
11:04:15.994: [
CSSD][4119]clssnmDoSyncUpdate: new values for local disk timeout and remote disk timeout will take effect when the sync is completed.
11:04:15.994: [
CSSD][4119]clssnmDoSyncUpdate: Starting cluster reconfig with incarnation
11:04:15.994: [
CSSD][4119]clssnmSetupAckWait: Ack message type (11)
11:04:15.994: [
CSSD][4119]clssnmSetupAckWait: node(2) is ALIVE
11:04:15.994: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:15.994: [
CSSD][4119]clssnmSendSync: syncSeqNo(), indicating EXADATA fence initialization complete
11:04:15.994: [
CSSD][4119]clssnmDoSyncUpdate: Terminating node 1, xxdb1, misstime(30004) state(5)
11:04:15.995: [
CSSD][4119]clssnmDoSyncUpdate: Wait for 0 vote ack(s)
11:04:15.995: [
CSSD][4119]clssnmCheckDskInfo: Checking disk info...
11:04:15.995: [
CSSD][4119]clssnmCheckSplit: Node 1, xxdb1, is alive, DHB (, ) more than disk timeout of 27000 after the last NHB (, )
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 2, events 0xa0, state 0x0
11:04:15.995: [
CSSD][4119]clssnmCheckDskInfo: My cohort: 2
11:04:15.995: [
CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1
11:04:15.995: [
CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, xxdb2, is smaller than cohort of 1 nodes led by node 1, xxdb1, based on map type 2
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 2, events 0x0, state 0x0
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 51, mbrc 3, to member 2, events 0x38, state 0x0
11:04:15.995: [
CSSD][4119]###################################
11:04:15.995: [
CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
11:04:15.995: [
CSSD][4119]###################################
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(ocr_xxdb-scan) count(2) master(1) event(2), incarn 20, mbrc 2, to member 2, events 0x78, state 0x0
11:04:15.995: [
CSSD][4119](:CSSSC00012:)clssscExit: A fatal error occurred and the CSS daemon is terminating abnormally
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.ONSPROC.MASTER) count(2) master(1) event(2), incarn 20, mbrc 2, to member 2, events 0xa0, state 0x0
11:04:15.995: [
CSSD][4119]
----- Call Stack Trace -----
11:04:15.996: [
CSSD][1]clssgmUpdateEventValue: CmInfo State
val 2, changes 13
11:04:15.996: [
CSSD][1]clssgmUpdateEventValue: ConnectedNodes
val , changes 5
11:04:15.996: [
CSSD][1]clssgmCleanupNodeContexts():
cleaning up nodes, rcfg()
11:04:15.996: [
CSSD][1]clssgmCleanupNodeContexts():
successful cleanup of nodes rcfg()
11:04:15.996: [
CSSD][1]clssgmStartNMMon:
completed node cleanup
11:04:15.996: [
CSSD][4119]calling
argument values in hex
11:04:15.996: [
CSSD][4119]location
(? means dubious value)
11:04:15.996: [
CSSD][3348]clssgmUpdateEventValue: HoldRequest
val 1, changes 3
11:04:15.996: [
CSSD][4119]-------------------- -------- -------------------- ----------------------------
11:04:15.997: [
CSSD][4119]clssscExit()+708
11:04:15.997: [
CSSD][4119]
11003ADF0 ? C914 ?
11:04:15.997: [
CSSD][4119]
? 013A80187 ?
11:04:15.997: [
CSSD][4119]
013A80187 ?
11:04:15.997: [
CSSD][4119]clssnmCheckDskInfo(
clssscExit()
111B08E30 ?
11:04:15.997: [
CSSD][4119])+1600
11003ADF0 ? C914 ?
11:04:15.997: [
CSSD][4119]
? 013A80187 ?
11:04:15.997: [
CSSD][4119]
013A80187 ?
11:04:15.997: [
CSSD][4119]clssnmDoSyncUpdate(
clssnmCheckDskInfo(
111B08E30 ? 110E3F130 ?
11:04:15.997: [
CSSD][4119])+4016
11:04:15.997: [
CSSD][4119]
? 013A80187 ?
11:04:15.997: [
CSSD][4119]
013A80187 ?
11:04:15.997: [
CSSD][4119]clssnmRcfgMgrThread
clssnmDoSyncUpdate(
111B08E30 ? 110E3F130 ?
11:04:15.997: [
CSSD][4119]()+2992
11:04:15.997: [
CSSD][4119]
11:04:15.997: [
CSSD][4119]
013A80187 ?
11:04:15.997: [
CSSD][4119]clssscthrdmain()+20
clssnmRcfgMgrThread
111B08E30 ?
11:04:15.997: [
CSSD][4119]8
11:04:15.997: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]_pthread_body()+240
clssscthrdmain()
111B08E30 ?
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]fffffffffffffffc
_pthread_body()
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]
11:04:15.998: [
CSSD][4119]
----- End of Call Stack Trace -----
11:04:16.052: [
CSSD][4119]clssnmSendMeltdownStatus: node xxdb2, number 2, has experienced a failure in thread number 3 and is shutting down
11:04:16.052: [
CSSD][4119]clssscExit: Starting CRSD cleanup
11:04:16.052: [
CSSD][2320]clssnmvDiskKillCheck: not evicted, file /dev/rhdisk22 flags 0x, kill block unique 0, my unique
11:04:16.995: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:17.004: [
CSSD][3605]clssnmPollingThread: state(3) clusterState(2) exit
11:04:17.004: [
CSSD][3605]clssscExit: abort already set 1
11:04:17.053: [
CSSD][2320](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node xxdb1, number 1, sync , stamp
11:04:17.053: [
CSSD][2320]clssscExit: abort already set 1
11:04:17.995: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
11:04:18.600: [
CSSD][4119]clssscExit: CRSD cleanup successfully completed
11:04:18.602: [ default][4119]kgzf_gen_node_reid2: generated reid cid=2ccaf96bf8ea33b240c4c97,icin=,nmn=2,lnid=,gid=0,gin=0,gmn=0,umemid=0,opid=0,opsn=0,lvl=node hdr=0xfece0100
11:04:18.602: [
CSSD][4119]clssnmrFenceSage: Fenced node xxdb2, number 2, with EXADATA, handle 0
11:04:18.602: [
CSSD][4119]clssgmUpdateEventValue: CmInfo State
val 0, changes 14
11:04:18.602: [
CSSD][1029]clssgmProcClientReqs: Checking RPC Q
11:04:18.602: [
CSSD][1029]clssgmProcClientReqs: Checking dead client Q
11:04:18.602: [
CSSD][1029]clssgmProcClientReqs: Checking dead proc Q
11:04:18.602: [
CSSD][1029]clssgmSendShutdown: Aborting client () proc (111dceb10), iocapables 1.
11:04:18.602: [
CSSD][1029]clssgmSendShutdown: I/O capable proc (111dceb10), pid (), iocapables 1, client ()
11:04:18.602: [
CSSD][1029]clssgmSendShutdown: Aborting client () proc (111d29b50), iocapables 2.
我们过滤掉节点2的ocssd中的关键信息，可以发现如下的内容：
06:03:07.804: [
CSSD][2577]clssnmvDHBValidateNCopy: node 1, xxdb1, has a disk HB, but no network HB, DHB has rcfg , wrtcnt, , LATS , lastSeqNo , uniqueness , timestamp /
06:03:07.804: [
CSSD][2320](:CSSNM00005:)clssnmvDiskKillCheck: Aborting, evicted by node xxdb1, number 1, sync , stamp
06:03:07.804: [
CSSD][2320]###################################
06:03:07.804: [
CSSD][4376]clssnmHandleSync: Node xxdb2, number 2, is EXADATA fence capable
06:03:07.804: [
CSSD][2320]clssscExit: CSSD aborting from thread clssnmvKillBlockThread
06:03:07.804: [
CSSD][4119]clssnmSendSync: syncSeqNo()
06:03:07.804: [
CSSD][2320]###################################
11:04:15.995: [
CSSD][4119]clssnmCheckSplit: Node 1, xxdb1, is alive, DHB (, ) more than disk timeout of 27000 after the last NHB (, )
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(CLSN.AQPROC.portaldb.MASTER) count(2) master(1) event(2), incarn 18, mbrc 2, to member 2, events 0xa0, state 0x0
11:04:15.995: [
CSSD][4119]clssnmCheckDskInfo: My cohort: 2
11:04:15.995: [
CSSD][4119]clssnmCheckDskInfo: Surviving cohort: 1
11:04:15.995: [
CSSD][4119](:CSSNM00008:)clssnmCheckDskInfo: Aborting local node to avoid splitbrain. Cohort of 1 nodes with leader 2, xxdb2, is smaller than cohort of 1 nodes led by node 1, xxdb1, based on map type 2
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(crs_version) count(3) master(0) event(2), incarn 21, mbrc 3, to member 2, events 0x0, state 0x0
11:04:15.995: [
CSSD][1]clssgmQueueGrockEvent: groupName(CRF-) count(3) master(0) event(2), incarn 51, mbrc 3, to member 2, events 0x38, state 0x0
11:04:15.995: [
CSSD][4119]###################################
11:04:15.995: [
CSSD][4119]clssscExit: CSSD aborting from thread clssnmRcfgMgrThread
11:04:15.995: [
CSSD][4119]###################################
单纯的从25号和28号的数据来看，这2次故障其实是不同的。我们上面的关键信息我们可以发现，25号的原cssd异常是调用clssnmvKillBlockThread出现问题，而
28号是clssnmRcfgMgrThread。
显然，这2个函数是完全不同的类型，第一个函数的针对votedisk的操作，而第2个函数是网络相关的函数.
最后问这哥们最近做过什么变动，据说是换了交换机之后就出现这个现象了。
跟大家分享这个小的案例！
以上是云栖社区小编为您精心准备的的内容，在云栖社区的博客、问答、公众号、人物、课程等栏目也有的相关内容，欢迎继续使用右上角搜索按钮进行搜索oracle rac单节点重启、oracle rac重启节点、oracle rac节点宕机、oracle rac添加节点、oracle rac查看节点，以便于您获取更多的相关知识。
稳定可靠、可弹性伸缩的在线数据库服务，全球最受欢迎的开源数据库之一
6款热门基础云产品6个月免费体验；2款产品1年体验;1款产品2年体验
弹性可伸缩的计算服务，助您降低 IT 成本，提升运维效率
开发者常用软件，超百款实用软件一站式提供
云栖社区()为您免费提供相关信息，包括
oracle rac单节点重启、oracle rac重启节点、oracle rac节点宕机、oracle rac添加节点、oracle rac查看节点的信息
，所有相关内容均不代表云栖社区的意见！

如何定位racoracle rac单节点重启启原因

我要回帖

更多关于 rac 关闭一个节点的文章

随机推荐

如何定位racoracle rac单节点重启启原因

我要回帖

更多关于 rac 关闭一个节点 的文章

随机推荐

更多关于 rac 关闭一个节点的文章