NetApp存储方案及巡检命令
一、MCC概述
Clustered Metro Cluster(简称MCC)是Netapp Data Ontap提供的存储双活解决方案,当初的方案是把1个FAS/ V系列双控在数据中心之间拉远形成异地HA Pair,每站点只有单控制器节点,数据中心两站点之间通过额外的FC/VI集群适配器相连,数据中心间SAS磁盘框通过SAS转FC的FibreBridge相连。在500米以内、同一个机房采用直接光纤通道交换机连接;在500米以上(最远100km)采用光纤通道和DWDM交换机相连。
MetroCluster在此架构上也进行了演变。通过在站点A、B两个站点分别放置两套FAS/ V双控阵列,阵列A的A控和阵列B的A控,阵列A的B控和阵列B的B控分别形成集群,这样可以充分把A、B站点数据中心资源充分利用,同时对外提供存储服务;但阵列内的A、B不是集群。如果站点间形成集群Pair的任意一个控制器节点故障,故障站点的主机都需要远程访问远端控制器节点;如何站点间形成集群Pair的两个节点同时故障,就会发生业务中断。
Netapp Data Ontap8.3版本推出了4控双活解决方案,最远支持200公里距离,4控Metro Cluster方案首先由2个HA Pair组成2个本地集群,然后再从2个集群上做4节点集群。集群控制器之间内存日志通过存放在NVRAM里面,NVRAM对没有下盘的日志做了镜像,保证节点故障以后,HA Pair集群的Partner节点能够接管业务;或者站点故障以后,远端HA Pair集群能够接管业务。当日志到达一定水位或者发生系统操作刷盘时,下盘数据同步通过SyncMirror实现主从站点双写,从而确保一个站点磁盘故障以后,另外一个站点磁盘还能提供系统访问,实现站点故障切换,保证业务不中断。
MetroCluster使用两个不同地点的镜像和集群来保护数据,每个集群把数据和Storage Virtual Machine (SVM) 配置都镜像同步另一个集群。当某个站点发生灾难时,管理员可以激活远端SVM并在另一站点接管业务。此外,每个集群在本地节点均配置为HA Pair,从而提供了本地故障转移能力。
NetApp MetroCluster是以NetApp SyncMirror是配合Cluster_remote和控制器Cluster Failover的功能实现的。
-
Clustered Failover – 在主存储和容灾存储间提供高可用性失败恢复能力,故障接管的决策是由管理员通过单一命令行决定的。
-
SyncMirror – 为远端存储提供即时的数据拷贝,当故障接管时,数据可以仅通过远端的存储进行访问。
-
ClusterRemote – 提供管理机制用以判断灾难的发生并初始远端存储进行接管。
二、MCC巡检常用命令
1、系统健康状态检查
- cluster1::> system health status show
- Status
- ---------------
- ok
2、集群状态检查
- cluster1::> cluster show
- Node Health Eligibility
- --------------------- ------- ------------
- cluster1-01 true true
- cluster1-02 true true
- 2 entries were displayed.
3、集群统计状态检查
- cluster1::> cluster statistics show
- Counter Value Delta
- ---------------- ----------------- -------------
- CPU Busy: 0% -
- Operations:
- Total: 0 -
- NFS: 0 -
- CIFS: 0 -
- Data Network:
- Busy: 0% -
- Received: 5.78GB -
- Sent: 13.7GB -
- Cluster Network:
- Busy: 0% -
- Received: 967KB -
- Sent: 979KB -
- Storage Disk:
- Read: 6.38PB -
- Write: 6.26PB -
4、查看RAID组信息
- cluster1::> aggr show
- Aggregate Size Available Used% State #Vols Nodes RAID Status
- --------- -------- --------- ----- ------- ------ ---------------- ------------
- aggr0_A1 953.8GB 247.3GB 74% online 1 cluster1-01 raid4,
- mirrored,
- normal
- aggr0_A2 953.8GB 247.3GB 74% online 1 cluster1-02 raid4,
- mirrored,
- normal
- aggr_data_A1
- 68.93TB 16.04TB 77% online 32 cluster1-01 mixed_raid_
- type,
- mirrored,
- hybrid,
- normal
- aggr_data_A2
- 68.93TB 14.77TB 79% online 31 cluster1-02 mixed_raid_
- type,
- mirrored,
- hybrid,
- normal
- 4 entries were displayed.
5、查看节点信息
- cluster1::> node show
- Node Health Eligibility Uptime Model Owner Location
- --------- ------ ----------- ------------- ----------- -------- ---------------
- cluster1-01
- true true
- 369 days 19:12 FAS8040 gz_idc
- cluster1-02
- true true
- 369 days 19:23 FAS8040 gz_idc
- 2 entries were displayed.
6、查看版本信息
- cluster1::> version
- NetApp Release 8.3.2P9: Fri Jan 06 05:54:05 UTC 2017
7、查看序列号
- cluster1::> system license show
- Serial Number: 1-80-023992
- Owner: cluster1
- Package Type Description Expiration
- ----------------- ------- --------------------- --------------------
- Base license Cluster Base License -
- Serial Number: 1-81-0000000000000451515******
- Package Type Description Expiration
- ----------------- ------- --------------------- --------------------
- NFS license NFS License -
- iSCSI license iSCSI License -
- Serial Number: 1-81-0000000000000451515******
- Owner: cluster1-02
- Package Type Description Expiration
- ----------------- ------- --------------------- --------------------
- NFS license NFS License -
- iSCSI license iSCSI License -
- 5 entries were displayed.
8、查看子系统健康状态
- cluster1::> system health subsystem show
- Subsystem Health
- ----------------- ------------------
- SAS-connect ok
- Environment ok
- Memory ok
- Service-Processor ok
- Switch-Health ok
- CIFS-NDO ok
- Motherboard ok
- IO ok
- MetroCluster ok
- MetroCluster_Node ok
- FHM-Switch ok
- FHM-Bridge ok
- 12 entries were displayed.
9、查看MCC集群信息状态及节点信息状态
- cluster1::> metrocluster show
- Configuration: fabric
- Cluster Configuration State Mode
- ------------------------------ ---------------------- ------------------------
- Local: cluster1 configured normal
- Remote: cluster1_dr configured normal
- cluster1::> metrocluster node show
- DR Configuration DR
- Group Cluster Node State Mirroring Mode
- ----- ------- ------------------ -------------- --------- --------------------
- 1 cluster1
- cluster1-01 configured enabled normal
- cluster1-02 configured enabled normal
- cluster1_dr
- cluster1_dr-01 configured enabled normal
- cluster1_dr-02 configured enabled normal
- 4 entries were displayed.
10、查看控制器状态
- cluster1::> system controller show
- Controller Name System ID Serial Number Model Status
- ------------------------- ------------- ----------------- -------- -----------
- cluster1-01 536964819 451515****** FAS8040 ok
- cluster1-02 536961600 451515****** FAS8040 ok
- 2 entries were displayed.
11、查看故障硬盘
- cluster1::> storage disk show -broken
- There are no entries matching your query.
12、查看spare硬盘
- cluster1::> storage disk show -spare
- Original Owner: cluster1-01
- Checksum Compatibility: block
- Usable Physical
- Disk HA Shelf Bay Chan Pool Type RPM Size Size Owner
- --------------- ------------ ---- ------ ----- ------ -------- -------- --------
- 1.30.11 3a 30 11 A Pool0 SAS 10000 1.09TB 1.09TB cluster1-01
- 1.30.13 3a 30 13 A Pool0 SAS 10000 1.09TB 1.09TB cluster1-01
- 1.31.4 3a 31 4 A Pool0 SAS 10000 1.09TB 1.09TB cluster1-01
- 1.32.20 4b 32 20 B Pool0 SAS 10000 1.09TB 1.09TB cluster1-01
- 1.32.23 3a 32 23 A Pool0 SAS 10000 1.09TB 1.09TB cluster1-01
- 1.33.0 3a 33 0 A Pool0 SAS 10000 1.09TB 1.09TB cluster1-01
- 1.33.1 3a 33 1 A Pool0 SAS 10000 1.09TB 1.09TB cluster1-01
- 1.33.10 4b 33 10 B Pool0 SAS 10000 1.09TB 1.09TB cluster1-01
- 2.42.22 3a 42 22 A Pool1 SAS 10000 1.09TB 1.09TB cluster1-01
- 2.42.23 4b 42 23 B Pool1 SAS 10000 1.09TB 1.09TB cluster1-01
- 2.43.2 4b 43 2 B Pool1 SAS 10000 1.09TB 1.09TB cluster1-01
- 2.43.22 3b 43 22 A Pool1 SAS 10000 1.09TB 1.09TB cluster1-01
- 2.43.23 4b 43 23 B Pool1 SAS 10000 1.09TB 1.09TB cluster1-01
- 3.11.21 4b 11 21 B Pool0 SSD - 372.4GB 372.6GB cluster1-01
- 4.20.21 3a 20 21 A Pool1 SSD - 372.4GB 372.6GB cluster1-01
- 4.21.14 3a 21 14 A Pool1 SAS 10000 1.09TB 1.09TB cluster1-01
- Original Owner: cluster1-02
- Checksum Compatibility: block
- Usable Physical
- Disk HA Shelf Bay Chan Pool Type RPM Size Size Owner
- --------------- ------------ ---- ------ ----- ------ -------- -------- --------
- 2.44.23 3b 44 23 A Pool1 SAS 10000 1.09TB 1.09TB cluster1-02
- 3.12.21 4a 12 21 B Pool0 SSD - 372.4GB 372.6GB cluster1-02
- 4.23.21 3b 23 21 A Pool1 SSD - 372.4GB 372.6GB cluster1-02
- 5.60.23 3b 60 23 B Pool1 SAS 10000 1.09TB 1.09TB cluster1-02
- 20 entries were displayed.
13、查看SAS桥故障
- cluster1::> storage bridge show
- Is Monitor
- Bridge Symbolic Name Monitored Status Vendor Model Bridge WWN
- ------------------------ ------------- --------- ------- ------ --------------------- ----------------
- ATTO_10.0.15.17 BRIDGE_B_1
- true ok Atto FibreBridge 6500N 2000001086627bc0
- ATTO_10.0.15.18 BRIDGE_B_2
- true ok Atto FibreBridge 6500N 2000001086630f0e
- ATTO_10.0.15.19 BRIDGE_B_3
- true ok Atto FibreBridge 6500N 2000001086630edc
- ATTO_10.0.15.20 BRIDGE_B_4
- true ok Atto FibreBridge 6500N 2000001086630ed2
- ATTO_10.0.15.6 BRIDGE_A_1
- true ok Atto FibreBridge 6500N 2000001086630eb4
- ATTO_10.0.15.7 BRIDGE_A_2
- true ok Atto FibreBridge 6500N 2000001086630efa
- ATTO_10.0.15.8 BRIDGE_A_3
- true ok Atto FibreBridge 6500N 2000001086630f18
- ATTO_10.0.15.9 BRIDGE_A_4
- true ok Atto FibreBridge 6500N 2000001086630ef0
- ATTO_FibreBridge6500N_10 -
- false - Atto FibreBridge6500N 200000108663e514
- ATTO_FibreBridge6500N_11 -
- false - Atto FibreBridge6500N 200000108663e3f2
- ATTO_FibreBridge6500N_12 -
- false - Atto FibreBridge6500N 200000108663e488
- ATTO_FibreBridge6500N_13 -
- false - Atto FibreBridge6500N 20000010866114ec
- ATTO_FibreBridge6500N_14 -
- false - Atto FibreBridge6500N 2000001086627bc0
- ATTO_FibreBridge6500N_7 -
- false - Atto FibreBridge6500N 2000001086630e96
- ATTO_FibreBridge6500N_9 -
- false - Atto FibreBridge6500N 200000108663e4c4
- 15 entries were displayed.
14、查看纤交换机故障
- cluster1::> storage switch show
- Symbolic Is Monitor
- Switch Name Vendor Model Switch WWN Monitored Status
- --------------------- -------- ------- ----- ---------------- --------- -------
- Brocade_10.0.15.10
- SW_A_1
- Brocade Brocade6505
- 100050eb1a88327f true ok
- Brocade_10.0.15.11
- SW_A_2
- Brocade Brocade6505
- 100050eb1a881582 true ok
- Brocade_10.0.15.21
- SW_B_3
- Brocade Brocade6505
- 100050eb1a882f69 true ok
- Brocade_10.0.15.22
- SW_B_4
- Brocade Brocade6505
- 100050eb1a881522 true ok
- 4 entries were displayed.
15、查看failover状态
- cluster1::> storage failover show
- Takeover
- Node Partner Possible State Description
- -------------- -------------- -------- -------------------------------------
- cluster1-01 cluster1-02 true Connected to cluster1-02
- cluster1-02 cluster1-01 true Connected to cluster1-01
- 2 entries were displayed.
16、查看严重告警日志及错误告警日志
- cluster1::> event log show -severity critical
- There are no entries matching your query.
- cluster1::> event log show -severity error
- Time Node Severity Event
- ------------------- ---------------- ------------- ---------------------------
- 3/6/2018 02:28:30 cluster1-02 ERROR asup.post.drop: AutoSupport message (HA Group Notification from cluster1-02 (MANAGEMENT_LOG) INFO) for host (0) was not posted to NetApp. The system will drop the message.
- 3/6/2018 01:28:18 cluster1-02 ERROR asup.post.drop: AutoSupport message (HA Group Notification from cluster1-02 (PERFORMANCE DATA) INFO) for host (0) was not posted to NetApp. The system will drop the message.
- 3/6/2018 00:00:07 cluster1-02 ERROR mgmtgwd.certificate.expired: A digital certificate with Fully Qualified Domain Name (FQDN) cluster1, Serial Number 5589765F, Certificate Authority 'cluster1' and type server for Vserver cluster1 has expired.
- 3/6/2018 00:00:07 cluster1-02 ERROR mgmtgwd.certificate.expired: A digital certificate with Fully Qualified Domain Name (FQDN) UC_SVM2, Serial Number 55A03966, Certificate Authority 'SVM2' and type server for Vserver SVM2 has expired.
- 3/6/2018 00:00:07 cluster1-02 ERROR mgmtgwd.certificate.expired: A digital certificate with Fully Qualified Domain Name (FQDN) UC_SVM, Serial Number 559FFD76, Certificate Authority 'SVM' and type server for Vserver SVM has expired.
- 3/6/2018 00:00:07 cluster1-02 ERROR mgmtgwd.certificate.expired: A digital certificate with Fully Qualified Domain Name (FQDN) UCS_SVM_DR, Serial Number 545845C16E278, Certificate Authority 'SVM_DR' and type server for Vserver SVM_DR-mc has expired.
- 3/6/2018 00:00:07 cluster1-02 ERROR mgmtgwd.certificate.expired: A digital certificate with Fully Qualified Domain Name (FQDN) UCS_SVM2_DR, Serial Number 545845A7B01FA, Certificate Authority 'SVM2_DR' and type server for Vserver SVM2_DR-mc has expired.
- 7 entries were displayed.
17、查看某个聚合下的Volume状态信息
cluster1::> vol show -aggregate aggr_data_A1
18、查看Lun信息及Lun详细信息
- cluster1::> lun show
- cluster1::> lun show -v
19、查看map信息及map详情
- cluster1::> igroup show
- cluster1::> igroup show -v
20、查看Lun的map情况
- cluster1::> lun show -m
21、进入某一节点
- cluster1::> run -node cluster1-01
- Type 'exit' or 'Ctrl-D' to return to the CLI
- cluster1-01>
22、节点下查看spare disks
- cluster1-01> vol status -s
- Local spares
- Pool1 spare disks
- RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
- --------- ------ ------------- ---- ---- ---- ----- -------------- --------------
- Spare disks for block checksum
- spare SW_B_3:6.126L41 3a 21 14 FC:A 1 SAS 10000 1142352/2339537408 1144641/2344225968 (not zeroed)
- spare SW_B_3:7.126L75 3a 42 22 FC:A 1 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_B_3:7.126L101 3b 43 22 FC:A 1 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_B_4:7.126L76 4b 42 23 FC:B 1 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_B_4:7.126L29 4b 43 2 FC:B 1 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_B_4:7.126L50 4b 43 23 FC:B 1 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_B_3:6.126L22 3a 20 21 FC:A 1 SSD N/A 381304/780910592 381554/781422768
- Pool0 spare disks
- RAID Disk Device HA SHELF BAY CHAN Pool Type RPM Used (MB/blks) Phys (MB/blks)
- --------- ------ ------------- ---- ---- ---- ----- -------------- --------------
- Spare disks for block checksum
- spare SW_A_1:7.126L12 3a 30 11 FC:A 0 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_A_1:7.126L14 3a 30 13 FC:A 0 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_A_1:7.126L31 3a 31 4 FC:A 0 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_A_1:7.126L76 3a 32 23 FC:A 0 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_A_1:7.126L79 3a 33 0 FC:A 0 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_A_1:7.126L80 3a 33 1 FC:A 0 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_A_2:7.126L73 4b 32 20 FC:B 0 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_A_2:7.126L37 4b 33 10 FC:B 0 SAS 10000 1142352/2339537408 1144641/2344225968
- spare SW_A_2:6.126L74 4b 11 21 FC:B 0 SSD N/A 381304/780910592 381554/781422768
23、节点下查看fail disk
- cluster1-01> vol status -f
- Broken disks (empty)
24、显示没有ownership(归属权)的硬盘
- cluster1-01> disk show -n
- disk show : No unassigned disks
25、分配硬盘的归属(硬盘更换常用)
- cluster1-01> disk assign all
26、查看所有硬盘位置信息
- cluster1-01> storage show disk -p