I thought I'd post it here as it is somewhat undocumented bug (I susspect) with NetApp and affects only clustered servers:
Analysis of the outage after NetApp upgrade to OnTap version 8.0.5
Symptom:
Clustered disk on SQL08 were not accessible through either cluster node. This happened after the upgrade of NetApp OnTap from 7.3.5 to 8.0.5
Environment:
HP Proliant DL380 G6 servers (2 local disks for OS)
Storage NetApp V3140 series (3 disks presented to cluster nodes)
Windows 2008r2 Server Enterprise no service packs
SQL 2008 cluster
NetApp windows host utility ver. 5.3
NetApp DSM (MPIO) ver. 3.3
Impact:
SQL server not accessible, following applications impacted: various applications
Detailed description of the problem:
Both SQL cluster nodes (SQL3 and SQL4) had clustered disks presented but not accessible. All three disks (Quorum, T-Log, SQL Data) were visible in the disk management on both nodes, however they were in “Offline state” with the message “The disk is offline because of a policy set by the administrator reserved”
Problem resolution:
Following steps were performed in attempt to rectify this issue:
- Use “DISK PART” utility to identify disks
- Disks 1-3 (SAN attached disks were reported to be in “Reserved” state)
- Running “Cluster node <node name> /clearpr:<DiskNumber” although reported to be successful did not change status of the disks from “Reserved” to “Online”
- NetApp support was contacted and following steps were performed on the controller:
- Take PRDSQL08 volume off line and bringing it back online – did not bring clustered disks online
- Take LUN’s associated with SQL cluster off line and bringing them back online – did not bring clustered disks online
- Microsoft support contacted as per NetApp support suggestion
- Following steps were performed on cluster nodes by Microsoft support - unsuccessfully:
- Cluster node <node name> /clearpr:<DiskNumber” although reported to be successful did not change status of the disks from “Reserved” to “Online”
- Trying to bring resources on-line through cluster manager
- Removing disks from cluster manager and re-adding the same back
- Running “Validation report” from within cluster manager
- Conferenced Microsoft and NetApp to work on resolution
- Microsoft engineer reported to NetApp that the issue is with the storage based on the following error recorded by “Validation Report”: “Cluster Disk 0 does not support Persistent Reservations. Some storage devices require specific firmware versions or settings to function properly with failover clusters. Please contact your storage administrator or storage vendor to check the configuration of the storage to allow it to function properly with failover clusters.”
- NetApp engineer suggested we upgrade NetApp host utilities to 6.1 and NetApp DSM utility 3.5
- Pre-requisites for the two utilities are:
- Microsoft Q2522766
- Microsoft Q2528357
- Microsoft Q979711
- Microsoft Q981379
- Pre-requisites for the two utilities are:
- Installing NetApp host utilities ver 6.1 and NetApp DSM utility ver 3.5 resolved the issue
Root cause analysis:
It appears that this particular issue is a bug within DSM utility ver 3.3 that affects only clustered servers. Cluster servers require reservation on the disk which is not possible using this version of DSM. Prior to this upgrade this particular issue wasn’t documented bug with the NetApp.
Vic Sabljic Sr. Data Centre Analyst