Having an issue with our iSCSI SAN constantly doing logical unit resets which kills the performance of it and causes issues with our exchange databases dismounting because they lose sync. Also causing an issue with live-migrating data (VM storage) to the SAN which is a cluster shared volume.
Our setup:
Three Dell R710's all running the same Server 2012 R2 build, all up to date, all current firmware on BIOS, the Perc 6/i controllers, drivers, NIC's have the latest firmware and drivers, and they're all setup identically the same, and they're all in a cluster on our domain.
Of the 4 NIC's on each of the servers, the first two NIC's have been teamed, then Hyper-V is using that NIC team as a Hyper-V virtual NIC interface. The vNIC's all have unique static IP's. The other two NIC's I have reserved as dedicated for iSCSI, each has its own unique static IP address on a separate iSCSI subnet, and there is no gateway to the main subnet or the internet, we wanted to keep the two completely separate. Jumbo frames is turned on and set to 4088.
For the iSCSI switch, we are using a Dell 5424 which has the latest firmware, I have setup a LAG for the first 8 ports which go to our main SAN which has 8 ethernet connections, and I also setup a LAG on the SAN (a DLink 3200) for all 8 ports, and that was assigned a static IP. I also have the switch running in iSCSI mode, I have spanning tree turned off, and storm control turned on. All three servers connect directly to this switch with those 2 dedicated NIC's. Jumbo frames is turned on.
The SAN, like I said, is a DLink 3200 15 bay unit. Firmware up to date, setup with the LAG connected to the LAG on the switch. To start, I created a larger RAID 5 volume at 4TB for our main VM storage, and a small 1 GB RAID 5 array for our Cluster Witness volume. Jumbo frames for each volume set to 4088.
Back on each of the servers, I used iSCSI initiator to connect them all, and chose MPIO. After adding the first session, I went back in and made a second session, connected the second NIC to the IP for the SAN, checked MPIO and it is showing the two connections as it should. I then went to disk management on one of the servers, brought both of the new volumes online, and created a volume on each without assigning them a drive letter. Then, took them both offline. The disks were showing up on each of the servers successfully. Everything APPEARS to be setup correctly from all the documentation I've been able to find online as far as the entire setup from SAN to servers.
I then created the cluster, added the three servers, the configuration validation test went ok with only a few warnings which I can post if needed but they were mainly about the two sets of NIC's being on separate subnets, and the cluster manager successfully found and added the two SAN volumes. I made the 1 GB in a Cluster Witness Disk, and the other a CSV.
With the three servers already running Hyper-V with multiple VM's on each, I went to the Failover Cluster Manager and created VM roles for high availability, and imported them all. The only warning was that the VM's storage was local, and not on the CSV. So my next task was to then get the VM storage transferred to the CSV on the SAN. As a test, I transferred our four VM's which are our email servers running in a DAG (each running Server 2012) to the SAN one by one. The first two transferred pretty quickly, each are about 100 GB in size and transferred within a few minutes each successfully, no problem, live migration, never went down, it went perfectly. But with the third and fourth servers, as soon as I begin trying to transfer one of them to the CSV, it goes extremely slowly. I open up the Performance Monitor, go to network, and look at my two dedicated DLink NIC's, they're both mirroring eachother, but very very little traffic is moving, it'll stop completely, then pick up some, then stop again.
Confused, I open up the DLink Xlink manager app to look at the SAN's log. I see these constant "LU reset occurred" warnings. And these coincide with the sudden drop in speed, plus while these warnings are happening if I try to connect to one of the VM's being hosted via the SAN, the VM is non-responsive, almost like it's frozen, then a minute or two later it "wakes up" and starts responding normally:
Those constant LU (logical unit) resets are causing me a lot of headaches right now and I am really trying to figure out the source of the issue. I have been fighting these resets for weeks now, tried multiple configurations, re-done the cluster over and over again, tried firmware updates and driver updates, tried different settings on the switch, I even updated the firmware on the on-board storage on each of the servers and their drivers, but no luck.
Is anyone familiar with these types of errors who could give me some ideas on where else to look? I see the log keeps referencing initiator ID 1023, it's always that same exact ID number, and I assume that is one of the servers, but how can I find out which one it is?
I would be happy to provide any more information that you might want, or post screen shots. I am pretty stumped as to what the issue might be. Also, we recently upgraded to the Dell 5424 switch because our 2824 which we were using was apparently not meant for iSCSI, but we get the same LU reset errors with the new one.
Thank you for your help!