IT engineering and a little bit of hacking: NetApp Experience: MetroCluster Disk Fail

I'm working on a case for a MetroCluster right now. The situation was started because the customer was doing power maintenance, and shut off one of the two PDU's in the rack. In a metrocluster, between the system and the shelves there are redundant fibre channel switches, but in this case the switches had only 1 power supply per switch, but they were spread across the two PDU's. This means that one of the two switches went offline during the work.

The system failed 9 disks in this case, and all 9 of them were being addressed over the switch that went down. NGS has found iSCSI errors over the switch that stayed up during that time. The only firmware that is backrev'd is ESH firmware, so we're gonna have to dig deeper for a solution.

Those 9 disks failing caused a RAID Group to fail, which caused an aggregate to fail. We got the system back up and running by re-seating each disk slowly, with 90 seconds between each action.

Update: Here's what NGS had to say. (Quote)

The ports on the brocade switches are not locked:

>>> [-1][01] CRITICAL [01][01][00]:switchname: Port[4] has loop port NOT LOCKED

The ports not being locked can cause a number of instability issues and is the most likely cause of the issue seen. The information on how to lock these can be found in the following document: http://media.netapp.com/documents/tr-3548.pdf

"Not enough power supplies are present...to satisfy disk drive and shelf power requirements."

The logs are erroneous, there's been burts opened to correct this warning, but one PSU should not cause the disk shelf any issues other than it takes longer to get the disks spun up since it will do this in increments.

"Cluster monitor: takeover of eg-nascsi-a02 disabled (unsynchronized log)"
"Cluster Interconnect link 0 is DOWN"

It wouldn't surprise me if the syncmirror lost sync during this period based on the switch issues they experienced. The ports not being locked can cause a large number of unusual errors.

(End Quote)

Making sense of it: Brocade's ports are categorized as E, F, L, G, etc. An L port is a loop port, which means the switch will only initiate loop traffic. Locking a port as an F port means that the switch won't begin treating the ports in a point-to-point relationship. Directly from Brocade's documentation:

Credit: Brocade Fabric OS Reference 2.6

Here's a case of a NetApp customer working through this issue.

IT engineering and a little bit of hacking

Pages

Wednesday, November 9, 2011

NetApp Experience: MetroCluster Disk Fail

No comments:

Post a Comment