Monday, May 23, 2011

NetApp Experience: Think on your feet (2)

Shelf Add Issue (AS) Setup:  
  • Amber light on one controller when we got there.  Autosupports indicate that there is traffic intended for the partner's FC ports bouncing off one of the controllers, indicating zoning may be misconfigured.
  • Found a DS14 shelf powered on but connected to nothing (!?). We added this to an existing loop after consulting with a very happily surprised customer.
  • Added 4 port HBA’s.
OS upgrade issue:  The OS upgrade would not take.  We were finally able to effect the update using the software update -r, which stopped the system from automatically rebooting.  After a manual reboot, the system worked just fine.  Our running theory at this point is that the backup kernel re-asserted the previous OS upon automatic reboots, and by rebooting manually we disrupted this process.

Disk issue:  After hot adding a 6 shelf stack of DS14’s, Loop A could see only 2 disks in shelf 5, and Loop B could see only 12 disks in shelf 5.  This behavior was exhibited by both controllers.  Error observed:
    "[FAS3XXX: fci.device.invalidate.soft.address:error]: Fibre Channel adapter 4a is invalidating disk drive 4a.1 (0x0d000001) which appears to have taken a soft address. Expected hard address 93 (0x45), assigned soft address 1 (0xe8). 
    [FAS3XXX: config.NotMultiPath:warning]: Disk 3a.93 and other disks on this loop/domain are not multipathed and should be for improved availability"
    We attempted to re-seat the ESH modules, to no effect.  NGS recommended removing/re-inserting the disks one by one.  This allowed the system to reset the soft ID’s and determine hard ID’s.  Per NGS:
    "Usually, soft address assignments occur when there is a shelf ID conflict.  A mechanism is designed to read the shelf ID from the corresponding select switch by performing a shelf power ON and then recording the shelf ID in memory. 
    This will record the status of the select switch, in case it is changed during the shelf running time. It is possible that the data recorded in the memory was corrupted for some reason, which lead to a situation where the newly inserted ESH4 is provided the same two shelf IDs. There is a possibility that a disk did not accept the hard address provided by ESH4. 
    If there is one disk in the shelf, ESH4 will stop using the hard addresses and then allows the HBA card to assign soft addresses. Such a possibility is higher when there are many disks.”
    Conclusion: If that didn’t make sense to you, you’re not alone.  I feel that description is pretty unlikely – memory corruption in a shelf module? One clue we took note of is there was a shelf-to-shelf cable that wasn't quite happy with how it was seated when we hot added the stack, and we needed to push it in further.  It’s conceivable that this connection was intermittently able to communicate, and the system saw two shelf 5’s, one of which was bouncing on and off-line.  Either way, good learning experience!

    No comments:

    Post a Comment