Tuesday, April 26, 2011

NetApp Training Brain Dump: Troubleshooting

Beginner tips for troubleshooting with NetApp FAS Systems:
  • Special Boot Menu - Option "4a" (Pre-ONTAP 8.0 only)
    • Look here for a mapping of how to access this menu.
    • Essentially a factory reset. 
    • Wipes out the data on all the disks and re-establishes RAID-DP.
      • Also wipes out the OS (obviously), which resides on the data drives.
      • Wipes out all customization and settings.
    • Re-installs a base level of DATA ONTAP OS from the Compact Flash.  
    • Creates a 3-disk aggregate with a flexible volume.
    • Boots into setup.
    • Does not wipe out the RLM settings/IP address.
  • RLM (aka BMC and SP and RMC)
    • Each controller has its own RLM
    • Port is referred to as e0p and labeled with a wrench on the back of the CPU module.
    • System power {off | on | cycle | status}
    • Has its own IP and is a completely separate entity from the controllers.  You can think of this as a direct KVM (keyboard, video, mouse) into the controller.  Same basic functionality as an HP ILO/IBM RSA/IBM IMM.
    • system console drops you down to the controller console.
    • Control-D brings you back to the RLM console.
  • Aggr status -r is your friend.  If you want a basic view of the health of the machine, this command will show you what data is healthy and presenting and if any of your aggregates have an issue.
  • disk show {-v | -a} is very useful for checking physical disk details, such as controller ownership.
  • LED's
    • LED's are not 100% reliable.
    • Solid green usually means the system identifies the disk, is able to use it, but possibly is not using it.  
    • Blinking green means the disk is healthy and in use.
    • Obviously, amber is bad news.
    • You can manually turn on LED's one drive at a time, or a whole loop (update: be careful with whole loop LED's.  Word is it can freak out your system and take over IO). 
    • The NVRAM LED should be totally off when you pull out the CPU Module.  If that light is still on, then data was not able to be flushed to the disk and is being kept alive by the battery.  This  is bad news bears.  Exception: "if the controller was waiting for giveback, the flashing led can be ignored."
    • When you enter maintenance mode, all LED's will be on automatically.
  • Easy to confuse (Pre-ONTAP 8.0 only):
    • Look here for a mapping of how to access these menus.
    • Maintenance Mode: 
      • Prompt: "*>"
      • Accessible from Special Boot Menu.
      • TBD (I'll update later)
      • "halt" to get out.
    • Diagnostic mode: 
      • Prompt: "Enter Diag, Command, or Option:"
      • Accessible from CFE> shell.
      • Allows you to run hardware diagnostic tests, among other things.
      • "exit" to get out.
  • ONTAP versions are customized per FAS system.  FAS2020 ONTAP 7.0 is not the same code as FAS3020 ONTAP 7.0.  
    • Are commands universal though?  I'll update later.
  • Replacing the PCI NVRAM card changes the system ID of the controller, as the ID is hard coded into the NVRAM logic.  (what models is this true for?  I'll update later)
  • Control-R: At the console, ONTAP kicks out messages regularly.  These can interrupt you mid-command.  Use Control-R to start a new line and (r)etrieve what you'd typed.
  • Control-C: 
    • Use this to cancel any process you've started and get a new line.
    • Use this to quit whatever command you've been typing and get a new line.
  • Control-G: At the console, use this to access the RLM console.
  • Set PuTTY up to automatically save your logs.

No comments:

Post a Comment