Sunday, January 20, 2019

Supermicro IPMI - Redux

The X10 IPMI support on my new servers is great - no more Java!
HTML5 virtual console for the win, that plus H5 vSphere and NSX (increasingly) means the days of needing Java or Flash are numbered.
  I did have one hiccup though; right when I thought my cluster was ready to go I tested remote access, as I wanted to secure it with an ACL in addition to a non-standard username & strong password.
  I'd used the IPMI virtual CD-ROM to install ESXi onto these so was surprised to find I couldn't access two of the four anymore.  After many reboots, trying both static and DHCP I concluded something had become wedged in the firmware, as the ports were showing as up on the switches, but though frames were being sent to them the inbound counters were all zeros.
  My theory is that in the bit of code that decides between the dedicated IPMI LAN interface and sharing LAN1 there was a bug.  This is the default 'failover' mode, where it uses the dedicated port if it is determined to be connected when power is applied, but once it's failed over to using LAN1 it never recovers without a hard reset - which is a huge pain in my new 2U boxes with shared power for the two nodes, the only way to power cycle just one is to physically pull it from the chassis, making my remote switched PDUs pointless.  Don't ever apply power to these until your switches are fully booted - which in my case is several minutes, so in the event of a power loss it would break.  I did try putting the LAN1 ports on the lights out VLAN without any change amongst many other experiments  that on my workbench at home I was happy to do for curiosities sake, where in a datacenter I'd just want the boxes back up ASAP.
  Anyhow, I built a DOS boot USB key (this is useful) and put Supermicro's IPMICFG tool as well as the latest IPMI firmware on it (already a release newer than when I started setting these servers up in December).  After upgrading from 3.77 to 3.78 and setting a static IP again I was back in business and once back in the web interface I changed their LAN mode from 'failover' to 'dedicated' which will hopefully prevent the issue from reoccurrence.


Postscript -
Managed to screw up another one by upgrading to current release, but then it never came back after rebooting.  Querying it from the Linux command line tools just gave errors. 
The AlUpdate tool was able to re-flash it - after which it worked, but be warned this needs a hard power cycle which would've been hard if the box had been off in a colo somewhere.
'AlUpdate -f REDFISH_X10_380.bin -kcs -r n'
(Update via KCS channel without preserving config)

No comments:

Post a Comment