You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Management Interfaces

From Wikitech-static
Revision as of 13:40, 10 January 2019 by imported>Volans (Add another error example)
Jump to navigation Jump to search

List of troubleshooting techniques and fixes for the most common IPMI and management interfaces issues.

Troubleshooting Commands

Does IPMI works locally?

SSH into the host (no mgmt) and run:

sudo ipmi-chassis --get-chassis-status

The typical error is: ipmi_cmd_get_chassis_status: internal system error or driver timeout

Does IPMI works remotely?

SSH into one of the hosts with the cluster::management Puppet role applied (neodymium and sarin at the time of writing) and run (the password is the one in pwstore):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis power status

The typical failing error is: Error: Unable to establish IPMI v2 / RMCP+ session

If it fails very quickly:

It might be that the IPMI password has gone out of sync with the host one (was the host rebooted recently?), see below on how to set it again.

Is remote IPMI enabled?

SSH into the host (no mgmt) and run (on Ubuntu Trusty use bmc-config instead of ipmi-config):

sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff

If there is no output it means that remote IPMI is enabled and this configuration is good. If there is any diff shown it means that the current values are not the correct ones.

Re-run the same command replacing --diff with --commit to change the config. Verify it again after the commit.

Are IPMI permissions set correctly?

SSH into the host (no mgmt) and run (on Ubuntu Trusty use bmc-config instead of ipmi-config):

sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Channel_Privilege_Limit=Administrator" --key-pair="Lan_Channel:Non_Volatile_Channel_Privilege_Limit=Administrator" --diff

If there is no output it means that the user is configured correctly. If there is any diff shown it means that the current values are not the correct ones.

Re-run the same command replacing --diff with --commit to change the config. Verify it again after the commit.

Is there any overriding for next boot?

To see if the host has any BIOS parameter overridden for the next boot, SSH into one of the hosts with the cluster::management Puppet role applied (neodymium and sarin at the time of writing) and run (the password is the one in pwstore):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootparam get 5

A typical output for a clean settings that have no overrides is:

Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0000000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : No override
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

In case of overrides the Boot parameter data bitmask will be different from 0000000000 and the line below will show the overridden values.

The wmf-auto-reimage script automatically checks that the host has set the PXE Boot bit before rebooting it and will print a warning after the reimage if the parameters are not all reset to their default values.

In case it's needed to manually reset it to remove any override, run:

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootdev none

And then re-check if the change has been applied.

Fix Commands

Set the IPMI password

In case it is thought that the IPMI password has got out of sync with the management one, it is possible to set it again. SSH into the management interface of the host and run:

  • for DELL hosts:
    racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 ${PASSWORD}
    
  • for HP hosts:
    set /map1/accounts1/root password=${PASSWORD}
    

Reset the management card

In case the management card is unresponsive to IPMI and maybe ping but SSH is still working, a card reset can be attempted. It will just restart the card OS, not affecting (in theory) the underlying host. To reset the management card SSH into the management interface of the host and run:

  • for DELL hosts:
    racadm racreset
    
  • for HP hosts:
    reset /map1
    

Power drain the host

For a full cold reset the host must be shutdown and the power cables removed (drain). This is usually the last resort when ping, IPMI and SSH are all failing and any other attempt to fix it didn't work.