You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Management Interfaces"

From Wikitech-static
Jump to navigation Jump to search
imported>Volans
 
imported>Jcrespo
(→‎Is there any overriding for next boot?: update deprecated reimage script)
(11 intermediate revisions by 5 users not shown)
Line 1: Line 1:
List of troubleshooting techniques and fixes for the most common IPMI and management interfaces issues.
List of troubleshooting techniques and fixes for the most common IPMI and management interfaces issues.
== General advices ==
=== How to execute remote IPMI commands ===
SSH into one of the hosts with the <code>cluster::management</code> Puppet role applied (<code>cumin[1001,2002]</code> at the time of writing) and run <code>ipmitool</code>, it will ask for <code>management</code> password, that is stored in pwstore.


== Troubleshooting Commands ==
== Troubleshooting Commands ==


=== Does IPMI works locally? ===
=== Does IPMI work locally? ===
SSH into the host (no mgmt) and run:<syntaxhighlight lang="bash">
SSH into the host (no mgmt) and run:<syntaxhighlight lang="bash">
sudo ipmi-chassis --get-chassis-status
sudo ipmi-chassis --get-chassis-status
</syntaxhighlight>The typical error is: <code>ipmi_cmd_get_chassis_status: internal system error</code>
</syntaxhighlight>The typical error is: <code>ipmi_cmd_get_chassis_status: internal system error</code> or <code>driver timeout</code>


=== Does IPMI works remotely? ===
=== Does IPMI work remotely? ===
SSH into one of the hosts with the <code>cluster::management</code> Puppet role applied (<code>neodymium</code> and <code>sarin</code> at the time of writing) and run (the password is the one in pwstore):<syntaxhighlight lang="bash">
Execute this remote IPMI command (see [[Management Interfaces#How to execute remote IPMI commands]]):<syntaxhighlight lang="bash">
sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis power status
sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis power status
</syntaxhighlight>The typical failing error is: <code>Error: Unable to establish IPMI v2 / RMCP+ session</code>
</syntaxhighlight>The typical failing error is: <code>Error: Unable to establish IPMI v2 / RMCP+ session</code>
Line 22: Line 27:


Re-run the same command replacing <code>--diff</code> with <code>--commit</code> to change the config. Verify it again after the commit.
Re-run the same command replacing <code>--diff</code> with <code>--commit</code> to change the config. Verify it again after the commit.
''See on the section below how to do the same from the web mgmt interface, if no local host access is available (e.g. a new host).''


=== Are IPMI permissions set correctly? ===
=== Are IPMI permissions set correctly? ===
Line 29: Line 36:


Re-run the same command replacing <code>--diff</code> with <code>--commit</code> to change the config. Verify it again after the commit.
Re-run the same command replacing <code>--diff</code> with <code>--commit</code> to change the config. Verify it again after the commit.
=== Did you do a reset but still getting IPMI connection failed (when using the reimage cookbook)? ===
Try logging in on mgmt, with the regular mgmt password, and then re-set the same password with <syntaxhighlight lang="shell">
racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 <password>
</syntaxhighlight> and then try again. We have had at least one case where this fixed running the reimage cookbook getting a remote IPMI connection failure.


=== Is there any overriding for next boot? ===
=== Is there any overriding for next boot? ===
To see if the host has any BIOS parameter overridden for the next boot, SSH into one of the hosts with the <code>cluster::management</code> Puppet role applied (<code>neodymium</code> and <code>sarin</code> at the time of writing) and run (the password is the one in pwstore):<syntaxhighlight lang="bash">
The BIOS Boot_Device is  [https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/ipmi/manifests/mgmt.pp#15 managed by puppet] so it should be correct. This can be validated using the custom <code>ipmi_chassis</code> fact.  A clean configuration should return <code>NO-OVERRIDE</code> for the <code>ipmi_chassis.boot_flags.device</code> fact
<syntaxhighlight lang="bash">
$ sudo facter -p ipmi_chassis.boot_flags.device
</syntaxhighlight>
 
Alternativly one can execute this remote IPMI command (see [[Management Interfaces#How to execute remote IPMI commands]]):
<syntaxhighlight lang="bash">
sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootparam get 5
sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootparam get 5
</syntaxhighlight>A typical output for a '''clean''' settings that have no overrides is:<syntaxhighlight lang="bash">
</syntaxhighlight>A typical output for a '''clean''' settings that have no overrides is:<syntaxhighlight lang="bash">
Line 47: Line 66:
</syntaxhighlight>In case of overrides the <code>Boot parameter data</code> bitmask will be different from <code>0000000000</code> and the line below will show the overridden values.
</syntaxhighlight>In case of overrides the <code>Boot parameter data</code> bitmask will be different from <code>0000000000</code> and the line below will show the overridden values.


The <code>wmf-auto-reimage</code> script automatically checks that the host has set the PXE Boot bit before rebooting it and will print a warning after the reimage if the parameters are not all reset to their default values.
The <code>sre.hosts.reimage</code> cookbook script automatically checks that the host has set the PXE Boot bit before rebooting it and will print a warning after the reimage if the parameters are not all reset to their default values.


In case it's needed to manually reset it to remove any override, run:<syntaxhighlight lang="bash">
In case it's needed to manually reset it to remove any override, run:<syntaxhighlight lang="bash">
sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootdev none
sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootdev none
</syntaxhighlight>And then re-check if the change has been applied.
</syntaxhighlight>And then re-check if the change has been applied.
=== Does IPMI work but SSH to the management console doesn't? ===
In this case it's possible to reset the management card, see below.


== Fix Commands ==
== Fix Commands ==
Line 64: Line 86:
set /map1/accounts1/root password=${PASSWORD}
set /map1/accounts1/root password=${PASSWORD}
</syntaxhighlight>
</syntaxhighlight>
=== Enable remote IPMI access (over LAN) without local host access ===
For HP ilo4 and lower, the option is under:
Administration > Access Settings > IPMI/DCMI
For HP ilo5, the option is under:
Security > Access Settings > Edit Network settings


=== Reset the management card ===
=== Reset the management card ===
In case the management card is unresponsive to IPMI and maybe ping but SSH is still working, a card reset can be attempted. It will just restart the card OS, not affecting (in theory) the underlying host. To reset the management card SSH into the management interface of the host and run:
{{Warn|content=Wait a couple of minutes at least after the execution of the commands below before proceeding with any test to let the card OS restart. To verify that it restarted try to ssh into the management interface.}}In case the management card is unresponsive IPMI or SSH or ping but at least one of following options is available, a card reset can be attempted. It will just restart the card OS, not affecting ('''in theory''') the underlying host.  
 
==== From the SSH console ====
To reset the management card, SSH into the management interface of the host and run:


* for DELL hosts:<syntaxhighlight lang="bash">
* for DELL hosts:<syntaxhighlight lang="bash">
racadm racreset
racadm racreset
</syntaxhighlight>
</syntaxhighlight>
* for HP hosts:<syntaxhighlight lang="bash">
*for HP hosts:<syntaxhighlight lang="bash">
reset /map1
reset /map1
</syntaxhighlight>
</syntaxhighlight>
==== From local IPMI ====
To reset the management card via local IPMI, SSH into the host (not mgmt) and run:<syntaxhighlight lang="shell">
bmc-device --cold-reset; echo $?
</syntaxhighlight>It doesn't print anything on success, hence the print of the exit code at least to check it executed correctly.
==== From remote IPMI ====
To reset the management card via remote IPMI, execute this remote IPMI command (see [[Management Interfaces#How to execute remote IPMI commands]]):<syntaxhighlight lang="shell">
sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E mc reset cold
</syntaxhighlight>For the full list of available commands to manage the management card via remote IPMI see <code>man ipmitool</code> and search for the section titled <code>mc \| bmc</code> .


=== Power drain the host ===
=== Power drain the host ===
For a full cold reset the host must be shutdown and the power cables removed (drain). This is usually the last resort when ping, IPMI and SSH are all failing and any other attempt to fix it didn't work.
For a full cold reset the host must be shutdown and the power cables removed (drain). This is usually the last resort when ping, IPMI and SSH are all failing and any other attempt to fix it didn't work.


== Change the mgmt password ==
=== cookbook ===
To change the mgmt (SSH) password there is a [[Spicerack]] [[Spicerack/Cookbooks|cookbook]] called '''sre.hosts.ipmi-password-reset'''. Connect to a [[cumin]] server and run it like:
@cumin1001:~$ sudo cookbook sre.hosts.ipmi-password-reset 'db2*'
In this example we are running it on all db hosts in codfw simply by host name with wildcard.
You will be asked to enter current mgmt password followed twice by the new password.
=== ipmitool ===
After using the cookbook above there might be some failures to "establish an IPMI sesssion". Usually these are HP hosts and it depends on their ILO version.
The next step is to directly run ipmitool yourself. Example:
ipmitool -I lanplus -H db2063.mgmt.codfw.wmnet -U root -E user set password 1 <password> 16
Note you use the mgmt interface name, not the server name.
Replace <password> with the new password. You will be asked for the current (old) password interactively or you can use -f to read it from a file.
==== user slot ====
The "1" before the password means we are using user slot 1. Dell servers usually use slot 2 and HP servers usually use slot 1. To be sure always check if your password change was successful by connecting via ssh to the mgmt interface.
Also you can use this command to list the user slots. example:
ipmitool -I lanplus -H ms-be1039.mgmt.eqiad.wmnet -U root -E user list
The ID column in the output of this command should match your slot number.
The "16" after the password is about the (minimum) password length.
==== running ipmitool on multiple servers ====
You can make a simple text file with the list of failed servers (taken from the output of the cookbook but turning regular expressions into a list of all FQDNs) and then run ipmitool in a loop like, example:
[cumin1001:~] $ for host in $(cat failures) ; do echo $host; ipmitool -I lanplus -H $host -U root -E user set password 1 <PASSWORD> 16; sleep 1; done
Here "failures" is the text file with host names. "1" is the user slot and "16" is the password length again. Replace <PASSWORD> with the actual password. You will be asked for the current password interactively for each host or you can use -f to provide it from a file.
=== racadm (Dell) ===
There might be some cases where IPMI is not working, because IPMI over LAN is disabled in BIOS or because it needs a reset. In these cases you might get failures using ipmitool but you can still ssh to the mgmt interface using an existing/old password. If it's a Dell server you can change the password there using:
racadm set iDRAC.Users.2.Password <Password>
Where <Password> needs to be replaced and the number "2" refers to the same slot number referred to in the ipmitool section above. (If it doesn't work with slot 2, check if it's slot 1).
=== HP ILO ===
If it's a HP server with a newer version ILO, ssh to the mgmt interface and:
set /map1/accounts1/root password=<Password>
Alternatively it's possible to change the password via a web browser UI.
=== tunnel to web interface (on a HP) ===
Create an ssh tunnel to jump via a cumin host, example:
ssh -L 8000:db2056.mgmt.codfw.wmnet:443 cumin2001.codfw.wmnet
and keep the connection open.
In a browser connect to https://localhost:8000/ and create an exception for the certificate error.
== Other useful ipmitool commands ==
=== Force PXE boot ===
<syntaxhighlight lang="shell">
ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootdev pxe
</syntaxhighlight>
===Show boot parameter===
<syntaxhighlight lang="shell">
ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootparam get 5
</syntaxhighlight>
[[Category:How-To]]
[[Category:How-To]]
[[Category:Operations]]
[[Category:Operations]]
[[Category:Management interfaces]]

Revision as of 09:15, 11 October 2021

List of troubleshooting techniques and fixes for the most common IPMI and management interfaces issues.

General advices

How to execute remote IPMI commands

SSH into one of the hosts with the cluster::management Puppet role applied (cumin[1001,2002] at the time of writing) and run ipmitool, it will ask for management password, that is stored in pwstore.

Troubleshooting Commands

Does IPMI work locally?

SSH into the host (no mgmt) and run:

sudo ipmi-chassis --get-chassis-status

The typical error is: ipmi_cmd_get_chassis_status: internal system error or driver timeout

Does IPMI work remotely?

Execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis power status

The typical failing error is: Error: Unable to establish IPMI v2 / RMCP+ session

If it fails very quickly:

It might be that the IPMI password has gone out of sync with the host one (was the host rebooted recently?), see below on how to set it again.

Is remote IPMI enabled?

SSH into the host (no mgmt) and run (on Ubuntu Trusty use bmc-config instead of ipmi-config):

sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Access_Mode=Always_Available" --key-pair="Lan_Channel:Non_Volatile_Access_Mode=Always_Available" --diff

If there is no output it means that remote IPMI is enabled and this configuration is good. If there is any diff shown it means that the current values are not the correct ones.

Re-run the same command replacing --diff with --commit to change the config. Verify it again after the commit.

See on the section below how to do the same from the web mgmt interface, if no local host access is available (e.g. a new host).

Are IPMI permissions set correctly?

SSH into the host (no mgmt) and run (on Ubuntu Trusty use bmc-config instead of ipmi-config):

sudo ipmi-config --section=Lan_Channel --key-pair="Lan_Channel:Volatile_Channel_Privilege_Limit=Administrator" --key-pair="Lan_Channel:Non_Volatile_Channel_Privilege_Limit=Administrator" --diff

If there is no output it means that the user is configured correctly. If there is any diff shown it means that the current values are not the correct ones.

Re-run the same command replacing --diff with --commit to change the config. Verify it again after the commit.

Did you do a reset but still getting IPMI connection failed (when using the reimage cookbook)?

Try logging in on mgmt, with the regular mgmt password, and then re-set the same password with

racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 <password>

and then try again. We have had at least one case where this fixed running the reimage cookbook getting a remote IPMI connection failure.

Is there any overriding for next boot?

The BIOS Boot_Device is managed by puppet so it should be correct. This can be validated using the custom ipmi_chassis fact. A clean configuration should return NO-OVERRIDE for the ipmi_chassis.boot_flags.device fact

$ sudo facter -p ipmi_chassis.boot_flags.device

Alternativly one can execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootparam get 5

A typical output for a clean settings that have no overrides is:

Boot parameter version: 1
Boot parameter 5 is valid/unlocked
Boot parameter data: 0000000000
 Boot Flags :
   - Boot Flag Invalid
   - Options apply to only next boot
   - BIOS PC Compatible (legacy) boot
   - Boot Device Selector : No override
   - Console Redirection control : System Default
   - BIOS verbosity : Console redirection occurs per BIOS configuration setting (default)
   - BIOS Mux Control Override : BIOS uses recommended setting of the mux at the end of POST

In case of overrides the Boot parameter data bitmask will be different from 0000000000 and the line below will show the overridden values.

The sre.hosts.reimage cookbook script automatically checks that the host has set the PXE Boot bit before rebooting it and will print a warning after the reimage if the parameters are not all reset to their default values.

In case it's needed to manually reset it to remove any override, run:

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootdev none

And then re-check if the change has been applied.

Does IPMI work but SSH to the management console doesn't?

In this case it's possible to reset the management card, see below.

Fix Commands

Set the IPMI password

In case it is thought that the IPMI password has got out of sync with the management one, it is possible to set it again. SSH into the management interface of the host and run:

  • for DELL hosts:
    racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 ${PASSWORD}
    
  • for HP hosts:
    set /map1/accounts1/root password=${PASSWORD}
    

Enable remote IPMI access (over LAN) without local host access

For HP ilo4 and lower, the option is under:

Administration > Access Settings > IPMI/DCMI

For HP ilo5, the option is under:

Security > Access Settings > Edit Network settings

Reset the management card

In case the management card is unresponsive IPMI or SSH or ping but at least one of following options is available, a card reset can be attempted. It will just restart the card OS, not affecting (in theory) the underlying host.

From the SSH console

To reset the management card, SSH into the management interface of the host and run:

  • for DELL hosts:
    racadm racreset
    
  • for HP hosts:
    reset /map1
    

From local IPMI

To reset the management card via local IPMI, SSH into the host (not mgmt) and run:

bmc-device --cold-reset; echo $?

It doesn't print anything on success, hence the print of the exit code at least to check it executed correctly.

From remote IPMI

To reset the management card via remote IPMI, execute this remote IPMI command (see Management Interfaces#How to execute remote IPMI commands):

sudo ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E mc reset cold

For the full list of available commands to manage the management card via remote IPMI see man ipmitool and search for the section titled mc \| bmc .

Power drain the host

For a full cold reset the host must be shutdown and the power cables removed (drain). This is usually the last resort when ping, IPMI and SSH are all failing and any other attempt to fix it didn't work.

Change the mgmt password

cookbook

To change the mgmt (SSH) password there is a Spicerack cookbook called sre.hosts.ipmi-password-reset. Connect to a cumin server and run it like:

@cumin1001:~$ sudo cookbook sre.hosts.ipmi-password-reset 'db2*'

In this example we are running it on all db hosts in codfw simply by host name with wildcard. You will be asked to enter current mgmt password followed twice by the new password.

ipmitool

After using the cookbook above there might be some failures to "establish an IPMI sesssion". Usually these are HP hosts and it depends on their ILO version.

The next step is to directly run ipmitool yourself. Example:

ipmitool -I lanplus -H db2063.mgmt.codfw.wmnet -U root -E user set password 1 <password> 16

Note you use the mgmt interface name, not the server name.

Replace <password> with the new password. You will be asked for the current (old) password interactively or you can use -f to read it from a file.

user slot

The "1" before the password means we are using user slot 1. Dell servers usually use slot 2 and HP servers usually use slot 1. To be sure always check if your password change was successful by connecting via ssh to the mgmt interface.

Also you can use this command to list the user slots. example:

ipmitool -I lanplus -H ms-be1039.mgmt.eqiad.wmnet -U root -E user list

The ID column in the output of this command should match your slot number.

The "16" after the password is about the (minimum) password length.

running ipmitool on multiple servers

You can make a simple text file with the list of failed servers (taken from the output of the cookbook but turning regular expressions into a list of all FQDNs) and then run ipmitool in a loop like, example:

[cumin1001:~] $ for host in $(cat failures) ; do echo $host; ipmitool -I lanplus -H $host -U root -E user set password 1 <PASSWORD> 16; sleep 1; done

Here "failures" is the text file with host names. "1" is the user slot and "16" is the password length again. Replace <PASSWORD> with the actual password. You will be asked for the current password interactively for each host or you can use -f to provide it from a file.

racadm (Dell)

There might be some cases where IPMI is not working, because IPMI over LAN is disabled in BIOS or because it needs a reset. In these cases you might get failures using ipmitool but you can still ssh to the mgmt interface using an existing/old password. If it's a Dell server you can change the password there using:

racadm set iDRAC.Users.2.Password <Password>

Where <Password> needs to be replaced and the number "2" refers to the same slot number referred to in the ipmitool section above. (If it doesn't work with slot 2, check if it's slot 1).

HP ILO

If it's a HP server with a newer version ILO, ssh to the mgmt interface and:

set /map1/accounts1/root password=<Password>


Alternatively it's possible to change the password via a web browser UI.

tunnel to web interface (on a HP)

Create an ssh tunnel to jump via a cumin host, example:

ssh -L 8000:db2056.mgmt.codfw.wmnet:443 cumin2001.codfw.wmnet

and keep the connection open.

In a browser connect to https://localhost:8000/ and create an exception for the certificate error.

Other useful ipmitool commands

Force PXE boot

ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootdev pxe

Show boot parameter

ipmitool -I lanplus -H "$HOST.mgmt.$DC.wmnet" -U root -E chassis bootparam get 5