You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

SRE/Dc-operations/Platform-specific documentation/Dell Documentation: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Volans
m (→‎Initial System Setup: Add troubleshooting section)
imported>RobH
Line 81: Line 81:


==== Updating Firmware ====
==== Updating Firmware ====
Dell firmware can be updated via the idrac command line (requires FTP server and we're still working on this setup) or https mgmt interface.
Dell firmware can be updated via the idrac command line (requires FTP server and we're still working on this setup) or https mgmt interface.  When updating via HTTPS interface, you can queue up multiple firmware revisions for bios, NIC, raid, and/or idrac together and it will apply all the non-idrac and then the idrac firmware updates.


Dell firmware can be downloaded directly from Dell, without login, via the 'Dell Config' link under every server's netbox details page.  Click Dell Config, then 'Drivers and Downloads' on the Dell Support Site.
Dell firmware can be downloaded directly from Dell, without login, via the 'Dell Config' link under every server's netbox details page.  Click Dell Config, then 'Drivers and Downloads' on the Dell Support Site.
Line 88: Line 88:


Commonly updated firmwares: idrac, bios, network, raid, backplane.
Commonly updated firmwares: idrac, bios, network, raid, backplane.
===== Urgent Firmware Revision Notices: =====
* Broadcom NetExtremeE firmware for 10G nic should only upgrade to 21.60.22.11, as 22.00.07.60 breaks installer.
* iDrac shouldn't upgrade to 6.00.00.00  (breaks https mgmt access), cap at 5.10.30.00.


==== Rolling back Firmware updates ====
==== Rolling back Firmware updates ====

Revision as of 19:54, 6 July 2022

  • Lights Out Manager: Dell iDRAC7 (12/13 Gen Servers, R4[23]), Dell iDRAC6 (Dell 11 Gen servers, R410)
  • We always purchase the enterprise version and license, allowing for a dedicated network port, rather than using the ports on the primary Ethernet interfaces.

Lights Out Management

The iDRAC/7 is very similar to iDRAC/6. The initial command line prompt is identical:

 /admin1->

Common Actions

Show logs

racadm getsel
racadm lclog view

Reboot and boot from network then console

racadm config -g cfgServerInfo -o cfgServerBootOnce 1
racadm config -g cfgServerInfo -o cfgServerFirstBootDevice PXE
racadm serveraction powercycle
console com2

Note: If having trouble with PXE boots, try Ctrl+S to enter the card-level setup for Broadcom (if applicable), and set the boot protocol to PXE instead of NONE inside the card settings. This worked for some new 10G cards in R620's (BCM578xx).

Reboot and boot into BIOS then console

racadm config -g cfgServerInfo -o cfgServerBootOnce 1
racadm config -g cfgServerInfo -o cfgServerFirstBootDevice BIOS
racadm serveraction powercycle
console com2

Connecting to mgmt interface

  • Via SSH
  • ssh root@servername.mgmt.datacenter.wmnet
  • Example: ssh root@bast1001.mgmt.eqiad.wmnet
  • Via Browser
  • Please note you will have to override an unknown (self signed) certificate, you won't want to save it permanently, as a few of these saved tends to result in errors connecting to other Dell DRAC interfaces via HTTPS.

Connecting to Serial Console

  • Attach to the serial console: console com2
  • Detach from serial console: ctrl+\
  • Console Redirection Key Mappings:
Use the <ESC><0> key sequence for <F10>
Use the <ESC><!> key sequence for <F11>
Use the <ESC><@> key sequence for <F12>
Use the <ESC><Ctrl><M> key sequence for <Ctrl><M>
Use the <ESC><Ctrl><H> key sequence for <Ctrl><H>
Use the <ESC><Ctrl><I> key sequence for <Ctrl><I>
Use the <ESC><Ctrl><J> key sequence for <Ctrl><J>
Use the <ESC><X><X> key sequence for <Alt><x>, where x is any letter key, and X is the upper case of that key
Use the <ESC><R><ESC><r><ESC><R> key sequence for <Ctrl><Alt><Del>

If you get locked out of the console, you can reset racadm:

racadm racreset

This can happen if your network connection dies while you are logged into a console session.

Power cycling

Log in with SSH on the mgmt interface:

 racadm serveraction action
  • Where action is one of the following:
  • powerdown - power server off
  • powerup - power server on
  • powercycle - perform server power cycle
  • hardreset - force hard server power reset
  • powerstatus - display current power status of server
  • Alternatively, use the SM CLP shell, after logging in, use the following commands:
 reset /system1
 stop /system1
 start /system1

Administrative Actions

Updating Firmware

Dell firmware can be updated via the idrac command line (requires FTP server and we're still working on this setup) or https mgmt interface. When updating via HTTPS interface, you can queue up multiple firmware revisions for bios, NIC, raid, and/or idrac together and it will apply all the non-idrac and then the idrac firmware updates.

Dell firmware can be downloaded directly from Dell, without login, via the 'Dell Config' link under every server's netbox details page. Click Dell Config, then 'Drivers and Downloads' on the Dell Support Site.

<todo: list standard file format naming and firmware names required>

Commonly updated firmwares: idrac, bios, network, raid, backplane.

Urgent Firmware Revision Notices:
  • Broadcom NetExtremeE firmware for 10G nic should only upgrade to 21.60.22.11, as 22.00.07.60 breaks installer.
  • iDrac shouldn't upgrade to 6.00.00.00 (breaks https mgmt access), cap at 5.10.30.00.

Rolling back Firmware updates

The rollback feature is a tab in the HTTPS interface next to the update feature tab. If the HTTPS interface is unavailable (say by updating idrac/8 past version 2.51), then a crash cart can be connected and the system rebooted into the Lifecycle Controller (key entry required during post via crash cart) and then iDRAC Settings > Update and Rollback firmware > Rollback firmware, where you can select the version/increment to rollback.

NICs

You can identify which type of NIC the server has from the web interface under System -> Inventory -> Hardware inventory (there are multiple pages).

The Broadcom 10G NICs will be something like "NIC in Slot N Port Y - PCI Device" and the Description field will contain the exact model, typically NetXtreme-E. Alternatively if you have OS access to the host, lshw will display the same information.

With the NIC model you can download the driver from the Netbox shortlink, the web interface will accept firmware in windows (.exe) format.

HTTPS Mgmt Interface
  • This requires you have an SSH tunnel into our mgmt network via a cumin host.
  • Pull up and login (user root, and mgmt password) to the https mgmt interface. Example: https://mw1446.mgmt.eqiad.wmnet
  • Maintenance > System Update
  • Please note when updating firmware, you can upload multiple files and apply them all in a single batch action, rather than individually uploading and applying. Simply upload the files, and once all are uploaded check all boxes and click 'Install and Reboot'.
    • The order of operations seems to apply Bios first, then network (and raid/backplane, not tested for order), then idrac last when batched together.

Polling for MAC Address

  • SSH into iDRAC interface.
  • Info command:
 nicstatistics
  • Ensure you pick out the proper MAC address for the correct interface (careful of 1G and 10G numbering):
racadm>>nicstatistics
NIC.Embedded.1-1-1:Broadcom Gigabit Ethernet BCM5720 - 2C:EA:7F:7F:C0:B2
PartitionCapable :                            Not Capable
NIC.Embedded.2-1-1:Broadcom Gigabit Ethernet BCM5720 - 2C:EA:7F:7F:C0:B3
PartitionCapable :                            Not Capable

Changing iDRAC User Password

  • SSH into iDRAC interface.
  • Change command:
 racadm config -g cfgUserAdmin -o cfgUserAdminPassword -i 2 <newpassword>

iDRAC SSH Key Based Authentication

  • This is not yet standard on all hosts. RobH is working on it, and merely listing the commands here for posterity.
  • On Dell systems, the root user is userid # 2. (#1 is given to the disabled anon access user.)
  • We only assign a single ssh key per user, though up to 4 can be assigned.
  • SSH into iDRAC interface.
  • List all keys assigned to a user
  • Specific key:
 racadm sshpkauth -i <2 to 16> -v -k <1 to 4>
  • All keys:
 racadm sshpkauth -i <2 to 16> -v -k all
  • List root user keys:
 racadm sshpkauth -i 2 -v -k all
  • Add key:
 racadm sshpkauth -i <2 to 16> -k <1 to 4> -t <key-text>
  • Add key to slot one for root user:
 racadm sshpkauth -i 2 -k 1 -t "contents of the public key file line"
  • Delete key:
  • Specific Key:
 racadm sshpkauth -i <2 to 16> -d -k <1 to 4>
  • All Keys:
 racadm sshpkauth -i <2 to 16> -d -k all
  • Delete all root user keys:
 racadm sshpkauth -i 2 -d -k all

Changing the iDRAC Network IP Settings

 racadm setniccfg -s <ipaddres> <subnetmask> <gateway>

Enable / Disable IPMI over DRAC

 racadm config -g cfgIpmiLan -o cfgIpmiLanEnable <0 or 1>
  • 0 is off, 1 is on.

Setting a one-time boot option

Sometimes you want the server to reboot into a network boot, or into bios directly. You can set one-time boot options with the following on the mgmt SSH command line:

 racadm config -g cfgServerInfo -o cfgServerBootOnce 1
 racadm config -g cfgServerInfo -o cfgServerFirstBootDevice <BOOT OPTION>
  • Valid boot option targets: No-Override, PXE, HDD, DIAG, CD-DVD, BIOS, vFDD, VCD-DVD, iSCSI, VFLASH partition label, FDD, SDe, RFS (Remote File Share)
  • We most commonly just make use of PXE & BIOS

Changing the User Defied String for the front LCD

  • Please note only models with LCD have this set: R620, R720
  • Checking the string:
 racadm get System.LCD.LCDUserString
  • Setting new string:
 racadm set System.LCD.LCDUserString newstring
  • We need to test this on a new system out of box. Rather than set the display User specified string and string in DRAC setup, skip that, and attempt to set with this single command afterwards. Ideally populating this via the iDrac command line will force it to also be set to display on LCD (rather than default service tag.) Please let RobH know the result of this test (or just put it in here!) --RobH (talk)

Troubleshooting

  • 10Gb NIC systems will occasionally need to have the legacy PXE boot option specifically set in the NIC bios. If you have it hitting the PXE boot step, and simply halting, this setting is not correct.

Initial System Setup

Automatic setup

The inital setup for Dell servers can be done running the sre.hosts.provision cookbook. The only pre-requisite is that the server is racked and plugged with both power and network cables. No additional step is needed, nor the requirement to attach a physical console to the host.

For a new host just run from within a tmux/screen session the cookbook with the hostname (not the FQDN) as it is defined in Netbox, for example:

sudo cookbook sre.hosts.provision example1001

See the cookbook's help (with -h or --help) for more details on all the steps performed, its current output (Dec. 2021) is:

$ sudo cookbook sre.hosts.provision -h
usage: cookbook [-h] [--no-dhcp] [--no-users] [--enable-virtualization] host

Provision a new physical host setting up it's BIOS, management console and NICs.

    Actions performed:
        * Validate that the host is a physical host and the vendor is supported (only Dell at this time)
        * Fail if the host is active on Netbox but --no-dhcp and --no-users are not set as a precautionary measure
        * [unless --no-dhcp is set] Setup the temporary DHCP so that the management console can get a connection and
          become reachable
        * Get the current configuration for BIOS, management console and NICs
        * Modify the common settings
          * [if --enable-virtualization is set] Leave virtualization enabled, by default it gets disabled
        * Push back the whole modified configuration
        * Checks that it can still connect to Redfish API
        * Checks that the configuration has been applied correctly dumping the new configuration and trying to apply
          the same changes. In case it detects any non-applied configuration will prompt the user what to do. It can
          retry to apply them, or the user can apply them manually (via web console or ssh) and then skip the step.
        * [unless --no-users is set] Update the root's user password with the production management password
        * Checks that it can connect via remote IPMI

    Usage:
        cookbook sre.hosts.provision example1001
        cookbook sre.hosts.provision --enable-virtualization example1001
        cookbook sre.hosts.provision --no-dhcp --no-users example1001



positional arguments:
  host                  Short hostname of the host to provision, not FQDN

optional arguments:
  -h, --help            show this help message and exit
  --no-dhcp             Skips the DHCP setting, assuming that the management console is already reachable (default: False)
  --no-users            Skips changing the root's user password from Dell's default value to the management one. Uses the management passwords also for the first connection (default:
                        False)
  --enable-virtualization
                        Keep virtualization capabilities on. They are turned off if not speficied. (default: False)
Troubleshooting
Failed to perform GET request to https://$HOSTNAME.mgmt.$DC.wmnet/redfish

If the cookbook fails early on in its run with Failed to perform GET request to https://$HOSTNAME.mgmt.$DC.wmnet/redfish, the most likely reason is that the iDRAC was unable to get an IP address from the DHCP. This might happen for various reasons, one of which is a typo in the device's Service Tag in Netbox. In order to troubleshoot the issue follow these steps:

  1. On the cumin host from where the cookbook was run, we need to find what was the DHCP snippet sent to the install server for the provisioned host. It can be extracted from the cookbook log with: eval $(sudo grep '/usr/bin/base64' /var/log/spicerack/sre/hosts/provision.log | grep $MY_HOSTNAME | tail -n1 | grep -o "/bin/echo.*base64 -d") (where $MY_HOSTNAME is the hostname passed to the cookbook).
  2. On the install server run a tcpdump for a minute or so to check what DHCP requests are coming and what's their Hostname Option with: sudo tcpdump -vvv 'udp and (src port 67 or src port 68 or src port 69)' | grep 'Hostname Option'
  3. Check if the Service Tag from the DHCP setting is present in the tcpdump output. If not it's possible that the Service Tag in Netbox does not match the one of the device. Correct it on Netbox before retying the cookbook.

Manual steps

We need to change a number of options in the bios and mgmt configuration to our own specifications. Some of these items MUST be done locally by the on-site technician racking the systems. As such, all of the following steps should be done by the on-site before completing a new server in the racked state.

  • Rack server according to directions on how to do so (including in racking task)
  • Attach physical console to system (keyboard & monitor).
  • Boot system, and enter BIOS by pressing F2 during POST.
  • System POSTS and enters BIOS with screen listing: System Bios, iDRAC Settings, & Device Settings.
  • Please note the first boot, this will be in a GUI that can be driven with keyboard only. AFTER the serial redirection is setup, this menu will no longer display on the physical console with the GUI, but a non-graphical command-line type menu system.
  • Enter System Bios.
  • We will be changing the entries for a number of items. If an item is not listed, it doesn't need to change from defaults.
  • Processor Settings
  • Logical Processor set to enable - This is hyperthreading, and sometimes we don't need it. If unsure, ask some application-specific expert. HHVM and Elasticsearch greatly benefit from it.
  • Virtualization Technology set to disabled - leaving this on when not using system for virtual machines leaves a potential security vector.
  • Serial Communication
  • Serial Communication set to: On with console redirection via COM2
  • Serial Port Address set to Serial Device1=COM1,Serial Device2=COM2
  • External Serial Connector: Serial Device 1
  • Failsafe Baud Rage: 115200
  • Remote Terminal Type: VT100/VT200
  • Redirection after boot set to disabled - newer ubuntu versions prefer this during boot, though you may swap it back when troubleshooting PXE boot issues.
  • System Profile Settings
  • System Profile set to Performance Per Watt (OS) - the default dell setting causes power_saving/watchdog kernel threads to spawn and inordinately consume CPU cycles, this setting fixes it.
  • Miscellaneous Settings
  • Asset Tag set to the Asset Tag assigned when system was received into stock.
  • Hit ESC to exit out, when prompted saving your changes. Do not exit the BIOS screen entirely, just go back to main settings screen (System Bios, iDRAC Settings, & Device Settings)
  • Select iDRAC Settings
  • Network
  • Confirm the following setttings:
  • Enable NIC is set to Enabled - should already be set to this, just a double-check
  • Nic Selection is set to Dedicated (iDRAC7 Enterprise only) - should be set to this already, if it won't change, it means the DRAC Enterprise License did not apply correctly during purchase. Please contact RobH or cmjohnson if this occurs so we can get it fixed.
  • Set the Static IP, Static Gateway, & Static Subnet Mask.
  • We don't assign DNS servers to mgmt interfaces.
  • Enable IPMI Over LAN set to enabled - this will allow us to use IPMI commands & scripting in the future.
  • Front Panel Security (Only for systems with front LCD)
  • Set LCD message set to User-Defined String
  • User-defined string set to system name if available, otherwise input asset tag again.
  • User Configuration
  • Set password to the mgmt password
  • ESC and save when prompted until back out of BIOS entirely, all settings are now in place.
  • All systems must be tested for DRAC and console redirection before the racking and on-site work is complete.
  • Connect to DRAC via SSH
  • Test powercycling, powering down, and powering up.
  • Test console redirection, ensure you can watch system POST via SSH mgmt session.
  • Once testing has passed, system is ready for operations allocation.


External Links

  • Manuals TODO: fix this link
  • [1] TODO: fix this link