You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

SRE/Dc-operations/Platform-specific documentation/HP Documentation

From Wikitech-static
Jump to navigation Jump to search
Wikimedia infrastructure

[edit]

HP ProLiant Server Directions

Please note our datacenter installation currently includes both HP ProLiant Gen8, Gen9, and Gen10. This document will cover all instances, with details where the commands differ.

  • Lights Out Manager: HP iLO 4

Lights Out Management

The SSH implementation on the remote management interfaces only supports legacy SSH. When using a current openssh release you'll likely need to explicitly enable a key exchange algorithm supported by iLO, e.g.

 ssh -oKeyAlgorithms=+diffie-hellman-group14-sha1 HOST
 -or-
 ssh -oKexAlgorithms=diffie-hellman-group14-sha1 HOST
 -or-
 ssh -oKexAlgorithms=diffie-hellman-group14-sha1 -c 3des-cbc HOST

Common Actions

Reboot and boot from network then console

These boxes take ages to boot, so it will look like it's hanging when you connect to the serial console. Give it a few minutes (it takes 4-5 minutes just to get some screen feedback).

set /system1/bootconfig1/bootsource5 bootorder=1
power reset
VSP

(That will leave it netbooting forever. So at some point, return to ilo and

set /system1/bootconfig1/bootsource5 bootorder=5

Note: this is only true if BootFmDisk is bootsource5. Some newer hardware may only have 4 bootsources:

   bootsource2=BootFmDisk      bootorder=1
   bootsource1=BootFmCd        bootorder=2
   bootsource3=BootFmUSBKey    bootorder=3
   bootsource4=BootFmNetwork1   bootorder=4

In this case the command would be

set /system1/bootconfig1/bootsource2 bootorder=1
Get console
</>hpiLO-> vsp

Virtual Serial Port Active: COM2

Starting virtual serial port.
Press 'ESC (' to return to the CLI Session.


or

   start /system1/oemhp_vsp1
Reset Virtual Serial port

Should the message

   Virtual Serial Port is currently in use by another session.

be returned and you know you are not killing someone's else working session, issue

   stop /system1/oemhp_vsp1
Reboot and boot into BIOS then console
Connecting to mgmt interface
  • Via SSH
  • ssh root@servername.mgmt.datacenter.wmnet
  • Example: ssh root@bast1001.mgmt.eqiad.wmnet
  • Via Browser
  • Please note you will have to override an unknown (self signed) certificate, you won't want to save it permanently, as a few of these saved tends to result in errors connecting to other Dell DRAC interfaces via HTTPS.
Connecting to Serial Console
  • Attach to the serial console: vsp
  • Detach from serial console: esc+(
  • BIOS Serial Console Boot Keys:
  • ESC+9 for ROM
  • ESC+0 for Intelligent Provisioning
  • ESC+! for Default Boot Override Options
  • ESC+@ for Network Boot
  • Crash Cart Boot Keys:
  • F9 for ROM-Based Setup Utility
  • F10 for Intelligent Provisioning
  • F11 for Default Boot Override Options
  • F12 for Network Boot
Power cycling
  • Login to the iLO command line interface.
  • Power commands are as follows:
  • power -- Displays the current server power state
  • power on -- Turns the server on
  • power off -- Turns the server off
  • power off hard -- Force the server off using press and hold
  • power reset -- Reset the server

Administrative Actions

Polling for MAC Address
  • Login to iLO command line interface.
  • Run: show system1/network1/Integrated_NICs
Changing iLO User Password
Changing the iLO Network IP Settings
Enable / Disable IPMI over iLO
Setting a one-time boot option
Get MAC address
   show /system1/network1/Integrated_NICs

Unless we are bonding, only the first NIC is used. So the first NIC as reported by ILO should be the one that is plugged.

Disable PCI device
 rbsu> SHOW PCI DEVICE ENABLE/DISABLE 
 rbsu> SET PCI DEVICE ENABLE/DISABLE <entrynum> 0
Enable/Disable Hyperthreading
 rbsu> SHOW CONFIG INTEL(R) HYPERTHREADING OPTIONS
 Intel(R) Hyperthreading Options
 1|Enabled <=
 2|Disabled
 
 rbsu> SET CONFIG INTEL(R) HYPERTHREADING OPTIONS 2
 Intel(R) Hyperthreading Options
 1|Enabled
 2|Disabled <=
Show system event log entries

While at the ILO console:

 # show all recorded entries
 show /system1/log1
 # show a particular entry
 show /system1/log1/record15

ms-be RAID0 config

An easy way to configure swift backend ms-be machines disks all in raid0 using the console above (order is important )

First, reboot the system and during reboot Press 'ESC+9' to enter for System Utilities. Once in the System Utilities, select System Configuration then Slot 3 : Smart Array P840 Controller. Select Exit and launch HP Smart Storage Administrator(HPSSA). At the next step, an error message will appear 'error: no such device: EMBEDDED250.' there is nothing to do at this point, but wait for the hpssacli prompt (==>)

 set target controller slot=3
 array all delete forced
 create type=arrayr0 drivetype=ss_sata
 create type=arrayr0 drivetype=sata
Additional MS-BE RAID details

The ms-be systems RAID configuration is each disk in its own RAID 0 Starting from the SSD disks first. So the ms-be systems in general comes with a total of 14 disks. Counting from 0 to 13, the ssd's are in slot 12 and 13. You need to create first a RAID 0 for the first SSD disk in slot 12 then another RAID 0 for the SSD in slot 13 so that each SSD's will take as name sda and sdb. After that, do the same for the other 12 disks. At the end you will have:

Array A Array BArray C Array D Array E Array F Array G Array H Array I Array J Array K Array L Array M Array N

Array A being the SSD in slot 12 and Array B the SSD in slot 13

once in BIOS go to "system Configuration" - "Embedded RAID 1 : HPE Smart Array P816i-a SR Gen10 " - "Array Configuration " - "Create Array "


Mark a disk as failed

It might happen that Linux detects errors while writing to a disk but the raid controller itself doesn't see the disk as failed (e.g. https://phabricator.wikimedia.org/T163690). In these cases it is useful to forcefully mark the physical drive as failed as follows:

 set target controller slot=3
 pd all show
 # take note of the disk e.g. 1I:1:5
 pd DISK modify disablepd forced

To reenable the LD (not the PD) after the disk has been swapped:

 ld NUMBER modify reenable

Blink disk led

Via hpssacli:

 set target controller slot=3
 pd DISK modify led=on

ACPI Errors

On first install and after the first puppet run there might be messages similar to this showing up on console:

 ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20160831/exfield-427)
 ACPI Error: Method parse/execution failed [\_SB.PMI0._PMM] (Node ffff8a523f04f2f8), AE_AML_BUFFER_LIMIT (20160831/psparse-543)
 ACPI Exception: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20160831/power_meter-338)

This is related to the "power meter" ACPI module loaded, we blacklist the module since https://gerrit.wikimedia.org/r/#/c/356422/ and thus a reboot will make the message disappear.


Initial System Setup

Gen10: HP's require F9 to get to BIOS Gen[89]: Enter the system setup tool by pressing esc-9 during boot. The terminal emulation of this tool is lousy, so things will scrawl all over your screen and generally be hard to use.

System Options

Checklist for BIOS settings on HP systems:

  • boot to system setup utilities (F9 on post)
  • Check the following settings in the System Settings > System Configuration > Bios/Platform Configuration (RBSU):
    • System Options > Serial Port Options > Virtual Serial Port : COM 2
    • System Options > USB Options > Internal SD Card Slot : Disabled
    • Processor Options > Intel(R) Hyper-Threading : Enabled
    • Virtualization Options : Disabled on EVERYTHING unless it is a cloudvirt or ganeti server (those are enabled on everything)
      • Cloudvirt/Ganeti settings: Virtualization Technology = Enabled, Intel(R) VT-d = Enabled, SR-IOV = Enabled
    • Boot Options > Boot Mode : Legacy BIOS Mode
    • Legacy BIOS Boot Order > Standard Boot Order (IPL) : Hard drive is located ahead of network port.
    • Network Options > Network Boot Options : Ensure only the active/primary ethernet port is set to Network Boot and the other ports set to Disabled. (This does NOT disable the port, ONLY disables it attempting to network boot.)
    • System Configuration -> BIOS/Platform Configuration (RBSU) -> System Options -> USB Options -> Embedded User Partition = Disabled
    • Save settings, go back to the System Configuration main menu.
  • If systems have hardware raid, select the embedded raid controller on the System Configuration Screen and setup the raid.
    • Differing systems have differing raid requirements, please see the individual system setup tasks for details on the raid levels.
Setting Management Network Settings

During Post hit F8 to get to ILO (moves slow)

  • Under Network
  • DHCP/DNS Setting
  • Disable DHCP by pressing space bar and ensuring it says OFF hit f10
  • NIC and TCP/IP
  • Enter IP/Subnet/Gateway
  • Under User
  • edit and change Administrator to root
        user name root
        login name root
        password  mgmt password
  • Ensure all iLO privileges are marked yes
  • File
  • Exit and iLO will reset

RAID controller firmware upgrade

Upgrading RAID controller firmware is relatively straightforward and as of Jan 2019 hasn't posed issues, see also bug T141756 for more context.

This guide assumes one of the common controllers we have deployed on the fleet, usually P840.

 version=ea3138d8e8-6.88-1.1
 cd /tmp
 curl apt.wikimedia.org/firmware/firmware-smartarray-${version}.x86_64.tgz | tar zxv
 sudo ./usr/lib/x86_64-linux-gnu/firmware-smartarray-${version}/setup

Setting proper power option

In bios:

* select service options

* Set Processor Power Monitoring and choose disabled

* Press enter, ignore warning message regarding modification by pressing enter again. Select disabled and press enter again.