You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

SRE/Dc-operations/Platform-specific documentation/ServerTech

From Wikitech-static
< SRE‎ | Dc-operations‎ | Platform-specific documentation
Revision as of 21:30, 30 June 2021 by imported>RobH
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search
Wikimedia infrastructure

[edit]

ServerTech CDUs are smart power supplies with support for power and environmental monitoring, and (some models) outlet control.

Issues

  • After ANY IP/Network/SNMP/syslog changes, the settings must be saved and the PDU/CDU restarted. Restarting the PDU/CDU does NOT interrupt power delivery, and all powered gear should remain unaffected.

Initial Setup

  • connect the PDU to it's serial connection, then connect and login to the PDU via serial
    • default user/pass is admn/admn
  • immediately run the following to set the ip info so you can connect via HTTPS to configure the remainder (example is for PDUs in eqiad, you'll need to change for your site's network info):

set dhcp disabled set ipv4 address <ip address> set ipv4 gateway 10.65.0.1 set ipv4 subnet 255.255.0.0 set dns primary 10.3.0.1 set dns secondary 10.3.0.1 reboot

  • Confirm reboot, you should now be able to connect to the device via https://ps1-rack-site.mgmt.site.wmnet to setup the remainder
  • Connect via https and login with the default admn/admn, we'll change this first.
  • Go to Configuration > Access > Local Users >
    • Create Users and create the root user, using the mgmt-pdu password.
    • Edit the root user, and change Access Level to Administrator
  • Log out, and log back in as root
  • Go to Configuration > Access > Local Users >
    • Delete the default admn user.
  • Go to Configuration > Network > DHCP/IP
    • Configure DHCP settings to set the FQDN to the pdu FQDN
    • Un-check Zero Touch Provisioning
    • Apply Settings (don't reboot yet, we'll do that when we are done)
  • Go to Configuration > Network > HTTP/HTTPS
    • Un-check Enable under HTTP server (because unsecured traffic is a bad idea)
    • Apply Settings (don't reboot yet, we'll do that when we are done)
  • Go to Configuration > Network > SNMP
    • Set the SNMPv2 Agent GET and SET passwords (you should copy this from another PDU)
    • Set the system name (FQDN)
    • Set the system location (site)
    • Set system contact: Wikimedia <noc@wikimedia.org>
    • Apply Settings (don't reboot yet, we'll do that when we are done)
  • Go to Configuration > Network > SMTP
    • Set primary and secondary hosts (depending on which is closest site to PDU) as ntp.eqiad.wikimedia.org and ntp.codfw.wikimedia.org
    • Apply Settings (don't reboot yet, we'll do that when we are done)
  • Go to Configuration > Network > Syslog
    • Set both Host 1 and Host 2 to syslog.anycast.wmnet
    • Apply Settings
  • Now we will reboot the PDU for it to apply the above, go to Tools > Restart > Action : Restart and Apply.
  • PDU will restart and apply all the updated settings, should immediately start showing in librenms.

Adding items to scs

  • using a browser go to the scs example:scs-a1-sdtpda.mgmt.pmtpa.wmnet
  • add the pdu to the appropriate port
  • connect the pdu to the scs switch (be sure to use cisco wire cfg)
  • From terminal go to root@scs-a1-sdtpa.mgmt connect to the port using commands
  • #pmshell
  • #“port #”
  • Run the following commands be sure to set the ip address to the appropriate ip address
  • set dhcp disabled
  • set ipaddress “10.1.5.21”
  • set subnet 255.255.0.0
  • set gateway “10.1.0.1”
  • restart

After the pdu restarts

  • Using a web browser go to the web page and accept the security certificate (you can use the IP Adress in the address bar)
  • Login to the pdu with the default user/passwd admn/admn
  • Go to users and create user "root" and give it the management password. Then click apply
  • Under the action link use edit and select admin as the role and click apply.
  • Logout of the pdu and then login again as user: root with the mgmt password
  • go to users and delete the admn user.

Setting up the Configuration

Configuration:
	System:
		About:
			Location: ulsfo
		Bluetooth:
			Enable unchecked

	Network:
		DHCP/IP:
			Primary DNS:  10.3.0.1
			IPv4 Address/mask/gateway: SET
			FQDN: SET
			DHCP: Uncheck
		FTP:
			FTP Server: Uncheck*
		HTTP/HTPS:
			HTTP Server: Uncheck*
			HTTPS server: Verify Checked
		SNMP:
			SNMPv2 Agent: Enable
			GET Community: Set to SNMP secret*
			SET Community: Empty
			System Name: Set to hostname
			System Location: Set to site code (ulsfo, etc.)
			System Contact: noc@wikimedia.org
		SNTP:
			Primary host: ntp.eqiad.wikimedia.org
			Secondary host: ntp.codfw.wikimedia.org
		Syslog:
			Host 1: syslog.anycast.wmnet
		Telnet/SSH:
			Telnet server: Uncheck*
			SSH server: verify checked
	Access:
		Local Users:
			admn: remove
Tools:
	restart:
		Action: Restart


* Restart required

Adding devices to monitoring

Add device to LibreNMS: LibreNMS#Add a device to LibreNMS

Add device to Icinga: duplicate this change

Cookbooks

There are a number of cook books to add in the management of this HW. run cookbook -lv sre.pdus from a Cumin host to see a list of the current cook books.  

$ sudo cookbook -lv sre.pdus
cookbooks
`-- sre: SRE Cookbooks
    `-- sre.pdus: -
        |-- sre.pdus.reboot-and-wait: List PDU 🔌 uptime
        |-- sre.pdus.rotate-password: Update Sentry PDUs 🔌 passwords
        |-- sre.pdus.rotate-snmp: Update Sentry PDUs 🔌 SNMP communities
        `-- sre.pdus.uptime: List PDU 🔌 uptime

PDU's

The pdu cook books live under cookbook -l sre.pdus and the all have the following common arguments

  • --username USERNAME: the username to use to login to the PDU
  • --check-default: if this flag is passed the script will check if the default username and password is still configured
  • query: Either the word all or a list of PDU IPs or netbox device names. the cook book will run against all PDU's which have a valid entry in netbox

sre.pdus.uptime

This cookbook simply reports the PDU uptime (in unix uptime format) and the PDU version. This cook book has no additional arguments simply run as follows

$ sudo cookbook sre.pdus.uptime all
START - Cookbook sre.pdus.uptime
Current password:
10.65.0.55: uptime 6 days 16 hours 6 minutes 8 seconds
10.193.0.34: uptime 6 days 21 hours 18 minutes 10 seconds
10.65.0.48: uptime 4 days 18 hours 0 minutes 27 seconds
10.65.0.39: uptime 6 days 17 hours 36 minutes 45 seconds
END (PASS) - Cookbook sre.pdus.uptime (exit_code=0)

sre.pdus.reboot-and-wait

This cookbook can be used to perform a rolling reboot of a set of PDU's. It reboots the PDU's in order and waits for each PDU to fully reboot before moving on.

This cookbook has the following additional arguments

  • --since SINCE: By default the cookbook will try to reboot all nodes, however you can pass an integer with since representing a number of seconds, ensuring the cookbook only reboots nodes which have not been rebooted since value
$ sudo cookbook sre.pdus.reboot-and-wait ps1-a8-codfw
START - Cookbook sre.pdus.reboot-and-wait
Current password:
10.193.0.32: rebooting Sentry v3 PDU
10.193.0.32: sleep while reboot
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [1/25, retrying in 5.00s]:
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [2/25, retrying in 5.00s]:
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [3/25, retrying in 5.00s]:
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [4/25, retrying in 5.00s]:
10.193.0.32: found reboot since 2020-09-22 09:38:10.464170
END (PASS) - Cookbook sre.pdus.reboot-and-wait (exit_code=0)

sre.pdus.rotate-snmp

This cook book is used to update all SNMP readonly strings to a specific value and optionally update all SNMP read/write strings to a random value

This cookbook has the following additional arguments:

  • --force: if passed force and update of the snmp and a reboot of the PDU even if the RO SNMP configured is correct
  • --reset-rw: update the RW snmp string to a random value
$ sudo cookbook sre.pdus.rotate-snmp ps1-a8-codfw          
START - Cookbook sre.pdus.rotate-snmp
Enter login password:
New SNMP RO String:
Again, just to be sure:
10.193.0.32: Updating SNMP RO
10.193.0.32: SNMP RO: updated
10.193.0.32: rebooting Sentry v3 PDU
10.193.0.32: sleep while reboot
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [1/25, retrying in 5.00s]:
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [2/25, retrying in 5.00s]:
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [3/25, retrying in 5.00s]:
Failed to call 'cookbooks.sre.pdus.wait_reboot_since' [4/25, retrying in 5.00s]:
10.193.0.32: found reboot since 2020-09-22 09:10:50.040743
END (PASS) - Cookbook sre.pdus.rotate-snmp (exit_code=0)
$ sudo cookbook sre.pdus.rotate-snmp ps1-a8-codfw                             
START - Cookbook sre.pdus.rotate-snmp
Enter login password:
New SNMP RO String:
Again, just to be sure:
10.193.0.32: SNMP communities already match (version: 3, uptime: 0 days 0 hours 5 minutes 45 seconds)
END (PASS) - Cookbook sre.pdus.rotate-snmp (exit_code=0)

sre.pdus.rotate-password

This cookbook is used to rotate a user password

$ sudo cookbook sre.pdus.rotate-password 10.193.0.32
START - Cookbook sre.pdus.rotate-password
Current password:
New password:
Again, just to be sure:
10.193.0.32: Password updated successfully 😌
END (PASS) - Cookbook sre.pdus.rotate-password (exit_code=0)

Other Info

powering off and on outlets

These instructions are good for Sentry Switched CDU 6.0 firmware. (Manuals here, see the manuals for Firmware Version 6.0). They may well work for other versions of the firmware but haven't been tested there.

Please Note: We have a very limited number of switched powerstrips in place. Only the network racks (A1-sdtpa, A1-eqiad, A8-eqiad) and B1-sdtpa have these. The rest have normal non-switched powerstrips. The reasoning behind this is that the switched strip has more parts and complexity and can have issues more easily than the more simple strips. Networking kit does not normally have full out of band lights management, so those racks have them. B1-sdtpa is a legacy rack to have older servers that do not have full lights out management remote reboot capabilities.

  1. Find the pdu for your rack and your server. See Netbox
    It will have a name like ps1-b1-sdtpa.mgmt.pmtpa.wmnet (powerstrip-n, rack m, dc name...)
  2. ssh in, use standard credentials for the manangement network.
  3. Check what outlets or groups of outlets you want. You can list those by:
    list user
    At the "Username:" prompt, type root. The outlets are first, the groups at the bottom.
  4. To power the outlet off, do
    off name-of-outlet-or-group-here
    Example: off .BC6 or off dataset1-all
  5. To power the outlet on, do
    on name-of-outlet-or-group-here

Sample successful output from a power off command:

Switched CDU: off dataset1-all

   Group: dataset1-all

   Outlet   Outlet                    Outlet     Load      Power     Control
   ID       Name                      Status     (Amps)    (Watts)   State

   .AC6     dataset1_a:xz:6           Off        0.00      0         Off       
   .AC7     dataset1-array1_z:xz:7    Off        0.00      0         Off       
   .BC6     dataset1_b:xz:6           Off        0.00      0         Off       
   .BC7     dataset1-array1_b:xz:7    Off        0.00      0         Off       

   Command successful

Switched CDU:

Power on gives similar results.

Note that there's a five minute timeout so if you idle too long you'll have to reconnect.

summary of other commands

A few other useful commands:

  • From the command line, hitting return/enter will show you a list of commands it knows.
  • Check the version of the firmware by show system. It also announces the firmware version when you login.
  • Some commands for displaying various statuses: show ports/network/options/towers/infeeds/traps/system
  • A couple monitoring commands: istat, envmon, sysstat
  • Listing user info: list user (you will be prompted for a specific user), list users (you will be shown a list of all users)