You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Ping offload"

From Wikitech
Jump to navigation Jump to search
imported>Krinkle
imported>Ayounsi
 
Line 10: Line 10:
 
Deployment task: https://phabricator.wikimedia.org/T190090
 
Deployment task: https://phabricator.wikimedia.org/T190090
  
==== eqiad ====
+
eqiad: ping1001.eqiad.wmnet<br>
cr1-eqiad/cr2-eqiad redirect inbound icmp echo requests to ping1001.eqiad.wmnet
 
  
==== codfw ====
+
codfw: ping2001.codfw.wmnet<br>
cr1-codfw/cr2-codfw redirect inbound icmp echo requests to ping2001.codfw.wmnet
 
  
==== POPs ====
+
esams: ping3001.esams.wmnet
Plan is to wait for the Ganeti clusters in the POPs before duplicating the work there. [[phab:T96852|T96852]]
 
  
 
=== Monitoring ===
 
=== Monitoring ===
Line 37: Line 34:
  
 
=== How-to ===
 
=== How-to ===
 +
 +
==== Deploy a new host ====
 +
# Create a VM, see existing VMs on https://netbox.wikimedia.org/virtualization/virtual-machines/?q=ping
 +
# Assign the ping_offload role in Puppet (eg. https://gerrit.wikimedia.org/r/c/operations/puppet/+/564873)
 +
# Add the target VIP to its configuration (eg. https://gerrit.wikimedia.org/r/c/operations/puppet/+/564908)
 +
# Set the VIP and ping host in Homer (eg. https://gerrit.wikimedia.org/r/c/operations/homer/public/+/564917)
  
 
==== Temporarily stop the ICMP echo redirect ====
 
==== Temporarily stop the ICMP echo redirect ====
Line 60: Line 63:
 
To confirm that the change is effective, monitor tcpdump on the ping host (for example <code>sudo tcpdump -i ens5 icmp -nn</code>) or the dashboard.
 
To confirm that the change is effective, monitor tcpdump on the ping host (for example <code>sudo tcpdump -i ens5 icmp -nn</code>) or the dashboard.
  
To re-activate the redirect, re-do the similar changes as above but replace <code>deactivate</code> with <code>activate</code>  
+
To re-activate the redirect, re-do the similar changes as above but replace <code>deactivate</code> with <code>activate</code>
  
 
=== Possible improvements ===
 
=== Possible improvements ===

Latest revision as of 09:39, 15 January 2020

Service status: Work in progress

Documentation status: Ready

Goal: Lower the high ICMP load on LVS/CP servers by offloading echo requests to a dedicated server.

Linux has internal ICMP rate limiters that can cause the kernel to drop valuable ICMP packets. By offloading ICMP echo, we make sure the "important" ICMP (eg PMTU discovery) doesn't get dropped.

Deployment

Deployment task: https://phabricator.wikimedia.org/T190090

eqiad: ping1001.eqiad.wmnet

codfw: ping2001.codfw.wmnet

esams: ping3001.esams.wmnet

Monitoring

Icinga: https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=ping1001&style=hostservicedetail (and ping2001)

Grafana dashboard: https://grafana.wikimedia.org/dashboard/db/ping-offload

External monitoring: Ping to VIPs configured in Watchmouse

InAddrErrors alert

From the Grafana dashboard alerting.

This means the server is receiving packets for an IP not existing on the server.

  1. Run ip addr to check if all the redirected IPs are present on the loopback interface
    1. If not, they can manually be added temporarily with ip addr add <ip>/32 dev lo:ping_offload
  2. If the IPs are present, use tcpdump to find the IP in question (eg. filter out all the present IPs)
  3. In any cases or if the troubleshooting takes too long, disable the redirect (see bellow)

How-to

Deploy a new host

  1. Create a VM, see existing VMs on https://netbox.wikimedia.org/virtualization/virtual-machines/?q=ping
  2. Assign the ping_offload role in Puppet (eg. https://gerrit.wikimedia.org/r/c/operations/puppet/+/564873)
  3. Add the target VIP to its configuration (eg. https://gerrit.wikimedia.org/r/c/operations/puppet/+/564908)
  4. Set the VIP and ping host in Homer (eg. https://gerrit.wikimedia.org/r/c/operations/homer/public/+/564917)

Temporarily stop the ICMP echo redirect

If the system is showing signs of issues or needs to go down for maintenance.

On both cr1 and cr2 routers of the target site, enter the following commands:

# deactivate firewall family inet filter border-in4 term offload-ping4

# deactivate firewall family inet filter transport-in4 term offload-ping4

Then verify that the changes about to be made are correct, the output should be similar to:

# show | compare
[edit firewall family inet filter border-in4]
!       inactive: term offload-ping4 { ... }
[edit firewall family inet filter transport-in4]
!       inactive: term offload-ping4 { ... }

Finish by committing the changes (replace <TASK #> with a phabricator task ID or relevant comment):

# commit comment "<TASK #>"

To confirm that the change is effective, monitor tcpdump on the ping host (for example sudo tcpdump -i ens5 icmp -nn) or the dashboard.

To re-activate the redirect, re-do the similar changes as above but replace deactivate with activate

Possible improvements

  • Use BGP flowspec to automatically advertise/remove the redirect
  • Add IPv6 support
  • Have multiple ping servers per site for redundancy