IPSec connections are monitored today using a combination of prometheus-ipsec-exporter and icinga.

A per-site alert will fire via icinga check_prometheus should one or more defined IPsec tunnels change to disconnected, unknown, or other non-connected state.

  1. troubleshoot the issue.
    1. Review the alert description text carefully. You should see some hint about where the problem lies here.
      1. e.g. alert text of "instance=cp1081:9536 site=eqiad tunnel={cp3060_v4,cp3060_v6}" indicates that there is a problem with the tunnels to cp3060 being reported by host cp1081.
    2. Check the ipsec dashboard at
      1. investigate any ongoing non-zero values.
      2. Look for commonality, e.g. do multiple instances report problems with the same tunnel?
  2. Write more about what you did to troubleshot the issue in this runbook :)
  3. Tell the traffic team about it