You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Difference between revisions of "Monitoring/strongswan"

From Wikitech
Jump to navigation Jump to search
imported>Dzahn
 
imported>Herron
(Updated to reflect prometheus-ipsec-exporter, grafana dashboards, and added some troubleshooting hints)
 
Line 1: Line 1:
check_strongswan is an [[Icinga]] plugin that checks [[Wikipedia:IPSec|IPSec]] connections.
+
[[Wikipedia:IPSec|IPSec]] connections are monitored today using a combination of prometheus-ipsec-exporter and icinga.
  
The actual scripts is in the puppet repo in ''modules/strongswan/files/monitoring/check_strongswan''.
+
A per-site alert will fire via icinga check_prometheus should one or more defined IPsec tunnels change  to disconnected, unknown, or other non-connected state.
  
<pre>
+
# troubleshoot the issue. 
# Nagios/Icinga check script for Strongswan
+
## Review the alert description text carefully.  You should see some hint about where the problem lies here.
# Parses output of 'ipsec statusall': checks that each defined connection has
+
### e.g. alert text of "instance=cp1081:9536 site=eqiad tunnel={cp3060_v4,cp3060_v6}" indicates that there is a problem with the tunnels to cp3060 being reported by host cp1081.
# corresponding established Security Associations (IKE parent + ESP child).
+
## Check the ipsec dashboard at https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status?orgId=1
# Also checks that connections configured by Strongswan have corresponding
+
### investigate any ongoing non-zero values.
# xfrm policies in place in the kernel, by parsing output of 'ip xfrm state'
+
### Look for commonality, e.g. do multiple instances report problems with the same tunnel? 
# for matching Security Parameter Index values.
+
# Write more about what you did to troubleshot the issue in this runbook :)
</pre>
+
# Tell the traffic team about it
 
 
1.) troubleshoot the issue
 
 
 
2.) write what you did to troubleshot the issue in this runbook :)
 
 
 
3.) tell the traffic team about it
 
  
 
[[Category:Runbooks]]
 
[[Category:Runbooks]]

Latest revision as of 21:09, 8 November 2019

IPSec connections are monitored today using a combination of prometheus-ipsec-exporter and icinga.

A per-site alert will fire via icinga check_prometheus should one or more defined IPsec tunnels change to disconnected, unknown, or other non-connected state.

  1. troubleshoot the issue.
    1. Review the alert description text carefully. You should see some hint about where the problem lies here.
      1. e.g. alert text of "instance=cp1081:9536 site=eqiad tunnel={cp3060_v4,cp3060_v6}" indicates that there is a problem with the tunnels to cp3060 being reported by host cp1081.
    2. Check the ipsec dashboard at https://grafana.wikimedia.org/d/B9JpocKZz/ipsec-tunnel-status?orgId=1
      1. investigate any ongoing non-zero values.
      2. Look for commonality, e.g. do multiple instances report problems with the same tunnel?
  2. Write more about what you did to troubleshot the issue in this runbook :)
  3. Tell the traffic team about it