You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Network monitoring
Revision as of 18:17, 21 September 2017 by imported>Ayounsi (Adding BGPmin + RIPE rpki)
Monitoring resources
Tool | Auth | Alerts | Link |
---|---|---|---|
LibreNMS | LDAP | https://librenms.wikimedia.org/ | |
Smokeping | Open | https://smokeping.wikimedia.org/ | |
Prometheus | Open | https://grafana.wikimedia.org/dashboard/db/network-performances-global | |
Icinga | LDAP | Network monitoring#Icinga alerts | https://icinga.wikimedia.org/icinga/ |
Logstash | LDAP | https://logstash.wikimedia.org/app/kibana#/dashboard/6bcd2a10-7d21-11e7-86fb-51c84229aeb7 | |
External monitoring | Open | https://status.wikimedia.org/ | |
RIPE Atlas | Semi-open | https://atlas.ripe.net | |
Rancid | Internal | N/A | |
BGPmon | External | https://bgpmon.net/ | |
RIPE RPKI | External | https://my.ripe.net/#/rpki |
Runbooks
Icinga alerts
host (ipv6) down
- If service impacting (eg. full switch stack down).
- Depool the site if possible
- Ping/page netops
- If not service impacting (eg. loss of redundancy, management nework)
- Decide if depooling the site is necessary
- Ping and open high priority task for netops
Router interface down
Example
CRITICAL: host '208.80.154.197', interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0<BR>xe-3/2/3: down -> Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]<BR>
The part that interests us is the one between the <BR> tags. In this example:
- Interface name is
xe-3/2/3
- Description is
Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]
- Type is
Core
, other types are for example:Peering
,Transit
,OOB
. - The other side of the link is
cr2-codfw:xe-5/0/1
- The circuit is operated by Zayo, with the after-mentioned circuit ID
- The remaining informations are optional (latency, speed, cable#)
- Type is
If such alert shows up:
First, all links are redundant, but don't hesitate to depool the site if it's showing signs of a larger outage.
Identify the type of interface going down
- 3rd part provider: Type can be Core/Transit/Peering/OOB, a provider name identifiable and present on that list
- Internal link: Type is Core, no provider name listed
If 3rd party provider link
- Verify if the provider doesn't have a planned maintenance for that circuit ID on the maintenance calendar
- Verify if the provider didn't send a last minute maintenance or outage email notification
- If scheduled or provider aware of the incident
- downtime the alert for the duration of the maintenance
- monitor that no other links are going down (risk of total loss or redundancy
- If unplanned
- Open a phabricator task, tag netops, include the alert and timestamp
- Contact the provider using the informations present on that list, make sure to include the circuit ID, and time when the outage started
- If needed, escalate to netops
- Monitor for recovery, if no reply to email within 30min, call them
- Close the task if quick recovery
If internal link
- Open a phabricator task, tag netops and dcops, include the alert and timestamp
- Most likely the optic need to be replaced on one of the ends.
Juniper alarm
- If warning/yellow: open a phabricator task, tag netops
- If critical/red: open a phabricator task, tag netops, ping/page netops
BGP status
- If warning/yellow: open a phabricator task, tag/ping netops.
- This is most likely an IXP peer session down
- If critical/red: consider similar router interface down.
- Identify the peer name: in a terminal type `whois as#####` or lookup the AS number on http://peeringdb.com/
- Follow the router interface down instructions.
Syslog
Some of the syslog messages seen across the infra and their fix or workaround are listed on https://phabricator.wikimedia.org/T174397