You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Network monitoring

From Wikitech-static
Revision as of 18:17, 21 September 2017 by imported>Ayounsi (Adding BGPmin + RIPE rpki)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Monitoring resources

Tool Auth Alerts Link
LibreNMS LDAP https://librenms.wikimedia.org/
Smokeping Open https://smokeping.wikimedia.org/
Prometheus Open https://grafana.wikimedia.org/dashboard/db/network-performances-global
Icinga LDAP Network monitoring#Icinga alerts https://icinga.wikimedia.org/icinga/
Logstash LDAP https://logstash.wikimedia.org/app/kibana#/dashboard/6bcd2a10-7d21-11e7-86fb-51c84229aeb7
External monitoring Open https://status.wikimedia.org/
RIPE Atlas Semi-open https://atlas.ripe.net
Rancid Internal N/A
BGPmon External https://bgpmon.net/
RIPE RPKI External https://my.ripe.net/#/rpki

Runbooks

Icinga alerts

host (ipv6) down

  • If service impacting (eg. full switch stack down).
    1. Depool the site if possible
    2. Ping/page netops
  • If not service impacting (eg. loss of redundancy, management nework)
    1. Decide if depooling the site is necessary
    2. Ping and open high priority task for netops

Router interface down

Example

CRITICAL: host '208.80.154.197', interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0<BR>xe-3/2/3: down -> Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]<BR>

The part that interests us is the one between the <BR> tags. In this example:

  • Interface name is xe-3/2/3
  • Description is Core: cr2-codfw:xe-5/0/1 (Zayo, circuitID) 36ms {#2909} [10Gbps wave]
    • Type is Core, other types are for example: Peering, Transit, OOB.
    • The other side of the link is cr2-codfw:xe-5/0/1
    • The circuit is operated by Zayo, with the after-mentioned circuit ID
    • The remaining informations are optional (latency, speed, cable#)

If such alert shows up:

First, all links are redundant, but don't hesitate to depool the site if it's showing signs of a larger outage.

Identify the type of interface going down

  • 3rd part provider: Type can be Core/Transit/Peering/OOB, a provider name identifiable and present on that list
  • Internal link: Type is Core, no provider name listed

If 3rd party provider link

  1. Verify if the provider doesn't have a planned maintenance for that circuit ID on the maintenance calendar
  2. Verify if the provider didn't send a last minute maintenance or outage email notification
  • If scheduled or provider aware of the incident
  1. downtime the alert for the duration of the maintenance
  2. monitor that no other links are going down (risk of total loss or redundancy
  • If unplanned
  1. Open a phabricator task, tag netops, include the alert and timestamp
  2. Contact the provider using the informations present on that list, make sure to include the circuit ID, and time when the outage started
  3. If needed, escalate to netops
  4. Monitor for recovery, if no reply to email within 30min, call them
  5. Close the task if quick recovery

If internal link

  1. Open a phabricator task, tag netops and dcops, include the alert and timestamp
  2. Most likely the optic need to be replaced on one of the ends.

Juniper alarm

  • If warning/yellow: open a phabricator task, tag netops
  • If critical/red: open a phabricator task, tag netops, ping/page netops

BGP status

  • If warning/yellow: open a phabricator task, tag/ping netops.
    • This is most likely an IXP peer session down
  • If critical/red: consider similar router interface down.
    1. Identify the peer name: in a terminal type `whois as#####` or lookup the AS number on http://peeringdb.com/
    2. Follow the router interface down instructions.

Syslog

Some of the syslog messages seen across the infra and their fix or workaround are listed on https://phabricator.wikimedia.org/T174397