You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Netflow/sflow

From Wikitech-static
Jump to navigation Jump to search

High level description on https://en.wikipedia.org/wiki/NetFlow and https://en.wikipedia.org/wiki/SFlow

Goal

Gather network level (Layer 4) traffic flows metadata to assist with traffic engineering and DoS mitigation.

How does it work?

Netflow diagram.
Netflow architecture

We first started with the "netflow" pipeline.

The sflow pipeline got added later on after requests from SREs for better visibility on internal flows. It was decided to keep it separated:

  • to not impact the primary (netflow) pipeline which is real-time sensitive and critical to SRE's days to day work
  • as devices functionally different (switches/routers) historically supported different protocols (the gap got narrowed only very recently)

Netflow, on routers, for external traffic

On the routers side:

  • 1 out of 1000 flows crossing the routers' external interfaces (both inbound and outbound) gets its metadata sent to a configured collector once the flow timeout is reached (here 10s)
    • Example metadata are: source/dest IP/port/AS#, IP protocol, TCP flag...
  • The routers share their full BGP view with the collector

On the collectors side:

  • Samplicator duplicates the IPFIX (netflow) packets to Fastnetmon and nfacct, while spoofing the source IP (so they still seem to come from the routers)
  • Nfacct extrapolates the flow size and packets based on the sampling rate (eg. do *1000)
  • Nfacct uses a prefix list (exported from Puppet) to enrich the collected flows with traffic direction
  • Nfacct uses the BGP data provided by the routers to enrich the collected flows metadata (adds peer src/dst AS#, AS path, src/dst AS#)
  • Nfacct uses an IP to location database to enrich the collected flows metadata (adds source and destination country)
  • Nfacct exports the enriched flow data to Druid via Kafka
  • Fastnetmon monitors inbound traffic for both known attack patterns and traffic level threshold and if any condition is met:
    • sends a notification email including a traffic signature if able
    • Triggers our monitoring system

Sflow, on switches, for internal traffic

On the switches side:

  • On L3 switches only as older switches don't support sending sflow data over their management interface
  • 1:1000 sampling is configured on all server facing ports in the server->switch direction (ingress) to prevent double accounting (inbound on one port, outbound on the other)
  • Packets exit through the data plane to not risk overwhelm the management plane (or management network)

On the collectors side:

  • Sfacct extrapolates the flow size and packets based on the sampling rate (eg. do *1000)
  • Sfacct uses a prefix list (exported from Puppet) to enrich the collected flows with their scope (eg. if source and destination IPs are Wikimedia's range, it's an internal flow)
  • Sfacct exports the flow data tagged as internal to Druid via Kafka

How to deploy?

Collectors:

  1. Apply role::netinsights to a server (see existing servers for specs)

The network device side is provisioned automatically with Homer. Except:

Troubleshooting

Check if pmacct is sending data to kafka

$ kafkacat -b kafka-jumbo1001.eqiad.wmnet -t netflow -C -o end

$ kafkacat -C -u -b kafka-jumbo1001.eqiad.wmnet:9092 -t network_flows_internal -o end | grep --line-buffered XXX | jq .

Real time Fastnetmon dashboard

$ fastnetmon_client

Check the logs

Both Pmacct and Fastnetmon log to syslog, grep for nfacctd , sfacctd, or fastnetmon

Detected attack details are logged in /var/log/fastnetmon_attacks/

Visualization

Limitations

  • Fastnetmon misreports attack type and protocol - T241374

Resources

https://www.juniper.net/documentation/en_US/junos/topics/task/configuration/services-ipfix-flow-template-flow-aggregation-configuring.html

https://github.com/pavel-odintsov/fastnetmon/

https://github.com/pmacct/

roll out sensible flow-table-sizes to Juniper core routers with sampling enabled - T248394

Collect netflow data for internal traffic - T263277