You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Network best practices

From Wikitech-static
Revision as of 12:30, 24 August 2022 by imported>Ayounsi
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

A collection of best practices to follow for an efficient use of our network and high availability. With their exceptions.

Please reach out to SRE/Infrastructure Foundations if you need help designing a service or if something doesn't fit those best practices.

Servers uplinks

Except special cases listed below, servers MUST use a single uplinks.

When needing access to more than 1 logical network (vlan) those MUST be trunked.

General reasons for servers to have multiple uplinks

Redundancy

When configured in a active/passive mode, if any of the [switch|switch interface|cable|server interface] fails, the alternate link takes over.

Not done in production for multiple reasons:

  • Switch cost (this would double or switch budget)
  • Low frequency of failure
    • It's more likely that a server fail or needs to be restarted than a cable or interface fails
    • for example even primary DBs use a single unplink
  • Higher setup complexity (more difficult troubleshooting, special cases to adapt in server lifecycle, more cabling)

One exception is the Fundraising Infrastructure as they're both critical and have a tiny footprint (2 racks) allowing them to have 2 ToR switches. The complexity downside applies here too (see task T268802)

Capacity

When configured in an active/active mode, 2x10G links will have more bandwidth than 1x10G link.

Not done in production for multiple reasons:

  • Switch cost (this would double or switch budget)
  • No need, this could change in the future but the current services rarely require more than 10G per server
    • New ToR switches support 1/10/25/50/100G uplinks
    • SPOF risk (it's recommended to have multiple distributed nodes than few large ones)
  • Less visibility (risk of a link failing saturating the other link)
  • Backbone capacity, while we're upgrading backbone links and gear, significantly increasing capacity for a few servers could cause congestion in legacy parts of the infra
  • Higher setup complexity (more difficult troubleshooting, special cases to adapt in server lifecycle, more cabling)

Required L2 adjacency to physically distant networks

LVS - This use-case will go away with the future L4LB project.

Failure domains

Services MUST tolerate a failure domain to fail

By spreading servers across multiple failure domains (In other words, not put all our eggs/servers in the same basket/failure domain).

In networking (and in our network) multiple elements are considered failure domains:

Virtual Chassis

Our legacy network design includes the usage of the Juniper Virtual Chassis technology. While it eases its management, it makes all members of such VC at the risk of going down for bug, miss-configuration or maintenance.

You can find the list of active VC on Netbox. For production they're the A/B/C/D rows in eqiad/codfw as well as esams/ulsfo/eqsin switches.

The new network design doesn't have this constraint, for example F1 and F2 are distinct failure models.

L2 domains

There is little control that can be done on traffic transiting a L2 domain, any device miss-behaving will impact all the other servers, a cabling issue can flood the network, efficient scaling is not possible.

For production both the legacy network design conveniently fit the VC models (L2 domains are stretched across the VC, eg. private1-a-eqiad).

The new network design restrict the L2 domains to each ToR switches.

For example a service with 3 servers MUST NOT have those severs in eqiad A1 A2 A3, but rather A1 B1 C1, or E1 E2 E3.

In the cloud realm, the cloud-instances vlan is stretched across multiple switches.

Ganeti clusters

Ganeti clusters follow the L2 domains, each cluster is thus a matching failure domain.

Core DCs

At a higher level, eqiad and codfw are disaster recovery pairs. Critical services should be able to perform their duty from the secondary site if the primary one becomes unreachable. See Switch Datacenter.

Public IPs

Except special cases, servers MUST use private IPs

Public IPs and more specially public IPv4 have the following risks:

  • They are directly exposed to the Internet, and thus have fewer safeguards if a miss-config or bug is introduced to their firewall rules or host services
  • They are scarce, and thus expensive and could negatively impact our growth or deployment of a new service/POP/etc

They should be used only if there are no other options (for example if a service shouldn't depend on LVS) and throughout discussion. The bellow diagram can help figure out if we're indeed in such special case.

We should additionally strive to "migrate" existing servers that use direct public IPs if the requirement or dependency is not valid anymore.

File:New service IP flow chart.png
Flow chart to help a user pick an appropriate IP type for production.

IPv6

Except special cases, servers MUST be dual stacked (have both a v4 and v6 IP on their primary interface)

TODO

Congestion

Cross DC traffic flows MUST be capped at 5Gbps

Cluster traffic exchanges within a DC MUST not exceed 30Gbps

The network is a shared resource, while we're working at increasing backbone capacity (hardware/links) and safeguards (QoS) we all need to be careful about large data transfers.