You are browsing a read-only backup copy of Wikitech. The primary site can be found at

Network best practices: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
Line 1: Line 1:
#REDIRECT [[Wikimedia network guidelines]]
A collection of best practices to follow for an efficient use of our network and high availability. With their exceptions.
Please reach out to [[SRE/Infrastructure Foundations]] if you need help designing a service or if something doesn't fit those best practices.
== Servers uplinks ==
<big>Except special cases listed below, servers MUST use a single uplinks.</big>
When needing access to more than 1 logical network (vlan) those MUST be [[:en:IEEE_802.1Q|trunked]].
=== General reasons for servers to have multiple uplinks ===
==== Redundancy ====
When configured in a active/passive mode, if any of the [switch|switch interface|cable|server interface] fails, the alternate link takes over.
Not done in production for multiple reasons:
* Switch cost (this would double or switch budget)
* Low frequency of failure
** It's more likely that a server fail or needs to be restarted than a cable or interface fails
** for example even primary DBs use a single unplink
* Higher setup complexity (more difficult troubleshooting, special cases to adapt in server lifecycle, more cabling)
One exception is the Fundraising Infrastructure as they're both critical and have a tiny footprint (2 racks) allowing them to have 2 ToR switches. The complexity downside applies here too (see {{Phabricator/en|T268802}})
==== Capacity ====
When configured in an active/active mode, 2x10G links will have more bandwidth than 1x10G link.
Not done in production for multiple reasons:
* Switch cost (this would double or switch budget)
* No need, this could change in the future but the current services rarely require more than 10G per server
** New ToR switches support up to 25G uplinks (with 40/50/100G for exceptional cases)
** SPOF risk (it's recommended to have multiple distributed nodes than few large ones)
* Risk of a link failing saturating the other link (eg. server is pushing >10G through 2 NICs and we don't monitor for it)
* Backbone capacity, while we're upgrading backbone links and gear, significantly increasing capacity for a few servers could cause congestion in legacy parts of the infra
* Higher setup complexity (more difficult troubleshooting, special cases to adapt in server lifecycle, more cabling)
==== Required L2 adjacency to physically distant networks ====
[[LVS]] - This use-case will go away with the future L4LB project.
== Failure domains ==
<big>Services MUST tolerate a failure domain to fail</big>
By spreading servers across multiple [[:en:Failure_domain|failure domains]] (In other words, not put all our eggs/servers in the same basket/failure domain).
In networking (and in our network) multiple elements are considered failure domains:
=== Virtual Chassis ===
Our legacy network design includes the usage of the Juniper Virtual Chassis technology. While it eases its management, it makes all members of such VC at the risk of going down for bug, miss-configuration or maintenance.
You can find the list of active VC on [ Netbox]. For production they're the A/B/C/D rows in eqiad/codfw as well as esams/ulsfo/eqsin switches.
The new network design doesn't have this constraint, for example F1 and F2 are distinct failure models.
=== L2 domains ===
There is little control that can be done on traffic transiting a [[:en:Data_link_layer|L2 domain]], any device miss-behaving will impact all the other servers, a cabling issue can flood the network, efficient scaling is not possible.
For production both the legacy network design conveniently fit the VC models (L2 domains are stretched across the VC, eg. private1-a-eqiad).
The new network design restrict the L2 domains to each ToR switches.
For example a service with 3 servers MUST NOT have those severs in eqiad A1 A2 A3, but rather A1 B1 C1, or E1 E2 E3.
In the cloud realm, the cloud-instances vlan is stretched across multiple switches.
==== Ganeti clusters ====
[[labsconsole:Ganeti|Ganeti]] [ clusters] follow the L2 domains, each cluster is thus a matching failure domain.
=== Core DCs ===
At a higher level, eqiad and codfw are [[:en:Disaster_recovery|disaster recovery]] pairs. Critical services should be able to perform their duty from the secondary site if the primary one becomes unreachable. See [[Switch Datacenter]].
== Public IPs ==
<big>Except special cases, servers MUST use private IPs</big>
Services requiring public Internet connectivity can be deployed in several ways.  The most straightforward, deploying hosts to a public Vlan with a public IP directly connected to its primary interface, is discouraged for several reasons:
* There is no load-balancing / redundancy built in when reaching the IP
* They are directly exposed to the Internet, and thus have fewer safeguards if a miss-config or bug is introduced to their firewall rules or host services
* IPv4 space is scarce, and pre-allocating large public subnets to vlans is difficult to do without wastage.
Where services need to be made available to the internet they should ideally sit behind a load-balancer, or expose the service IP with another technique (BGP etc.).  Where hosts need outbound web access they should use our [[HTTP proxy|HTTP proxies]] where possible.
Public vlans should be used only if there is no other option (for example if a service cannot sit behind a load-balancer, or needs external access that cannot be done any other way).  The bellow diagram can help figure out if we're indeed in such special case.
Additionally we should strive to migrate services away from public vlans if the requirement or dependency is not valid anymore, or can be satisfied in a different way.[[File:New service IP flow chart.png|none|thumb|Flow chart to help a user pick an appropriate IP type for production.]]
== IPv6 ==
<big>Except special cases, servers MUST be dual stacked (have both a v4 and v6 IP on their primary interface)</big>
This aligns with the longer-term goal of depreciating IPv4 and eventually only having one protocol to configure.
== Congestion ==
<big>Cross DC traffic flows SHOULD be capped at 5Gbps</big>
<big>Cluster traffic exchanges within a DC SHOULD NOT exceed 30Gbps</big>
The network is a shared resource, while we're working at increasing backbone capacity (hardware/links) and safeguards (QoS) we all need to be careful about large data transfers.
If planning a service that is expected to consume a lot of bandwidth please discuss with Netops to ensure optimal placement and configuration of the network. It is extremely important that we don't introduce new services which may end up negatively impacting the overall network or existing applications.

Latest revision as of 10:47, 2 September 2022