You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Network design - Eqiad WMCS Network Infra

From Wikitech-static
Revision as of 14:02, 1 June 2022 by imported>Cathal Mooney (→‎Static Routes)
Jump to navigation Jump to search

This page details the configuration of the network devices managed by SRE Infrastructure Foundations (netops) to support cloud services in the Eqiad (Equinix, Ashburn) datacenter. Further information on the overall WMCS networking setup, including elements managed by the WMCS team themselves, are on the Portal:Cloud VPS/Admin/Network page.

Physical Network

File:WMCS network-L1.png

The dedicated physical network currently consists of 4 racks of equipment, C8, D5, E4 and F4. 6 Juniper QFX-series switches are deployed across the 4 racks. Additionally, rack C8 is connected to the virtual-chassis switches in row B, to provide connectivity for legacy servers installed there.

Racks C8 and D5 each have 2 switches, a main switch that connects servers and also has an uplink to one of the core routers, and a second switch which provides additional ports for servers. Most cloud hosts consume 2 switch ports, which means a single 48-port switch is not sufficient to connect all hosts in the racks, hence the second switch in each.

Racks E4 and F4 currently only have a single top-of-rack each, and it is hoped in time WMCS can adjust the server configs to use 801.1q / vlan tagging so separate physical ports are not required to connect to two or more networks.

The network is configured in a basic Spine/Leaf structure, with the switches in C8 and D5 acting as Spines, aggregating traffic from E4 and F4, and connecting to the outside world via the CR routers. Connections between racks E4/F4 and C8/D5 are optical 40G Ethernet (40GBase-LR) connections over single-mode fiber. The topology is not a perfect Spine/Leaf, however, as there is also a direct connection between cloudsw1-c8 and cloudsw1-d5. This is required for various reasons, principally that there is only a single uplink from each cloudsw to the CR routers, and an alternate path is needed in case of a link or CR failure.

Logical Network

Several networks are configured on the switches described in the last section. At a high-level networks are divided into the "cloud" and "production" realms, which are logically isolated from each other. This isolation is used to support the agreed Cross-Realm traffic guidelines.

Isolation is achieved through the use of Vlans and VRFs (routing-instances in JunOS) on the cloudsw devices. The default routing-instances on the cloudsw's is used for the production realm traffic, and a named routing-instance, 'cloud', is used for the cloud-realm.

Some networks exist purely at layer-2, with the switches only forwarding traffic between servers based on destination MAC address. The switches are unaware of the IP addressing used on those layer-2 segments and do not participate in routing. Those networks only carry traffic internal to the cloud realm. Specific cloud hosts, like cloudnet and cloudgw, act as the layer-3 routers for devices on these segments. They are not technically part of the cloud vrf, as there are no IP interfaces belonging to them on the switches, but they are considered to be part of the cloud realm.

Production Realm

File:WMCS network-L3 - Prod Realm.drawio.png

The above diagram shows an overview of the production realm routing configured on the cloud switches.

CR Uplinks

Cloudsw1-c8 and cloudsw1-d5 each have a 10G uplink to one of our core routers (CRs). 802.1q sub-interfaces are configured on these links, and one sub-interface is used on each switch for production realm traffic. eBGP is used to exchange routes with the CRs, with separate BGP sessions used to exchange IPv4 and IPv6 routes. IPv4 routes are exchanged over a session between the IPv4 addresses either side, and the IPv6 routes exchanged over IPv6.

The CRs only announce default routes to the switches. The switches announce all routes from the production realm to the CR routers. This includes all the connected subnets on each switch (both for end-hosts/production and link networks/infrastructure), as well as the production loopbacks configured on the switches. A maximum-prefix setting of 1,000 routes is applied to the eBGP sessions to the CR routers. This is applied just as a safeguard in case somehow a full routing table was announced by a CR in error, to protect the switches which have limited TCAM space for routes.

On the CR switches these peerings are in the Switch4 and Switch6 groups, along with the peerings to EVPN Spine switches, which similarly announce private production subnets and device loopbacks. Filters are deployed on these peerings to ensure only correct routes are accepted by the CRs. The CR sub-interfaces have the cr-labs filters applied to them, which controls what traffic is allowed on these networks from cloud hosts. Additionally uRPF if configured on these interfaces to ensure traffic is sourced from valid source networks.

Local Networks

Each rack has a dedicated Vlan for production realm hosts to be connected to. All of these Vlans have a /24 IPv4 subnet and a /64 IPv6 subnet configured. The cloudsw1 devices in each rack act as the L3 gateway for these subnets. In most cases the switches multicast IPv6 RAs to all hosts in the Vlan. RAs are not, however, enabled on the cloudsw2 devices in C8/D5, as we want to prevent hosts connected to cloudsw1 in those racks using them as gateway. IPv4 DHCP relay is enabled on all switches, to forward DHCP messages from local hosts to the install server which processes the requests. DHCP Option 82 information is added to DHCP DISCOVER messages by the switch, which allows the install server to identify the host making the request and assign the correct IP. The cloudsw2 devices in C8 and D5 each have an IP interface on the cloud-hosts1 vlan for that rack, even though they do not act as gateway for the Vlan. They use these IP as the source for relayed DHCP messages sent to the install server.

The switches in racks D5 and C8, as well as the asw2-b-eqiad virtual-chassis, also have the legacy cloud-hosts1-eqiad Vlan (1118) configured on them. This Vlan is trunked across these switches, and cloud hosts provisioned prior to the redesign are connected to it. Cloudsw1-c8 and cloudsw1-d5 run VRRP between them over this Vlan, acting as gateway for the hosts. No specific VRRP priority is configured, so master selection is non-deterministic. Over time this Vlan will be phased out, as replacement hosts will automatically get added to the new, rack-specific Vlans. In this way all hosts will eventually use a L3 gateway in the same rack.

Link Networks

Several Vlans are used as "link networks". These are configured on trunks between 2 switches as required, with a /30 IPv4 subnet configured on matching IRB/Vlan interfaces each side. These Vlans are used as a next-hop for routed traffic between switches, and to establish BGP peerings. Ideally one would use regular routed interfaces, with IPs bound directly to the interface (or sub-interfaces of it), but these inter-switch links instead need to be configured as L2 trunks to support the stretched cloud-instances Vlan.


L2 Vlans

The production realm has no Vlans configured that operate purely at layer-2. All production realm Vlans have routed IRB interfaces on the switches which act as L3 gateway for connected hosts.

Within racks C8 and D5 the cloud-hosts vlans are extended at layer-2 to the cloudsw2 devices in those racks, to provide additional ports for end servers (not shown on diagram).

Cloudsw BGP Routing

The cloudsw1 devices in all racks run BGP in the default routing-instance, and exchange production realm prefixes with each other. The "Spine" devices, cloudsw1-c5 and cloudsw1-c8, both use AS64710. eBGP is configured over the 'linknet' Vlans to cloudsw1-e4 (AS4264710003) and cloudsw1-f4 (AS4264710004) from each. Cloudsw1-c5 and cloudsw1-c8 peer with each other using iBGP. All local networks (direct), static routes and BGP routes are enabled in the export BGP policy on all of these peerings. No more finely-grained filters are used.

Cloud Realm / VRF

The below diagram provides an overview of routing in the Cloud VRF:

File:WMCS network-L3 - Cloud VRF Realm.drawio.png

Routing Instance

All IP interfaces on the cloudsw devices in the cloud realm are placed into a dedicated VRF / routing-instance. This alternate routing instance has no entries for any of the production realm networks, and thus traffic cannot route directly between the cloud and production realms via the switches.

In general the routed topology for the cloud VRF is a mirror of the one in the default instance (production realm), just isolated. The cloud vrf is IPv4 only as WMCS do not support IPv6 as yet.

Static Routes

The two IPv4 ranges used internally by WMCS are statically routed to the cloudgw VIP on Vlan 1120 (cloud-instance-transport1-b-eqiad), on cloudsw1-d5 and cloudsw1-c8:

Prefix Description
172.16.0.0/21 Cloud instance (VM) IP range, ideally should not be routable from WMF production (see T209011)
185.15.56.0/24 WMCS Public IPv4 aggregate route

CR Uplinks

Dedicated sub-interfaces are configured for the cloud vrf on the same physical CR uplinks (from cloudsw1-c8 and cloudsw1-d5) as the production realm uses.

eBGP is configured over these links, again similar to in the default table. The two static routes described in the last section are exported to the CRs, making these ranges are routable from WMF production. The CR sub-interfaces connecting the cloud vrf have the cloud-in filter applied, to control what traffic is allowed between the cloud realm and WMF production.

Local Networks

Transit Vlan

Vlan 1120 (cloud-instance-transport1-b-eqiad) is configured on cloudsw1-c8 and cloudsw1-d5. On each switch it only has a single host connected to it. Cloudgw1001 is connected to cloudsw1-c8, and cloudgw1002 is connected to cloudsw1-d5. The two switches provide a VRRP gateway in this Vlan for the cloudgw devices to send

TODO: - VRRP on this Vlan is somewhat pointess. If either cloudsw fail then one of the cloudgw's will be offline, which should force cloudgw to failover. The use of a VRRP VIP on the cloudgw

Storage Networks

On the cloudsw devices themselves the only Vlans / subnets configured are for the cloud-storage networks. These are used by WMCS Ceph hosts as their 'cluster' network. L2 Vlans and IRB interface MTU allow Jumbo Ethernet frames to pass over these Vlans, and the Ceph hosts are configured for 9,000 byte MTU. All the cloud-storage networks are within the RFC1918 192.168.0.0/16 range, and Ceph hosts need a static route for that supernet towards the cloudsw IRB IP (last in subnet) to communicate with each other. None of these ranges are announced in BGP to the CR routers, they are simply used locally between the Ceph hosts within the cloud vrf.

Vlan 1106 is the 'legacy' storage Vlan for Ceph hosts, and is trunked between all the cloudsw devices in racks c8/d5, as well as to the asw2-b-eqiad virtual-chassis. This is similar to Vlan1118 in the production realm. Cloudsw1-c8 and cloudsw1-d5 run VRRP between them and act as gateway for cloud hosts on this Vlan, providing VIP 192.168.4.254/24 as gateway to the other storage subnets.

TODO: - Create two new, per-rack storage Vlans / subnets for racks c8 and d5, so we can move Ceph hosts to always using a local gateway.

Link Networks

MTU on these interfaces is also high to enable Jumbo frames routing on the Ceph cluster network.

L2 Vlans

The WMCS cloudnet hosts, implementing the OpenStack Neutron router, act as gateway for cloud instances (VMs) on this network.

Cloudsw BGP Routing

Notes