You are browsing a read-only backup copy of Wikitech. The live site can be found at

Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh

From Wikitech-static
< Wikimedia Cloud Services team‎ | EnhancementProposals
Revision as of 16:54, 27 August 2020 by imported>Ayounsi (Merge cloudsw ideas to the proposed solution)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page describes the cloudgw project, a project to introduce a refresh in the CloudVPS network by reworking the edge networking by offloading functionalities from the neutron virtual router to a dedicated L3 linux box, and offloading some functionalities from prod core routers to cloudsw phyisical network switches.

Why the cloudgw project

This section explains the rationale behind the cloudgw project.


It is important to understand the context and scope of the project to better evaluate the proposed solution.

Current edge network setup

Eqiad1 transport.png

From a network point of view, we can understand the WMF prod network as the upstream (or ISP) connection for the CloudVPS service. Our Neutron virtual router is defined as the gateway between the virtual network (the CloudVPS virtual network) and our upstream connection. There is no other L3 device between Neutron and the prod core router. Given we don't have any device with proper firewalling capabilities in the network, the core router also acts as a firewall for the CloudVPS virtual network.

There is static routing in the prod core routers to support this setup, and past BGP experiments showed the limits of Neutron for acting as a true edge router. In this setup, we are limited to static routing only.

The virtual machines inside the CloudVPS virtual network use private addressing in the range. When virtual machines contact the outside internet, there is a NAT mechanism in the Neutron virtual router that SNAT the traffic using a public IPv4 address (for example In our setup, we refer to this address as the routing_source_ip. We also have a feature called floating IP, which associates a VM instance with a public IPv4 address. This floating IP address will be used for all ingress/egress traffic by the VM instance.

Traditionally, we had the need for VMs to contact some WMF prod internal services directly, without NAT being involved. This is, so WMF services know the particular VM instance using the services. We implement this NAT exclusion by means of a mechanism called dmz_cidr. Currently, some of our services, like NFS, relies on this setup to have proper control about VM usage of the service.

The Neutron virtual router is implemented by means of a Linux network namespace in cloudnet servers. The different netns are dynamically managed by Neutron which should then be configured and operated using the openstack networking API and CLI utilities. All the routing, NAT and firewalling done by Neutron is using standard Linux components: the network stack, netfilter engine, keepalived for high availability. etc.

The current CloudVPS network setup is extensively described in the Neutron page. This includes documentation about both the edge/backbone network and the internal software defined network.

WMCS eqiad1 network topology.png

Specific problems

There are a number of problems and other limitating factors we aim to address/solve/ease with this project.

reduce technical debt

We are currently using the neutron virtual router as the edge router for the CloudVPS service internal virtual network. This is against upstream openstack design recommendations, as can be seen in some docs. Moreover, this has proven challenging for proper CloudVPS network administration, given we don't have enough configuration flexibility in Neutron for managing the virtual router as a general purpose network gateway.

Additionally, for the current setup to work, we have custom code customization in Neutron. Basically the code customization was introduced to make Neutron behave like the old nova-network openstack component behaved. This was a requirement during the nova-network to neutron migration that was done years ago, but it is no longer required. The code customization is a pain point when upgrading openstack versions, given we have to rebase all the patches and test that everything works as expected, adding unnecessary complexity to our operations.

In an ideal model, Neutron would just do what it was designed for, which is enable software defined networking (SDN) inside the cloud. Neutron wasn't designed to act as the edge router for an openstack-based public cloud. So, if we offload/decouple some of the current Neutron virtual router responsibilities to an external server, we would be effectively reducing technical debt from the CloudVPS service point of view.

further separate CloudVPS/prod network

The current setup has some flaws in the CloudVPS/prod separation topic.

Currently, CloudVPS internal IP addresses reach production network, without NAT being involved. This is by means of a mechanism called dmz_cidr which is part of our neutron custom code customization. There are some long overdue revisions to this, as can be seen for example in phabricator T209011 - Change routing to ensure that traffic originating from Cloud VPS is seen as non-private IPs by Wikimedia wikis. Neutron is controlling this NAT, and as described above, this cannot be easily changed with enough flexibility. Therefore first step would be to address this lack of flexibility. Again, offlading/decoupling the edge NAT/firewalling setup from Neutron feels like the right move.

Neutron is not designed to work as an arbitrary edge network router or firewall. Currently, all edge network firewalling for the CloudVPS service network is implemented by means of prod core routers. The prod core routers aren't also designed to be this kind of firewall. Traditionally, in the past, managing the firewalling policy for CloudVPS in prod core routers has been a challenge for us. In an ideal separation from the prod realm, CloudVPS would have their own edge network firewalling.

Enabling more network separation between prod and the CloudVPS network could help us to eventually relocate some important services like storage (NFS, ceph) and others into CloudVPS own network.

prepare for the future

We feel that investing proper engineering time in the network architecture is long overdue in the cloud realm. We pretty much need to invest some engineering time here to ensure we prepare the ground for the future of our public cloud. This includes introducing technologies we don't currently use, like IPv6 and BGP, and brand new cloud features like Neutrn tenant networks.

It is widely accepted that the future of the CloudVPS service is completely separated from the prod architecture, and therefore any engineering time to move in that direction would be more than welcome.

other considerations

In the current architecture, with Neutron acting as the edge router for the CloudVPS virtual network, we are forced to use the Neutron API and other openstack abstraction (the very openstack design can be seen as an abstraction itself) in order to manage what otherwise would be a very simple setup. We would rather use standard linux utilities to directly manage certain components, like routing, addressing, NAT/firewalling, etc.

Since all the Neutron configuration lives in a mysql database, with no external RO/RW access, and the Neutron API itself is strictly restricted, external contributors have reduced chances to learn and contribute about our setup. This is something we would like to improve. Using git operations, like in our current puppet model or similar, is a much friendly way of welcoming and engaging technical contributors.

We also identified the need to follow more closely what upstream Linux communities and projects are doing, instead of cooking our own stuff. The Neutron custom code customization we have is just one clear example of us not following upstream patterns.

In FY19/20 Q4, new dedicated switches were procured for the WMCS team. These switches, called cloudsw1-c8-eqiad and cloudsw1-d5-eqiad, are already racked and connected. We refer to them generically as cloudsw. We can leverage these dedicated switches to improve our edge routing by introducing advanced L3 setups based on OSPF and BGP, and improve our general L2 VLAN management and setup. This, again, is the right move in the long road of an eventual full separation from the prod network.


This is an executive summary of all the concerns and desires/goals expressed above:

  • stop using neutron as the CloudVPS edge router, which is against upstream recommendations
  • eliminate our neutron custom code customization
  • simplify neutron setup by offloading functionalities to the cloudgw servers
  • we would like to stop CloudVPS internal IPs reaching prod networks
  • we would like to manage our own firewalling policies without relaying on the prod core routers
  • eventually relocate storage (NFS, ceph, etc) inside our own L3 domain
  • build the initial changes that will eventually allow for a full separation from pod
  • prepare the ground for better IPv6, BGP support
  • having an upstream-openstack-like setup allows us to think on introducing tenant networks in the CloudVPS service
  • standard linux utilities to manage the network (instead of the Neutron API or other abstractions)
  • external contributors have easy / simple ways to learn / contribute to our setup. Neutron is not very friendly to external contributors.
  • we keep contributing/integrating more with Linux upstream communities and projects
  • we use cloudsw dedicated switches to introduce better L3 edge routing by using protocols such as BGP and OSPF, along with better future L2 VLAN management and isolation.

Why the cloudsw project?


A. Remove physical and logical dependency on eqiad row B (asw2-b-eqiad)

Because of historical reasons and technical limitation in OpenStack, WMCS only grew in eqiad row B.

Our current eqiad HA design is done per row, which mean production (core) services aim to be equally balanced between the 4 rows we have in the datacenter.

Since its initial deployment, WMCS grew significantly, competing in term of rack space and more importantly 10G switch ports in that row.

In addition, bandwidth requirements and traffic flows are different between WMCS and the production infrastructure, which brings a risk of WMCS saturating the underlying switch and router infrastructure, impacting the production environment.

Providing dedicated L2 switches and progressively moving WMCS servers to them (as they get refreshed) will eliminate those issues.

B. Standardize production<->WMCS physical and logical interconnect

The way WMCS grew within the production realm is a snow flake compared to the industry best practice of having a PE/CE interconnect between functionally different infrastructure. This "snowflakiness" and lack of clear boundary brings a few issues:

  • It increases the complexity of managing the core network, introducing technical debt
  • It increases the security risk as VMs in an untrusted domain could gain access to critical infrastructure
  • It prevents having good traffic flow visibility for traffic engineering and analysis

Configuring the above mentioned switches to additionally act as L3 gateway will help solve this issue in the long term, while providing the tools (eg. flow visibility) to fully address it in the long term.

C. High availability of the WMCS network layer

As mentioned in (A), all the WMCS are hosted on the same virtual switch. Which mean maintenance or outage takes all WMCS hosts offline.

Using multiple dedicated WMCS switches sharing the same L2 domain but using individual control plane will ease maintenance and limit the blast radius of an outage.

D. Provide groundwork for WMCS growth and infrastructure independence

Due to (A) and (B), all changes to the WMCS foundations (network or other) have been challenging as they either require tight synchronization between several teams, or could cause unplanned issues due to unexpected dependencies.

Clearly defining the WMCS realm (with dedicated L2 and L3 domain) will ease significantly future changes (eg. new vlans, ACLs, experimentation, etc) without risking impacting the production realm.

Proposed solution

Cloudgw new device.png

The proposed solution comes in 2 independent parts:

  • One is to introduce 2 linux boxes acting as L3 gateways for the CloudVPS network. We refer to these servers as cloudgw. We relocate edge NAT/firewalling functionalities into these new servers.

For the cloudgw servers we will use standard puppet management, netfilter for NAT/firewalling, prometheus metrics, icinga monitoring, redundancy (HA) using standard mechanisms like keepalived, corosync/pacemaker, or similar.

  • The other is to introduce two switches dedicated to the WMCS realm in both L2 (all servers will be connected to them) and L3 (all servers traffic will be router through those switches) named cloudsw.

This proposal includes a timeline with a detailed plan with different operation stages.

Proposed timeline

Those two independent parts will move in parallel to eventually be integrated together:

  • cloudgw in codfw
  • cloudsw in eqiad

This is due to not having the required equipment for staging cloudsw in codfw, as well as the urgency of solving some of the mentionned issues in eqiad where they are present.

  • stage X: All new WMCS Openstack and Ceph servers are connected to dedicated WMCS switches - DONE
  • stage X: Route cloud-hosts vlan through cloudsw (see bellow)
  • stage 0: changes to L3 edge routing.
  • stage 0A: introduce the cloudgw L3 nodes. They don't have any NAt/firewalling enabled yet, but we introduce required L3 routing changes to have traffic flowing through them.
  • stage 0B: enable L3 routing on cloudsw nodes. They don't have any BGP/OSPF enabled yet, but we introduce static L3 routing to have traffic flowing.
  • stage 1: initial basic feature relocation.
  • stage 1A: introduce basic NAT / firewalling capabilities into cloudgw servers. Relocate prod core router cloud firewalling to cloudgw.
  • stage 1B: enable BGP/OSPF in cloudsw nodes.
  • stage 2: offload Neutron NAT / firewalling functions to cloudgw, specifically the dmz_cidr and routing_source_ip mechanisms.
  • stage 3: review / rework the dmz_cidr and routing_source_ip mechanisms. Evaluate entirely dropping or narrowing down the NAT exclusion mechanism for contacting prod network.
  • stage 4: evaluate reworking the L2/L3 setup for storage (NFS/Ceph).
  • stage 5: evaluate reworking the L2/L3 setup for wiki replicas.

Implementation details

Cloudgw-stage 3.png

Some details about how the implementation will look like.

specs for eqiad1

On cloudgw side, each server:

  • Hardware, misc box
    • CPU: 16 CPU
    • RAM: 32 GB
    • Disk: 500GB
    • 2 x 10Gbps NICs. NICs are bonded/teamed/aggregated for redundancy.
  • Software
    • standard puppet management
    • prometheus metrics, icinga monitoring
    • netfilter for NAT/firewalling
    • keepalived or corosync/pacemaker for HA

On cloudsw side, each device:

  • already procured and racked in eqiad

specs for codfw1dev

For cloudgw, TBD. Probably repurpose some old server.

  • Arturo proposes we repurpose labtestvirt2003 (currently spare) as cloudgw2001-dev.

For cloudsw, TBD.

network setup in codfw1dev

  • connectivity between cloudgw and the cloud-hosts1-b-codfw subnet.
    • L3:
      • a single IP address allocated by standard methods for ssh management, puppet, monitoring, etc. Gateway for this subnet lives in cloudsw.
    • L2:
  • connectivity between Neutron (cloudnet) and cloudgw:
    • L3:
      • cloudnet keeps the current connection to the cloud-hosts1-b-codfw subnet for ssh management, puppet, monitoring, etc. Gateway for this subnet lives in cloudsw.
      • keep the current cloud-instances-transport1-b-codfw (vlan 2120)
      • keep the current cloud-instances2-b-codfw (vlan 2105)
    • L2:
  • connectivity between cloudgw and cloudsw:
    • L3:
      • allocate new transport range and vlan 21XX.
    • L2:
  • connectivity between cloudsw and prod core router:
    • L3:
      • allocate new transport range and vlan 21YY.
      • BGP/OSPF in stage 1B
    • L2:
      • cloudsw has ports connected to the prod core routers (direct? using some asw?, TBD)
      • prod core router: TBD.

network setup in eqiad1

Current status


stage X: Route cloud-hosts vlan through cloudsw

The cloud-hosts vlan, which is part of the production realm, is curently routed on cr1/2-eqiad:ae2.1118. Which are the interfaces facing asw2-b-eqiad.

In the optic of better separation of WMCS and production realm, that routing should be moved to cr1/2-eqiad:xe-3/0/4.1118, the interfaces facing cloudsw.

This will contribute to goals (A) and (C) of the cloudsw project.

This is a low complexity change.

  1. Force VRRP master on cr1-eqiad
  2. Move inet/inet6 configuration on cr2-eqiad from ae2.1118 to xe-3/0/4.1118
  3. Check if IP is working as expected
    1. Reachability from cr1:ae2.1118 to cr2:xe-3/0/4.1118
    2. VRRP state sharing between cr1 and cr2
  4. Move VRRP mastership to cr2
  5. Check reachability of cloud-hosts hosts
  6. Move inet/inet6 configuration on cr1-eqiad from ae2.1118 to xe-3/0/4.1118
stage X: enable L3 routing on cloudsw nodes

This will contribute to goals (A), (B), (C) and (D) of the cloudsw project.

WMCS network-L2 L3.png

High level plan:

  1. Decide of public IP assignments (suggestions on diagram)
  2. Configure OSPF, iBGP, eBGP
  3. Configure firewall filters
  4. Move cloud-instance-transport routing to cloudsw (similar to cloud-host: 1 VRRP member at a time)
  5. Advertize WMCS public space from cloudsw to cr routers
stage X: enable L3 routing on cloudgw nodes


Final status


Cloudgw L2 stage 3 eqiad.png
  • connectivity between cloudgw and the cloud-hosts1-b-eqiad subnet.
    • L3:
      • a single IP address allocated by standard methods for ssh management, puppet, monitoring, etc. Gateway for this subnet lives in cloudsw.
    • L2:
  • connectivity between cloudgw and cloudsw:
    • L3:
      • allocate new transport range and vlan 11XX.
    • L2:
  • connectivity between cloudsw and prod core router:
    • L3:
      • allocate new transport range and vlan 11YY.
      • BGP/OSPF in stage 1B
    • L2:
      • cloudsw has ports connected to the prod core routers (direct? using some asw?, TBD)
      • prod core router: TBD.

Why this solution

This proposed solution addresses all the concerns detailed in the background section.


  • stop exposing neutron virtual router directly to the internet.
  • allow us to simplify/eliminate Neutron code customization.
  • gives our virtual network additional controls and flexibility regarding future network architectures and growth.
  • the "right" move in the long road of the eventual full separation from prod.
  • fully compatible with future works related to IPv6, BGP, etc
  • could allows us for more "deep" network isolation for Ceph, NFS, and other supporting services.
  • relatively "easy" tech solution: standard linux servers using puppet.
  • relatively "cheap". A couple (for redundancy) of linux servers per openstack deployment.
  • leverage already racked cloudsw devices for advanced dynamic routing.

Why Linux and not dedicated network hardware

  • The cloudgw setup would be very simple. Specially if you compare the setup with what Neutron does.
  • We contribute/integrate a bit more with the Linux upstream communities and projects.
  • Standard puppet workflow is a plus.
  • The flexibility of a full linux shell is interesting: scripts, debugging, tooling, etc.
  • The price of the hardware is not too high. Not a limitating factor. Small misc commodity boxes, with 10G NICs.
  • Better integration with many other external stuff, like prometheus, backups, etc.
  • Having a Linux box act as a gateway is not very complex in general.
  • Introduce HA and redundancy support is not complex either (plenty of options, keepalived, corosync/pacemaker, etc).
  • The Linux networking and NAT engines are industry standards.
  • Basically, we will be offloading some of the functions that Neutron (linux) already does to a dedicated box. The need for specific network hardware is literally zero. A linux box will suffice.
  • This all may better engage external contributors.
  • It is pretty common in corporate realms to separate routing/switching/firewalling into different components.
  • The cloudgw is a brand new piece of infra. This should not scare us. We do that all the time, even with more complex technologies.

Additional notes


  • cloudgw needs a leg in the cloud-host subnet for puppet etc.
  • shall we consider racking space issues when planning the different stages?
  • collect intel on why/what uses the dmz_cidr NAT exclusion mechanism.
  • budgeting for hardware?

See also

Other useful information: