You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh/Implementation details

From Wikitech-static
< Wikimedia Cloud Services team‎ | EnhancementProposals‎ | 2020 Network refresh
Revision as of 11:09, 21 September 2020 by imported>Arturo Borrero Gonzalez (create page)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This page contains the implementation details for the 2020 Network refresh project.

eqiad

Cloudgw-stage 3.png

In the eqiad datacenter, related to the eqiad1 openstack deployment.

specs for eqiad1

On cloudgw side, each server:

  • Hardware, misc box
    • CPU: 16 CPU
    • RAM: 32 GB
    • Disk: 500GB
    • 2 x 10Gbps NICs. NICs are bonded/teamed/aggregated for redundancy.
  • Software
    • standard puppet management
    • prometheus metrics, icinga monitoring
    • netfilter for NAT/firewalling
    • keepalived or corosync/pacemaker for HA

On cloudsw side, each device:

  • Juniper QFX5100 switches with L3 routing licenses
network setup in eqiad1
allocations

IPv4 allocations:

185.15.56.0/24
    185.15.56.0/25 - Openstack instances NAT
    185.15.56.128/26 - reserved for the above groth
    185.15.56.192/27 - unused
    185.15.56.224/28 - unused
    185.15.56.240/28 - infrastructure
        185.15.56.240/29 - 1120 - cloud-instances-transport1
        185.15.56.248/31 - 1104 - cloudsw1-c8<->cloudsw1-d5 - cloud-xlink1
        185.15.56.250/31 - unused
        185.15.56.252/30 - loopbacks

VLAN allocations:

1102 - cr1<->cloudsw1-c8 - cloud-transit1-eqiad
1103 - cr2<->cloudsw1-d5 - cloud-transit2-eqiad
1104 - cloudsw1-c8<->cloudsw1-d5 - cloud-xlink1-eqiad
1105 - cloud-instances1-eqiad
1106 - cloud-storage1-eqiad
1107 - cloudsw1<->cloudgw - cloud-gw-transport-eqiad ?
1118 - cloud-hosts1-eqiad

1120 - cloud-instances-transport1-eqiad
stage 0 starting network setup

TODO: for reference, include here some bits about the starting setup of the network?

VLAN Switched on L2 Members L3 Gateway (“to internet”)
cloud-hosts1-eqiad asw2-b cr1/2

all cloudvirt eth0

all Ceph OSD eth0

cr1/2 (via asw2-b)
cloud-instances2-eqiad asw2-b all cloud VPS

all cloudvirt eth1

cloudnet1003/1004 eth1

cloudnet1003/1004 eth1
cloud-instances-transport1-eqiad asw2-b cloudnet1003/1004 eth0 cr1/2
cloud-storage1-eqiad asw2-b all cloudcephosd eth1 (none)
stage 1: Route cloud-hosts vlan through cloudsw

The cloud-hosts vlan, which is part of the production realm, is curently routed on cr1/2-eqiad:ae2.1118. Which are the interfaces facing asw2-b-eqiad.

In the optic of better separation of WMCS and production realm, that routing should be moved to cr1/2-eqiad:xe-3/0/4.1118, the interfaces facing cloudsw.

This will contribute to goals (A) and (C) of the cloudsw project.

This is a low complexity change. See https://phabricator.wikimedia.org/T261866 for the implementation.

stage 2A: enable L3 routing on cloudsw nodes

This will contribute to goals (A), (B), (C) and (D) of the cloudsw project.

WMCS network-L2 L3.png


Steps (to be moved to a task for implementation):

  1. Baseline configuration
    1. Cloudsw vlans (L2) - 1102, 1103, 1104, 1120
    2. iBGP and OSPF between cloudsw
    3. eBGP between core routers and cloudsw (advertise 208.80.155.88/29, 185.15.56.0/24 and 172.16.0.0/21, receive 0/0)
    4. Static route for 185.15.56.0/25 and 172.16.0.0/21 on cloudsw
    5. Firewall filters - lo, cloud-in4 (on core routers)
    6. Test connectivity
  2. cloud-instances-transport migration (downtime required [!])
    1. Ensure cr1 is VRRP master for all vlans, including 1120
    2. Move cr2:ae2.1120 to cloudsw1-d5:irb.1120
    3. Test cr1:ae2.1120 to cloudsw1-d5:irb.1120 connectivity (and VRRP sync)
    4. [!] Move vlan 1120 VRRP master to cloudsw1-d5:irb.1120
    5. [!] Remove static routes for 185.15.56.0/25 and 172.16.0.0/21 on core routers
    6. Test connectivity
    7. Move cr1:ae2.1120 to cloudsw1-c8:irb.1120
    8. Cleanup (remove passive OSPF, trunked vlans, update Netbox)
  3. Renumber cloud-instances-transport (downtime required [!]) [Could be done when introducing cloudgw] similar to https://phabricator.wikimedia.org/T207663
    1. Configure 85.15.56.240/29 IPs on all devices
    2. [!] Reconfigure cloudnet with new gateway IP (to be confirmed)
    3. Update static routes on cloudsw to point to new VIP
    4. Cleanup 208.80.155.88/29 IPs and advertisement (+Netbox)

At this stage:

VLAN Switched on L2 Members L3 Gateway (“to internet”)
cloud-hosts1-eqiad asw2-b*

cloudsw

cr1/2

all cloudvirt eth0

all Ceph OSD eth0

cr1/2 (via cloudsw)
cloud-instances2-eqiad asw2-b*

cloudsw

all cloud VPS

all cloudvirt eth1

cloudnet1003/1004 eth1

cloudnet1003/1004 eth1
cloud-instances-transport1-eqiad asw2-b*

cloudsw

cloudsw

cloudnet1003/1004 eth0

cloudsw
cloud-transit1/2-eqiad cloudsw cr1/2

cloudsw

cr1/2
cloud-storage1-eqiad asw2-b*

cloudsw

all cloudcephosd eth1 (none)

* To be removed when hosts are moved away from that device

stage 2B: enable L3 routing on cloudgw nodes

TBD

stage 3 final status for all main network components

TBD

  • Cloudgw L2 stage 3 eqiad.png
    connectivity between cloudgw and the cloud-hosts1-b-eqiad subnet.
    • L3:
      • a single IP address allocated by standard methods for ssh management, puppet, monitoring, etc. Gateway for this subnet lives on core routers, but is switches through cloudgw after stage 1.
    • L2:
  • connectivity between cloudgw and cloudsw:
    • L3:
      • allocate new transport range and vlan 11XX.
      • static routes between cloudgw and cloudsw
    • L2:
  • connectivity between cloudsw and prod core router:
    • L1: cloudsw are directly connected to the prod core routers using 1x10G port each
    • L2: 2 vlans are trunked between the two sides: vlan 1118 (cloud-hosts) and 1102 (public interco vlan)
    • L3: allocate two new interco /31s prefixes (208.80.154.210/31 and 208.80.154.212/31), configure eBGP in stage 2A

codfw

Cloudgw-L3 stage 3 codfw(1).png

In the codfw datacenter, related to the codfw1dev openstack deployment.

specs for codfw1dev

For cloudgw, repurpose labtestvirt2003 as cloudgw2001-dev.

For cloudsw, we assume we wont have the device anytime soon.

network setup in codfw1dev

Specific configuration details for each stage.

allocations

IPv4 allocations:

185.15.57.0/24
    185.15.57.0/29 - Openstack instances NAT (floating IPs)
    185.15.57.8/29 - reserved for the above growth
	185.15.57.16/28 - unused
	185.15.57.32/27 - unused
	185.15.57.64/26 - unused
    185.15.57.128/25 - infrastructure
        185.15.57.128/29 - 2120 - cloud-instances-transport1-b-codfw (cr-codfw <-> cloudgw)
        185.15.57.144/29 - 2107 - cloud-gw-transport-codfw (cloudgw <-> neutron)

VLAN allocations:

2105 - cloud-instances1-codfw (172.16.128.0/24)
2107 - cloud-gw-transport-codfw (cloudgw <-> neutron) (185.15.57.144/29)
2118 - cloud-hosts1-codfw (10.192.20.0/24)
2120 - cloud-instances-transport1-codfw (cr-codfw <-> cloudgw) (185.15.57.128/29)


stage 0 starting network setup

TODO: for reference, include here some bits about the starting setup of the network?

stage 1: Route cloud-hosts vlan through cloudsw

We don't have hardware for cloudsw in codfw. This stage is NOOP.

stage 2B: enable L3 routing on cloudsw nodes

We don't have hardware for cloudsw in codfw. This stage is NOOP.

stage 2A: enable L3 routing on cloudgw nodes

TODO: describe here the PoC we will be doing with labtestvirt2003.

stage 3 final status for all main network components

TODO:due to lack of resources in codfw we don't have yet and estimation of when this stage can be implemented.

  • connectivity between cloudgw and the cloud-hosts1-b-codfw subnet.
    • L3:
      • a single IP address allocated by standard methods for ssh management, puppet, monitoring, etc. Gateway for this subnet lives in cloudsw.
    • L2:
  • connectivity between Neutron (cloudnet) and cloudgw:
    • L3:
      • cloudnet keeps the current connection to the cloud-hosts1-b-codfw subnet for ssh management, puppet, monitoring, etc. Gateway for this subnet lives in cloudsw.
      • drop the current cloud-instances-transport1-b-codfw (vlan 2120) 208.80.153.184/29
      • add cloud-gw-transport-codfw (cloudgw <-> neutron) (vlan 2107) 185.15.57.144/29
      • keep the current cloud-instances2-b-codfw (vlan 2105) 172.16.128.0/24
    • L2:
  • connectivity between cloudgw and cr-codfw:
    • L3:
      • relocate cloud-instances-transport1-codfw (cr-codfw <-> cloudgw) (185.15.57.128/29) vlan 2120
    • L2: