SRE/business case/Network - move from esams to knams
1. Executive Summary
The Wikimedia foundation operates 4 caching POPs across the world. Their purpose is to reduce user latency (and thus improve user experience and retention) by storing content closer to the users. Before opening our Marseille POP, esams (Iron Mountain, located near Amsterdam) was the largest of all, receiving about 50% of user traffic. We also operate a “network POP” in knams (Interxion Science Park, in Amsterdam) to complement the lack of provider diversity in esams. This consists of a single router connected to esams via 3 dark fibers on one side, and 5 transit providers on the other. Last, our servers' lifecycle is 5 years, and network equipment 8 years. This is based on vendor support, frequencies of failure and overall industry evolution.
2. Business Problem
3. Problem Analysis
The uniqueness and lack of standardization of this setup has both direct and indirect costs.
Explicit costs include having to pay for the knams rack, as well as 3 dark fibers and their matching 6 cross-connects.
Implicit costs, more difficult to evaluate include the engineer time required for day to day operations (maintenance, break/fix, amplified by the lack of OOB access) as well as maintaining code to support this special site. On the procurement side it also means an additional site to manage accounting for.
Operationally, the current setup performs well on a day to day basis, there is no urgency in improving the situation. However three factors brought the topic back on the table and makes now a good opportunity to fix those issues:
For reasons prior to my tenure and still unclear to me, it was not possible for us to have servers in knams. This limitation does not exist anymore. Second, before opening drmrs (the Marseille POP), esams was “too big to fail” as in a downtime would have a terrible impact on latency for users in the EMEA region. However with drmrs, medium to long maintenance in Amsterdam are now bearable.
Despite the above, moving the existing servers, circuits and network from esams to knams outside of their refresh cycle would be time consuming and require us to do another “heavy lifting” once the equipment needs to be refreshed. The opportunity here is that servers will reach their 5 years mark in Sept 2024, network equipment in 2023 for half of them, 2025 for the other half. The esams contract ends as well in 2023.
4. Current Technology/Solution
5. Available Options
5.1 Option 1 - esams->knams move
Instead of refreshing the servers and equipment in esams, have them delivered and installed in knams instead. With a sufficient overlap period for a smooth transition and minimal downtime.
As a side note, all the esams providers are present in knams, but the opposite is not true.
5.1.2 Benefits, Goals and Measurement Criteria
Once completed, that migration will reduce the overall infrastructure cost, both direct and indirect (see the “problems” section). Furthermore this would be close to a greenfield deployment, eliminating cruft and technical debt accumulated across the years, being able to deploy our new POPs standards, for example reducing our rack need from the current 3 in esams to 2.
The component shortage could increase the already long delays. Being live in knams needs to be properly orchestrated so it happens before our esams contract ends. If our vendor announces a sudden longer delay we could end up in a situation where we have to rely on on drmrs longer than we would like to. To prevent this we need to add additional padding to our current estimations (eg. target completion 3 to 6 months before the esams contract ends), clearly define what the strict minimum list of equipment is and have alternate plans if the situation occurs. Extending our esams contract by a few months is not an option as it's a 1 year minimum contract as well.
Moving a caching POP between facilities with minimal downtime will require precise engineering from multiple groups (dcops, traffic and I/F). This can be mitigated by relying more on drmrs.
Esams is the only caching POP with expensive and powerful MX480 routers. Which is a chassis receiving linecards for features and connectivity. Mostly because of the lack of alternatives at the time, and the central role esams played in our caching POPs. All of our other caching POPs use much smaller MX204s, unfortunately the latter has been recently announced as EOL. Not reusing the MX480s could be seen as “wasteful” but this is to be weighed against their usefulness in the current context, the space and power they would occupy in a 2 racks model, the complexity of physically moving them, the complexity of running esams with 1 router during the transition and the cost of refreshing linecards. Cr2-esams linecards and routing engines will reach their 9 years mark in 2023, while cr3-esams are 2 to 4 years younger depending on the component. The 2 options here are to either purchase a MX304, about the cost of a linecard, but more expensive than the previously budgetter MX204, or evaluate the cost of moving cr3-esams for a few more years. There is no value in moving cr2-esams.
5.2 Option 2 - Push the move 1 year later
The current plan includes refreshing servers 1 year before their 5 years mark to fit the esams end contract. To keep in mind that once the esams contract ends, it doesn’t go month-to-month like circuits, but needs to be renewed for at least 1 year.
Pushing the move 1 year later could be a viable option and one should run the numbers to see if it’s financially interesting. Basically, how 1 additional year of the items listed in the “problems” section compares to 1/5th of the servers and 1/8th of the network equipment cost. Delaying the move could also help having more headroom in terms of planning, the shortage could be resolved by then, new equipment more suitable could also be announced as well.
Savings from pushing the migration 1 year later (~76k if we keep cr3-esams, 95k if not):
- ⅕ of all existing servers costs:
- Depreciation of $51,855 in 5th year of server life cycle
- 1/8th of all existing network equipment cost:
- Depreciation of $6,442/yr for cr3 linecards (T230165) starting in 2019
- Depreciation of $375/yr for scs-oe16 starting in 2019
- Depreciation of $18,809/yr for cr3, cr3 linecards, & asw2-oe[15,16] (T161930) starting in 2017
- Depreciation of $17,500/yr for cr2, cr2 linecards, and asw2-oe14 (RT #9300) and cr2 linecards starting in 2015
Savings from staying on the current schedule (~90k):
- 1 year of esams contract - 1 year of 1 additional rack in knams: 47k
- 1 year of 3 knams-esams dark fibers: 2369×12 = 28428€
- 1 year of 3 knams-esams dark fibers x-connects (6): total of 3,747€ (42.9€/mo for each x-connect in esams, 61.2€/mo for each x-connect in knams)
- 1 year of support on 3 switches + 1 router: ~3x300=~1900usd
Pushing the migration doesn’t bring any benefits.
5.3 Option 3 - Stay in esams and knams
This is the current status-quo, with the downsides listed in the “problems” sections.
5.4 Option 4 - Stay in esams, move out of knams
The main limitation with this option is the low provider diversity in esams that would prevent us from taking proper decisions on who to connect to, reducing our flexibility and would have an impact on our service quality. While all the transit providers in knams are free (donated), any new provider in esams would be paid ones, which would reduce the benefits of leaving knams.
6. Recommended Option
Option 1 - esams->knams move
- Standardized cache site design
- Less operational cost
- More resilient (OOB, all in 1 site, no uplink/transport oversubscription)
- Moving from 4 (3 in esams, 1 in knams) to 2 racks
- Yearly cost saving
- 3 esams racks - 1 knams rack (inc power, etc): 47k€/y
- Support on 1 router and 1 ToR switch: ~300 x 2 = ~600usd
- 1 less router and switch: 5k€/y
- Yearly cost saving
- Migration cost
- Travel/installation cost:
- Engineering time:
- DC contracts overlap:
- Resiliate the 3 esams<->knams circuits (and matching x-connects)
- Cost saving: 32k€/y
- MX480 sunk-cost
- Heavier logistics to move a POP vs. stay in place
Why not keep the MX480s?
- Easier migration: setup new routers directly in their definite rack while keeping esams active
- Form factor: using MX304s allow us to consolidate a the caching site in 2 racks
- Power usage: MX304s use less power and allow us to use more efficient CP servers and consolidate in 2 racks
- Cost: cr2-esams linecards will reach their refresh date next FY (8 years), linecards are more expensive than MX304s
- Usefulness: even though it’s always better to have more capacity, we confirmed with drmrs that 2 MX204s can handle all the European POPs load
- Repurposing: we’re looking at if it could be useful to have the MX480s or some of the cr3-esams linecards in eqiad/codfw instead
7. Implementation Approach
With a focus on minimizing downtime at the cost of work spread out (in comparison to depooling and moving everything at once). To be adjusted based on gear and racks availability.
- Warn Tilaa about our move (OOB swap)
- Procure optics/fibers
- Procure OOB transit for knams
- Rack + initial setup of mr1-knams + console server + mgmt switches (this will need to either be shipped configured or have a WMF staff for the config)
- Shift traffic to drmrs so esams peaks under 10Gbps (so if cr2/3-esams fails, we have enough capacity on the other transit)
- Move cr3-knams and its x-connects to new racks (put back in production)
- Setup prod switchesSetup cr3-knams<->prod switches links/routing using cr3-knams 40/100G interfaces
- Setup + test servers (but don't put them in production)
- Move cr3-esams to knams (rename cr1-knams)
- Setup cr1-knams<->prod switches links/routing
- Depool esams
- Move production services from esams to knams
- Test redundancy/failovers
- Pool knams