You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh/2021-02-03-checkin: Difference between revisions
Jump to navigation
Jump to search
imported>Arturo Borrero Gonzalez (refresh) |
imported>Arturo Borrero Gonzalez No edit summary |
||
Line 6: | Line 6: | ||
** https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis | ** https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis | ||
*** this one, initially approached as a potential low hanging fruit, is proving to be way more challenging and will need to be delayed. | *** this one, initially approached as a potential low hanging fruit, is proving to be way more challenging and will need to be delayed. | ||
** https://phabricator.wikimedia.org/ | *** see all subtasks of https://phabricator.wikimedia.org/T209011 | ||
** https://phabricator.wikimedia.org/T272397 cloud: drop NAT exception for dumps NFS | |||
*** might continue with this one instead, should be easier? | *** might continue with this one instead, should be easier? | ||
Line 17: | Line 18: | ||
* Production Cloud services relationship review | * Production Cloud services relationship review | ||
** https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Production_Cloud_services_relationship | ** https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Production_Cloud_services_relationship | ||
* wiki replicas | |||
== notes == | == notes == | ||
=== cloudVPS NAT === | |||
* CloudVPS NAT wiki changes: several moving parts | |||
* faidon: how can we help | |||
* arturo: we need some help on the communications side, but Joaquin doesn't have time this Q | |||
* faidon: try talking to each team managers for coordination | |||
* nicholas: timeline needs to be extended | |||
* faidon: yes, ACK complexity | |||
* arzhel: what about introducing a window, perform the change for 1h, see what happens, collect intel for a later final "date". | |||
* faidon: ideally we don't need 5 teams green light, that sounds like too much. Faidon can handle part of the internal comms within the SRE sub teams | |||
* faidon: what about drop not every exception at the same time but progressively | |||
* bstorm: bot accounts store IP addresses, how do we handle that | |||
* arturo: we could drop requests per DC | |||
* faidon: All traffic should be running through eqiad | |||
* brandon: this is a large fraction of traffic coming from a single IP address. Our services are designed for a different case. | |||
* faidon: let's try to break down the problem into smaller pieces | |||
* brandon: if we were talking about 8 or 16 different source IP address, then the thing would be different | |||
* nicholas: there are risks and concerns surronding this whole project, perhaps we can introduce a task in the form of a blocker | |||
*How to do NAT pooling? | |||
*faidon: can we patch neutron? | |||
*arturo: we are moving away from patching | |||
*arzhel: ipv6 would help here | |||
*faidon: want to avoid tying this work to ipv6 | |||
=== wiki replicas === | |||
*brandon: Are we trying to get rid of cloud VLANS or ? | |||
*bstorm: labs VLAN trying to go away. However, the wiki rpelicas design was intended to reuse existing network design, so they inherited it | |||
*brandon: What other services will be LVS? Are there more VLANs coming? | |||
*arturo: Understand LVS to be part of solution for handling "public" traffic. | |||
*faidon: Why do wiki replicas today need to be in? | |||
*bstorm: no technical reason. Legacy, presumption? | |||
*faidon: access by anything besides NAT'd network? | |||
*bstorm: dbproxy1018/19 are still accessed the legacy way. Would need to be changed first. New replica ports are out on LVS, but nothing else. | |||
*brandon: for things moving forward to go through LVS, can things like dbproxy live in production VLANS or do things need to stay in labs VLANS. | |||
*bstorm: should be possible to change.. account creation is done inside production realm. No LVS required. | |||
*bstorm: Dumps NFS might be a possible service to move to LVS. Don't need write locks, so maybe? | |||
*arturo: Expection is wiki replicas is an exception, and future services will do something else | |||
*faidon: Should plan for LVS future. Understand migration and timelines | |||
*nicholas: once the old cluster is gone, what's blocking? | |||
* faidon: the old cluster is accessed by cloud private addresses. The new cluster doesn't need to. But the new proxies lives in the cloud-support vlan, which has implications for LVS. | |||
*faidon: if being used by cloud private ips, don't renumber. Remove the use case, and then renumber to solve | |||
*arturo: very small machines, easily fixed | |||
* nicholas: perhaps by the end of the FY we can get rid of the old cluster | |||
* faidon: if you end up thinking that procuring a couple new proxy servers would make things easier, then go for it |
Latest revision as of 16:04, 3 February 2021
2021-02-03 WMCS network checkin
agenda
- Q3 goals, how are they going
- https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis
- this one, initially approached as a potential low hanging fruit, is proving to be way more challenging and will need to be delayed.
- see all subtasks of https://phabricator.wikimedia.org/T209011
- https://phabricator.wikimedia.org/T272397 cloud: drop NAT exception for dumps NFS
- might continue with this one instead, should be easier?
- https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis
- procurement of hardware for new edge network setup:
- renaming labtestvirt2003 to cloudgw https://phabricator.wikimedia.org/T271519
Done
- codfw cloudgw device procurement https://phabricator.wikimedia.org/T268016
Done
- codfw cloudsw https://phabricator.wikimedia.org/T272348 (needs discussion)
- eqiad cloudgw devices procurement https://phabricator.wikimedia.org/T270705 (ordered)
- renaming labtestvirt2003 to cloudgw https://phabricator.wikimedia.org/T271519
- Production Cloud services relationship review
- wiki replicas
notes
cloudVPS NAT
- CloudVPS NAT wiki changes: several moving parts
- faidon: how can we help
- arturo: we need some help on the communications side, but Joaquin doesn't have time this Q
- faidon: try talking to each team managers for coordination
- nicholas: timeline needs to be extended
- faidon: yes, ACK complexity
- arzhel: what about introducing a window, perform the change for 1h, see what happens, collect intel for a later final "date".
- faidon: ideally we don't need 5 teams green light, that sounds like too much. Faidon can handle part of the internal comms within the SRE sub teams
- faidon: what about drop not every exception at the same time but progressively
- bstorm: bot accounts store IP addresses, how do we handle that
- arturo: we could drop requests per DC
- faidon: All traffic should be running through eqiad
- brandon: this is a large fraction of traffic coming from a single IP address. Our services are designed for a different case.
- faidon: let's try to break down the problem into smaller pieces
- brandon: if we were talking about 8 or 16 different source IP address, then the thing would be different
- nicholas: there are risks and concerns surronding this whole project, perhaps we can introduce a task in the form of a blocker
- How to do NAT pooling?
- faidon: can we patch neutron?
- arturo: we are moving away from patching
- arzhel: ipv6 would help here
- faidon: want to avoid tying this work to ipv6
wiki replicas
- brandon: Are we trying to get rid of cloud VLANS or ?
- bstorm: labs VLAN trying to go away. However, the wiki rpelicas design was intended to reuse existing network design, so they inherited it
- brandon: What other services will be LVS? Are there more VLANs coming?
- arturo: Understand LVS to be part of solution for handling "public" traffic.
- faidon: Why do wiki replicas today need to be in?
- bstorm: no technical reason. Legacy, presumption?
- faidon: access by anything besides NAT'd network?
- bstorm: dbproxy1018/19 are still accessed the legacy way. Would need to be changed first. New replica ports are out on LVS, but nothing else.
- brandon: for things moving forward to go through LVS, can things like dbproxy live in production VLANS or do things need to stay in labs VLANS.
- bstorm: should be possible to change.. account creation is done inside production realm. No LVS required.
- bstorm: Dumps NFS might be a possible service to move to LVS. Don't need write locks, so maybe?
- arturo: Expection is wiki replicas is an exception, and future services will do something else
- faidon: Should plan for LVS future. Understand migration and timelines
- nicholas: once the old cluster is gone, what's blocking?
- faidon: the old cluster is accessed by cloud private addresses. The new cluster doesn't need to. But the new proxies lives in the cloud-support vlan, which has implications for LVS.
- faidon: if being used by cloud private ips, don't renumber. Remove the use case, and then renumber to solve
- arturo: very small machines, easily fixed
- nicholas: perhaps by the end of the FY we can get rid of the old cluster
- faidon: if you end up thinking that procuring a couple new proxy servers would make things easier, then go for it