Difference between revisions of "Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh/2021-02-03-checkin"

From Wikitech-static
Jump to navigation Jump to search
imported>Arturo Borrero Gonzalez
(refresh)
 
imported>Arturo Borrero Gonzalez
 
Line 6: Line 6:
** https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis
** https://wikitech.wikimedia.org/wiki/News/CloudVPS_NAT_wikis
*** this one, initially approached as a potential low hanging fruit, is proving to be way more challenging and will need to be delayed.
*** this one, initially approached as a potential low hanging fruit, is proving to be way more challenging and will need to be delayed.
** https://phabricator.wikimedia.org/T272397cloud: drop NAT exception for dumps NFS
*** see all subtasks of https://phabricator.wikimedia.org/T209011
** https://phabricator.wikimedia.org/T272397 cloud: drop NAT exception for dumps NFS
*** might continue with this one instead, should be easier?
*** might continue with this one instead, should be easier?


Line 17: Line 18:
* Production Cloud services relationship review
* Production Cloud services relationship review
** https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Production_Cloud_services_relationship
** https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Production_Cloud_services_relationship
* wiki replicas


== notes ==
== notes ==
=== cloudVPS NAT ===
* CloudVPS NAT wiki changes: several moving parts
* faidon: how can we help
* arturo: we need some help on the communications side, but Joaquin doesn't have time this Q
* faidon: try talking to each team managers for coordination
* nicholas: timeline needs to be extended
* faidon: yes, ACK complexity
* arzhel: what about introducing a window, perform the change for 1h, see what happens, collect intel for a later final "date".
* faidon: ideally we don't need 5 teams green light, that sounds like too much. Faidon can handle part of the internal comms within the SRE sub teams
* faidon: what about drop not every exception at the same time but progressively
* bstorm: bot accounts store IP addresses, how do we handle that
* arturo: we could drop requests per DC
* faidon: All traffic should be running through eqiad
* brandon: this is a large fraction of traffic coming from a single IP address. Our services are designed for a different case.
* faidon: let's try to break down the problem into smaller pieces
* brandon: if we were talking about 8 or 16 different source IP address, then the thing would be different
* nicholas: there are risks and concerns surronding this whole project, perhaps we can introduce a task in the form of a blocker
*How to do NAT pooling?
*faidon: can we patch neutron?
*arturo: we are moving away from patching
*arzhel: ipv6 would help here
*faidon: want to avoid tying this work to ipv6
=== wiki replicas ===
*brandon: Are we trying to get rid of cloud VLANS or ?
*bstorm: labs VLAN trying to go away. However, the wiki rpelicas design was intended to reuse existing network design, so they inherited it
*brandon: What other services will be LVS? Are there more VLANs coming?
*arturo: Understand LVS to be part of solution for handling "public" traffic.
*faidon: Why do wiki replicas today need to be in?
*bstorm: no technical reason. Legacy, presumption?
*faidon: access by anything besides NAT'd network?
*bstorm: dbproxy1018/19 are still accessed the legacy way. Would need to be changed first. New replica ports are out on LVS, but nothing else.
*brandon: for things moving forward to go through LVS, can things like dbproxy live in production VLANS or do things need to stay in labs VLANS.
*bstorm: should be possible to change.. account creation is done inside production realm. No LVS required.
*bstorm: Dumps NFS might be a possible service to move to LVS. Don't need write locks, so maybe?
*arturo: Expection is wiki replicas is an exception, and future services will do something else
*faidon: Should plan for LVS future. Understand migration and timelines
*nicholas: once the old cluster is gone, what's blocking?
* faidon: the old cluster is accessed by cloud private addresses. The new cluster doesn't need to. But the new proxies lives in the cloud-support vlan, which has implications for LVS.
*faidon: if being used by cloud private ips, don't renumber. Remove the use case, and then renumber to solve
*arturo: very small machines, easily fixed
* nicholas: perhaps by the end of the FY we can get rid of the old cluster
* faidon: if you end up thinking that procuring a couple new proxy servers would make things easier, then go for it

Latest revision as of 16:04, 3 February 2021

2021-02-03 WMCS network checkin

agenda

  • wiki replicas

notes

cloudVPS NAT

  • CloudVPS NAT wiki changes: several moving parts
  • faidon: how can we help
  • arturo: we need some help on the communications side, but Joaquin doesn't have time this Q
  • faidon: try talking to each team managers for coordination
  • nicholas: timeline needs to be extended
  • faidon: yes, ACK complexity
  • arzhel: what about introducing a window, perform the change for 1h, see what happens, collect intel for a later final "date".
  • faidon: ideally we don't need 5 teams green light, that sounds like too much. Faidon can handle part of the internal comms within the SRE sub teams
  • faidon: what about drop not every exception at the same time but progressively
  • bstorm: bot accounts store IP addresses, how do we handle that
  • arturo: we could drop requests per DC
  • faidon: All traffic should be running through eqiad
  • brandon: this is a large fraction of traffic coming from a single IP address. Our services are designed for a different case.
  • faidon: let's try to break down the problem into smaller pieces
  • brandon: if we were talking about 8 or 16 different source IP address, then the thing would be different
  • nicholas: there are risks and concerns surronding this whole project, perhaps we can introduce a task in the form of a blocker
  • How to do NAT pooling?
  • faidon: can we patch neutron?
  • arturo: we are moving away from patching
  • arzhel: ipv6 would help here
  • faidon: want to avoid tying this work to ipv6

wiki replicas

  • brandon: Are we trying to get rid of cloud VLANS or ?
  • bstorm: labs VLAN trying to go away. However, the wiki rpelicas design was intended to reuse existing network design, so they inherited it
  • brandon: What other services will be LVS? Are there more VLANs coming?
  • arturo: Understand LVS to be part of solution for handling "public" traffic.
  • faidon: Why do wiki replicas today need to be in?
  • bstorm: no technical reason. Legacy, presumption?
  • faidon: access by anything besides NAT'd network?
  • bstorm: dbproxy1018/19 are still accessed the legacy way. Would need to be changed first. New replica ports are out on LVS, but nothing else.
  • brandon: for things moving forward to go through LVS, can things like dbproxy live in production VLANS or do things need to stay in labs VLANS.
  • bstorm: should be possible to change.. account creation is done inside production realm. No LVS required.
  • bstorm: Dumps NFS might be a possible service to move to LVS. Don't need write locks, so maybe?
  • arturo: Expection is wiki replicas is an exception, and future services will do something else
  • faidon: Should plan for LVS future. Understand migration and timelines
  • nicholas: once the old cluster is gone, what's blocking?
  • faidon: the old cluster is accessed by cloud private addresses. The new cluster doesn't need to. But the new proxies lives in the cloud-support vlan, which has implications for LVS.
  • faidon: if being used by cloud private ips, don't renumber. Remove the use case, and then renumber to solve
  • arturo: very small machines, easily fixed
  • nicholas: perhaps by the end of the FY we can get rid of the old cluster
  • faidon: if you end up thinking that procuring a couple new proxy servers would make things easier, then go for it