You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikimedia Cloud Services team/EnhancementProposals/2020 Network refresh/2020-11-25-checkin: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Arturo Borrero Gonzalez
 
imported>Arturo Borrero Gonzalez
(refresh after the meeting happened)
 
(2 intermediate revisions by the same user not shown)
Line 5: Line 5:
* questions, feedback
* questions, feedback
* next, TODO, etc
* next, TODO, etc
* Q3 OKR planning
** Faidon's strawdog proposal:
    "Reduce the number of ACL exceptions from the cloud tenant network to production (cloud-in4) by {N terms/N%/etc.}"
        aligned to TI-HC-FLD
    a) Complete T264993 (Audit cloud-in4 ACL)
    b) Complete & merge r641977 ("cloud: dmz_cidr: detail the list of private production addresses")
    c) Complete & merge r643269 ("Allow specific flows from 172.16/12 to prod"); carry that to dmz_cidr
    d) Reduce the list by a meaningful percentage/amount. Potentially in scope:
        - https://phabricator.wikimedia.org/T209011 (NAT wiki traffic)
        - https://phabricator.wikimedia.org/T207533 (Move labs-recursors in WMCS)
        - https://phabricator.wikimedia.org/T207543 (Move labmon (Graphite, StatsD) into a Cloud VPS)
        - https://phabricator.wikimedia.org/T207536 (parent task for support services)
        - https://phabricator.wikimedia.org/T216422 (Virtualize NFS servers used exclusively by Cloud VPS tenants)
        - others not previously documented but discovered during (a)/(b)/(c)
       
    (a), (b) and (c) can happen in the remainder of Q2, paving the road for (d) in Q3
   
* FYI, slightly related: Move some of wikimediacloud.org 185.15.56.0/23 to Netbox - https://phabricator.wikimedia.org/T268621


== status updates from arturo ==
== status updates from arturo ==
Line 10: Line 30:
* requested a server for 2º cloudgw device in codfw: https://phabricator.wikimedia.org/T268016
* requested a server for 2º cloudgw device in codfw: https://phabricator.wikimedia.org/T268016
* arturo's plan is once this new server arrives and we finish all the testing and validation, we move forward with eqiad and with a cloudsw device in codfw.
* arturo's plan is once this new server arrives and we finish all the testing and validation, we move forward with eqiad and with a cloudsw device in codfw.
* refreshed NFS ideas page: https://wikitech.wikimedia.org/w/index.php?title=Portal:Cloud_VPS/Admin/notes/NAT_loophole/NFS [[Portal:Cloud_VPS/Admin/notes/NAT_loophole/NFS]]
* refreshed NFS ideas page: https://wikitech.wikimedia.org/w/index.php?title=Portal:Cloud_VPS/Admin/notes/NAT_loophole/NFS
* bootstrapped a practical guide for prod<->cloud networking:  
* bootstrapped a practical guide for prod<->cloud networking:  
** https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Production_Cloud_bridging [[Wikimedia_Cloud_Services_team/EnhancementProposals/Production_Cloud_bridging]]
** https://wikitech.wikimedia.org/wiki/Wikimedia_Cloud_Services_team/EnhancementProposals/Production_Cloud_bridging
** it was hinted in a meeting with analytics this guidelines page should be interesting for other teams as well as for ourselves.
** it was hinted in a meeting with analytics this guidelines page should be interesting for other teams as well as for ourselves.
** the source 'policy' for the guidelines are in this document: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network_and_Policy [[Portal:Cloud_VPS/Admin/Network_and_Policy]]
** the source 'policy' for the guidelines is this document: https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Network_and_Policy
* we can generalize the NFS architecture problem into a general one: How to 'bridge' prod/cloud when we need VMs private address contacting a prod service endpoint?
* we can generalize the NFS architecture problem into a general one: How to 'bridge' prod/cloud when we need VMs private address contacting a prod service endpoint?
** this might be the case for both NFS and Cinder/Ceph.
** this might be the case for both NFS and Cinder/Ceph, or others in the future.
* Arturo proposes to discuss the following topics today:
** Arturo proposes to discuss this today
** how to serve ceph to cloud realm clients from production (cinder)
* misc: 2 patches under review for clarity in network policies:
** or, specify general rules/mechanisms on how to bridge the 2 realms when unavoidable (cinder and/or NFS).
** 641977: cloud: dmz_cidr: detail the list of private production addresses | https://gerrit.wikimedia.org/r/c/operations/puppet/+/641977
** 643269: Allow specific flows from 172.16/12 to prod | https://gerrit.wikimedia.org/r/c/operations/homer/public/+/643269


== notes ==
== notes ==
* Arzhel thinks NFS document should be reduced in number of options. Arturo agrees.
* Faidon thanks for audits to cloud-in filter etc. More reviews to come.
* Faidon  (a), (b) and (c) can happen in the remainder of Q2
* NAT Wiki traffic -- is this more bad news for the community? This will restrict how they query the replicas, and could could introduce limits on api calls, etc, that folks are using to mitigate wiki replicas changes.
** Birgitt: Ideally we don't overload community; needs a balance to encourage buy-in
** Faidon: Intention isn't to rate limit, don't need to focus on it first if there's community impact concerns
* Arzhel: IPv6 could solve some, some intelligent ordering would help
** ipv6 would be last; ipv6 requires large design changes on the end of kubernetes and gridengine CANNOT do it
* Nicholas: Another potential Q3 goal is to look at Network Security Audit
** Faidon: Network/Infra Security is in SRE. May have someone to help.
* Nicholas: Q2 OKR concerns?
** SRE KR's complete.
* Arturo: If we can't provide a service natively within the cloud, how should we bridge them? (premade IP from VM reaching an IP from outside)
** Brooke: Can this be done without exposing the private IP?
** Brooke: For example, how to setup an OLAP view
* Bridging; it's more than just network, seehttps://upload.wikimedia.org/wikipedia/labs/thumb/9/9c/NFS.png/1920px-NFS.png
* Arturo: Can this interaction be generalized in some way?
* Faidon: Mental model; think about it as similar to external provider wanting access to internal resources
* Faidon: There should be clear lines of seperation. Think about it if there was no private backhaul to the internal network.
* Arturo: Openstack has idea of provider services; provider hardware is co-located next to the cloud.
* Brooke: Services that don't live in VM's, so not in a segregated space. Nothing we are doing seems like a multi-tenant network.  Good to think like a VM cannot bridge, but they do exist. So we must think about it.
* Brooke: In process of redesigning wiki replicas, so some of these questions are relevant today. We can't pretend it's completely external as it's not.
* Faidon: Unclear if this is a special case at the moment. Provider network doesn't have to be expanded to everything in the network.
* Arturo: Can we have cloud dedicated vlans on production hardware? Accesible by VM's?
* Faidon: Yes, possible.
* VLAN would be to host services that can't be hosted anywhere else. Inside Cloud first preference.
* Faidon: We should be looking to reduce the number of crosses; the number of places things can cross
* This tradeoff already exists as data is crossing.
* Loki example; no ssh access from VM's, but needs to access them. Can't virtualize for reasons*
* Faidon: Bare metal for users. Openstack Ironic. One tenant managed by cloud services team. This could be a solution. Nothing about loki requires it to be in production.
* Arturo: How do we have something physical without being in production.
* Brooke: Maybe Ironic is an option?
* Faidon: Goal is to avoid bridging. Any option is open. More services shouldn't mean more exposure.
* Faidon: Last time this happeend. The cloud infra project existed. Make one tenant in cloud services to run all the ancillary services for the rest of the cloud.
* Brooke: Cloudinfra has worked, but has limits. "Bridging the realms". 1) Networking side, needs to be clear and understood. 2) Data flows for data services. This one is harder.
* Faidon: Data services could even be thought of seperately from provider network.
* Brooke: Production data somehow has to get to cloud guests.
* Arturo: Do we fork production services?
* Brooke: This would solve network bridging concerns, then it would only be about data flows
* Faidon: Wouldn't object. But that just moves the boundary. They don't disappear, they just move.
* Brooke: But it would be a cloud guest wouldn't be an attack vector for production anymore. Only for cloud.
* Faidon: We're all in this together though. So don't want cloud to be owned either.
== actions ==
* Plan shared Q3 objective to reduce ACL exceptions

Latest revision as of 11:37, 26 November 2020

2020-11-25 WMCS network checkin

Agenda:

  • status updates from arturo
  • questions, feedback
  • next, TODO, etc
  • Q3 OKR planning
    • Faidon's strawdog proposal:
   "Reduce the number of ACL exceptions from the cloud tenant network to production (cloud-in4) by {N terms/N%/etc.}"
       aligned to TI-HC-FLD
   a) Complete T264993 (Audit cloud-in4 ACL)
   b) Complete & merge r641977 ("cloud: dmz_cidr: detail the list of private production addresses")
   c) Complete & merge r643269 ("Allow specific flows from 172.16/12 to prod"); carry that to dmz_cidr
   d) Reduce the list by a meaningful percentage/amount. Potentially in scope:
       - https://phabricator.wikimedia.org/T209011 (NAT wiki traffic)
       - https://phabricator.wikimedia.org/T207533 (Move labs-recursors in WMCS)
       - https://phabricator.wikimedia.org/T207543 (Move labmon (Graphite, StatsD) into a Cloud VPS)
       - https://phabricator.wikimedia.org/T207536 (parent task for support services)
       - https://phabricator.wikimedia.org/T216422 (Virtualize NFS servers used exclusively by Cloud VPS tenants)
       - others not previously documented but discovered during (a)/(b)/(c)
       
   (a), (b) and (c) can happen in the remainder of Q2, paving the road for (d) in Q3
   

status updates from arturo

notes

  • Arzhel thinks NFS document should be reduced in number of options. Arturo agrees.
  • Faidon thanks for audits to cloud-in filter etc. More reviews to come.
  • Faidon (a), (b) and (c) can happen in the remainder of Q2
  • NAT Wiki traffic -- is this more bad news for the community? This will restrict how they query the replicas, and could could introduce limits on api calls, etc, that folks are using to mitigate wiki replicas changes.
    • Birgitt: Ideally we don't overload community; needs a balance to encourage buy-in
    • Faidon: Intention isn't to rate limit, don't need to focus on it first if there's community impact concerns
  • Arzhel: IPv6 could solve some, some intelligent ordering would help
    • ipv6 would be last; ipv6 requires large design changes on the end of kubernetes and gridengine CANNOT do it
  • Nicholas: Another potential Q3 goal is to look at Network Security Audit
    • Faidon: Network/Infra Security is in SRE. May have someone to help.
  • Nicholas: Q2 OKR concerns?
    • SRE KR's complete.
  • Arturo: If we can't provide a service natively within the cloud, how should we bridge them? (premade IP from VM reaching an IP from outside)
    • Brooke: Can this be done without exposing the private IP?
    • Brooke: For example, how to setup an OLAP view
  • Bridging; it's more than just network, seehttps://upload.wikimedia.org/wikipedia/labs/thumb/9/9c/NFS.png/1920px-NFS.png
  • Arturo: Can this interaction be generalized in some way?
  • Faidon: Mental model; think about it as similar to external provider wanting access to internal resources
  • Faidon: There should be clear lines of seperation. Think about it if there was no private backhaul to the internal network.
  • Arturo: Openstack has idea of provider services; provider hardware is co-located next to the cloud.
  • Brooke: Services that don't live in VM's, so not in a segregated space. Nothing we are doing seems like a multi-tenant network. Good to think like a VM cannot bridge, but they do exist. So we must think about it.
  • Brooke: In process of redesigning wiki replicas, so some of these questions are relevant today. We can't pretend it's completely external as it's not.
  • Faidon: Unclear if this is a special case at the moment. Provider network doesn't have to be expanded to everything in the network.
  • Arturo: Can we have cloud dedicated vlans on production hardware? Accesible by VM's?
  • Faidon: Yes, possible.
  • VLAN would be to host services that can't be hosted anywhere else. Inside Cloud first preference.
  • Faidon: We should be looking to reduce the number of crosses; the number of places things can cross
  • This tradeoff already exists as data is crossing.
  • Loki example; no ssh access from VM's, but needs to access them. Can't virtualize for reasons*
  • Faidon: Bare metal for users. Openstack Ironic. One tenant managed by cloud services team. This could be a solution. Nothing about loki requires it to be in production.
  • Arturo: How do we have something physical without being in production.
  • Brooke: Maybe Ironic is an option?
  • Faidon: Goal is to avoid bridging. Any option is open. More services shouldn't mean more exposure.
  • Faidon: Last time this happeend. The cloud infra project existed. Make one tenant in cloud services to run all the ancillary services for the rest of the cloud.
  • Brooke: Cloudinfra has worked, but has limits. "Bridging the realms". 1) Networking side, needs to be clear and understood. 2) Data flows for data services. This one is harder.
  • Faidon: Data services could even be thought of seperately from provider network.
  • Brooke: Production data somehow has to get to cloud guests.
  • Arturo: Do we fork production services?
  • Brooke: This would solve network bridging concerns, then it would only be about data flows
  • Faidon: Wouldn't object. But that just moves the boundary. They don't disappear, they just move.
  • Brooke: But it would be a cloud guest wouldn't be an attack vector for production anymore. Only for cloud.
  • Faidon: We're all in this together though. So don't want cloud to be owned either.

actions

  • Plan shared Q3 objective to reduce ACL exceptions