You are browsing a read-only backup copy of Wikitech. The live site can be found at

User:Rush/august 2018 brain dump

From Wikitech-static
< User:Rush
Revision as of 03:38, 19 July 2018 by imported>Rush (nfs-manage sucks :))
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search


(Arturo: I can't think on concrete technical questions yet, so some open questions follow)

  • Arturo: How many of Toolforge/Cloud VPS/WikiReplicas/NFS (or others) services were your original engineering design?

To get a sense of how much of the original mind behind this stuff is yourself

   Almost none :) Nearly everything is on gen 3 or 4 of an original thought to offer 
   something.  Some of the worst stuff (read: NFS cluster) is also some of the most improved
   from the last gen but that's just because it was problematic before.  I suppose this
   iteration of the wikireplica offering is very different but that was a shared
   output between us and the DBAs.

  • Arturo: According to you, What is the most complex, hidden or difficult corner case system, service or workflow?

You mentioned backups, but something else that we should pay special attention in learning?

  The NFS setup I think because it was in such a state of disrepair that even with tons of 
  work and investment it's nothing any of us are in love with, and it also fails in the 
  most interesting and complex ways.  i.e. NFS goes crazy and load starts rising everywhere 
  that makes debugging further that much more difficult.  see:
  The other is probably our lack of real storage solution.  The instance thin provisioning (COW),
  combined with the practice of unformatted instance space (allocate 80G but format 20G), combined with our overprovisining
  via the scheduler, along with our relatively large default project grant (IMO), combined with our use of only ephemeral storage makes reasoning on
  how much storage should be counted on as sanely available really complicated in some cases.
  We have been wanting to write scripts that boil this all down in some consistent way for
  ongoing auditing forever and it's just never happened completely.  There are lots of little artifacts from this
  issue being kicked down the road like the ugly scratch NFS volume and large ephemeral disk asks that have
  sat for months or maybe even years.  I don't know if storage is the most urgent problem but it's probably the messiest one.
  Maybe also monitoring since it's such a mess all around.
  • Arturo: Could you please brain-dump on what do you like and/or dislike of our current physical/logical networking setup?

You know our needs in the short, mid and long terms regarding the network.

  My biggest dislike historically has been the mixing of tenant and provider traffic in cloud-hosts and the associated
  mixed ACL needed to make that even semi-sane.  Thankfully that dies with Neutron.  The other is the constraint to
  row b for all things hypervisor and net which in theory could be overcome a few ways, either with VXLAN or through
  looking at using public cloud provider resources (that second one is like 1000x more controversial :).  The third
  is the use of VXLAN for the k8s setup in Toolforge and that conflicting with any provider level (tenant agnostic) VXLAN
  strategy to sanely address the above issues.  Then there is the lack of firewalling at the instance gateway which could 
  be dealt with using the firewall extension in neutron.  There is the issue of we keep adding 1G hypervisors (20+!) but
  have had 1G that they are all constrained by at the labnet server(s).  That at least improves with cloudnet100[34] but
  since we are using subinterfaces on the 10G eth1 it's not a 10x increase.  Ideally I would have liked to have 3 NICs for
  those and have the instance VLAN and transport VLAN each get 10G.  But in practice iirc we were not bumping up against
  our 1G thresolds too much so its never been an emergency.
  • Arturo: Which technological challenges do you see in the future for WMCS?

For example: Keep an eye on OpenStack upstream doing stupi^W weird things. Keep an eye in ops/puppet divergences between main WMF prod and us

  I don't think there are any surprises here but to reinforce:
      * Labstore1003 is going to die and labstore1008/9 probably have to take priority really really soon
      * labstore1004/5 performance issues beyond the currently pinned kernel
      * labstore1004/5 failover issues, essentially right now the sanest (ha!) thing is to reboot the old-primary
         to allow DRBD to failover as the unmounting of the hard links fails spectacularly.  I talked with Brooke about
         this quite a bit and I'm not sure what the right solution is.
      * A real PaaS strategy for Toolforge
      * Moving k8s forward from the patched up 1.4 currently used.  I think the RBAC stuff that landed in (1.8?) will
        over the current custom use cases but it needs thinking for sure, if Cloud could move forward and settle into the
        upstream RBAC or webhook models (for external calls) then keeping k8s maintained here becomes 100x easier.
      * HA story for cloud in general 
      * Yes, def Puppet between Cloud Services Ops, Core Ops, Tenant VPS Projects, and Toolforge.  There is so much going on
         in the mono-repo now and its really one long running organic thing rather than a planned out thing.  
      * Related to above Hiera use and integration across those same categories
      * Migration of labsdb100[4567] to instances and the management of that
      * Aforementioned storage madness :)



Tools share and misc share backups

We get emails to the admin list weekly that look like this:

2018-05-30 20:00:02,667 INFO force is enabled
2018-05-30 20:00:02,736 INFO removing misc-project-backup
2018-05-30 20:00:02,853 INFO removing misc-project-backup
2018-05-30 20:00:03,327 INFO creating misc-project-backup at 2T
2018-05-30 20:00:04,246 INFO force is enabled
2018-05-30 20:00:04,285 INFO removing misc-snap
2018-05-30 20:00:04,319 INFO removing misc-snap
2018-05-30 20:00:04,631 INFO creating misc-snap at 1T

One for Tools and one for Misc. This is the backup job that uses bdsync to make an exact copy of the two shares on the relevant server in codfw. We strive to take our backup from the standby host to not effect users. Changing which host backups are taken from is a manual thing.