You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
User:Rush/august 2018 brain dump
(Arturo: I can't think on concrete technical questions yet, so some open questions follow)
- Arturo: How many of Toolforge/Cloud VPS/WikiReplicas/NFS (or others) services were your original engineering design?
To get a sense of how much of the original mind behind this stuff is yourself
Almost none :) Nearly everything is on gen 3 or 4 of an original thought to offer something. Some of the worst stuff (read: NFS cluster) is also some of the most improved from the last gen but that's just because it was problematic before. I suppose this iteration of the wikireplica offering is very different but that was a shared output between us and the DBAs.
- Arturo: According to you, What is the most complex, hidden or difficult corner case system, service or workflow?
You mentioned backups, but something else that we should pay special attention in learning?
The NFS setup I think because it was in such a state of disrepair that even with tons of work and investment it's nothing any of us are in love with, and it also fails in the most interesting and complex ways. i.e. NFS goes crazy and load starts rising everywhere that makes debugging further that much more difficult. see: https://gerrit.wikimedia.org/r/c/operations/puppet/+/446715
The other is probably our lack of real storage solution. The instance thin provisioning (COW), combined with the practice of unformatted instance space (allocate 80G but format 20G), combined with our overprovisining via the scheduler, along with our relatively large default project grant (IMO), combined with our use of only ephemeral storage makes reasoning on how much storage should be counted on as sanely available really complicated in some cases. We have been wanting to write scripts that boil this all down in some consistent way for ongoing auditing forever and it's just never happened completely. There are lots of little artifacts from this issue being kicked down the road like the ugly scratch NFS volume and large ephemeral disk asks that have sat for months or maybe even years. I don't know if storage is the most urgent problem but it's probably the messiest one.
Maybe also monitoring since it's such a mess all around.
- Arturo: Could you please brain-dump on what do you like and/or dislike of our current physical/logical networking setup?
You know our needs in the short, mid and long terms regarding the network.
My biggest dislike historically has been the mixing of tenant and provider traffic in cloud-hosts and the associated mixed ACL needed to make that even semi-sane. Thankfully that dies with Neutron. The other is the constraint to row b for all things hypervisor and net which in theory could be overcome a few ways, either with VXLAN or through looking at using public cloud provider resources (that second one is like 1000x more controversial :). The third is the use of VXLAN for the k8s setup in Toolforge and that conflicting with any provider level (tenant agnostic) VXLAN strategy to sanely address the above issues. Then there is the lack of firewalling at the instance gateway which could be dealt with using the firewall extension in neutron. There is the issue of we keep adding 1G hypervisors (20+!) but have had 1G that they are all constrained by at the labnet server(s). That at least improves with cloudnet100 but since we are using subinterfaces on the 10G eth1 it's not a 10x increase. Ideally I would have liked to have 3 NICs for those and have the instance VLAN and transport VLAN each get 10G. But in practice iirc we were not bumping up against our 1G thresolds too much so its never been an emergency.
- Arturo: Which technological challenges do you see in the future for WMCS?
For example: Keep an eye on OpenStack upstream doing stupi^W weird things. Keep an eye in ops/puppet divergences between main WMF prod and us
I don't think there are any surprises here but to reinforce: * Labstore1003 is going to die and labstore1008/9 probably have to take priority really really soon * labstore1004/5 performance issues beyond the currently pinned kernel * labstore1004/5 failover issues, essentially right now the sanest (ha!) thing is to reboot the old-primary to allow DRBD to failover as the unmounting of the hard links fails spectacularly. I talked with Brooke about this quite a bit and I'm not sure what the right solution is. * A real PaaS strategy for Toolforge * Moving k8s forward from the patched up 1.4 currently used. I think the RBAC stuff that landed in (1.8?) will over the current custom use cases but it needs thinking for sure, if Cloud could move forward and settle into the upstream RBAC or webhook models (for external calls) then keeping k8s maintained here becomes 100x easier. * HA story for cloud in general * Yes, def Puppet between Cloud Services Ops, Core Ops, Tenant VPS Projects, and Toolforge. There is so much going on in the mono-repo now and its really one long running organic thing rather than a planned out thing. * Related to above Hiera use and integration across those same categories * Migration of labsdb100 to instances and the management of that * Aforementioned storage madness :)
We get emails to the admin list weekly that look like this:
2018-05-30 20:00:02,667 INFO force is enabled 2018-05-30 20:00:02,736 INFO removing misc-project-backup 2018-05-30 20:00:02,853 INFO removing misc-project-backup 2018-05-30 20:00:03,327 INFO creating misc-project-backup at 2T 2018-05-30 20:00:04,246 INFO force is enabled 2018-05-30 20:00:04,285 INFO removing misc-snap 2018-05-30 20:00:04,319 INFO removing misc-snap 2018-05-30 20:00:04,631 INFO creating misc-snap at 1T
One for Tools and one for Misc. This is the backup job that uses bdsync to make an exact copy of the two shares on the relevant server in codfw. We strive to take our backup from the standby host to not effect users. Changing which host backups are taken from is a manual thing.