You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Nova Resource:Tools/SAL

From Wikitech-static
< Nova Resource:Tools
Revision as of 18:07, 31 July 2019 by imported>Stashbot (bstorm_: drained tools-worker-1015/05/03/17 to rebalance load)
Jump to navigation Jump to search

2019-07-31

  • 18:07 bstorm_: drained tools-worker-1015/05/03/17 to rebalance load
  • 17:41 bstorm_: drained tools-worker-1025 and 1026 to rebalance load
  • 17:32 bstorm_: drained tools-worker-1028 to rebalance load
  • 17:29 bstorm_: drained tools-worker-1008 to rebalance load
  • 17:23 bstorm_: drained tools-worker-1021 to rebalance load
  • 17:17 bstorm_: drained tools-worker-1007 to rebalance load
  • 17:07 bstorm_: drained tools-worker-1004 to rebalance load
  • 16:27 andrewbogott: moving tools-static-12 to cloudvirt1018
  • 15:33 bstorm_: T228573 spinning up 5 worker nodes for kubernetes cluster (tools-worker-1035-9)

2019-07-27

2019-07-26

  • 17:39 bstorm_: restarted maintain-kubeusers because it was suspiciously tardy and quiet
  • 17:14 bstorm_: drained tools-worker-1013.tools.eqiad.wmflabs to rebalance load
  • 17:09 bstorm_: draining tools-worker-1020.tools.eqiad.wmflabs to rebalance load
  • 16:32 bstorm_: created tools-worker-1034 - T228573
  • 15:57 bstorm_: created tools-worker-1032 and 1033 - T228573
  • 15:55 bstorm_: created tools-worker-1031 - T228573

2019-07-25

  • 22:01 bstorm_: T228573 created tools-worker-1030
  • 21:22 jeh: rebooting tools-worker-1016 unresponsive

2019-07-24

  • 10:14 arturo: reallocating tools-puppetmaster-01 from cloudvirt1027 to cloudvirt1028 (T227539)
  • 10:12 arturo: reallocating tools-docker-registry-04 from cloudvirt1027 to cloudvirt1028 (T227539)

2019-07-22

  • 18:39 bstorm_: repooled tools-sgeexec-0905 after reboot
  • 18:33 bstorm_: depooled tools-sgeexec-0905 because it's acting kind of weird and not responding to prometheus
  • 18:32 bstorm_: repooled tools-sgewebgrid-lighttpd-0902 after restarting the grid-exec service
  • 18:28 bstorm_: depooled tools-sgewebgrid-lighttpd-0902 to find out why it is behaving weird
  • 17:55 bstorm_: draining tools-worker-1023 since it is having issues
  • 17:38 bstorm_: Adding the prometheus servers to the ferm rules via wikitech hiera for kubelet stats T228573

2019-07-20

  • 19:52 andrewbogott: rebooting tools-worker-1023

2019-07-17

  • 20:23 andrewbogott: migrating tools-sgegrid-shadow to cloudvirt1014

2019-07-15

  • 14:50 bstorm_: cleared error state from tools-sgeexec-0911 which went offline after error from job 5190035

2019-06-25

  • 09:30 arturo: detected puppet issue in all VMs: T226480

2019-06-24

  • 17:42 andrewbogott: moving tools-sgeexec-0905 to cloudvirt1015

2019-06-17

  • 14:07 andrewbogott: moving tools-sgewebgrid-lighttpd-0903 to cloudvirt1015
  • 13:59 andrewbogott: moving tools-sgewebgrid-generic-0902 and tools-sgewebgrid-lighttpd-0902 to cloudvirt1015 (optimistic re: T220853 )

2019-06-11

  • 18:03 bstorm_: deleted anomalous kubernetes node tools-worker-1019.eqiad.wmflabs

2019-06-05

  • 18:33 andrewbogott: repooled tools-sgeexec-0921 and tools-sgeexec-0929
  • 18:16 andrewbogott: depooling and moving tools-sgeexec-0921 and tools-sgeexec-0929

2019-05-30

  • 13:01 arturo: uncordon/repool tools-worker-1001/2/3. They should be fine now. I'm only leaving 1029 cordoned for testing purposes
  • 13:01 arturo: reboot tools-woker-1003 to cleanup sssd config and let nslcd/nscd start freshly
  • 12:47 arturo: reboot tools-woker-1002 to cleanup sssd config and let nslcd/nscd start freshly
  • 12:42 arturo: reboot tools-woker-1001 to cleanup sssd config and let nslcd/nscd start freshly
  • 12:35 arturo: enable puppet in tools-worker nodes
  • 12:29 arturo: switch hiera setting back to classic/sudoldap for tools-worker because T224651 (T224558)
  • 12:25 arturo: cordon/drain tools-worker-1002 because T224651 and T224651
  • 12:23 arturo: cordon/drain tools-worker-1001 because T224651 and T224651
  • 12:22 arturo: cordon/drain tools-worker-1029 because T224651 and T224651
  • 12:20 arturo: cordon/drain tools-worker-1003 because T224651 and T224651
  • 11:59 arturo: T224558 repool tools-worker-1003 (using sssd/sudo now!)
  • 11:23 arturo: T224558 depool tools-worker-1003
  • 10:48 arturo: T224558 drop/build a VM for tools-worker-1002. It didn't like the sssd/sudo change :-(
  • 10:33 arturo: T224558 switch tools-worker-1002 to sssd/sudo. Includes drain/depool/reboot/repool
  • 10:28 arturo: T224558 use hiera config in prefix tools-worker for sssd/sudo
  • 10:27 arturo: T224558 switch tools-worker-1001 to sssd/sudo. Includes drain/depool/reboot/repool
  • 10:09 arturo: T224558 disable puppet in all tools-worker- nodes
  • 10:01 arturo: T224558 add tools-worker-1029 to the nodes pool of k8s
  • 09:58 arturo: T224558 reboot tools-worker-1029 after puppet changes for sssd/sudo in jessie

2019-05-29

  • 11:13 arturo: briefly tested some sssd config changes in tools-sgebastion-09
  • 10:13 arturo: enroll the tools-worker-1029 VM into toolforge k8s, but leave it cordoned for sssd testing purposes (T221225)
  • 10:12 arturo: re-create the tools-worker-1001 VM, already enrolled into toolforge k8s
  • 09:34 arturo: delete tools-worker-1001, it was totally malfunctioning

2019-05-28

  • 18:15 arturo: T221225 for the record, tools-worker-1001 is not working after trying with sssd
  • 18:13 arturo: T221225 created tools-worker-1029 to test sssd/sudo stuff
  • 17:49 arturo: T221225 repool tools-worker-1002 (using nscd/nslcd and sudoldap)
  • 17:44 arturo: T221225 back to classic/ldap hiera config in the tools-worker puppet prefix
  • 17:35 arturo: T221225 hard reboot tools-worker-1001 again
  • 17:27 arturo: T221225 hard reboot tools-worker-1001
  • 17:12 arturo: T221225 depool & switch to sssd/sudo & reboot & repool tools-worker-1002
  • 17:09 arturo: T221225 depool & switch to sssd/sudo & reboot & repool tools-worker-1001
  • 17:08 arturo: T221225 switch to sssd/sudo in puppet prefix for tools-worker
  • 13:04 arturo: T221225 depool and rebooted tools-worker-1001 in preparation for sssd migration
  • 12:39 arturo: T221225 disable puppet in all tools-worker nodes in preparation for sssd
  • 12:32 arturo: drop the tools-bastion puppet prefix, unused
  • 12:31 arturo: T221225 set sssd/sudo in the hiera config for the tools-checker prefix, and reboot tools-checker-03
  • 12:27 arturo: T221225 set sssd/sudo in the hiera config for the tools-docker-registry prefix, and reboot tools-docker-registry-[03-04]
  • 12:16 arturo: T221225 set sssd/sudo in the hiera config for the tools-sgebastion prefix, and reboot tools-sgebastion-07/08
  • 11:26 arturo: merged change to the sudo module to allow sssd transition

2019-05-27

  • 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90% (on tools-sgebastion-08)
  • 09:47 arturo: run `apt-get clean` to wipe 4GB of unused .deb packages, usage on / (root) was > 90%

2019-05-21

  • 12:35 arturo: T223992 rebooting tools-redis-1002

2019-05-20

  • 11:25 arturo: T223332 enable puppet agent in tools-k8s-master and tools-docker-registry nodes and deploy new SSL cert
  • 10:53 arturo: T223332 disable puppet agent in tools-k8s-master and tools-docker-registry nodes

2019-05-18

  • 11:13 chicocvenancio: PAWS update helm chart to point to new singleuser image (T217908)
  • 09:06 bd808: Rebuilding all stretch docker images to pick up toollabs-webservice 0.45

2019-05-17

  • 17:36 bd808: Rebuilding all docker images to pick up toollabs-webservice 0.45
  • 17:35 bd808: Deployed toollabs-webservice 0.45 (python 3.5 and nodejs 10 containers)

2019-05-16

  • 11:22 chicocvenancio: PAWS: restart hub to get new configured announcement
  • 11:05 chicocvenancio: PAWS: change confimap to reference WMHACK 2019 as busiest time

2019-05-15

  • 16:20 arturo: T223148 repool both tools-sgeexec-0921 and -0929
  • 15:32 arturo: T223148 depool tools-sgeexec-0921 and move to cloudvirt1014
  • 15:32 arturo: T223148 depool tools-sgeexec-0920 and move to cloudvirt1014
  • 12:29 arturo: T223148 repool both tools-sgeexec-09[37,39]
  • 12:13 arturo: T223148 depool tools-sgeexec-0937 and move to cloudvirt1008
  • 12:13 arturo: T223148 depool tools-sgeexec-0939 and move to cloudvirt1007
  • 11:34 arturo: T223148 repool tools-sgeexec-0940
  • 11:20 arturo: T223148 depool tools-sgeexec-0940 and move to cloudvirt1006
  • 11:11 arturo: T223148 repool tools-sgeexec-0941
  • 10:46 arturo: T223148 depool tools-sgeexec-0941 and move to cloudvirt1005
  • 09:44 arturo: T223148 repool tools-sgeexec-0901
  • 09:00 arturo: T223148 depool tools-sgeexec-0901 and reallocate to cloudvirt1004

2019-05-14

  • 17:12 arturo: T223148 repool tools-sgeexec-0920
  • 16:37 arturo: T223148 depool tools-sgeexec-0920 and reallocate to cloudvirt1003
  • 16:36 arturo: T223148 repool tools-sgeexec-0911
  • 15:56 arturo: T223148 depool tools-sgeexec-0911 and reallocate to cloudvirt1003
  • 15:52 arturo: T223148 repool tools-sgeexec-0909
  • 15:24 arturo: T223148 depool tools-sgeexec-0909 and reallocate to cloudvirt1002
  • 15:24 arturo: T223148 last SAL entry is bogus, please ignore (depool tools-worker-1009)
  • 15:23 arturo: T223148 depool tools-worker-1009
  • 15:13 arturo: T223148 repool tools-worker-1023
  • 13:16 arturo: T223148 repool tools-sgeexec-0942
  • 13:03 arturo: T223148 repool tools-sgewebgrid-generic-0904
  • 12:58 arturo: T223148 reallocating tools-worker-1023 to cloudvirt1001
  • 12:56 arturo: T223148 depool tools-worker-1023
  • 12:52 arturo: T223148 reallocating tools-sgeexec-0942 to cloudvirt1001
  • 12:50 arturo: T223148 depool tools-sgeexec-0942
  • 12:49 arturo: T223148 reallocating tools-sgewebgrid-generic-0904 to cloudvirt1001
  • 12:43 arturo: T223148 depool tools-sgewebgrid-generic-0904

2019-05-13

  • 08:15 zhuyifei1999_: `truncate -s 0 /var/log/exim4/paniclog` on tools-sgecron-01.tools.eqiad.wmflabs & tools-sgewebgrid-lighttpd-0921.tools.eqiad.wmflabs

2019-05-07

  • 14:38 arturo: T222718 uncordon tools-worker-1019, I couldn't find a reason for it to be cordoned
  • 14:31 arturo: T222718 reboot tools-worker-1009 and 1022 after being drained
  • 14:28 arturo: k8s drain tools-worker-1009 and 1022
  • 11:46 arturo: T219362 enable puppet in tools-redis servers and use the new puppet role
  • 11:33 arturo: T219362 disable puppet in tools-reds servers for puppet code cleanup
  • 11:12 arturo: T219362 drop the `tools-services` puppet prefix (we are actually using `tools-sgeservices`)
  • 11:10 arturo: T219362 enable puppet in tools-static servers and use new puppet role
  • 11:01 arturo: T219362 disable puppet in tools-static servers for puppet code cleanup
  • 10:16 arturo: T219362 drop the `tools-webgrid-lighttpd` puppet prefix
  • 10:14 arturo: T219362 drop the `tools-webgrid-generic` puppet prefix
  • 10:06 arturo: T219362 drop the `tools-exec-1` puppet prefix

2019-05-06

  • 11:34 arturo: T221225 reenable puppet
  • 10:53 arturo: T221225 disable puppet in all toolforge servers for testing sssd patch (puppetmaster livehack)

2019-05-03

  • 09:43 arturo: fixed puppet in tools-puppetdb-01 too
  • 09:39 arturo: puppet should be now fine across toolforge (except tools-puppetdb-01 which is WIP I think)
  • 09:37 arturo: fix puppet in tools-elastic-03, archived jessie repos, weird rsyslog-kafka package situation
  • 09:33 arturo: fix puppet in tools-elastic-02, archived jessie repos, weird rsyslog-kafka package situation
  • 09:24 arturo: fix puppet in tools-elastic-01, archived jessie repos, weird rsyslog-kafka package situation
  • 09:18 arturo: solve a weird apt situation in tools-puppetmaster-01 regarding the rsyslog-kafka package (puppet agent was failing)
  • 09:16 arturo: solve a weird apt situation in tools-worker-1028 regarding the rsyslog-kafka package

2019-04-30

  • 12:50 arturo: enable puppet in all servers T221225
  • 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd (T221225)
  • 12:45 arturo: adding `sudo_flavor: sudo` hiera config to all puppet prefixes with sssd
  • 11:07 arturo: T221225 disable puppet in toolforge
  • 10:56 arturo: T221225 create tools-sgebastion-0test for more sssd tests

2019-04-29

  • 11:22 arturo: T221225 re-enable puppet agent in all toolforge servers
  • 10:27 arturo: T221225 reboot tool-sgebastion-09 for testing sssd
  • 10:21 arturo: disable puppet in all servers to livehack tools-puppetmaster-01 to test T221225
  • 08:29 arturo: cleanup disk in tools-sgebastion-09, was full of debug logs and unused apt packages

2019-04-26

  • 12:20 andrewbogott: rescheduling every pod everywhere
  • 12:18 andrewbogott: rescheduling all pods on tools-worker-1023.tools.eqiad.wmflabs

2019-04-25

  • 12:49 arturo: T221225 using `profile::ldap::client::labs::client_stack: sssd` in horizon for tools-sgebastion-09 (testing)
  • 11:43 arturo: T221793 removing prometheus crontab and letting puppet agent re-create it again to resolve staleness

2019-04-24

  • 12:54 arturo: puppet broken, fixing right now
  • 09:18 arturo: T221225 reallocating tools-sgebastion-09 to cloudvirt1008

2019-04-23

  • 15:26 arturo: T221225 rebooting tools-sgebastion-08 to cleanup sssd
  • 15:19 arturo: T221225 creating tools-sgebastion-09 for testing sssd stuff
  • 13:06 arturo: T221225 use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix, again. Rollback again.
  • 12:57 arturo: T221225 use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix, try again with sssd in the bastions, reboot them
  • 10:28 arturo: T221225 use `profile::ldap::client::labs::client_stack: classic` in the puppet bastion prefix
  • 10:27 arturo: T221225 rebooting tools-sgebastion-07 to clean sssd confiuration
  • 10:16 arturo: T221225 disable puppet in tools-sgebastion-08 for sssd testing
  • 09:49 arturo: T221225 run puppet agent in the bastions and reboot them with sssd
  • 09:43 arturo: T221225 use `profile::ldap::client::labs::client_stack: sssd` in the puppet bastion prefix
  • 09:41 arturo: T221225 disable puppet agent in the bastions

2019-04-17

  • 12:09 arturo: T221225 rebooting bastions to clean sssd. We are back to nscd/nslcd until we figure out what's wrong here
  • 11:59 arturo: T221205 sssd was deployed successfully into all webgrid nodes
  • 11:39 arturo: deploy sssd to tools-sge-services-03/04 (includes reboot)
  • 11:31 arturo: reboot bastions for sssd deployment
  • 11:30 arturo: deploy sssd to bastions
  • 11:24 arturo: disable puppet in bastions to deploy sssd
  • 09:52 arturo: T221205 tools-sgewebgrid-lighttpd-0915 requires some manual intervention because issues in the dpkg database prevents deleting nscd/nslcd packages
  • 09:45 arturo: T221205 tools-sgewebgrid-lighttpd-0913 requires some manual intervention because unconfigured packages prevents a clean puppet agent run
  • 09:12 arturo: T221205 start deploying sssd to sgewebgrid nodes
  • 09:00 arturo: T221205 add `profile::ldap::client::labs::client_stack: sssd` in horizon for the puppet prefixes `tools-sgewebgrid-lighttpd` and `tools-sgewebgrid-generic`
  • 08:57 arturo: T221205 disable puppet in all tools-sgewebgrid-* nodes

2019-04-16

  • 20:49 chicocvenancio: change paws announcement in configmap hub-config back to a welcome message
  • 17:15 chicocvenancio: add paws outage announcement in configmap hub-config
  • 17:00 andrewbogott: moving tools-k8s-master-01 to eqiad1-r

2019-04-15

  • 18:50 andrewbogott: moving tools-elastic-01 to cloudvirt1008 to make spreadcheck happy
  • 15:01 andrewbogott: moving tools-redis-1001 to eqiad1-r

2019-04-14

  • 16:23 andrewbogott: moved all tools-worker nodes off of cloudvirt1015 and uncordoned them

2019-04-13

  • 21:09 bstorm_: Moving tools-prometheus-01 to cloudvirt1009 and tools-clushmaster-02 to cloudvirt1008 for T220853
  • 20:36 bstorm_: moving tools-elastic-02 to cloudvirt1009 for T220853
  • 19:58 bstorm_: started migrating tools-k8s-etcd-03 to cloudvirt1012 T220853
  • 19:51 bstorm_: started migrating tools-flannel-etcd-02 to cloudvirt1013 T220853

2019-04-11

  • 22:38 andrewbogott: moving tools-paws-worker-1005 to cloudvirt1009 to make spreadcheck happier
  • 21:49 bd808: Re-enabled puppet on tools-elastic-02 and forced puppet run
  • 21:44 andrewbogott: moving tools-mail-02 to eqiad1-r
  • 20:56 andrewbogott: shutting down tools-logs-02 — seems unused
  • 19:44 bd808: Disabled puppet on tools-elastic-02 and set to 1-node master
  • 19:34 andrewbogott: moving tools-puppetmaster-01 to eqiad1-r
  • 15:40 andrewbogott: moving tools-redis-1002 to eqiad1-r
  • 13:52 andrewbogott: moving tools-prometheus-01 and tools-elastic-01 to eqiad1-r
  • 12:01 arturo: T151704 deploying oidentd
  • 11:54 arturo: disable puppet in all hosts to deploy oidentd
  • 02:33 andrewbogott: tools-paws-worker-1005, tools-paws-worker-1006 to eqiad1-r
  • 00:03 andrewbogott: tools-paws-worker-1002, tools-paws-worker-1003 to eqiad1-r

2019-04-10

  • 23:58 andrewbogott: moving tools-clushmaster-02, tools-elastic-03 and tools-paws-worker-1001 to eqiad1-r
  • 18:52 bstorm_: depooled and rebooted tools-sgeexec-0929 because systemd was in a weird state
  • 18:46 bstorm_: depooled and rebooted tools-sgewebgrid-lighttpd-0913 because high load was caused by ancient lsof processes
  • 14:49 bstorm_: cleared E state from 5 queues
  • 13:06 arturo: T218126 hard reboot tools-sgeexec-0906
  • 12:31 arturo: T218126 hard reboot tools-sgeexec-0926
  • 12:27 arturo: T218126 hard reboot tools-sgeexec-0925
  • 12:06 arturo: T218126 hard reboot tools-sgeexec-0901
  • 11:55 arturo: T218126 hard reboot tools-sgeexec-0924
  • 11:47 arturo: T218126 hard reboot tools-sgeexec-0921
  • 11:23 arturo: T218126 hard reboot tools-sgeexec-0940
  • 11:03 arturo: T218126 hard reboot tools-sgeexec-0928
  • 10:49 arturo: T218126 hard reboot tools-sgeexec-0923
  • 10:43 arturo: T218126 hard reboot tools-sgeexec-0915
  • 10:27 arturo: T218126 hard reboot tools-sgeexec-0935
  • 10:19 arturo: T218126 hard reboot tools-sgeexec-0914
  • 10:02 arturo: T218126 hard reboot tools-sgeexec-0907
  • 09:41 arturo: T218126 hard reboot tools-sgeexec-0918
  • 09:27 arturo: T218126 hard reboot tools-sgeexec-0932
  • 09:26 arturo: T218216 hard reboot tools-sgeexec-0932
  • 09:04 arturo: T218216 add `profile::ldap::client::labs::client_stack: sssd` to prefix puppet for sge-exec nodes
  • 09:03 arturo: T218216 do a controlled rollover of sssd, depooling sgeexec nodes, reboot and repool
  • 08:39 arturo: T218216 disable puppet in all tools-sgeexec-XXXX nodes for controlled sssd rollout
  • 00:32 andrewbogott: migrating tools-worker-1022, 1023, 1025, 1026 to eqiad1-r

2019-04-09

  • 22:04 bstorm_: added the new region on port 80 to the elasticsearch security group for stashbot
  • 21:16 andrewbogott: moving tools-flannel-etcd-03 to eqiad1-r
  • 20:43 andrewbogott: moving tools-worker-1018, 1019, 1020, 1021 to eqiad1-r
  • 20:04 andrewbogott: moving tools-k8s-etcd-03 to eqiad1-r
  • 19:54 andrewbogott: moving tools-flannel-etcd-02 to eqiad1-r
  • 18:36 andrewbogott: moving tools-worker-1016, tools-worker-1017 to eqiad1-r
  • 18:05 andrewbogott: migrating tools-k8s-etcd-02 to eqiad1-r
  • 18:00 andrewbogott: migrating tools-flannel-etcd-01 to eqiad1-r
  • 17:36 andrewbogott: moving tools-worker-1014, tools-worker-1015 to eqiad1-r
  • 17:05 andrewbogott: migrating tools-k8s-etcd-01 to eqiad1-r
  • 15:56 andrewbogott: moving tools-worker-1012, tools-worker-1013 to eqiad1-r
  • 14:56 bstorm_: cleared 4 queues on gridengine of E status (ldap again)
  • 14:07 andrewbogott: moving tools-worker-1010, tools-worker-1011, tools-worker-1001 to eqiad1-r
  • 03:48 andrewbogott: moving tools-worker-1008 and tools-worker-1009 to eqiad1-r
  • 02:07 bstorm_: reloaded ferm on tools-flannel-etcd-0[1-3] to get the k8s node moves to register

2019-04-08

  • 22:36 andrewbogott: moving tools-worker-1006 and tools-worker-1007 to eqiad1-r
  • 20:03 andrewbogott: moving tools-worker-1003 and tools-worker-1004 to eqiad1-r

2019-04-07

  • 16:54 zhuyifei1999_: tools-sgeexec-0928 unresponsive since around 22 UTC. No data on Graphite. Can't ssh in even as root. Hard rebooting via Horizon
  • 01:06 bstorm_: cleared E state from 6 queues

2019-04-05

  • 15:44 bstorm_: cleared E state from two exec queues

2019-04-04

  • 21:21 bd808: Uncordoned tools-worker-1013.tools.eqiad.wmflabs after reboot and forced puppet run
  • 20:53 bd808: Rebooting tools-worker-1013
  • 20:50 bd808: Draining tools-worker-1013.tools.eqiad.wmflabs
  • 20:29 bd808: Released floating IP and deleted instance tools-checker-01 via Horizon
  • 20:28 bd808: Shutdown tools-checker-01 via Horizon
  • 20:17 bd808: Repooled tools-webgrid-lighttpd-0906 after reboot, apt-get dist-upgrade, and forced puppet run
  • 20:13 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0906 via Horizon
  • 20:09 bd808: Repooled tools-webgrid-lighttpd-0912 after reboot, apt-get dist-upgrade, and forced puppet run
  • 20:05 bd808: Depooled and rebooted tools-sgewebgrid-lighttpd-0912
  • 20:05 bstorm_: rebooted tools-webgrid-lighttpd-0912
  • 20:03 bstorm_: depooled tools-webgrid-lighttpd-0912
  • 19:59 bstorm_: depooling and rebooting tools-webgrid-lighttpd-0906
  • 19:43 bd808: Repooled tools-sgewebgrid-lighttpd-0926 after reboot, apt-get dist-update, and forced puppet run
  • 19:36 bd808: Hard reboot of tools-sgewebgrid-lighttpd-0926 via Horizon
  • 19:30 bd808: Rebooting tools-sgewebgrid-lighttpd-0926
  • 19:28 bd808: Depooled tools-sgewebgrid-lighttpd-0926
  • 19:13 bstorm_: cleared E state from 7 queues
  • 17:32 andrewbogott: moving tools-static-12 to cloudvirt1023 to keep the two static nodes off the same host

2019-04-03

  • 11:22 arturo: puppet breakage in due to me introducing openstack-mitaka-jessie repo by mistake. Cleaning up already

2019-04-02

  • 12:11 arturo: icinga downtime toolschecker for 1 month T219243
  • 03:55 bd808: Added etcd service group to tools-k8s-etcd-* (T219243)

2019-04-01

  • 19:44 bd808: Deleted tools-checker-02 via Horizon (T219243)
  • 19:43 bd808: Shutdown tools-checker-02 via Horizon (T219243)
  • 16:53 bstorm_: cleared E state on 6 grid queues
  • 14:54 andrewbogott: moving tools-static-12 to eqiad1-r (for real this time maybe)

2019-03-29

  • 21:13 bstorm_: depooled tools-sgewebgrid-generic-0903 because of some stuck jobs and odd load characteristics
  • 21:09 bd808: Updated cherry-pick of https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ on tools-puppetmaster-01 (T219243)
  • 20:48 bd808: Using root console to fix broken initial puppet run on tools-checker-03.
  • 20:32 bd808: Creating tools-checker-03 with role::wmcs::toolforge::checker (T219243)
  • 20:24 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/500095/ to tools-puppetmaster-01 for testing (T219243)
  • 20:22 bd808: Disabled puppet on tools-checker-0{1,2} to make testing new role::wmcs::toolforge::checker easier (T219243)
  • 17:25 bd808: Cleared the "Eqw" state of 44 jobs with `qstat -u '*' | grep Eqw | awk '{print $1;}' | xargs -L1 sudo qmod -cj` on tools-sgegrid-master
  • 17:16 andrewbogott: aborted move of tools-static-12; will wait until tomorrow and give DNS caches more time to update
  • 17:11 bd808: Restarted nginx on tools-static-13
  • 16:53 andrewbogott: moving tools-static-12 to eqiad1-r
  • 16:49 bstorm_: cleared E state from 21 queues
  • 14:34 andrewbogott: moving tools-static.wmflabs.org to point to tools-static-13 in eqiad1-r
  • 13:54 andrewbogott: moving tools-static-13 to eqiad1-r

2019-03-28

  • 01:00 bstorm_: cleared error states from two queues
  • 00:23 bstorm_: T216060 created tools-sgewebgrid-generic-0901...again!

2019-03-27

  • 23:35 bstorm_: rebooted tools-paws-master-01 for NFS issue T219460
  • 14:45 bstorm_: cleared several "E" state queues
  • 12:26 gtirloni: truncated exim4/paniclog on tools-sgewebgrid-lighttpd-0921
  • 12:25 gtirloni: truncated exim4/paniclog on tools-sgecron-01
  • 12:15 arturo: T218126 `aborrero@tools-sgegrid-master:~$ sudo qmod -d 'test@tools-sssd-sgeexec-test-2'` (and 1)

2019-03-26

  • 22:00 gtirloni: downtimed toolschecker
  • 17:31 arturo: T218126 create VM instances tools-sssd-sgeexec-test-[12]
  • 00:26 bd808: Deleted DNS record for login-trusty.tools.wmflabs.org
  • 00:26 bd808: Deleted DNS record for trusty-dev.tools.wmflabs.org

2019-03-25

  • 21:21 bd808: All Trusty grid engine hosts shutdown and deleted (T217152)
  • {{safesubst:SAL entry|1=21:19 bd808: Deleted tools-grid-{master,shadow} (T217152)}}
  • 21:18 bd808: Deleted tools-webgrid-lighttpd-14* (T217152)
  • 20:55 bstorm_: reboot tools-sgewebgrid-generic-0903 to clear up some issues
  • 20:52 bstorm_: rebooting tools-package-builder-02 due to lots of hung /usr/bin/lsof +c 15 -nXd DEL processes
  • 20:51 bd808: Deleted tools-webgrid-generic-14* (T217152)
  • 20:49 bd808: Deleted tools-exec-143* (T217152)
  • 20:49 bd808: Deleted tools-exec-142* (T217152)
  • 20:48 bd808: Deleted tools-exec-141* (T217152)
  • 20:47 bd808: Deleted tools-exec-140* (T217152)
  • 20:43 bd808: Deleted tools-cron-01 (T217152)
  • 20:42 bd808: Deleted tools-bastion-0{2,3} (T217152)
  • 20:35 bstorm_: rebooted tools-worker-1025 and tools-worker-1021
  • 19:59 bd808: Shutdown tools-exec-143* (T217152)
  • 19:51 bd808: Shutdown tools-exec-142* (T217152)
  • 19:47 bstorm_: depooling tools-worker-1025.tools.eqiad.wmflabs because it's not responding and showing insane load
  • 19:33 bd808: Shutdown tools-exec-141* (T217152)
  • 19:31 bd808: Shutdown tools-bastion-0{2,3} (T217152)
  • 19:19 bd808: Shutdown tools-exec-140* (T217152)
  • 19:12 bd808: Shutdown tools-webgrid-generic-14* (T217152)
  • 19:11 bd808: Shutdown tools-webgrid-lighttpd-14* (T217152)
  • 18:53 bd808: Shutdown tools-grid-master (T217152)
  • 18:53 bd808: Shutdown tools-grid-shadow (T217152)
  • 18:49 bd808: All jobs still running on the Trusty job grid force deleted.
  • 18:46 bd808: All Trusty job grid queues marked as disabled. This should stop all new Trusty job submissions.
  • 18:43 arturo: icinga downtime tools-checker for 24h due to trusty grid shutdown
  • 18:39 bd808: Shutdown tools-cron-01.tools.eqiad.wmflabs (T217152)
  • 15:27 bd808: Copied all crontab files still on tools-cron-01 to tool's $HOME/crontab.trusty.save
  • 02:34 bd808: Disassociated floating IPs and deleted shutdown Trusty grid nodes tools-exec-14{33,34,35,36,37,38,39,40,41,42} (T217152)
  • 02:26 bd808: Deleted shutdown Trusty grid nodes tools-webgrid-lighttpd-14{20,21,22,24,25,26,27,28} (T217152)

2019-03-22

  • 17:16 andrewbogott: switching all instances to use ldap-ro.eqiad.wikimedia.org as both primary and secondary ldap server
  • 16:12 bstorm_: cleared errored out stretch grid queues
  • 15:56 bd808: Rebooting tools-static-12
  • 03:09 bstorm_: T217280 depooled and rebooted 15 other nodes. Entire stretch grid is in a good state for now.
  • 02:31 bstorm_: T217280 depooled and rebooted tools-sgeexec-0908 since it had no jobs but very high load from an NFS event that was no longer happening
  • 02:09 bstorm_: T217280 depooled and rebooted tools-sgewebgrid-lighttpd-0924
  • 00:39 bstorm_: T217280 depooled and rebooted tools-sgewebgrid-lighttpd-0902

2019-03-21

  • 23:28 bstorm_: T217280 depooled, reloaded and repooled tools-sgeexec-0938
  • 21:53 bstorm_: T217280 rebooted and cleared "unknown status" from tools-sgeexec-0914 after depooling
  • 21:51 bstorm_: T217280 rebooted and cleared "unknown status" from tools-sgeexec-0909 after depooling
  • 21:26 bstorm_: T217280 cleared error state from a couple queues and rebooted tools-sgeexec-0901 and 04 to clear other issues related

2019-03-18

  • 18:43 bd808: Rebooting tools-static-12
  • 18:42 chicocvenancio: PAWS: 3 nodes still in not ready state, `worker-10(01|07|10)` all else working
  • 18:41 chicocvenancio: PAWS: deleting pods stuck in Unknown state with ` --grace-period=0 --force`
  • 18:40 andrewbogott: rebooting tools-static-13 in hopes of fixing some nfs mounts
  • 18:25 chicocvenancio: removing postStart hook for PWB update and restarting hub while gerrit.wikimedia.com is down

2019-03-17

  • 23:41 bd808: Cherry-picked https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/497210/ as a quick fix for T218494
  • 22:30 bd808: Investigating strange system state on tools-bastion-03.
  • 17:48 bstorm_: T218514 rebooting tools-worker-1009 and 1012
  • 17:46 bstorm_: depooling tools-worker-1009 and tools-worker-1012 for T218514
  • 17:13 bstorm_: depooled and rebooting tools-worker-1018
  • 15:09 andrewbogott: running 'killall dpkg and dpkg --configure -a' on all nodes to try to work around a race with initramfs

2019-03-16

  • 22:34 bstorm_: clearing errored out queues again

2019-03-15

  • 21:08 bstorm_: cleared error state on several queues T217280
  • 15:58 gtirloni: rebooted tools-clushmaster-02
  • 14:40 mutante: tools-sgebastion-07 - dpkg-reconfigure locales and adding Korean ko_KR.EUC-KR - T130532
  • 14:32 mutante: tools-sgebastion-07 - generating locales for user request in T130532

2019-03-14

  • 23:52 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{21,22,23,24,25,26,27,28,29,30,31,32} (T217152)
  • 23:28 bd808: Deleted tools-bastion-05 (T217152)
  • 22:30 bd808: Removed obsolete submit hosts from Trusty grid config
  • 22:20 bd808: Removed tools-webgrid-lighttpd-142{0,1,2,5} from the grid and shutdown instances via horizon (T217152)
  • 22:10 bd808: Depooled tools-webgrid-lighttpd-142{0,1,2,5} (T217152)
  • 21:55 bd808: Removed submit host flag from tools-bastion-05.tools.eqiad.wmflabs, removed floating ip, and shutdown instance via horizon (T217152)
  • 21:48 bd808: Removed tools-exec-14{33,34,35,36,37,38,39,40,41,42} from the grid and shutdown instances via horizon (T217152)
  • 21:38 gtirloni: rebooted tools-sgewebgrid-generic-0904 (T218341)
  • 21:32 gtirloni: rebooted tools-exec-1020 (T218341)
  • 21:23 gtirloni: rebooted tools-sgeexec-0919, tools-sgeexec-0934, tools-worker-1018 (T218341)
  • 21:19 bd808: Killed jobs still running on tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs 2 weeks after being depooled (T217152)
  • 20:58 bd808: Repooled tools-sgeexec-0941 following reboot
  • 20:57 bd808: Hard reboot of tools-sgeexec-0941 via horizon
  • 20:54 bd808: Depooled and rebooted tools-sgeexec-0941.tools.eqiad.wmflabs
  • 20:53 bd808: Repooled tools-sgeexec-0917 following reboot
  • 20:52 bd808: Hard reboot of tools-sgeexec-0917 via horizon
  • 20:47 bd808: Depooled and rebooted tools-sgeexec-0917
  • 20:44 bd808: Repooled tools-sgeexec-0908 after reboot
  • 20:36 bd808: depooled and rebooted tools-sgeexec-0908
  • 19:08 gtirloni: rebooted tools-worker-1028 (T218341)
  • 19:08 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914 (T218341)
  • 19:07 gtirloni: rebooted tools-sgewebgrid-lighttpd-0914
  • 18:13 gtirloni: drained tools-worker-1028 for reboot (processes in D state)

2019-03-13

  • 23:30 bd808: Rebuilding stretch Kubernetes images
  • 22:55 bd808: Rebuilding jessie Kubernetes images
  • 17:11 bstorm_: specifically rebooted SGE cron server tools-sgecron-01
  • 17:10 bstorm_: rebooted cron server
  • 16:10 bd808: Updated DNS for dev.tools.wmflabs.org to point to Stretch secondary bastion. This was missed on 2019-03-07
  • 12:33 arturo: reboot tools-sgebastion-08 (T215154)
  • 12:17 arturo: reboot tools-sgebastion-07 (T215154)
  • 11:53 arturo: enable puppet in tools-sgebastion-07 (T215154)
  • 11:20 arturo: disable puppet in tools-sgebastion-07 for testing T215154
  • 05:07 bstorm_: re-enabled puppet for tools-sgebastion-07
  • 04:59 bstorm_: disabled puppet for a little bit on tools-bastion-07
  • 00:22 bd808: Raise web-memlimit for isbn tool to 6G for tomcat8 (T217406)

2019-03-11

  • 15:53 bd808: Manually started `service gridengine-master` on tools-sgegrid-master after reboot (T218038)
  • 15:47 bd808: Hard reboot of tools-sgegrid-master via Horizon UI (T218038)
  • 15:42 bd808: Rebooting tools-sgegrid-master (T218038)
  • 14:49 gtirloni: deleted tools-webgrid-lighttpd-1419
  • 00:53 bd808: Re-enabled 13 queue instances that had been disabled by LDAP failures during job initialization (T217280)

2019-03-10

  • 22:36 gtirloni: increased nscd group TTL from 60 to 300sec

2019-03-08

  • 19:48 andrewbogott: repooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
  • 19:21 andrewbogott: depooling tools-exec-1430 and tools-sgeexec-0905 to compare ldap usage
  • 17:49 bd808: Re-enabled 4 queue instances that had been disabled by LDAP failures during job initialization (T217280)
  • 00:30 bd808: DNS record created for trusty-dev.tools.wmflabs.org (Trusty secondary bastion)

2019-03-07

  • 23:31 bd808: Updated DNS to point login.tools.wmflabs.org at 185.15.56.48 (Stretch bastion)
  • 04:15 bd808: Killed 3 orphan processes on Trusty grid
  • 04:01 bd808: Cleared error state on a large number of Stretch grid queues which had been disabled by LDAP and/or NFS hiccups (T217280)
  • 00:49 zhuyifei1999_: clushed misctools 1.37 upgrade on @bastion,@cron,@bastion-stretch T217406
  • 00:38 zhuyifei1999_: published misctools 1.37 T217406
  • 00:34 zhuyifei1999_: begin building misctools 1.37 using debuild T217406

2019-03-06

  • 13:57 gtirloni: fixed SSH warnings in tools-clushmaster-02

2019-03-04

  • 19:07 bstorm_: umounted /mnt/nfs/dumps-labstore1006.wikimedia.org for T217473
  • {{safesubst:SAL entry|1=14:05 gtirloni: rebooted tools-docker-registry-{03,04}, tools-puppetmaster-02 and tools-puppetdb-01 (load avg >45, not accessible)}}

2019-03-03

  • 20:54 andrewbogott: cleaning out /tmp on tools-exec-1412

2019-02-28

  • 19:36 zhuyifei1999_: built with debuild instead T217297
  • 19:08 zhuyifei1999_: test failures during build, see ticket
  • 18:55 zhuyifei1999_: start building jobutils 1.36 T217297

2019-02-27

  • 20:41 andrewbogott: restarting nginx on tools-checker-01
  • 19:34 andrewbogott: uncordoning tools-worker-1028, 1002 and 1005, now in eqiad1-r
  • 16:20 zhuyifei1999_: regenerating k8s creds for tools.whichsub & tools.permission-denied-test T176027
  • 15:40 andrewbogott: moving tools-worker-1002, 1005, 1028 to eqiad1-r
  • 01:36 bd808: Shutdown tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs via horizon (T217152)
  • 01:29 bd808: Depooled tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs (T217152)
  • 01:26 bd808: Disabled job queues and rescheduled continuous jobs away from tools-exec-14{33,34,35,36,37,38,39,40,41,42}.tools.eqiad.wmflabs (T217152)

2019-02-26

  • 20:51 gtirloni: reboot tools-package-builder-02 (unresponsive)
  • 19:01 gtirloni: pushed updated docker images
  • 17:30 andrewbogott: draining and cordoning tools-worker-1027 for a region migration test

2019-02-25

  • 23:20 bstorm_: Depooled tools-sgeexec-0914 and tools-sgeexec-0915 for T217066
  • 21:41 andrewbogott: depooling tools-sgeexec-0911, tools-sgeexec-0912, tools-sgeexec-0913 to test T217066
  • 13:11 chicocvenancio: PAWS: Stopped AABot notebook pod T217010
  • 12:54 chicocvenancio: PAWS: Restarted Criscod notebook pod T217010
  • 12:21 chicocvenancio: PAWS: killed proxy and hub pods to attempt to get it to see routes to open notebooks servers to no avail. Restarted BernhardHumm's notebook pod T217010
  • 09:50 gtirloni: rebooted tools-sgeexec-09{16,22,40} (T216988)
  • 09:41 gtirloni: rebooted tools-sgeexec-09{16,22,40}
  • 08:37 zhuyifei1999_: uncordon tools-worker-1015.tools.eqiad.wmflabs
  • 08:34 legoktm: hard rebooted tools-worker-1015 via horizon
  • 07:48 zhuyifei1999_: systemd stuck in D state. :(
  • 07:44 zhuyifei1999_: I saved dmesg and process list to a few files in /root if that helps debugging
  • 07:43 zhuyifei1999_: D states are not responding to SIGKILL. Will reboot.
  • 07:37 zhuyifei1999_: tools-worker-1015.tools.eqiad.wmflabs having severe NFS issues (all NFS accessing processes are stuck in D state). Draining.

2019-02-22

  • 16:29 gtirloni: upgraded and rebooted tools-puppetmaster-01 (new kernel)
  • 15:59 gtirloni: started tools-puppetmaster-01 (new size: m1.large)
  • 15:13 gtirloni: shutdown tools-puppetmaster-01

2019-02-21

  • 09:59 gtirloni: upgraded all packages in all stretch nodes
  • 00:12 zhuyifei1999_: forcing puppet run on tools-k8s-master-01
  • 00:08 zhuyifei1999_: running /usr/local/bin/git-sync-upstream on tools-puppetmaster-01 to speed puppet changes up

2019-02-20

  • 23:30 zhuyifei1999_: begin rebuilding all docker images T178601 T193646 T215683
  • 23:25 zhuyifei1999_: upgraded toollabs-webservice on tools-bastion-02 to 0.44 (newly-built version)
  • 23:19 zhuyifei1999_: this was built for stretch. hopefully it works for all distros
  • 23:17 zhuyifei1999_: begin build new tools-webservice package T178601 T193646 T215683
  • 21:57 andrewbogott: moving tools-static-13 to a new virt host
  • 21:34 andrewbogott: moving the tools-static IP from tools-static-13 to tools-static-12
  • 19:17 andrewbogott: moving tools-bastion-02 to labvirt1004
  • 16:56 andrewbogott: moving tools-paws-worker-1003
  • 15:53 andrewbogott: moving tools-worker-1017, tools-worker-1027, tools-worker-1028
  • 15:04 andrewbogott: moving tools-exec-1413 and tools-exec-1442

2019-02-19

  • 01:49 bd808: Revoked Toolforge project membership for user DannyS712 (T215092)

2019-02-18

  • 20:45 gtirloni: upgraded and rebooted tools-sgebastion-07 (login-stretch)
  • 20:22 gtirloni: enabled toolsdb monitoring in Icinga
  • 20:03 gtirloni: pointed tools-db.eqiad.wmflabs to 172.16.7.153
  • 18:50 chicocvenancio: moving paws back to toolsdb T216208
  • 13:47 arturo: rebooting tools-sgebastion-07 to try fixing general slowness

2019-02-17

  • 22:23 zhuyifei1999_: uncordon tools-worker-1010.tools.eqiad.wmflabs
  • 22:11 zhuyifei1999_: rebooting tools-worker-1010.tools.eqiad.wmflabs
  • 22:10 zhuyifei1999_: draining tools-worker-1010.tools.eqiad.wmflabs, `docker ps` is hanging. no idea why. also other weirdness like ContainerCreating forever

2019-02-16

  • 05:00 zhuyifei1999_: fixed by restarting flannel. another puppet run simply started kubelet
  • 04:58 zhuyifei1999_: puppet logs: https://phabricator.wikimedia.org/P8097. Docker is failing with 'Failed to load environment files: No such file or directory'
  • 04:52 zhuyifei1999_: copied the resolv.conf from tools-k8s-master-01, removing secondary DNS to make sure puppet fixes that, and starting puppet
  • 04:48 zhuyifei1999_: that host's resolv.conf is badly broken https://phabricator.wikimedia.org/P8096. The last Puppet run was at Thu Feb 14 15:21:09 UTC 2019 (2247 minutes ago)
  • 04:44 zhuyifei1999_: puppet is also failing bad here 'Error: Could not request certificate: getaddrinfo: Name or service not known'
  • 04:43 zhuyifei1999_: this one has logs full of 'Can't contact LDAP server'
  • 04:41 zhuyifei1999_: nslcd also broken on tools-worker-1005
  • 04:34 zhuyifei1999_: uncordon tools-worker-1014.tools.eqiad.wmflabs
  • 04:33 zhuyifei1999_: the issue was, /var/run/nslcd/socket was somehow a directory, AFAICT
  • 04:31 zhuyifei1999_: then started nslcd vis systemctl and `id zhuyifei1999` returns correct stuffs
  • 04:30 zhuyifei1999_: `nslcd -nd` complains about 'nslcd: bind() to /var/run/nslcd/socket failed: Address already in use'. SIGTERMed a background nslcd, `rmdir /var/run/nslcd/socket`, and `nslcd -nd` seemingly starts to work
  • 04:23 zhuyifei1999_: drained tools-worker-1014.tools.eqiad.wmflabs
  • 04:16 zhuyifei1999_: logs: https://phabricator.wikimedia.org/P8095
  • 04:14 zhuyifei1999_: restarting nslcd on tools-worker-1014 in an attempt to fix that, service failed to start, looking into logs
  • 04:12 zhuyifei1999_: restarting nscd on tools-worker-1014 in an attempt to fix seemingly-not-attached-to-LDAP

2019-02-14

  • 21:57 bd808: Deleted old tools-proxy-02 instance
  • 21:57 bd808: Deleted old tools-proxy-01 instance
  • 21:56 bd808: Deleted old tools-package-builder-01 instance
  • 20:57 andrewbogott: rebooting tools-worker-1005
  • 20:34 andrewbogott: moving tools-exec-1409, tools-exec-1410, tools-exec-1414, tools-exec-1419
  • 19:55 andrewbogott: moving tools-webgrid-generic-1401 and tools-webgrid-lighttpd-1419
  • 19:33 andrewbogott: moving tools-checker-01 to labvirt1003
  • 19:25 andrewbogott: moving tools-elastic-02 to labvirt1003
  • 19:11 andrewbogott: moving tools-k8s-etcd-01 to labvirt1002
  • 18:37 andrewbogott: moving tools-exec-1418, tools-exec-1424 to labvirt1003
  • 18:34 andrewbogott: moving tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1406, tools-webgrid-lighttpd-1410 to labvirt1002
  • 17:35 arturo: T215154 tools-sgebastion-07 now running systemd 239 and starts enforcing user limits
  • 15:33 andrewbogott: moving tools-worker-1002, 1003, 1005, 1006, 1007, 1010, 1013, 1014 to different labvirts in order to move labvirt1012 to eqiad1-r

2019-02-13

  • 19:16 andrewbogott: deleting tools-sgewebgrid-generic-0901, tools-sgewebgrid-lighttpd-0901, tools-sgebastion-06
  • 15:16 zhuyifei1999_: `sudo /usr/local/bin/grid-configurator --all-domains --observer-pass $(grep OS_PASSWORD /etc/novaobserver.yaml|awk '{gsub(/"/,"",$2);print $2}')` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042
  • 15:06 zhuyifei1999_: `sudo systemctl restart gridengine-master` on tools-sgegrid-master to attempt to make it recognize -sgebastion-07 T216042
  • 13:03 arturo: T216030 switch login-stretch.tools.wmflabs.org floating IP to tools-sgebastion-07

2019-02-12

  • 01:24 bd808: Stopped maintain-kubeusers, edited /etc/kubernetes/tokenauth, restarted maintain-kubeusers (T215704)

2019-02-11

  • 22:57 bd808: Shutoff tools-webgrid-lighttpd-14{01,13,24,26,27,28} via Horizon UI
  • 22:34 bd808: Decommissioned tools-webgrid-lighttpd-14{01,13,24,26,27,28}
  • 22:23 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1401.tools.eqiad.wmflabs
  • 22:21 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1413.tools.eqiad.wmflabs
  • 22:18 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1428.tools.eqiad.wmflabs
  • 22:07 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1427.tools.eqiad.wmflabs
  • 22:06 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
  • 22:05 bd808: sudo exec-manage depool tools-webgrid-lighttpd-1426.tools.eqiad.wmflabs
  • 20:06 bstorm_: Ran apt-get clean on tools-sgebastion-07 since it was running out of disk (and lots of it was the apt cache)
  • 19:09 bd808: Upgraded tools-manifest on tools-cron-01 to v0.19 (T107878)
  • 18:57 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.19 (T107878)
  • 18:57 bd808: Built tools-manifest_0.19_all.deb and published to aptly repos (T107878)
  • 18:26 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.18 (T107878)
  • 18:25 bd808: Built tools-manifest_0.18_all.deb and published to aptly repos (T107878)
  • 18:12 bd808: Upgraded tools-manifest on tools-sgecron-01 to v0.17 (T107878)
  • 18:08 bd808: Built tools-manifest_0.17_all.deb and published to aptly repos (T107878)
  • 10:41 godog: flip tools-prometheus proxy back to tools-prometheus-01 and upgrade to prometheus 2.7.1

2019-02-08

  • 19:17 hauskatze: Stopped webservice of `tools.sulinfo` which redirects to `tools.quentinv57-tools` which is also unavalaible
  • 18:32 hauskatze: Stopped webservice for `tools.quentinv57-tools` for T210829.
  • 13:49 gtirloni: upgraded all packages in SGE cluster
  • 12:25 arturo: install aptitude in tools-sgebastion-06
  • 11:08 godog: flip tools-prometheus.wmflabs.org to tools-prometheus-02 - T215272
  • 01:07 bd808: Creating tools-sgebastion-07

2019-02-07

  • 23:48 bd808: Updated DNS to make tools-trusty.wmflabs.org and trusty.tools.wmflabs.org CNAMEs for login-trusty.tools.wmflabs.org
  • 20:18 gtirloni: cleared mail queue on tools-mail-02
  • 08:41 godog: upgrade prometheus-02 to prometheus 2.6 - T215272

2019-02-04

  • 13:20 arturo: T215154 another reboot for tools-sgebastion-06
  • 12:26 arturo: T215154 another reboot for tools-sgebastion-06. Puppet is disabled
  • 11:38 arturo: T215154 reboot tools-sgebastion-06 to totally refresh systemd status
  • 11:36 arturo: T215154 manually install systemd 239 in tools-sgebastion-06

2019-01-30

  • 23:54 gtirloni: cleared apt cache on sge* hosts

2019-01-25

  • 20:50 bd808: Deployed new tcl/web Kubernetes image based on Debian Stretch (T214668)
  • 14:22 andrewbogott: draining and moving tools-worker-1016 to a new labvirt for T214447
  • 14:22 andrewbogott: draining and moving tools-worker-1021 to a new labvirt for T214447

2019-01-24

  • 11:09 arturo: T213421 delete tools-services-01/02
  • 09:46 arturo: T213418 delete tools-docker-registry-02
  • 09:45 arturo: T213418 delete tools-docker-builder-05 and tools-docker-registry-01
  • 03:28 bd808: Fixed rebase conflict in labs/private on tools-puppetmaster-01

2019-01-23

  • 22:18 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance using Stretch base image (T214519)
  • 22:09 bd808: Deleted tools-sgewebgrid-lighttpd-0904 instance via Horizon, used wrong base image (T214519)
  • 21:04 bd808: Building new tools-sgewebgrid-lighttpd-0904 instance (T214519)
  • 20:53 bd808: Deleted broken tools-sgewebgrid-lighttpd-0904 instance via Horizon (T214519)
  • 19:49 andrewbogott: shutting down eqiad-region proxies tools-proxy-01 and tools-proxy-02
  • 17:44 bd808: Added rules to default security group for prometheus monitoring on port 9100 (T211684)

2019-01-22

  • 20:21 gtirloni: published new docker images (all)
  • 18:57 bd808: Changed deb-tools.wmflabs.org proxy to point to tools-sge-services-03.tools.eqiad.wmflabs

2019-01-21

  • 05:25 andrewbogott: restarted tools-sgeexec-0906 and tools-sgeexec-0904; they seem better now but I have not repooled them yet

2019-01-18

  • 21:22 bd808: Forcing php-igbinary update via clush for T213666

2019-01-17

  • 23:37 bd808: Shutdown tools-package-builder-01. Use tools-package-builder-02 instead!
  • 22:09 bd808: Upgrading tools-manifest to 0.16 on tools-sgecron-01
  • 22:05 bd808: Upgrading tools-manifest to 0.16 on tools-cron-01
  • 21:51 bd808: Upgrading tools-manifest to 0.15 on tools-cron-01
  • 20:41 bd808: Building tools-package-builder-02 to replace tools-package-builder-01
  • 17:16 arturo: T213421 shutdown tools-services-01/02. Will delete VMs after a grace period
  • 12:54 arturo: add webservice security group to tools-sge-services-03/04

2019-01-16

  • 17:29 andrewbogott: depooling and moving tools-sgeexec-0904 tools-sgeexec-0906 tools-sgewebgrid-lighttpd-0904
  • 16:38 arturo: T213418 shutdown tools-docker-registry-01 and 02. Will delete the instances in a week or so
  • 14:34 arturo: T213418 point docker-registry.tools.wmflabs.org to tools-docker-registry-03 (was in -02)
  • 14:24 arturo: T213418 allocate floating IPs for tools-docker-registry-03 & 04

2019-01-15

  • 21:02 bstorm_: restarting webservicemonitor on tools-services-02 -- acting funny
  • 18:46 bd808: Dropped A record for www.tools.wmflabs.org and replaced it with a CNAME pointing to tools.wmflabs.org.
  • 18:29 bstorm_: T213711 installed python3-requests=2.11.1-1~bpo8+1 python3-urllib3=1.16-1~bpo8+1 on tools-proxy-03, which stopped the bleeding
  • 14:55 arturo: disable puppet in tools-docker-registry-01 and tools-docker-registry-02, trying with `role::wmcs::toolforge::docker::registry` in the puppetmaster for -03 and -04. The registry shouldn't be affected by this
  • 14:21 arturo: T213418 put a backup of the docker registry in NFS just in case: `aborrero@tools-docker-registry-02:$ sudo cp /srv/registry/registry.tar.gz /data/project/.system_sge/docker-registry-backup/`

2019-01-14

  • 22:03 bstorm_: T213711 Added UDP port needed for flannel packets to work to k8s worker sec groups in both eqiad and eqiad1-r
  • 22:03 bstorm_: T213711 Added ports needed for etcd-flannel to work on the etcd security group in eqiad
  • 21:42 zhuyifei1999_: also `write`-ed to them (as root). auth on my personal account would take a long time
  • 21:37 zhuyifei1999_: that command belonged to tools.scholia (with fnielsen as the ssh user)
  • 21:36 zhuyifei1999_: killed an egrep using too mush NFS bandwidth on tools-bastion-03
  • 21:33 zhuyifei1999_: SIGTERM PID 12542 24780 875 14569 14722. `tail`s with parent as init, belonging to user maxlath. they should submit to grid.
  • 16:44 arturo: T213418 docker-registry.tools.wmflabs.org point floating IP to tools-docker-registry-02
  • 14:00 arturo: T213421 disable updatetools in the new services nodes while building them
  • 13:53 arturo: T213421 delete tools-services-03/04 and create them with another prefix: tools-sge-services-03/04 to actually use the new role
  • 13:47 arturo: T213421 create tools-services-03 and tools-services-04 (stretch) they will use the new puppet role `role::wmcs::toolforge::services`

2019-01-11

  • 11:55 arturo: T213418 shutdown tools-docker-builder-05, will give a grace period before deleting the VM
  • 10:51 arturo: T213418 created tools-docker-builder-06 in eqiad1
  • 10:46 arturo: T213418 migrating tools-docker-registry-02 from eqiad to eqiad1

2019-01-10

  • 22:45 bstorm_: T213357 - Added 24 lighttpd nodes tot he new grid
  • 18:54 bstorm_: T213355 built and configured two more generic web nodes for the new grid
  • 10:35 gtirloni: deleted non-puppetized checks from tools-checker-0[1,2]
  • 00:12 bstorm_: T213353 Added 36 exec nodes to the new grid

2019-01-09

  • 20:16 andrewbogott: moving tools-paws-worker-1013 and tools-paws-worker-1007 to eqiad1
  • 17:17 andrewbogott: moving paws-worker-1017 and paws-worker-1016 to eqiad1
  • 14:42 andrewbogott: experimentally moving tools-paws-worker-1019 to eqiad1
  • 09:59 gtirloni: rebooted tools-checker-01 (T213252)

2019-01-07

  • 17:21 bstorm_: T67777 - set the max_u_jobs global grid config setting to 50 in the new grid
  • 15:54 bstorm_: T67777 Set stretch grid user job limit to 16
  • 05:45 bd808: Manually installed python3-venv on tools-sgebastion-06. Gerrit patch submitted for proper automation.

2019-01-06

  • 22:06 bd808: Added floating ip to tools-sgebastion-06 (T212360)

2019-01-05

  • 23:54 bd808: Manually installed php-mbstring on tools-sgebastion-06. Gerrit patch submitted to install it on the rest of the Son of Grid Engine nodes.

2019-01-04

  • 21:37 bd808: Truncated /data/project/.system/accounting after archiving ~30 days of history

2019-01-03

  • 21:03 bd808: Enabled Puppet on tools-proxy-02
  • 20:53 bd808: Disabled Puppet on tools-proxy-02
  • 20:51 bd808: Enabled Puppet on tools-proxy-01
  • 20:49 bd808: Disabled Puppet on tools-proxy-01

2018-12-21

  • 16:29 andrewbogott: migrating tools-exec-1416 to labvirt1004
  • 16:01 andrewbogott: moving tools-grid-master to labvirt1004
  • 00:35 bd808: Installed tools-manifest 0.14 for T212390
  • 00:22 bd808: Rebuiliding all docker containers with toollabs-webservice 0.43 for T212390
  • 00:19 bd808: Installed toollabs-webservice 0.43 on all hosts for T212390
  • 00:01 bd808: Installed toollabs-webservice 0.43 on tools-bastion-02 for T212390

2018-12-20

  • 20:43 andrewbogott: moving moving tools-prometheus-02 to labvirt1004
  • 20:42 andrewbogott: moving tools-k8s-etcd-02 to labvirt1003
  • 20:41 andrewbogott: moving tools-package-builder-01 to labvirt1002

2018-12-17

  • 22:16 bstorm_: Adding a bunch of hiera values and prefixes for the new grid - T212153
  • 19:18 gtirloni: decreased nfs-mount-manager verbosity (T211817)
  • 19:02 arturo: T211977 add package tools-manifest 0.13 to stretch-tools & stretch-toolsbeta in aptly
  • 13:46 arturo: T211977 `aborrero@tools-services-01:~$ sudo aptly repo move trusty-tools stretch-toolsbeta 'tools-manifest (=0.12)'`

2018-12-11

  • 13:19 gtirloni: Removed BigBrother (T208357)

2018-12-05

  • 12:17 gtirloni: remoted node tools-worker-1029.tools.eqiad.wmflabs from cluster (T196973)

2018-12-04

  • 22:47 bstorm_: gtirloni added back main floating IP for tools-k8s-master-01 and removed unnecessary ones to stop k8s outage T164123
  • 20:03 gtirloni: removed floating IPs from tools-k8s-master-01 (T164123)

2018-12-01

  • 02:44 gtirloni: deleted instance tools-exec-gift-trusty-01 (T194615)
  • 00:10 andrewbogott: moving tools-worker-1020 and tools-worker-1022 to different labvirts

2018-11-30

  • 23:13 andrewbogott: moving tools-worker-1009 and tools-worker-1019 to different labvirts
  • 22:18 gtirloni: Pushed new jdk8 docker image based on stretch (T205774)
  • 18:15 gtirloni: shutdown tools-exec-gift-trusty-01 instance (T194615)

2018-11-27

  • 17:49 bstorm_: restarted maintain-kubeusers just in case it had any issues reconnecting to toolsdb

2018-11-26

  • 17:39 gtirloni: updated tools-manifest package on tools-services-01/02 to version 0.12 (10->60 seconds sleep time) (T210190)
  • 17:34 gtirloni: T186571 removed legofan4000 user from project-tools group (again)
  • 13:31 gtirloni: deleted instance tools-clushmaster-01 (T209701)

2018-11-20

  • 23:05 gtirloni: Published stretch-tools and stretch-toolsbeta aptly repositories individually on tools-services-01
  • 14:18 gtirloni: Created Puppet prefixes 'tools-clushmaster' & 'tools-mail'
  • 13:24 gtirloni: shutdown tools-clushmaster-01 (use tools-clushmaster-02)
  • 10:52 arturo: T208579 distributing now misctools and jobutils 1.33 in all aptly repos
  • 09:43 godog: restart prometheus@tools on prometheus-01

2018-11-16

  • 21:16 bd808: Ran grid engine orphan process kill script from T153281. Only 3 orphan php-cgi processes belonging to iluvatarbot found.
  • 17:47 gtirloni: deleted tools-mail instance
  • 17:23 andrewbogott: moving tools-docker-registry-02 to labvirt1001
  • 17:07 andrewbogott: moving tools-elastic-03 to labvirt1007
  • 13:36 gtirloni: rebooted tools-static-12 and tools-static-13 after package upgrades

2018-11-14

  • 17:29 andrewbogott: moving tools-worker-1027 to labvirt1008
  • 17:18 andrewbogott: moving tools-webgrid-lighttpd-1417 to labvirt1005
  • 17:15 andrewbogott: moving tools-exec-1420 to labvirt1009

2018-11-13

  • 17:40 arturo: remove misctools 1.31 and jobutils 1.30 from the stretch-tools repo (T207970)
  • 13:32 gtirloni: pointed mail.tools.wmflabs.org to new IP 208.80.155.158
  • 13:29 gtirloni: Changed active mail relay to tools-mail-02 (T209356)
  • 13:22 arturo: T207970 misctools and jobutils v1.32 are now in both `stretch-tools` and `stretch-toolsbeta` repos in tools-services-01
  • 13:05 arturo: T207970 there is now a `stretch-toolsbeta` repo in tools-services-01, still empty
  • 12:59 arturo: the puppet issue has been solved by reverting the code
  • 12:28 arturo: puppet broken in toolforge due to a refactor. Will be fixed in a bit

2018-11-08

  • 18:12 gtirloni: cleaned up old tmp files on tools-bastion-02
  • 17:58 arturo: installing jobutils and misctools v1.32 (T207970)
  • 17:18 gtirloni: cleaned up old tmp files on tools-exec-1406
  • 16:56 gtirloni: cleaned up /tmp on tools-bastion-05
  • 16:37 gtirloni: re-enabled tools-webgrid-lighttpd-1424.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1414.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1408.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-webgrid-lighttpd-1403.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-exec-1429.tools.eqiad.wmflabs
  • 16:36 gtirloni: re-enabled tools-exec-1411.tools.eqiad.wmflabs
  • 16:29 bstorm_: re-enabled tools-exec-1433.tools.eqiad.wmflabs
  • 11:32 gtirloni: removed temporary /var/mail fix (T208843)

2018-11-07

  • 10:37 gtirloni: removed invalid apt.conf.d file from all hosts (T110055)

2018-11-02

  • 18:11 arturo: T206223 some disturbances due to the certificate renewal
  • 17:04 arturo: renewing *.wmflabs.org T206223

2018-10-31

  • 18:02 gtirloni: truncated big .err and error.log files
  • 13:15 addshore: removing Jonas Kress (WMDE) from tools project, no longer with wmde

2018-10-29

  • 17:00 bd808: Ran grid engine orphan process kill script from T153281

2018-10-26

  • 10:34 arturo: T207970 added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo
  • 10:32 arturo: T209970 added misctools 1.31 and jobutils 1.30 to stretch-tools aptly repo

2018-10-19

  • 14:17 andrewbogott: moving tools-clushmaster-01 to labvirt1004
  • 00:29 andrewbogott: migrating tools-exec-1411 and tools-exec-1410 off of cloudvirt1017

2018-10-18

  • 19:57 andrewbogott: moving tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420 and tools-webgrid-lighttpd-1421 to labvirt1009, 1010 and 1011 as part of (gradually) draining labvirt1017

2018-10-16

  • 15:13 bd808: (repost for gtirloni) T186571 removed legofan4000 user from project-tools group (leftover from T165624 legofan4000->macfan4000 rename)

2018-10-07

  • 21:57 zhuyifei1999_: restarted maintain-kubeusers on tools-k8s-master-01 T194859
  • 21:48 zhuyifei1999_: maintain-kubeusers on tools-k8s-master-01 seems to be in an infinite loop of 10 seconds. installed python3-dbg
  • 21:44 zhuyifei1999_: journal on tools-k8s-master-01 is full of etcd failures, did a puppet run, nothing interesting happens

2018-09-21

  • 12:35 arturo: cleanup stalled apt preference files (pinning) in tools-clushmaster-01
  • 12:14 arturo: T205078 same for {jessie,stretch}-wikimedia
  • 12:12 arturo: T205078 upgrade trusty-wikimedia packages (git-fat, debmonitor)
  • 11:57 arturo: T205078 purge packages smbclient libsmbclient libwbclient0 python-samba samba-common samba-libs from trusty machines

2018-09-17

  • 09:13 arturo: T204481 aborrero@tools-mail:~$ sudo exiqgrep -i | xargs sudo exim -Mrm

2018-09-14

  • 11:22 arturo: T204267 stop the corhist tool (k8s) because is hammering the wikidata API
  • 10:51 arturo: T204267 stop the openrefine-wikidata tool (k8s) because is hammering the wikidata API

2018-09-08

  • 10:35 gtirloni: restarted cron and truncated /var/log/exim4/paniclog (T196137)

2018-09-07

  • 05:07 legoktm: uploaded/imported toollabs-webservice_0.42_all.deb

2018-08-27

  • 23:40 bd808: `# exec-manage repool tools-webgrid-generic-1402.eqiad.wmflabs` T202932
  • 23:28 bd808: Restarted down instance tools-webgrid-generic-1402 & ran apt-upgrade
  • 22:36 zhuyifei1999_: `# exec-manage depool tools-webgrid-generic-1402.eqiad.wmflabs` T202932

2018-08-22

  • 13:02 arturo: I used this command: `sudo exim -bp | sudo exiqgrep -i | xargs sudo exim -Mrm`
  • 13:00 arturo: remove all emails in tools-mail.eqiad.wmflabs queue, 3378 bounce msgs, mostly related to @qq.com

2018-08-19

2018-08-14

2018-08-13

  • 23:31 legoktm: rebuilding docker images for webservice upgrade
  • 23:16 legoktm: published toollabs-webservice_0.41_all.deb
  • 23:06 legoktm: fixed permissions of tools-package-builder-01:/srv/src/tools-webservice

2018-08-09

  • 10:40 arturo: T201602 upgrade packages from jessie-backports (excluding python-designateclient)
  • 10:30 arturo: T201602 upgrade packages from jessie-wikimedia
  • 10:27 arturo: T201602 upgrade packages from trusty-updates

2018-08-08

  • 10:01 zhuyifei1999_: building & publishing toollabs-webservice 0.40 deb, and all Docker images T156626 T148872 T158244

2018-08-06

  • 12:33 arturo: T197176 installing texlive-full in toolforge

2018-08-01

  • 14:31 andrewbogott: temporarily depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 to try to give labvirt1009 a break

2018-07-30

  • 20:33 bd808: Started rebuilding all Kubernetes Docker images to pick up latest apt updates
  • 04:47 legoktm: added toollabs-webservice_0.39_all.deb to stretch-tools

2018-07-27

  • 04:52 zhuyifei1999_: rebuilding python/base docker container T190274

2018-07-25

  • 19:02 chasemp: tools-worker-1004 reboot
  • 19:01 chasemp: ifconfig eth0:fakenfs 208.80.155.106 netmask 255.255.255.255 up on tools-worker-1004 (late log)

2018-07-18

  • 13:24 arturo: upgrading packages from `stretch-wikimedia` T199905
  • 13:18 arturo: upgrading packages from `stable` T199905
  • 12:51 arturo: upgrading packages from `oldstable` T199905
  • 12:31 arturo: upgrading packages from `trusty-updates` T199905
  • 12:16 arturo: upgrading packages from `jessie-wikimedia` T199905
  • 12:09 arturo: upgrading packages from `trusty-wikimedia` T199905

2018-06-30

  • 18:15 chicocvenancio: pushed new config to PAWS to fix dumps nfs mountpoint
  • 16:40 zhuyifei1999_: because tools-paws-master-01 was having ~1000 loadavg due to NFS having issues and processes stuck in D state
  • 16:39 zhuyifei1999_: reboot tools-paws-master-01
  • 16:35 zhuyifei1999_: `root@tools-paws-master-01:~# sed -i 's/^labstore1006.wikimedia.org/#labstore1006.wikimedia.org/' /etc/fstab`
  • 16:34 andrewbogott: "sed -i '/labstore1006/d' /etc/fstab" everywhere

2018-06-29

  • 17:41 bd808: Rescheduling continuous jobs away from tools-exec-1408 where load is high
  • 17:11 bd808: Rescheduled jobs away from toole-exec-1404 where linkwatcher is currently stealing most of the CPU (T123121)
  • 16:46 bd808: Killed orphan tool owned processes running on the job grid. Mostly jembot and wsexport php-cgi processes stuck in deadlock following an OOM. T182070

2018-06-28

  • 19:50 chasemp: tools-clushmaster-01:~$ clush -w @all 'sudo umount -fl /mnt/nfs/dumps-labstore1006.wikimedia.org'
  • 18:02 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo umount -fl /mnt/nfs/dumps-labstore1007.wikimedia.org"
  • 17:53 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --disable 'labstore1007 outage'"
  • 17:20 chasemp: tools-worker-1007:~# /sbin/reboot
  • 16:48 arturo: rebooting tools-docker-registry-01
  • 16:42 andrewbogott: rebooting tools-worker-<everything> to get NFS unstuck
  • 16:40 andrewbogott: rebooting tools-worker-1012 and tools-worker-1015 to get their nfs mounts unstuck

2018-06-21

  • 13:18 chasemp: tools-bastion-03:~# bash -x /data/project/paws/paws-userhomes-hack.bash

2018-06-20

  • 15:09 bd808: Killed orphan processes on webgrid nodes (T182070); most owned by jembot and croptool

2018-06-14

  • 14:20 chasemp: timeout 180s bash -x /data/project/paws/paws-userhomes-hack.bash

2018-06-11

  • 10:11 arturo: T196137 `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo wc -l /var/log/exim4/paniclog 2>/dev/null | grep -v ^0 && sudo rm -rf /var/log/exim4/paniclog && sudo service prometheus-node-exporter restart || true'`

2018-06-08

  • 07:46 arturo: T196137 more rootspam today, restarting again `prometheus-node-exporter` and force rotating exim4 paniclog in 12 nodes

2018-06-07

  • 11:01 arturo: T196137 force rotate all exim panilog files to avoid rootspam `aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo logrotate /etc/logrotate.d/exim4-paniclog -f -v'`

2018-06-06

  • 22:00 bd808: Scripting a restart of webservice for tools that are still in CrashLoopBackOff state after 2nd attempt (T196589)
  • 21:10 bd808: Scripting a restart of webservice for 59 tools that are still in CrashLoopBackOff state after last attempt (P7220)
  • 20:25 bd808: Scripting a restart of webservice for 175 tools that are in CrashLoopBackOff state (P7220)
  • 19:04 chasemp: tools-bastion-03 is virtually unusable
  • 09:49 arturo: T196137 aborrero@tools-clushmaster-01:~$ clush -w@all 'sudo service prometheus-node-exporter restart' <-- procs using the old uid

2018-06-05

  • 18:02 bd808: Forced puppet run on tools-bastion-03 to re-enable logins by dubenben (T196486)
  • 17:39 arturo: T196137 clush: delete `prometheus` user and re-create it locally. Then, chown prometheus dirs
  • 17:38 bd808: Added grid engine quota to limit user debenben to 2 concurrent jobs (T196486)

2018-06-04

  • 10:28 arturo: T196006 installing sqlite3 package in exec nodes

2018-06-03

  • 10:19 zhuyifei1999_: Grid is full. qdel'ed all jobs belonging to tools.dibot except lighttpd, and tools.mbh that has a job name starting 'comm_delin', 'delfilexcl' T195834

2018-05-31

2018-05-30

  • 10:52 zhuyifei1999_: undid both changes to tools-bastion-05
  • 10:50 zhuyifei1999_: also making /proc/sys/kernel/yama/ptrace_scope 0 temporarily on tools-bastion-05
  • 10:45 zhuyifei1999_: installing mono-runtime-dbg on tools-bastion-05 to produce debugging information; was previously installed on tools-exec-1413 & 1441. Might be a good idea to uninstall them once we can close T195834

2018-05-28

  • 12:09 arturo: T194665 adding mono packages to apt.wikimedia.org for jessie-wikimedia and stretch-wikimedia
  • 12:06 arturo: T194665 adding mono packages to apt.wikimedia.org for trusty-wikimedia

2018-05-25

  • 05:31 zhuyifei1999_: Edit /data/project/.system/gridengine/default/common/sge_request, h_vmem 256M -> 512M, release precise -> trusty T195558

2018-05-22

2018-05-18

  • 16:36 bd808: Restarted bigbrother on tools-services-02

2018-05-16

  • 21:17 zhuyifei1999_: maintain-kubeusers on stuck in infinite sleeps of 10 seconds

2018-05-15

  • 04:28 andrewbogott: depooling, rebooting, re-pooling tools-exec-1414. It's hanging for unknown reasons.
  • 04:07 zhuyifei1999_: Draining unresponsive tools-exec-1414 following Portal:Toolforge/Admin#Draining_a_node_of_Jobs
  • 04:05 zhuyifei1999_: Force deletion of grid job 5221417 (tools.giftbot sga), host tools-exec-1414 not responding

2018-05-12

  • 10:09 Hauskatze: tools.quentinv57-tools@tools-bastion-02:~$ webservice stop | T194343

2018-05-11

  • 14:34 andrewbogott: repooling labvirt1001 tools instances
  • 13:59 andrewbogott: depooling a bunch of things before rebooting labvirt1001 for T194258: tools-exec-1401 tools-exec-1407 tools-exec-1408 tools-exec-1430 tools-exec-1431 tools-exec-1432 tools-exec-1435 tools-exec-1438 tools-exec-1439 tools-exec-1441 tools-webgrid-lighttpd-1402 tools-webgrid-lighttpd-1407

2018-05-10

  • 18:55 andrewbogott: depooling, rebooting, repooling tools-exec-1401 to test a kernel update

2018-05-09

  • 21:11 Reedy: Added Tim Starling as member/admin

2018-05-07

  • 21:02 zhuyifei1999_: re-building all docker images T190893
  • 20:49 zhuyifei1999_: building, signing, and publishing toollabs-webservice 0.39 T190893
  • 00:25 zhuyifei1999_: `renice -n 15 -p 28865` (`tar cvzf` of `tools.giftbot`) on tools-bastion-02, been hogging the NFS IO for a few hours

2018-05-05

  • 23:37 zhuyifei1999_: regenerate k8s creds for tools.zhuyifei1999-test because I messed up while testing

2018-05-03

  • 14:48 arturo: uploaded a new ruby docker image to the registry with the libmysqlclient-dev package T192566

2018-05-01

  • 14:05 andrewbogott: moving tools-webgrid-lighttpd-1406 to labvirt1016 (routine rebalancing)

2018-04-27

  • 18:26 zhuyifei1999_: `$ write` doesn't seem to be able to write to their tmux tty, so echoed into their pts directly: `# echo -e '\n\n[...]\n' > /dev/pts/81`
  • 18:17 zhuyifei1999_: SIGTERM tools-bastion-03 PID 6562 tools.zoomproof celery worker

2018-04-23

  • 14:41 zhuyifei1999_: `chown tools.pywikibot:tools.pywikibot /shared/pywikipedia/` Prior owner: tools.russbot:project-tools T192732

2018-04-22

  • 13:07 bd808: Kill orphan php-cgi processes across the job grid via clush -w @exec -w @webgrid -b 'ps axwo user:20,ppid,pid,cmd | grep -E " 1 " | grep php-cgi | xargs sudo kill -9'`

2018-04-15

  • 17:51 zhuyifei1999_: forced puppet puns across tools-elastic-0[1-3] T192224
  • 17:45 zhuyifei1999_: granted elasticsearch credentials to tools.flaky-ci T192224

2018-04-11

  • 13:25 chasemp: cleanup exim frozen messages in an effort to aleve queue pressure

2018-04-06

  • 16:30 chicocvenancio: killed job in bastion, tools.gpy affected
  • 14:30 arturo: add puppet class `toollabs::apt_pinning` to tools-puppetmaster-01 using horizon, to add some apt pinning related to T159254
  • 11:23 arturo: manually upgrade apache2 on tools-puppemaster for T159254

2018-04-05

  • 18:46 chicocvenancio: killed wget that was hogging io

2018-03-29

  • 20:09 chicocvenancio: killed interactive processes in tools-bastion-03
  • 19:56 chicocvenancio: several interactive jobs running in bastion-03. I am writing to connected users and will kill the jobs once done

2018-03-28

  • 13:06 zhuyifei1999_: SIGTERM PID 30633 on tools-bastion-03 (tool 3d2commons's celery). Please run this on grid

2018-03-26

  • 21:34 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'

2018-03-23

2018-03-22

  • 22:04 bd808: Forced puppet run on tools-proxy-02 for T130748
  • 21:52 bd808: Forced puppet run on tools-proxy-01 for T130748
  • 21:48 bd808: Disabled puppet on tools-proxy-* for https://gerrit.wikimedia.org/r/#/c/420619/ rollout
  • 03:50 bd808: clush -w @exec -w @webgrid -b 'sudo find /tmp -type f -atime +1 -delete'

2018-03-21

  • 17:50 bd808: Cleaned up stale /project/.system/bigbrother.scoreboard.* files from labstore1004
  • 01:09 bd808: Deleting /tmp files owned by tools.wsexport with -mtime +2 across grid (T190185)

2018-03-20

  • 08:28 zhuyifei1999_: unmount dumps & remount on tools-bastion-02 (can someone clush this?) T189018 T190126

2018-03-19

  • 11:02 arturo: reboot tools-exec-1408, to balance load. Server is unresponsive due to high load by some tools

2018-03-16

  • 22:44 zhuyifei1999_: suspended process 22825 (BotOrderOfChapters.exe) on tools-bastion-03. Threads continuously going to D-state & R-state. Also sent message via $ write on pts/10
  • 12:13 arturo: reboot tools-webgrid-lighttpd-1420 due to almost full /tmp

2018-03-15

  • 16:56 zhuyifei1999_: granted elasticsearch credentials to tools.denkmalbot T185624

2018-03-14

  • 20:57 bd808: Upgrading elasticsearch on tools-elastic-01 (T181531)
  • 20:53 bd808: Upgrading elasticsearch on tools-elastic-02 (T181531)
  • 20:51 bd808: Upgrading elasticsearch on tools-elastic-03 (T181531)
  • 12:07 arturo: reboot tools-webgrid-lighttpd-1415, almost full /tmp
  • 12:01 arturo: repool tools-webgrid-lighttpd-1421, /tmp is now empty
  • 11:56 arturo: depool tools-webgrid-lighttpd-1421 for reboot due to /tmp almost full

2018-03-12

  • 20:09 madhuvishy: Run clush -w @all -b 'sudo umount /mnt/nfs/labstore1003-scratch && sudo mount -a' to remount scratch across all of tools
  • 17:13 arturo: T188994 upgrading packages from `stable`
  • 16:53 arturo: T188994 upgrading packages from stretch-wikimedia
  • 16:33 arturo: T188994 upgrading packages form jessie-wikimedia
  • 14:58 zhuyifei1999_: building, publishing, and deploying misctools 1.31 5f3561e T189430
  • 13:31 arturo: tools-exec-1441 and tools-exec-1442 rebooted fine and are repooled
  • 13:26 arturo: depool tools-exec-1441 and tools-exec-1442 for reboots
  • 13:19 arturo: T188994 upgrade packages from jessie-backports in all jessie servers
  • 12:49 arturo: T188994 upgrade packages from trusty-updates in all ubuntu servers
  • 12:34 arturo: T188994 upgrade packages from trusty-wikimedia in all ubuntu servers

2018-03-08

  • 16:05 chasemp: tools-clushmaster-01:~$ clush -g all 'sudo puppet agent --test'
  • 14:02 arturo: T188994 upgrading trusty-tools packages in all the cluster, this includes jobutils, openssh-server and openssh-sftp-server

2018-03-07

2018-03-06

  • 16:15 madhuvishy: Reboot tools-docker-registry-02 T189018
  • 15:50 madhuvishy: Rebooting tools-worker-1011
  • 15:08 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1011.tools.eqiad.wmflabs
  • 15:03 arturo: drain and reboot tools-worker-1011
  • 15:03 chasemp: rebooted tools-worker 1001-1008
  • 14:58 arturo: drain and reboot tools-worker-1010
  • 14:27 chasemp: multiple tools running on k8s workers report issues reading replica.my.cnf file atm
  • 14:27 chasemp: reboot tools-worker-100[12]
  • 14:23 chasemp: downtime icinga alert for k8s workers ready
  • 13:21 arturo: T188994 in some servers there was some race in the dpkg lock between apt-upgrade and puppet. Also, I forgot to use DEBIAN_FRONTEND=noninteractive, so debconf prompts happened and stalled dpkg operations. Already solved, but some puppet alerts were produced
  • 12:58 arturo: T188994 upgrading packages in jessie nodes from the oldstable source
  • 11:42 arturo: clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoclean" <-- free space in filesystem
  • 11:41 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo DEBIAN_FRONTEND=noninteractive apt-get autoremove -y" <-- we did in canary servers last week and it went fine. So run in fleet-wide
  • 11:36 arturo: (ubuntu) removed linux-image-3.13.0-142-generic and linux-image-3.13.0-137-generic (T188911)
  • 11:33 arturo: removing unused kernel packages in ubuntu nodes
  • 11:08 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all "sudo rm /etc/apt/preferences.d/* ; sudo puppet agent -t -v" <--- rebuild directory, it contains stale files across all the cluster

2018-03-05

  • 18:56 zhuyifei1999_: also published jobutils_1.30_all.deb
  • 18:39 zhuyifei1999_: built and published misctools_1.30_all.deb T167026 T181492
  • 14:33 arturo: delete `linux-image-4.9.0-6-amd64` package from stretch instances for T188911
  • 14:01 arturo: deleting old kernel packages in jessie instances for T188911
  • 13:58 arturo: running `apt-get autoremove` with clush in all jessie instances
  • 12:16 arturo: apply role::toollabs::base to tools-paws prefix in horizon for T187193
  • 12:10 arturo: apply role::toollabs::base to tools-prometheus prefix in horizon for T187193

2018-03-02

  • 13:41 arturo: doing some testing with puppet classes in tools-package-builder-01 via horizon

2018-03-01

2018-02-27

  • 17:37 chasemp: add chico as admin to toolsbeta
  • 12:23 arturo: running `apt-get autoclean` in canary servers
  • 12:16 arturo: running `apt-get autoremove` in canary servers

2018-02-26

  • 19:17 chasemp: tools-clushmaster-01:~$ clush -w @all "sudo puppet agent --test"
  • 10:35 arturo: enable puppet in tools-proxy-01
  • 10:23 arturo: disable puppet in tools-proxy-01 for apt pinning tests

2018-02-25

  • 19:04 chicocvenancio: killed jobs in tools-bastion-03, wrote notice to tools owners' terminals

2018-02-23

  • 19:11 arturo: enable puppet in tools-proxy-01
  • 18:53 arturo: disable puppet in tools-proxy-01 for apt preferences testing
  • 13:52 arturo: deploying https://gerrit.wikimedia.org/r/#/c/413725/ across the fleet
  • 13:04 arturo: install apt-rdepends in tools-paws-master-01 which triggered some python libs to be upgraded

2018-02-22

  • 16:31 bstorm_: Enabled puppet on tools-static-12 as the test server

2018-02-21

  • 19:02 bstorm_: disabled puppet on tools-static-* pending change 413197
  • 18:15 arturo: puppet should be fine across the fleet
  • 17:24 arturo: another try: merged https://gerrit.wikimedia.org/r/#/c/413202/
  • 17:02 arturo: revert last change https://gerrit.wikimedia.org/r/#/c/413198/
  • 16:59 arturo: puppet is broken across the cluster due to last change
  • 16:57 arturo: deploying https://gerrit.wikimedia.org/r/#/c/410177/
  • 16:26 bd808: Rebooting tools-docker-registry-01, NFS mounts are in a bad state
  • 11:43 arturo: package upgrades in tools-webgrid-lightttpd-1401
  • 11:35 arturo: package upgrades in tools-package-builder-01 tools-prometheus-01 tools-static-10 and tools-redis-1001
  • 11:22 arturo: package upgrades in tools-mail, tools-grid-master, tool-logs-02
  • 10:51 arturo: package upgrades in tools-checker-01 tools-clushmaster-01 and tools-docker-builder-05
  • 09:18 chicocvenancio: killed io intensive tool job in bastion
  • 03:32 zhuyifei1999_: removed /data/project/.elasticsearch.ini, owned by root and mode 644, leaks the creds of /data/project/strephit/.elasticsearch.ini Might need to cycle it as well...

2018-02-20

  • 12:42 arturo: upgrading tools-flannel-etcd-01
  • 12:42 arturo: upgrading tools-k8s-etcd-01

2018-02-19

  • 19:13 arturo: upgrade all packages of tools-services-01
  • 19:02 arturo: move tools-services-01 from puppet3 to puppet4 (manual package upgrade). No issues detected.
  • 18:23 arturo: upgrade packages of tools-cron-01 from all channels (trusty-wikimedia, trusty-updates and trusty-tools)
  • 12:54 arturo: puppet run with clush to ensure puppet is back to normal after being broken due to duplicated python3 declaration

2018-02-16

  • 18:21 arturo: upgrading tools-proxy-01 and tools-paws-master-01, same as others
  • 17:36 arturo: upgrading oldstable, jessie-backports, jessie-wikimedia packages in tools-k8s-master-01 (excluding linux*, libpam*, nslcd)
  • 13:00 arturo: upgrades in tools-exec-14[01-10].eqiad.wmflabs were fine
  • 12:42 arturo: aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'DEBIAN_FRONTEND=noninteractive sudo apt-upgrade -u upgrade trusty-updates -y'
  • 11:58 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-wikimedia -y
  • 11:57 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade jessie-backports -y
  • 11:53 arturo: (10 exec canary nodes) aborrero@tools-clushmaster-01:~$ clush -q -w @exec-upgrade-canarys 'sudo apt-upgrade -u upgrade trusty-wikimedia -y'
  • 11:41 arturo: aborrero@tools-elastic-01:~$ sudo apt-upgrade -u -f exclude upgrade oldstable -y

2018-02-15

  • 13:54 arturo: cleanup ferm (deinstall) in tools-services-01 for T187435
  • 13:28 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-tools
  • 13:16 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-updates -y
  • 13:13 arturo: aborrero@tools-bastion-02:~$ sudo apt-upgrade -u upgrade trusty-wikimedia
  • 13:06 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-tools
  • 12:57 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-updates
  • 12:51 arturo: aborrero@tools-webgrid-generic-1401:~$ sudo apt-upgrade -u upgrade trusty-wikimedia

2018-02-14

  • 13:09 arturo: the reboot was OK, the server seems working and kubectl sees all the pods running in the deployment (T187315)
  • 13:04 arturo: reboot tools-paws-master-01 for T187315

2018-02-11

  • 01:28 zhuyifei1999_: `# find /home/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected: only /home/tr8dr, mode 0777 -> 0775
  • 01:21 zhuyifei1999_: `# find /data/project/ -maxdepth 1 -perm -o+w \! -uid 0 -exec chmod -v o-w {} \;` Affected tools: wikisource-tweets, gsociftttdev, dow, ifttt-testing, elobot. All mode 2777 -> 2775

2018-02-09

  • 10:35 arturo: deploy https://gerrit.wikimedia.org/r/#/c/409226/ T179343 T182562 T186846
  • 06:15 bd808: Killed orphan processes owned by iabot, dupdet, and wsexport scattered across the webgrid nodes
  • 05:07 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1426
  • 05:06 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1411
  • 05:05 bd808: Killed 1 orphan php-fcgi process from jembot that were running on tools-webgrid-lighttpd-1409
  • 05:02 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1421 and pegging the cpu there
  • 04:56 bd808: Rescheduled 30 of the 60 tools running on tools-webgrid-lighttpd-1421 (T186830)
  • 04:39 bd808: Killed 4 orphan php-fcgi processes from jembot that were running on tools-webgrid-lighttpd-1417 and pegging the cpu there

2018-02-08

  • 18:38 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1002.tools.eqiad.wmflabs
  • 18:35 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade jessie-wikimedia -v
  • 18:33 arturo: aborrero@tools-worker-1002:~$ sudo apt-upgrade -u upgrade oldstable -v
  • 18:28 arturo: cordon & drain tools-worker-1002.tools.eqiad.wmflabs
  • 18:10 arturo: uncordon tools-paws-worker-1019. Package upgrades were OK.
  • 18:08 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stable -v
  • 18:06 arturo: aborrero@tools-paws-worker-1019:~$ sudo apt-upgrade upgrade stretch-wikimedia -v
  • 18:02 arturo: cordon tools-paws-worker-1019 to do some package upgrades
  • 17:29 arturo: repool tools-exec-1401.tools.eqiad.wmflabs. Package upgrades were OK.
  • 17:20 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-updates -vy
  • 17:15 arturo: aborrero@tools-exec-1401:~$ sudo apt-upgrade upgrade trusty-wikimedia -vy
  • 17:11 arturo: depool tools-exec-1401.tools.eqiad.wmflabs to do some package upgrades
  • 14:22 arturo: it was some kind of transient error. After a second puppet run across the fleet, all seems fine
  • 13:53 arturo: deploy https://gerrit.wikimedia.org/r/#/c/407465/ which is causing some puppet issues. Investigating.

2018-02-06

  • 13:15 arturo: deploy https://gerrit.wikimedia.org/r/#/c/408529/ to tools-services-01
  • 13:05 arturo: unpublish/publish trusty-tools repo
  • 13:03 arturo: install aptly v0.9.6-1 in tools-services-01 for T186539 after adding it to trusty-tools repo (self contained)

2018-02-05

  • 17:58 arturo: publishing/unpublishing trusty-tools repo in tools-services-01 to address T186539
  • 13:27 arturo: for the record, not a single warning or error (orange/red messages) in puppet in the toolforge cluster
  • 13:06 arturo: deploying fix for T186230 using clush

2018-02-03

  • 01:04 chicocvenancio: killed io intensive process in bastion-03 "vltools python3 ./broken_ref_anchors.py"

2018-01-31

  • 22:54 chasemp: add bstorm to sudoers as root

2018-01-29

  • 20:02 chasemp: add zhuyifei1999_ tools root for T185577
  • 20:01 chasemp: blast a puppet run to see if any errors are persistent

2018-01-28

  • 22:49 chicocvenancio: killed compromised session generating miner processes
  • 22:48 chicocvenancio: killed miner processes in tools-bastion-03

2018-01-27

  • 00:55 arturo: at tools-static-11 the kernel OOM killer stopped git gc at about 20% :-(
  • 00:25 arturo: (/srv is almost full) aborrero@tools-static-11:/srv/cdnjs$ sudo git gc --aggressive

2018-01-25

  • 23:47 arturo: fix last deprecation warnings in tools-elastic-03, tools-elastic-02, tools-proxy-01 and tools-proxy-02 by replacing by hand configtimeout with http_configtimeout in /etc/puppet/puppet.conf
  • 23:20 arturo: T179386 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
  • 05:25 arturo: deploying misctools and jobutils 1.29 for T179386

2018-01-23

  • 19:41 madhuvishy: Add bstorm to project admins
  • 15:48 bd808: Admin clean up; removed Coren, Ryan Lane, and Springle.
  • 14:17 chasemp: add me, arturo, chico to sudoers and removed marc

2018-01-22

  • 18:32 arturo: T181948 T185314 deploying jobutils and misctools v1.28 in the cluster
  • 11:21 arturo: puppet in the cluster is mostly fine, except for a couple of deprecation warnings, a conn timeout to services-01 and https://phabricator.wikimedia.org/T181948#3916790
  • 10:31 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v' <--- check again how is the cluster with puppet
  • 10:18 arturo: T181948 deploy misctools 1.27 in the cluster

2018-01-19

  • 17:32 arturo: T185314 deploying new version of jobutils 1.27
  • 12:56 arturo: the puppet status across the fleet seems good, only minor things like T185314 , T179388 and T179386
  • 12:39 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'

2018-01-18

  • 16:11 arturo: aborrero@tools-clushmaster-01:~$ sudo aptitude purge vblade vblade-persist runit (for something similar to T182781)
  • 15:42 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent -t -v'
  • 13:52 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -f 1 -w @all 'sudo facter | grep lsbdistcodename | grep trusty && sudo apt-upgrade trusty-wikimedia -v'
  • 13:44 chasemp: upgrade wikimedia packages on tools-bastion-05
  • 12:24 arturo: T178717 aborrero@tools-exec-1401:~$ sudo apt-upgrade trusty-wikimedia -v
  • 12:11 arturo: T178717 aborrero@tools-webgrid-generic-1402:~$ sudo apt-upgrade trusty-wikimedia -v
  • 11:42 arturo: T178717 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'

2018-01-17

  • 18:47 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'apt-show-versions | grep upgradeable | grep trusty-wikimedia' | tee pending-upgrades-report-trusty-wikimedia.txt
  • 17:55 arturo: aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo report-pending-upgrades -v' | tee pending-upgrades-report.txt
  • 15:15 andrewbogott: running purge-old-kernels on all Trusty exec nodes
  • 15:15 andrewbogott: repooling exec-manage tools-exec-1430.
  • 15:04 andrewbogott: depooling exec-manage tools-exec-1430. Experimenting with purge-old-kernels
  • 14:09 arturo: T181647 aborrero@tools-clushmaster-01:~$ clush -w @all 'sudo puppet agent --test'

2018-01-16

  • 22:01 chasemp: qstat -explain E -xml | grep 'name' | sed 's/<name>//' | sed 's/<\/name>//' | xargs qmod -cq
  • 21:54 chasemp: tools-exec-1436:~$ /sbin/reboot
  • 21:24 andrewbogott: repooled tools-exec-1420 and tools-webgrid-lighttpd-1417
  • 21:14 andrewbogott: depooling tools-exec-1420 and tools-webgrid-lighttpd-1417
  • 20:58 andrewbogott: depooling tools-exec-1412, 1415, 1417, tools-webgrid-lighttpd-1415, 1416, 1422, 1426
  • 20:56 andrewbogott: repooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
  • 20:46 andrewbogott: depooling tools-exec-1416, 1418, 1424, tools-webgrid-lighttpd-1404, tools-webgrid-lighttpd-1410
  • 20:46 andrewbogott: repooled tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
  • 20:33 andrewbogott: depooling tools-exec-1406, 1421, 1436, 1437, tools-webgrid-generic-1404, tools-webgrid-lighttpd-1409, tools-webgrid-lighttpd-1411, tools-webgrid-lighttpd-1418, tools-webgrid-lighttpd-1425
  • 20:20 andrewbogott: depooling tools-webgrid-lighttpd-1412 and tools-exec-1423 for host reboot
  • 20:19 andrewbogott: repooled tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
  • 20:02 andrewbogott: depooling tools-exec-1409, 1410, 1414, 1419, 1427, 1428 tools-webgrid-generic-1401, tools-webgrid-lighttpd-1406
  • 20:00 andrewbogott: depooled and repooled tools-webgrid-lighttpd-1427 tools-webgrid-lighttpd-1428 tools-exec-1413 tools-exec-1442 for host reboot
  • 18:50 andrewbogott: switched active proxy back to tools-proxy-02
  • 18:50 andrewbogott: repooling tools-exec-1422 and tools-webgrid-lighttpd-1413
  • 18:34 andrewbogott: moving proxy from tools-proxy-02 to tools-proxy-01
  • 18:31 andrewbogott: depooling tools-exec-1422 and tools-webgrid-lighttpd-1413 for host reboot
  • 18:26 andrewbogott: repooling tools-exec-1404 and 1434 for host reboot
  • 18:06 andrewbogott: depooling tools-exec-1404 and 1434 for host reboot
  • 18:04 andrewbogott: repooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
  • 17:48 andrewbogott: depooling tools-exec-1402, 1426, 1429, 1433, tools-webgrid-lighttpd-1408, 1414, 1424
  • 17:28 andrewbogott: disabling tools-webgrid-generic-1402, tools-webgrid-lighttpd-1403, tools-exec-1403 for host reboot
  • 17:26 andrewbogott: repooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 after host reboot
  • 17:08 andrewbogott: depooling tools-exec-1405, 1425, tools-webgrid-generic-1403, tools-webgrid-lighttpd-1401, 1405 for host reboot
  • 16:19 andrewbogott: repooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 after host reboot
  • 15:52 andrewbogott: depooling tools-exec-1401, 1407, 1408, 1430, 1431, 1432, 1435, 1438, 1439, 1441, tools-webgrid-lighttpd-1402, 1407 for host reboot
  • 13:35 chasemp: tools-mail almouked@ltnet.net 719 pending messages cleared

2018-01-11

  • 20:33 andrewbogott: repooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
  • 20:33 andrewbogott: uncordoning tools-worker-1012 and tools-worker-1017
  • 20:06 andrewbogott: cordoning tools-worker-1012 and tools-worker-1017
  • 20:02 andrewbogott: depooling tools-exec-1411, tools-exec-1440, tools-webgrid-lighttpd-1419, tools-webgrid-lighttpd-1420, tools-webgrid-lighttpd-1421
  • 19:00 chasemp: reboot tools-worker-1015
  • 15:08 chasemp: reboot tools-exec-1405
  • 15:06 chasemp: reboot tools-exec-1404
  • 15:06 chasemp: reboot tools-exec-1403
  • 15:02 chasemp: reboot tools-exec-1402
  • 14:57 chasemp: reboot tools-exec-1401 again...
  • 14:53 chasemp: reboot tools-exec-1401
  • 14:46 chasemp: install metltdown kernel and reboot workers 1011-1016 as jessie pilot

2018-01-10

  • 15:14 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
  • 15:03 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1016`; do kubectl cordon $n; done
  • 14:41 chasemp: tools-clushmaster-01:~$ clush -w @k8s-worker "sudo puppet agent --disable 'chase rollout'"
  • 14:01 chasemp: tools-k8s-master-01:~# kubectl uncordon tools-worker-1001.tools.eqiad.wmflabs
  • 13:57 arturo: T184604 cleaned stalled log files that prevented logrotate from working. Triggered a couple of logrorate runs by hand in tools-worker-1020.tools.eqiad.wmflabs
  • 13:46 arturo: T184604 aborrero@tools-k8s-master-01:~$ sudo kubectl uncordon tools-worker-1020.tools.eqiad.wmflabs
  • 13:45 arturo: T184604 aborrero@tools-worker-1020:/var/log$ sudo mkdir /var/lib/kubelet/pods/bcb36fe1-7d3d-11e7-9b1a-fa163edef48a/volumes
  • 13:26 arturo: sudo kubectl drain tools-worker-1020.tools.eqiad.wmflabs
  • 13:22 arturo: empty by hand syslog and daemon.log files. They are so big that logrotate won't handle them
  • 13:20 arturo: aborrero@tools-worker-1020:~$ sudo service kubelet restart
  • 13:18 arturo: aborrero@tools-k8s-master-01:~$ sudo kubectl cordon tools-worker-1020.tools.eqiad.wmflabs for T184604
  • 13:13 arturo: detected low space in tools-worker-1020, big files in /var/log due to kubelet issue. Opened T184604

2018-01-09

  • 23:21 yuvipanda: paws new cluster master is up, re-adding nodes by executing same sequence of commands for upgrading
  • 23:08 yuvipanda: turns out the version of k8s we had wasn't recent enough to support easy upgrades, so destroy entire cluster again and install 1.9.1
  • 23:01 yuvipanda: kill paws master and reboot it
  • 22:54 yuvipanda: kill all kube-system pods in paws cluster
  • 22:54 yuvipanda: kill all PAWS pods
  • 22:53 yuvipanda: redo tools-paws-worker-1006 manually, since clush seems to have missed it for some reason
  • 22:49 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/init-worker.bash' to bring paws workers back up again, but as 1.8
  • 22:48 yuvipanda: run 'clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/install-kubeadm.bash to setup kubeadm on all paws worker nodes
  • 22:46 yuvipanda: reboot all paws-worker nodes
  • 22:46 yuvipanda: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
  • 22:46 madhuvishy: run clush -w tools-paws-worker-10[01-20] 'sudo bash /home/yuvipanda/kubeadm-bootstrap/remove-worker.bash' to completely destroy the paws k8s cluster
  • 21:17 chasemp: ...rush@tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable && sudo puppet agent --test"
  • 21:17 chasemp: tools-clushmaster-01:~$ clush -f 1 -w @k8s-worker "sudo puppet agent --enable --test"
  • 21:10 chasemp: tools-k8s-master-01:~# for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016 -e tools-worker-1028 -e tools-worker-1029 `; do kubectl uncordon $n; done
  • 20:55 chasemp: for n in `kubectl get nodes | awk '{print $1}' | grep -v -e tools-worker-1001 -e tools-worker-1016`; do kubectl cordon $n; done
  • 20:51 chasemp: kubectl cordon tools-worker-1001.tools.eqiad.wmflabs
  • 20:15 chasemp: disable puppet on proxies and k8s workers
  • 19:50 chasemp: clush -w @all 'sudo puppet agent --test'
  • 19:42 chasemp: reboot tools-worker-1010

2018-01-08

  • 20:34 madhuvishy: Restart kube services and uncordon tools-worker-1001
  • 19:26 chasemp: sudo service docker restart; sudo service flannel restart; sudo service kube-proxy restart on tools-proxy-02

2018-01-06

  • 00:35 madhuvishy: Run `clush -w @paws-worker -b 'sudo iptables -L FORWARD'`
  • 00:05 madhuvishy: Drain and cordon tools-worker-1001 (for debugging the dns outage)

2018-01-05

  • 23:49 madhuvishy: Run clush -w @k8s-worker -x tools-worker-1001.tools.eqiad.wmflabs 'sudo service docker restart; sudo service flannel restart; sudo service kubelet restart; sudo service kube-proxy restart' on tools-clushmaster-01
  • 16:22 andrewbogott: moving tools-worker-1027 to labvirt1015 (CPU balancing)
  • 16:01 andrewbogott: moving tools-worker-1017 to labvirt1017 (CPU balancing)
  • 15:32 andrewbogott: moving tools-exec-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
  • 15:18 andrewbogott: moving tools-exec-1411.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
  • 15:02 andrewbogott: moving tools-exec-1440.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
  • 14:47 andrewbogott: moving tools-webgrid-lighttpd-1421.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
  • 14:25 andrewbogott: moving tools-webgrid-lighttpd-1420.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
  • 14:05 andrewbogott: moving tools-webgrid-lighttpd-1417.tools.eqiad.wmflabs to labvirt1015 (CPU balancing)
  • 13:46 andrewbogott: moving tools-webgrid-lighttpd-1419.tools.eqiad.wmflabs to labvirt1017 (CPU balancing)
  • 05:33 andrewbogott: migrating tools-worker-1012 to labvirt1017 (CPU load balancing)

2018-01-04

  • 17:24 andrewbogott: rebooting tools-paws-worker-1019 to verify repair of T184018

2018-01-03

Archives