You are browsing a read-only backup copy of Wikitech. The live site can be found at

Release Engineering/SAL: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
(deploying (legoktm))
(deploying & (legoktm))
Line 1: Line 1:
== June 27 ==
== June 27 ==
* 02:42 legoktm: deploying &
* 02:36 Krinkle: Reloading Zuul to deploy
* 02:22 Krinkle: Reloading Zuul to deploy
* 02:15 legoktm: deploying
* 01:56 legoktm: deploying &
* 01:42 legoktm: deploying
* 01:36 legoktm: deploying
* 01:28 legoktm: deploying
* 01:28 legoktm: deploying
* 01:13 legoktm: deploying
* 01:13 legoktm: deploying

Revision as of 02:42, 27 June 2015

June 27

June 26

  • 22:39 marxarelli: Reloading Zuul to deploy I3deec5e5a7ce7eee75268d0546eafb3e4145fdc7
  • 22:20 marxarelli: Reloading Zuul to deploy I7affe14e878d5c1fc4bcb4dfc7f2d1494cd795b7
  • 21:45 legoktm: deploying
  • 21:21 marxarelli: running `jenkins-jobs update` to deploy I7affe14e878d5c1fc4bcb4dfc7f2d1494cd795b7
  • 18:46 marxarelli: running `jenkins-jobs update '*bundle*'` to deploy Icb31cf57bee0483800b41a2fb60d236fcd2d004e

June 25

  • 23:38 legoktm: deploying
  • 21:21 thcipriani: updated deployment-salt to match puppet by rm /var/lib/git/operations/puppet/modules/cassandra per godog's instructions
  • 19:09 hashar: purged all WikidataQuality workspaces. Got renamed to WikibaseQuality*
  • 14:22 jzerebecki: reloading zuul for
  • 14:20 jzerebecki: killing a fellows idle shell zuul@gallium:~$ kill 13602
  • 11:03 hashar: Rebooting integration-raita and integration-vmbuilder-trusty
  • 11:01 hashar: Unmounting /data/project and /home NFS mounts from integration-raita and integration-vmbuilder-trusty
  • 10:45 hashar: deployment-sca02 deleted /var/lib/puppet/state/agent_catalog_run.lock from June 5th
  • 08:57 hashar: Fixed puppet "Can't dup Symbol" on deployment-pdf01 by deleting puppet, /var/lib/puppet and reinstalling it from scratch
  • 08:39 hashar: apt-get upgrade deployment-salt
  • 08:08 hashar: deployment-pdf01 deleted /var/log/ocg/ content. Last entry is from July 25th 2014 and puppet complains with e[/var/log/ocg]: Not removing directory; use 'force' to override
  • 08:04 hashar: apt-get upgrade deployment-pdf01
  • 06:37 Krinkle: Reloading Zuul to deploy
  • 06:33 Krinkle: Reloading Zuul to deploy

June 24

  • 19:31 hashar: rebooting deployment-cache-upload02
  • 19:28 hashar: fixing DNS puppet etc on deployment-cache-upload02
  • 19:24 hashar: rebooting deployment-zookeeper to get rid of the /home NFS
  • 19:06 hashar: beta: salt 'i-00*' "echo 'domain integration.eqiad.wmflabs\nsearch integration.eqiad.wmflabs eqiad.wmflabs\nnameserver\noptions timeout:5' > /etc/resolv.conf"
  • 19:06 hashar: fixing DNS / puppet and salt on i-000008d5.eqiad.wmflabs i-000002de.eqiad.wmflabs i-00000958.eqiad.wmflabs
  • 15:35 hashar: integration-dev recovered! puppet hasn't run for ages but caught up with changes
  • 15:13 hashar: removed /var/lib/puppet/state/agent_catalog_run.lock on integration-dev
  • 09:52 hashar: Java 6 removed from gallium / lanthanum and CI labs slaves.
  • 09:18 hashar: getting rid of java 6 on CI machines ( )
  • 07:58 hashar: Bah puppet reenable NFS on deployment-parsoidcache02 for some reason
  • 07:57 hashar: disabling NFS on deployment-parsoidcache02
  • 00:38 marxarelli: reloading zuul to deploy
  • 00:32 marxarelli: running `jenkins-jobs update` to create 'mwext-MobileFrontend-mw-selenium' with I7affe14e878d5c1fc4bcb4dfc7f2d1494cd795b7
  • 00:20 marxarelli: running `jenkins-jobs update` to create 'mediawiki-selenium-integration' with I7affe14e878d5c1fc4bcb4dfc7f2d1494cd795b7

June 23

  • 23:29 Krinkle: Reloading Zuul to deploy
  • 21:34 bd808: updated scap to 33f3002 (Ensure that the minimum batch size used by cluster_ssh is 1)
  • 19:53 legoktm: deleted broken renames from centralauth.renameuser_status on beta cluster
  • 18:28 jzerebecki: zuul reload for
  • 16:33 bd808: updated scap to da64a65 (Cast pid read from file to an int)
  • 16:20 bd808: updated scap to 947b93f (Fix reference to _get_apache_list)
  • 12:24 hashar: rebooting integration-labvagrant (stuck)
  • 00:07 legoktm: deploying

June 22

  • 22:23 legoktm: deploying
  • 21:47 bd808: scap emitting soft failures due to missing python-netifaces on deployment-videoscaler01; should be fixed by a current puppet run
  • 21:37 bd808: Updated scap to 81b7c14 (Move dsh group file names to config)
  • 14:58 hashar: disabled sshd MAC/KEX hardening on beta (was )
  • 14:32 hashar: restarting Jenkins
  • 14:30 hashar: Reenable sshd MAC/KEX hardening on beta by cherry picking
  • 13:17 moritzm: activated firejail service containment for graphoid, citoid and mathoid in deployment-sca
  • 11:07 hashar: fixing puppet on integration-zuul-server
  • 10:29 hashar: rebooted deployment-kafka02 to get rid of /home NFS share
  • 10:25 hashar: fixed puppet.conf on deployment-urldownloader
  • 10:20 hashar: enabled puppet agent on deployment-urldownloader
  • 10:05 hashar: removing puppet lock on deployment-elastic07 ( rm /var/lib/puppet/state/agent_catalog_run.lock )
  • 09:40 hashar: fixed puppet certificates on integration-lightslave-jessie-1002 by deleting the SSL certs
  • 09:31 hashar: cant reach integration-lightslave-jessie-1002 , probably NFS related
  • 09:22 hashar: upgrading Jenkins gearman plugin from 0.1.1 to latest master (f2024bd).

June 21

June 20

June 19

  • 18:39 thcipriani: running `salt -b 2 '*' 'puppet agent -t'` from deployment salt to remount /data/projects
  • 18:36 thcipriani: added role::deployment::repo_config to deployment-prep hiera, to be removed after patched in ops/puppet
  • 16:48 thcipriani: primed keyholder on deployment-bastion
  • 15:35 hashar: nodepool manages to boot instances and ssh to them. Now attempting to add them as slave in Jenkins!

June 17

June 16

  • 15:55 bd808: Resolved rebase conflicts on deployment-salt caused by code review changes of prior to merge
  • 13:05 hashar: upgrading HHVM on CI trusty slaves salt -v -t 30 --out=json -C 'G@oscodename:trusty and *slave*' pkg.install pkgs='["hhvm","hhvm-dev","hhvm-fss","hhvm-luasandbox","hhvm-tidy","hhvm-wikidiff2"]'
  • 11:45 hashar: integration-slave-trusty-1021 downgrading hhvm plugins to match hhvm 3.3.1
  • 11:42 hashar: integration-slave-trusty-1021 downgrading hhvm, hhvm-dev from 3.3.6 to 3.3.1
  • 11:19 hashar: rebooting integration-dev , unreacheable
  • 11:09 hashar: apt-get upgrade on integration-slave-trusty-1021
  • 08:19 hashar: rebooting integration-slave-jessie-1001, unreacheable

June 15

June 13

June 12

June 11

June 10

  • 20:18 legoktm: deploying
  • 10:42 hashar: restarted jobchron/jobrunner on deployment-jobrunner01
  • 10:42 hashar: manually nuked and repopulated jobqueue:aggregator:s-wikis:v2 on deplkoyment-redis01 It now only contains entries from all-labs.dblist
  • 09:46 hashar: deployment-videoscaler restarted jobchron
  • 08:19 mobrovac: reboot deployment-restbase01 due to ssh problems

June 9

  • 22:13 thcipriani: are we back?
  • 17:31 twentyafterfour: Branching 1.26wmf9
  • 17:10 hashar: restart puppet master on deployment-salt. Was overloaded with wait I/O since roughly 1am UTC
  • 16:56 hashar: restarted puppetmaster on deployment-salt

June 8

June 7

  • 20:43 Krinkle: Rebooting integration-slave-trusty-1015 to see if it comes back so we can inspect logs (T101658)
  • 20:16 Krinkle: Per Yuvi's advice, disabled "Shared project storage" (/data/project NFS mount) for the integration project. Mostly unused. Two existing directories were archived to /home/krinkle/integration-nfs-data-project/
  • 17:51 Krinkle: integration-slave-trusty-1012, trusty-1013 and 1015 unresponsive to pings or ssh. Other trusty slaves still reachable.

June 6

June 5

June 4

June 3

  • 23:31 Krinkle: Reloading Zuul to deploy
  • 20:49 hashar: restarted zuul entirely to remove some stalled jobs
  • 20:47 marxarelli: Reloading Zuul to deploy I96649bc92a387021a32d354c374ad844e1680db2
  • 20:28 hashar: Restarting Jenkins to release a deadlock
  • 20:22 hashar: deployment-bastion Jenkins slave is stalled again :-( No code update happening on beta cluster
  • 18:50 thcipriani: change use_dnsmasq: false for deployment-prep
  • 18:24 thcipriani: updating deployment-salt puppet in prep for use_dnsmasq=false
  • 11:58 kart_: Cherry-picked 213840 to test logstash
  • 10:08 hashar: Update JJB fork again f966521..4135e14 . Will remove the http notification to zuul {{bug:T93321}}. REFRESHING ALL JOBS!
  • 10:03 hashar: Further updated JJB fork c7231fe..f966521
  • 09:10 hashar: Refershing almost all jenkins jobs to take in account the Jenkins Git plugin upgrade
  • 03:07 Krinkle: Reloading Zuul to deploy

June 2

  • 20:58 bd808: redis-cli srem "deploy:scap/scap:minions" i-000002f4.eqiad.wmflabs
  • 20:54 bd808: deleted unused deployment-rsync01 instance
  • 20:49 bd808: Updated scap to 62d5cb2 (Lint JSON files)
  • 20:40 marxarelli: cherry-picked on integration-puppetmaster
  • 20:38 marxarelli: manually rebased operations/puppet on integration-puppetmaster to fix empty commit from cherry-pick
  • 17:01 hashar: updated JJB fork to e3199d9..c7231fe
  • 15:16 hashar: updated integration/jenkins-job-builder to e3199d9
  • 13:16 hashar: restarted deployment-salt

June 1

  • 08:18 hashar: Jenkins: upgrading git plugin from 1.5.0 to latest

May 31

May 29

May 28

  • 20:50 bd808: Ran "del jobqueue:aggregator:h-ready-queues:v2" on deployment-redis01
  • 13:46 hashar: upgrading Jenkins git plugin from 1.4.6+wmf1 to 1.7.1 bug T100655 and restarting Jenkins

May 27

  • 15:09 hashar: Jenkins slaves are all back up. Root cause was some ssh algorithm in their sshd which is not supported by Jenkins jsch embedded lib.
  • 14:30 hashar: manually rebasing puppet git on deployment-salt (stalled)
  • 14:27 hashar: restarting deployment-salt / some process is 100% wa/IO
  • 13:38 hashar: restarted integration puppetmaster (memory leak)
  • 13:35 hashar: integration-puppetmaster apparently out of memory
  • 13:30 hashar: All Jenkins slaves are disconnected due to some ssh error. CI is down.

May 24

May 23

May 20

  • 17:19 thcipriani|afk: add --fail to curl inside mwext-Wikibase-qunit jenkins job
  • 15:59 bd808: Applied role::beta::puppetmaster on deployment-salt to get Puppet logstash reports back

May 19

  • 02:54 bd808: Primed keyholder agent via `sudo -u keyholder env SSH_AUTH_SOCK=/run/keyholder/agent.sock ssh-add /etc/keyholder.d/mwdeploy_rsa`
  • 02:40 Krinkle: deployment-bastion.eqiad magically back online and catching up jobs, though failing due to T99644
  • 02:36 Krinkle: Jenkins is unable to launch slave agent on deployment-bastion.eqiad. Using "Jenkins Script Console" throws HTTP 503.
  • 02:30 Krinkle: Various beta-mediawiki-config-update-eqiad jobs have been stuck for over 13 hours.

May 12

May 11

  • 22:50 legoktm: deploying
  • 22:29 bd808: removed duplicate local group l10nupdate from deployment-bastion that was shadowing the ldap group of the same name
  • 22:24 bd808: removed duplicate local group mwdeploy from deployment-bastion that was shadowing the ldap group of the same name
  • 22:15 bd808: Removed role::logging::mediawiki from deployment-bastion
  • 20:55 legoktm: deleted operations-puppet-tox-py27 workspace on integration-slave-precise-1012, it was corrupt (fatal: loose object b48ccc3ef5be2d7252eb0f0f417f1b5b7c23fd5f (stored in .git/objects/b4/8ccc3ef5be2d7252eb0f0f417f1b5b7c23fd5f) is corrupt)
  • 13:54 hashar: Jenkins: removing label hasContintPackages from production slaves, it is no more needed :)

May 9

May 8

  • 23:59 bd808: Created /data/project/logs/WHERE_DID_THE_LOGS_GO.txt to point folks to the right places
  • 23:54 bd808: Switched MediaWiki debug logs to deployment-fluorine:/srv/mw-log
  • 20:05 bd808: Cherry-picked
  • 18:15 bd808: Cherry-picked
  • 05:14 bd808: apache2 access logs now only locally on instances in /var/log/apache2/other_vhosts_access.log; error log in /var/log/apache2.log and still relayed to deployment-bastion and logstash (works like production now)
  • 04:49 bd808: Symbolic link not allowed or link target not accessible: /srv/mediawiki/docroot/bits/static/master/extensions
  • 04:47 bd808: cherry-picked

May 7

  • 20:48 bd808: Updated kibana to bb9fcf6 (Merge remote-tracking branch 'upstream/kibana3')
  • 18:00 greg-g: brought deployment-bastion.eqiad back online in Jenkins (after Krinkle disconnected it some hours ago). Jobs are processing
  • 16:05 bd808: Updated scap to 5d681af (Better handling for php lint checks)
  • 14:05 Krinkle: deployment-bastion.eqiad has been stuck for 10 hours.
  • 14:05 Krinkle: As of two days now, Jenkins always returns Wikimedia 503 Error page after logging in. Log in session itself is fine.
  • 05:02 legoktm: slaves are going up/down likely due to automated labs migration script

May 6

  • 15:13 bd808: Updated scap to 57036d2 (Update statsd events)

May 5

May 4

  • 23:50 hashar: restarted Jenkins (deadlock with deployment-bastion)
  • 23:49 hashar: restarted Jenkins
  • 22:50 hashar: Manually retriggering last change of operations/mediawiki-config.git with: zuul enqueue --trigger gerrit --pipeline postmerge --project operations/mediawiki-config --change 208822,1
  • 22:49 hashar: restarted Zuul to clear out a bunch of operations/mediawiki-config.git jobs
  • 22:20 hashar: restarting Jenkins from gallium :/
  • 22:18 thcipriani: jenkins restarted
  • 22:12 thcipriani: preparing jenkins for shutdown
  • 21:59 hashar: disconnected reconnected Jenkins Gearman client
  • 21:41 thcipriani: deployment-bastion still not accepting jobs from jenkins
  • 21:35 thcipriani: disconnecting deployment-bastion and reconnecting, again
  • 20:54 thcipriani: marking node deployment-bastion offline due to suck jenkins execution lock
  • 19:03 legoktm: deploying
  • 17:46 bd808: integration-slave-precise-1014 died trying to clone mediawiki/core.git with "fatal: destination path 'src' already exists and is not an empty directory."

May 2

April 30

  • 19:26 Krinkle: Repooled integration-slave-trusty-1013. IP unchanged.
  • 19:00 Krinkle: Depooled integration-slave-trusty-1013 for labs maintenance (per andrewbogott)
  • 14:17 hashar: Jenkins: properly downgraded IRC plugin from 2.26 to 2.25
  • 13:40 hashar: Jenkins: downgrading IRC plugin from 2.26 to 2.25
  • 12:09 hashar: restarting Jenkins

April 29

April 28

  • 23:37 hoo: Ran foreachwiki extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --load-from '' --force-protocol http (because some sites are http only, although the sitematrix claims otherwise)
  • 23:33 hoo: Ran foreachwiki extensions/Wikidata/extensions/Wikibase/lib/maintenance/populateSitesTable.php --load-from '' to fix all sites tables
  • 23:18 hoo: Ran mysql> INSERT INTO sites (SELECT * FROM wikidatawiki.sites); on enwikinews to populate the sites table
  • 23:18 hoo: Ran mysql> INSERT INTO sites (SELECT * FROM wikidatawiki.sites); on testwiki to populate the sites table
  • 17:48 James_F: Restarting grrrit-wm for config change.
  • 16:24 bd808: Updated scap to ef15380 (Make scap localization cache build $TMPDIR aware)
  • 15:42 bd808: Freed 5G on deployment-bastion by deleting abandoned /tmp/scap_l10n_* directories
  • 14:01 marxarelli: reloading zuul to deploy
  • 00:17 greg-g: after the 3rd or so time doing it (while on the Golden Gate Bridge, btw) it worked
  • 00:11 greg-g: still nothing...
  • 00:10 greg-g: after disconnecting, marking temp offline, bringing back online, and launching slave agent: "Slave successfully connected and online"
  • 00:07 greg-g: deployment-bastion is idle, yet we have 3 pending jobs waiting for an executer on it - will disconnect/reconnect it in Jenkins

April 27

  • 21:45 bd808: Manually triggered beta-mediawiki-config-update-eqiad for zuul build df1e789c726ad4aae60d7676e8a4fc8a2f6841fb
  • 21:20 bd808: beta-scap-equad job green again after adding a /srv/ disk to deployment-jobrunner01
  • 21:08 bd808: Applied role::labs::lvm::srv on deployment-jobrunner01 and forced puppet run
  • 21:08 bd808: Deleted deployment-jobrunner01:/srv/* in preparation for applying role::labs::lvm::srv
  • 21:06 bd808: deployment-jobrunner01 missing role::labs::lvm::srv
  • 21:00 bd808: Root partition full on deployment-jobrunner01
  • 20:53 bd808: removed mwdeploy user from deployment-bastion:/etc/passwd
  • 20:15 Krinkle: Relaunched Gearman connection
  • 19:53 Krinkle: Jenkins unable to re-create Gearman connection. (HTTP 503 error from /configure). Have to force restart Jenkins
  • 17:32 Krinkle: Relauch slave agent on deployment-bastion
  • 17:31 Krinkle: Jenkins slave deployment-bastion deadlock waiting for executors

April 26

  • 06:09 thcipriani|afk: rm scap l10nfiles from /tmp on deployment-bastion root partition 100% again...

April 25

  • 16:00 thcipriani|afk: manually ran logrotate on deployment-jobrunner01, root partition at 100%
  • 15:16 thcipriani|afk: clear /tmp/scap files on deployment-bastion, root partition at 100%

April 24

  • 18:01 thcipriani: ran sudo chown -R mwdeploy:mwdeploy /srv/mediawiki on deployment-bastion to fix beta-scap-eqiad, hopefully
  • 17:26 thcipriani: remove deployment-prep from domain in /etc/puppet/puppet.conf on deployment-stream, puppet now OK
  • 17:20 thcipriani: rm stale lock on deployment-rsync01, puppet fine
  • 17:10 thcipriani: gzip /var/log/account/pacct.0 on deployment-bastion: ought to revisit logrotate on that instance.
  • 17:00 thcipriani: rm stale /var/lib/puppet/state/agent_catalog_run.lock on deployment-kafka02
  • 9:56 hashar: restarted mysql on both deployment-db1 and deployment-db2. The service is apparently not started on instance boot.
  • 9:08 hashar: beta: manually rebased operations/puppet.git
  • 8:43 hashar: Enabling puppet on deployment-eventlogging02.eqiad.wmflabs bug T96921

April 23

  • 06:11 Krinkle: Running git-cache-update inside screen on integration-slave-trusty-1021 at /mnt/git
  • 06:11 Krinkle: integration-slave-trusty-1021 stays depooled (see T96629 and T96706)
  • 04:35 Krinkle: Reloading Zuul to deploy and
  • 00:29 bd808: cherry-picked and applied (logstash: Convert $::realm switches to hiera)
  • 00:17 bd808: beta cluster fatal monitor full of "Bad file descriptor: AH00646: Error writing to /data/project/logs/apache-access.log"
  • 00:03 bd808: cleaned up redis leftovers on deployment-logstash1

April 22

  • 23:57 bd808: cherry-picked and applied (remove redis from logstash)
  • 23:33 bd808: reset deployment-salt:/var/lib/git/operations/puppet HEAD to production; forced update with upstream; re-cherry-picked I46e422825af2cf6f972b64e6d50040220ab08995
  • 23:28 bd808: deployment-salt:/var/lib/git/operations/puppet in detached HEAD state; looks to be for cherry pick of I46e422825af2cf6f972b64e6d50040220ab08995 ?
  • 21:40 thcipriani: restarted mariadb on deployment-db{1,2}
  • 20:20 thcipriani: gzipped /var/log/pacct.0 on deployment-bastion
  • 19:50 hashar: zuul/jenkins are back up (blame Jenkins)
  • 19:40 hashar: reenabling Jenkins gearman client
  • 19:30 hashar: Gearman went back. Reenabling Jenkins as a Gearman client
  • 19:27 hashar: Zuul gearman is stalled. Disabling Jenkins gearman client to free up connections
  • 17:58 Krinkle: Creating integration-slave-trusty-1021 per T96629 (using ci1.medium type)
  • 14:34 hashar: beta: failures on instances are due to them being moved on different openstack compute nodes (virt***)
  • 13:51 jzerebecki: integration-slave-trusty-1015:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-qunit/src/node_modules
  • 12:48 hashar: beta: Andrew B. starting to migrate beta cluster instances on new virt servers
  • 11:34 hashar: integration: apt-get upgrade on integration-slave-trusty* instances
  • 11:31 hashar: integration: Zuul package has been uploaded for Trusty! Deleting the .deb from /home/hashar/

April 21

April 20

  • 23:34 legoktm: deploying
  • 19:20 legoktm: mediawiki-extensions-hhvm workspace on integration-slave-trusty-1011 had bad lock file, wiping
  • 16:10 hashar: deployment-salt kill -9 of puppetmaster processes
  • 16:08 hashar: deployment-salt killed git-sync-upstream netcat to labmon1001.eqiad.wmnet 8125 was eating all memory
  • 16:04 hashar: beta: manually rebasing operations/puppet on deployment-salt . Might have killed some live hack in the process :/
  • 13:58 hashar: In Gerrit, hidden integration/jenkins-job-builder-config and integration/zuul-config historical repositories. Suggest by addshore on {{bug:T96522}}
  • 03:39 legoktm: deploying

April 19

April 18

April 17

  • 17:52 Krinkle: Reloading Zuul to deploy
  • 17:45 Krinkle: Creating integration-slave-trusty-1017
  • 16:29 Krinkle: Reloading Zuul to deploy
  • 16:00 Krinkle: Reloading Zuul to deploy
  • 12:42 hashar: restarting Jenkins
  • 12:38 hashar: Switching zuul on lanthanum.eqiad.wmnet to the Debian package version
  • 12:14 hashar: Switching Zuul scheduler on to the Debian package version
  • 12:12 hashar: Jenkins: enabled plugin "ZMQ Event Publisher" and publishing all jobs result on TCP port 8888
  • 05:37 legoktm: deploying
  • 01:11 Krinkle: Repool integration-slave-precise-1013 and integration-slave-trusty-1015 (live hack with libeatmydata enabled for mysql; T96308)

April 16

  • 22:08 Krinkle: Rebooting integration-slave-precise-1013 (depooled; experimenting with libeatmydata)
  • 22:07 Krinkle: Rebooted integration-slave-trusty-1015 (experimenting with libeatmydata)
  • 18:31 Krinkle: Rebooting integration-slave-precise-1012 and integration-slave-trusty-1012
  • 17:57 Krinkle: Repooled instances. Conversion of mysql.datadir to tmpfs complete.
  • 17:22 Krinkle: Gracefully depool integration slaves to deploy (T96230)
  • 14:35 thcipriani: running dpkg --configure -a on deployment-bastion to correct puppet failures

April 15

  • 23:21 Krinkle: beta-update-databases-eqiad stuck waiting for executors on a node that has plenty executors available
  • 21:15 hashar: Jenkins browser test jobs sometime deadlock because of the IRC notification plugin
  • 20:34 hashar: hard restarting Jenkins
  • 19:24 Krinkle: Aborting browser tests jobs. Stuck for over 5 hours.
  • 19:24 Krinkle: Aborting beta-scap-eqiad. Has been stuck for 2 hours on "Notifying IRC" after "Connection time out" from scap.
  • 08:22 hashar: restarted Jenkins
  • 08:20 hashar: Exception in thread "RequestHandlerThread[#2]" java.lang.OutOfMemoryError: Java heap space
  • 08:16 hashar: Jenkins process went wild taking all CPU busy on gallium

April 14

  • 20:43 legoktm: starting SULF on beta cluster
  • 20:42 marktraceur: stopping all beta jobs, aborting running (and stuck) beta DB update, kicking bastion, to try and get beta to update
  • 19:49 Krinkle: All systems go.
  • 19:48 Krinkle: Jenkins configuration panel won't load ("Loading..." stays indefine, "Uncaught TypeError: Cannot convert to object at prototype.js:195")
  • 19:46 Krinkle: Jenkins restarted. Relaunching Gearman
  • 19:42 Krinkle: Jenkins still unable to obtain Gearman connection. (HTTP 503 error from /configure). Have to force restart Jenkins.
  • 19:42 Krinkle: deployment-bastion jobs were stuck. marktraceur cancelled queue and relaunched slave. Now processing again.
  • 15:27 Krinkle: puppetmaster: Re-apply I05c49e5248cb operations/puppet patch to re-fix T91524. Somehow the patch got lost.
  • 08:46 hashar: does qa-morebots works ?

April 13

  • 20:14 Krinkle: Restarting Zuul, Jenkins and aborting all builds. Everything got stuck following NFS outage in lab
  • 19:28 Krinkle: Restarting Zuul, Jenkins and aborting all builds. Everything crashed following NFS outage in labs
  • 17:01 legoktm: deploying
  • 13:56 Krinkle: Delete old integration-slave1001...1004 (T94916)
  • 10:43 hashar: reducing number of executors on Precise instances from 5 to 4 and on Trusty instances from 6 to 4. The Jenkins scheduler tends to assign the unified jobs to the same slave which overload a single slave while others are idling.
  • 10:43 hashar: reducing number of executors from 5 to 4
  • 08:46 hashar: jenkins removed #wikimedia-qa IRC channel from the global configuration
  • 08:42 hashar: kill -9 jenkins causes it was stuck in some deadlock related to the IRC plugin :(
  • 08:34 zeljkof: restarting stuck Jenkins

April 12

  • 23:58 bd808: sudo ln -s /srv/l10nupdate/mediawiki /var/lib/l10nupdate/mediawiki on deployment-bastion
  • 23:11 greg-g: 0bytes left on /var on deployment-bastion

April 11

April 10

  • 13:50 Krinkle: Pool integration-slave-precise-1012..integration-slave-precise-1014
  • 11:43 hashar: Filled to migrate "Global-Dev Dashboard Data" to JJB/Zuul
  • 11:40 Krinkle: Deleting various jobs from Jenkins that can be safely deleted (no longer in jjb-config). Will report the others to T91410 for inspection.
  • 11:29 Krinkle: Fixed job "Global-Dev Dashboard Data" to be restricted to node "gallium" because it fails to connect to from lanthanum 1/2 builds.
  • 11:26 Krinkle: Re-established Gearman connection from Jenkins
  • 11:20 Krinkle: Jenkins unable to re-establish Gearman connection. Full restart.
  • 10:39 Krinkle: Deleting the old integration1401...integration1405 instances. They've been depooled for 24h and their replacements are OK. This is to free up quota to create new Precise instances.
  • 10:35 Krinkle: Creating integration-slave-precise-1012...integration-slave-precise-1014
  • 10:31 Krinkle: Pool integration-slave-precise-1011
  • 09:02 hashar: integration: Refreshed Zuul packages under /home/hashar
  • 08:57 Krinkle: Fixed puppet failure for missing Zuul package on integration-dev by applying

April 9

  • 19:50 legoktm: deployed
  • 17:20 Krinkle: Creating integration-slave-precise-1011
  • 17:11 Krinkle: Depool integration-slave1402...integration-slave1405
  • 16:52 Krinkle: Pool integration-slave-trusty-1011...integration-slave-trusty-1016
  • 16:00 hashar: integration-slave-jessie-1001 recreated. Applying it role::ci::slave::labs which should also bring in the package builder role under /mnt/pbuilder
  • 15:32 thcipriani: added mwdeploy_rsa to keyholder agent.sock via chmod 400 /etc/keyholder.d/mwdeploy_rsa && SSH_AUTH_SOCK=/run/keyholder/agent.sock ssh-add /etc/keyholder.d/mwdeploy_rsa && chmod 440 /etc/keyholder.d/mwdeploy_rsa; permissions in puppet may be wrong?
  • 14:24 hashar: deleting integration-slave-jessie-1001 extended disk is too small
  • 14:24 hashar: deleting integration-slave-jessie-1001 extended disk is too smal
  • 13:14 hashar: integration-zuul-packaged applied role::labs::lvm::srv
  • 13:01 hashar: integration-zuul-packaged applied zuul::merger and zuul::server
  • 12:59 Krinkle: Creating integration-slave-trusty-1011 - integration-slave-trusty-1016
  • 12:40 hashar: spurts out Permission denied (publickey).
  • 12:39 hashar: is still broken :-(
  • 12:31 hashar: beta: reset hard of operations/puppet repo on the puppetmaster since it has been stalled for 9+days
  • 10:46 hashar: repacked extensions in deployment-bastion staging area: find /mnt/srv/mediawiki-staging/php-master/extensions -maxdepth 2 -type f -name .git -exec bash -c 'cd `dirname {}` && pwd && git repack -Ad && git gc' \;
  • 10:31 hashar: deployment-bastion has a lock file remaining /mnt/srv/mediawiki-staging/php-master/extensions/.git/refs/remotes/origin/master.lock
  • 09:55 hashar: restarted Zuul to clear out some stalled jobs
  • 09:35 Krinkle: Pooled integration-slave-trusty-1010
  • 08:59 hashar: rebooted deployment-bastion and cleared some files under /var/
  • 08:51 hashar: deployment-bastion is out of disk space on /var/  :(
  • 08:50 hashar: timed out after 30 minutes while trying to git pull
  • 08:50 hashar: job stalled for some reason
  • 06:15 legoktm: deploying
  • 06:02 legoktm: deploying
  • 05:11 legoktm: deleted core dumps from integration-slave1002, /var had filled up
  • 04:36 legoktm: deploying
  • 00:32 legoktm: deploying

April 8

  • 21:56 legoktm: deploying
  • 21:15 legoktm: deleting non-existent jobs' workspaces on labs slaves
  • 19:09 Krinkle: Re-establishing Gearman-Jenkins connection
  • 19:00 Krinkle: Restarting Jenkins
  • 19:00 Krinkle: Jenkins Master unable to re-establish Gearman connection
  • 19:00 Krinkle: Zuul queue is not being distributed properly. Many slaves are idling waiting to receive builds but not getting any.
  • 18:29 Krinkle: Another attempt at re-creating the Trusty slave pool (T94916)
  • 18:07 legoktm: deploying and
  • 18:01 Krinkle: Jobs for Precise slaves are not starting. Stuck in Zuul as 'queued'. Disconnected and restarted slave agent on them. Queue is back up now.
  • 17:36 legoktm: deployed
  • 13:32 hashar: Disabled Zuul install based on git clone / by cherry picking . Installed the Zuul debian package on all slaves
  • 13:31 hashar: integration: running apt-get upgrade on Trusty slaves
  • 13:30 hashar: integration: upgrading python-gear and python-six on Trusty slaves
  • 12:43 hasharLunch: Zuul is back and it is nasty
  • 12:24 hasharLunch: killed zuul on gallium :/

April 7

  • 16:26 Krinkle: git-deploy: Deploying integration/slave-scripts 4c6f541
  • 12:57 hashar: running apt-get upgrade on integration-slave-trusty* hosts
  • 12:45 hashar: recreating integration-slave-trusty-1005
  • 12:26 hashar: deleting integration-slave-trusty-1005 has been provisioned with role::ci::website instead of role::ci::slave::labs
  • 12:11 hashar: retriggering a bunch of browser tests hitting
  • 12:07 hashar: Puppet being fixed, it is finishing the installation of integration-slave-trusty-*** hosts
  • 12:03 hashar: Browser tests against beta cluster were all failing due to an improper DNS resolver being applied on CI labs instances bug T95273. Should be fixed now.
  • 12:00 hashar: running puppet on all integration machines and resigning puppet client certs
  • 11:31 hashar: integration-puppetmaster is back and operational with local puppet client working properly.
  • 11:28 hashar: restored /etc/puppet/fileserver.conf
  • 11:08 hashar: dishing out puppet SSL configuration on all integratio nodes. Can't figure out so lets restart from scratch
  • 10:52 hashar: made puppetmaster certname = integration-puppetmaster.eqiad.wmflabs instead of the ec2 id :(
  • 10:49 hashar: manually hacking integration-puppetmaster /etc/puppet/puppet.conf config file which is missing the [master] section
  • 09:37 hashar: integration project has been switched to a new labs DNS resolver ( ) . It is missing the dnsmasq hack to resolve beta cluster URls to the instance IP instead of the public IP. Causes a wild range of jobs to fail.
  • 01:25 Krinkle: Reloading Zuul to deploy

April 6

April 5

  • 11:13 Krinkle: New integration-slave-trusty-1001..1005 must remain unpooled. Provisioning failed. details at
  • 10:48 Krinkle: Puppet on integration-puppetmaster has been failing for the past 2 days: "Failed when searching for node i-0000063a.eqiad.wmflabs: You must set the 'external_nodes' parameter to use the external node terminus" (=integraton-dev.eqiad.wmflabs)
  • 10:22 Krinkle: Creating integration-slave-trusty-1001-1005 per T94916.

April 3

  • 23:47 greg-g: for Krinkle 23:31 "Finished npm upgrade on trusty slaves."
  • 23:08 Krinkle: Finished npm upgrade on precise slaves. Rolling trusty slaves now.
  • 22:55 bd808: Updated scap to a1a5235 (Add a logo banner to scap)
  • 21:31 Krinkle: Upgrading npm from v2.4.1 to v2.7.6 (rolling, slave by slave graceful)
  • 21:11 ^d: puppet re-enabled on staging-palladium, running fine again
  • 21:05 Krinkle: Delete unfinished/unpoooled instances integration-slave-precise-1011-1014. (T94916)
  • 14:49 hashar: integration-slave-jessie-1001 : manually installed jenkins-debian-glue Debian packages. It is pending upload by ops to bug T95006
  • 12:56 hashar: installed zuul_2.0.0-304-g685ca22-wmf1precise1_amd64.deb on integration-slave-precise-101* instances
  • 12:56 hashar: installed zuul_2.0.0-304-g685ca22-wmf1precise1_amd64.deb on integration-slave-precise-1011.eqiad.wmflabs
  • 12:35 hashar: Switching Jessie slave from role::ci::slave::labs::common to role::ci::slave::labs which will bring a whole lot of packages and break
  • 12:28 hashar: integration-slave-jessie-1001 applying role::ci::slave::labs::common to pool it as a very basic Jenkins slave
  • 12:19 hashar: enabled puppetmaster::autosigner on integration-puppetmaster
  • 11:58 hashar: Applied role::ci::slave::labs on integration-slave-precise-101[1-4] that Timo created earlier
  • 11:58 hashar: Cherry picked a couple patches to fix puppet Package[] definitions issues
  • 11:49 hashar: made integration-puppetmaster to self update its puppet clone
  • 11:42 hashar: recreating integration-slave-precise-1011 stalled with a puppet oddity related to Package['gdb'] defined twice bug T94917
  • 11:30 hashar: integration-puppetmaster migrated down to Precise
  • 11:23 hashar: rebooting integration-publisher : cant ssh to it
  • 10:37 hashar: disabled some hiera configuration related to puppetmaster.
  • 10:22 hashar: Created instance i-00000a4a with image "ubuntu-12.04-precise" and hostname i-00000a4a.eqiad.wmflabs.
  • 10:21 hashar: downgrading integration-puppetmaster from Trusty to Precise
  • 05:42 legoktm: deploying
  • 03:58 Krinkle: Jobs were throwing NOT_RECOGNISED. Relaunched Gearman. Jobs are now happy again.
  • 03:51 Krinkle: Jenkins is unable to re-establish Gearman connection. Have to force restart Jenkins master.
  • 03:42 Krinkle: Reloading Jenking config repaired the broken references. However Jenkins is still unable to make new references properly. New builds are 404'ing the same way.
  • 03:26 Krinkle: Reloading Jenkins configuration from disk
  • 03:18 Krinkle: Build metadata exists properly at /var/lib/jenkins/jobs/:jobname/builds/:nr, but the "last*Build" symlinks are outdated.
  • 03:12 Krinkle: As of 03:03, recent builds are mysteriously missing their entry in Jenkins. They show up on the dashbaord when running, but their build log is never published (url is 404). E.g. and
  • 02:47 Krinkle: Reloading Zuul to deploy
  • 00:31 greg-g: rm 'd .gitignore in /srv/mediawiki-staging/php-master/skins due to clashing with a local untracked version

April 2

  • 22:56 Krinkle: New integration-slave-precise-101x are unfinished and must remain depooled. See T94916.
  • 22:53 Krinkle: Most puppet failures blocking T94916 may be caused by the fact that intergration-puppetmaster was inadvertently changed to Trusty; puppetmaster version of Trusty is not yet supported by ops
  • 21:41 Krinkle: It seems integration-slave-jessie-1001 has role::ci::slave::labs::common instead of role::ci::slave::labs. Intentional?
  • 21:25 Krinkle: Re-creating integration-dev-slave-precise in preparation of re-creating precise slaves
  • 14:51 hashar: applying role::ci::slave::labs::common on integration-slave-jessie-1001
  • 14:49 hashar: integration: nice thing, newly created instances are automatically made to point to integration-pummetmaster via hiera! Just have to sign the certificate on the master using: puppet ca list ; puppet ca sign i-000xxxx.eqiad.wmflabs
  • 14:42 hashar: Created integration-slave-jessie-1001 to try out CI slave on Jessie (phab:T94836)
  • 14:11 hashar: reduced integration-slave1004 executors from 6 to 5 to make it on par with the other precise slaves
  • 14:10 hashar: integration-slave100[1-4] are now using Zuul provided by a Debian package as of PS 16
  • 14:04 hashar: uninstall the pip installed zuul version from Precise labs slaves by doing: pip uninstall zuul && rm /usr/local/bin/zuul* . Switching them all to a Debian package
  • 13:45 hashar: pooling back integration-slave1001 and 1002 which are using zuul-cloner provided by a debian package
  • 13:35 hashar: reloading Jenkins configuration files from disk to make it knows about a change manually applied to most jobs config.xml files for
  • 13:01 Krinkle: Reloading Zuul to deploy
  • 12:19 hashar: preventing job to run on integration-slave1001 by replacing its label with 'DoNotLabelThisSlaveHashar'. Going to install Zuul debian package on it
  • 09:37 hashar: rebooting integration-zuul-server homedir seems to be stalled/missing
  • 08:12 hashar: upgrading packages on integration-dev
  • 05:14 greg-g: and right when I log'd that, things seem to be recovering
  • 05:12 greg-g: the shinken alerts about beta cluster issues are due to wmflabs having issues.

April 1

  • 07:17 Krinkle: Creating integration-slave1410 as test. Will re-create our pool later today.
  • 06:26 Krinkle: Apply puppetmaster::autosigner to integration-puppetmaster
  • 05:51 legoktm: deleting non-existent job workspaces from integration slaves
  • 05:42 Krinkle: Free up space on integration-slave1001-1004 by removing obsolete phplint and qunit workspaces
  • 02:05 Krinkle: Restarting Jenkins again..
  • 01:35 legoktm: started zuul on gallium
  • 01:00 Krinkle: Restarting Jenkins
  • 01:00 Krinkle: Jenkins is unable to start Gearman connection (HTTP 503);
  • 01:00 Krinkle: Force restarted Zuul, didn't help
  • 00:55 Krinkle: Jenkins stuck. Builds are queued in Zuul but nothing is sent to Jenkins.

March 31

March 30

  • 22:58 legoktm: 1001-1003 were depooled, restarted and repooled. 1004 is depooled and restarted
  • 22:40 legoktm: rebooting precise jenkins slaves
  • 21:40 greg-g: Beta Cluster is down due to WMF Labs issues, being taken care of now (by Coren and Yuvi)
  • 19:53 legoktm: deleted core dumps from integration-slave1001
  • 19:11 legoktm: deploying
  • 16:29 jzerebecki: another damaged git repo integration-slave1001:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-qunit/src/vendor/
  • 16:07 jzerebecki: removing workspaces of deleted jobs integration-slave100{1,2,3,4}:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-{client,repo,repo-api}-tests{,@*}
  • 15:14 jzerebecki: integration-slave1001:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-repo-api-tests-sqlite
  • 15:05 jzerebecki: integration-slave1001:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-repo-api-tests-mysql/src/extensions/cldr
  • 14:36 jzerebecki: integration-slave1001:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-*-tests{,@*}
  • 13:06 jzerebecki: integration-slave1001:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-client-tests@*
  • 13:05 jzerebecki: integration-slave1001:~$ sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-Wikibase-client-tests

March 29

March 28

  • 04:02 bd808: manually updated beta-code-update-eqiad job to remove sudo to mwdeploy; needs associated jjb change for T94261

March 27

  • 23:28 bd808: applied beta::autoupdater directly to deployment-bastion via wikitech interface
  • 23:21 bd808: Duplicate declaration: Git::Clone[operations/mediawiki-config] is already declared in file /etc/puppet/modules/beta/manifests/autoupdater.pp:46; cannot redeclare at /etc/puppet/modules/scap/manifests/master.pp:22
  • 23:01 bd808: restarted puppetmaster
  • 22:52 hashar: integration: jzerebecki addition and sudo policy tracked for history purpose as bug T94280
  • 22:52 bd808: chown -R l10nupdate:wikidev /srv/mediawiki-staging/php-master/cache/l10n
  • 22:44 bd808: deployment-bastion: chown -R jenkins-deploy:wikidev /srv/mediawiki-staging/
  • 22:41 bd808: forcing puppet run on deployment-bastion
  • 22:41 bd808: cherry-picked and
  • 22:40 hashar: integration: created sudo policy allowing members to run any command as jenkins-deploy on all hosts.
  • 22:40 hashar: added jzerebecki to the integration labs project as a normal member
  • 22:34 hashar: integration-slave1001 rm -fR mwext-Wikibase-repo-api-tests/src/vendor
  • 21:13 greg-g: things be better
  • 20:56 greg-g: Beta Cluster is down, known
  • 18:50 marxarelli: running `jenkins-jobs update` to update 'browsertests-UploadWizard-*' with Id33ffde07f0c15e153d52388cf130be4c59b4559
  • 17:50 legoktm: deleted core dumps from integration-slave1002
  • 17:48 legoktm: marked integration-slave1002 as offline, /var filled up
  • 05:42 legoktm: marked integration-slave1001 as offline due to

March 26

  • 23:47 legoktm: deploying
  • 19:22 bd808: Manually added missing !log entries from 2015-03-25 from my bouncer logs
  • 17:14 greg-g: jobs appear to be processing according to zuul, the Jenkins UI just takes forever to load, apparently
  • 17:12 greg-g: "Please wait while Jenkins is getting ready to work"
  • 17:08 greg-g: 0:07 < robh> kill -9 and restarted per instrucitons
  • 16:53 greg-g: Still.... "Please wait while Jenkins is restarting..."
  • 16:49 greg-g: "Please wait while Jenkins is restarting..."
  • 16:39 greg-g: going to do a safe-restart of Jenkins
  • 16:38 greg-g: nothing executing on deployment-bastion, that is
  • 16:38 greg-g: same, nothing executing
  • 16:37 greg-g: did that checklist once, jobs still not executing, doing again
  • 16:32 greg-g: I'll start going through the checklist at
  • 16:30 hashar: deadlock on deployment-bastion slave. Someone need to restart Jenkins :(
  • 13:25 hashar: yamllint job fixed by altering the label
  • 13:17 hashar: Changes blocked because there is nothing able to run yamllint ( status|grep build:yamllint , shows 8 jobs pending and no worker available)

March 25

  • 23:23 bd808: chown -R jenkins-deploy:project-deployment-prep /srv/mediawiki-staging/php-master/cache/gitinfo
  • 23:14 bd808: chown -R l10nupdate:project-deployment-prep /srv/mediawiki-staging/php-master/cache/l10n
  • 23:14 bd808: chown -R l10nupdate:project-deployment-prep /srv/mediawiki-staging/php-master/cache/l10n
  • 23:04 bd808: chown -R mwdeploy:project-deployment-prep /srv/mediawiki-staging
  • 22:58 bd808: File permissions in deployment-bastion:/srv/mediawiki-staging as part mwdeploy:mwdeploy and part mwdeploy:project-deployment-prep and part jenkins-deploy:project-deployment-prep
  • 21:52 legoktm: deploying
  • 18:49 legoktm: deploying
  • 15:13 bd808: Updated scap to include 4a63a63 (Copy l10n CDB files to rebuildLocalisationCache.php tmp dir)
  • 03:44 legoktm: deploying and
  • 00:52 Krinkle: Restarted Jenkins-Gearman connection
  • 00:50 Krinkle: Jenkins is unable to start Gearman connection (HTTP 503); Restarting Jenkins.
  • 00:32 legoktm: disabling/enabling gearman in jenkins

March 24

  • 23:32 Krinkle: Force restart Zuul
  • 22:25 hashar: marked gallium and lanthanum slaves as temp offline, then back. Seems to have cleared some Jenkins internal state and resumed the build
  • 21:55 bd808: Ran trebuchet for scap to keep cherry-pick of I01b24765ce26cf48d9b9381a476c3bcf39db7ab8 on top of active branch; puppet was forcing back to prior trebuchet sync tag
  • 21:42 hashar: Reconfigured mediawiki-core-code-coverage
  • 21:22 hashar: Zuul gate is deadlocked for up to half an hour due to change being force merged :(
  • 21:15 hashar: beta: deleted untracked file /srv/mediawiki-staging/php-master/extensions/.gitignore . That fixed the Jenkins job
  • 20:31 twentyafterfour: sudo ln -s /srv/l10nupdate/ /var/lib/
  • 20:31 twentyafterfour: sudo mv /var/lib/l10nupdate/ /srv/
  • 20:28 bd808: deployment-bastion -- rm -r pacct.1.gz pacct.2.gz pacct.3.gz pacct.4.gz pacct.5.gz pacct.6.gz
  • 20:24 bd808: Deleted junk in deployment-bastion:/tmp
  • 18:57 legoktm: deploying
  • 18:25 legoktm: deploying
  • 17:06 legoktm: deploying
  • 11:23 hashar: beta-scap-eqiad keeps regenerating l10n cache
  • 08:35 hashar: restarting Jenkins for some plugins upgrades
  • 08:07 legoktm: deployed
  • 07:21 legoktm: deploying
  • 07:17 legoktm: deploying
  • 07:08 legoktm: deploying
  • 06:46 legoktm: freed ~6G on lanthanum by deleting mediawiki-extensions-zend* worksapces
  • 05:04 legoktm: deleting workspaces of jobs that no longer exist in jjb on lathanum
  • 04:11 legoktm: deploying
  • 03:14 Krinkle: Deleting old job workspaces on gallium not touched since 2013
  • 02:42 Krinkle: Restarting Zuul, wikimedia-fundraising-civicrm is stuck as of 46min ago waiting for something already merged
  • 02:32 legoktm: toggling gearman off/on in jenkins
  • 01:47 twentyafterfour: deployed scap/scap-sync-20150324-014257 to beta cluster
  • 00:23 Krinkle: Restarted Zuul

March 23

  • 23:18 hasharDinner: Stopping Jenkins for an upgrade
  • 23:16 legoktm: deleting mwext-*-lint* workspaces on gallium, shouldn't be needed
  • 23:11 legoktm: deleting mwext-*-qunit* workspaces on gallium, shouldn't be needed
  • 23:07 legoktm: deleting mwext-*-lint workspaces on gallium, shouldn't be needed
  • 23:00 legoktm: lanthanum is now online again, with 13G free disk space
  • 22:58 legoktm: deleting mwext-*-qunit* workspaces on lanthanum, shouldn't be needed any more
  • 22:54 legoktm: deleting mwext-*-qunit-mobile workspaces on lanthanum, shouldn't be needed any more
  • 22:48 legoktm: deleting mwext-*-lint workspaces on lanthanum, shouldn't be needed any more
  • 22:45 legoktm: took lanthanum offline in jenkins
  • 20:59 bd808: Last log copied from #wikimedia-labs
  • 20:58 bd808: 20:41 cscott deployment-prep updated OCG to version 11f096b6e45ef183826721f5c6b0f933a387b1bb
  • 19:28 YuviPanda: created staging-rdb01.eqiad.wmflabs
  • 19:19 YuviPanda: disabled puppet on staging-palladium to test a puppet patch
  • 18:41 legoktm: deploying
  • 13:11 hashar: and I restarted qa-morebots a minute or so ago (see )
  • 13:11 hashar: Jenkins: deleting unused jobs mwext-.*-phpcs-HEAD and mwext-.*-lint

March 21

March 20

March 19

March 18

  • 17:52 legoktm: deployed and
  • 17:27 legoktm: deployed
  • 15:20 hashar: setting gallium # of executors from 5 back to 3. When jobs run on it that slowdown the zuul scheduler and merger!
  • 15:06 legoktm: deployed
  • 02:02 bd808: Updated scap to I58e817b (Improved test for content preceeding <?php opening tag)
  • 01:48 marxarelli: memory usage, swap, io wait seem to be back to normal on deployment-salt and kill/start of puppetmaster
  • 01:45 marxarelli: kill 9'd puppetmaster processes on deployment-salt after repeated attempts to stop
  • 01:28 marxarelli: restarting salt master on deployment-salt
  • 01:20 marxarelli: deployment-salt still unresponsive, lot's of io wait (94%) + swapping
  • 00:32 marxarelli: seeing heavy swapping on deployment-salt; puppet processes using 250M+ memory each

March 17

  • 21:42 YuviPanda: recreated staging-sca01, let’s wait and see if it just automagically configures itself :)
  • 21:40 YuviPanda: deleted staging-sca01 because why not :)
  • 17:52 Krinkle: Reloading Zuul to deploy I206c81fe9bb88feda6
  • 16:28 bd808: Updated scap to include I61dcf7ae6d52a93afc6e88d3481068f09a45736d (Run rebuildLocalisationCache.php as www-data)
  • 16:25 bd808: chown -R trebuchet:wikidev && chmod -R g+rwX deployment-bastion:/srv/deployment/scap/scap
  • 16:16 YuviPanda: created staging-sca01
  • 14:39 hashar: me versus debian packaging tool chain
  • 09:24 hashar: deleted operations-puppet-validate
  • 09:21 hashar: deleted mwext-Wikibase-lint job, not triggered anymore

March 16

March 15

  • 07:39 legoktm: deleting non-generic, unused *-rubylint1.9.3lint & *-ruby2.0lint jobs
  • 00:56 Krinkle: Reload Zuul to deploy Idb2f15a94a67

March 14

March 13

March 12

  • 23:34 Krinkle: Depooling integration-slave1402 to play with T92351
  • 20:26 Krinkle: Restablished Gearman connection from Zuul due to deadlock
  • 17:39 YuviPanda: killll deployment-rsync01, wasn’t being used for anything discernable, and that’s not how proxies work in prod
  • 15:31 Krinkle: Reloading Zuul to deploy Ia289ebb0
  • 15:22 Krinkle: Fix Jenkins UI (was stuck in German)
  • 15:05 YuviPanda: jenkins loves german again
  • 07:11 YuviPanda: scap still failing on beta, I'll check when I'm back from lunch
  • 07:11 YuviPanda: rebooted puppetmaster, was dead

March 11

  • 19:47 legoktm: deployed
  • 15:11 Krinkle: Jenkins UI in German, again
  • 14:05 Krinkle: Jenkins web dashboard is in German
  • 11:02 hashar: created integration-zuul-packaged.eqiad.wmflabs to test out the Zuul debian package
  • 09:07 hashar: Deleted refs/heads/labs branch in integration/zuul.git
  • 09:01 hashar:
  • 09:01 hashar: made Zuul clone on labs to use the master branch instead of the labs one. There is no point in keeping separate ones anymore

March 10

  • 15:22 apergos: after update of salt in deployment-prep git deploy restart is likely broken. details;
  • 14:50 Krinkle: Browsertest job was stuck for > 10hrs. Jobs should not be allowed to run that long.

March 9

  • 23:57 legoktm: deployed
  • 22:49 Krinkle: Reloading Zuul to deploy I229d24c57d90ef
  • 20:37 legoktm: doing the gearman shuffle dance thing
  • 19:42 Krinkle: Reloading Zuul to deploy I48cb4db87
  • 19:35 Krinkle: Delete integration-slave1010
  • 19:31 Krinkle: Restarted slave agent on gallium
  • 19:30 Krinkle: Re-established Gearman connection from Jenkins

March 8

March 7

  • 22:10 legoktm: deployed
  • 14:44 Krinkle: Depool integration-slave1008 and integration-slave1010 (not deleting yet, just in case)
  • 14:43 Krinkle: Depool integration-slave1006 and integration-slave1007 (not deleting yet, just in case)
  • 14:41 Krinkle: Pool integration-slave1404
  • 14:35 Krinkle: Reloading Zuul to deploy I864875aa4acc
  • 06:28 Krinkle: Reloading Zuul to deploy I8d7e0bd315c4fc2
  • 04:53 Krinkle: Reloading Zuul to deploy I585b7f026
  • 04:51 Krinkle: Pool integration-slave1403
  • 03:55 Krinkle: Pool integration-slave1402
  • 03:31 Krinkle: Reloading Zuul to deploy I30131a32c7f1
  • 02:59 James_F: Pushed Ib4f6e9 and Ie26bb17 to grrrit-wm and restarted
  • 02:54 Krinkle: Reloading Zuul to deploy Ia82a0d45ac431b5

March 6

  • 23:30 Krinkle: Pool integration-slave1401
  • 22:24 Krinkle: Re-establishing Gearman connection from Jenkins (deployment-bastion was deadlocked)
  • 22:16 Krinkle: beta-scap-eqiad is has been waiting for 50minutes for an executor on deployment-bastion.eqiad (which has 5/5 slots idle)
  • 21:36 Krinkle: Provisioning integration-slave1401 - integration-slave1404
  • 20:14 legoktm: deployed for reals this time
  • 20:12 legoktm: deployed
  • 18:22 ^d: staging: set has_ganglia to false in hiera
  • 16:57 legoktm: deployed
  • 16:40 Krinkle: Jenkins auto-depooled integration-slave1008 due to low /tmp space. Purged /tmp/npm-* to bring back up.
  • 16:27 Krinkle: Delete integration-slave1005
  • 09:17 hasharConf: Jenkins: upgrading and restarting. Wish me luck.
  • 06:29 Krinkle: Re-creating integration-slave1401 - integration-slave1404
  • 02:21 legoktm: deployed
  • 02:12 Krinkle: Pooled integration-slave1405
  • 01:52 legoktm: deployed

March 5

  • 22:01 Krinkle: Reloading Zuul to deploy I97c1d639313b
  • 21:15 hashar: stopping Jenkins
  • 21:08 hashar: killing browser tests running
  • 20:48 Krinkle: Re-establishing Gearman connection from Jenkins
  • 20:44 Krinkle: Deleting integration-slave1201-integration-slave1204, and integration-slave1401-integration-slave1404.
  • 20:18 Krinkle: Finished creation and provisioning of integration-slave1405
  • 19:34 legoktm: deploying, lots of new jobs
  • 18:50 Krinkle: Re-creating integration-slave1405
  • 17:52 twentyafterfour: pushed wmf/1.25wmf20 branch to submodule repos
  • 16:18 greg-g: now there are jobs running on the zuul status page
  • 16:16 greg-g: getting "/zuul/status.json: Service Temporarily Unavailable" after the zuul restart
  • 16:12 ^d: restarted zuul
  • 16:06 greg-g: jenkins doesn't have anything queued and is processing jobs apparently, not sure why zuul is showing two jobs queued for almost 2 hours (one with all tests passing, the other with nothing tested yet)
  • 16:04 greg-g: not sure it helped
  • 16:02 greg-g: about to disconnect/reconnect gearman per
  • 00:34 legoktm: deployed

March 4

  • 17:34 Krinkle: Depooling all new integation-slave12xx and integration-slave14xx instances again (See T91524)
  • 17:11 Krinkle: Pooled integration-slave1201, integration-slave1202, integration-slave1203, integration-slave1204
  • 17:06 Krinkle: Pooled integration-slave1402, integration-slave1403, integration-slave1404, integration-slave1405
  • 16:56 Krinkle: Pooled integration-slave1401
  • 16:26 Krinkle: integration-slave12xx and integration-slave14xx are now provisioned. Old slaves will be depooled later and eventually deleted.

March 3

  • 22:00 hashar: reboot integration-puppetmaster in case it solves a NFS mount issue
  • 20:33 legoktm: manually created centralauth.users_to_rename table
  • 18:28 Krinkle: Lots of Jenkins builds are stuck even though they're "Finished". All services look up. (Filed T91430.)
  • 17:18 Krinkle: Reloading Zuul to deploy Icad0a26dc8 and Icac172b16
  • 15:39 hashar: cancelled logrotate update of all jobs since that seems to kill the Jenkins/Zuul gearman connection. Probably because all jobs are registered on each config change.
  • 15:31 hashar: updating all jobs in Jenkins based on PS2 of
  • 10:56 hashar: Created instance i-000008fb with image "ubuntu-14.04-trusty" and hostname i-000008fb.eqiad.wmflabs.
  • 10:52 hashar: deleting integration-puppetmaster to recreate it with a new image {bug|T87484} . Will have to reapply I5335ea7cbfba33e84b3ddc6e3dd83a7232b8acfd and I30e5bfeac398e0f88e538c75554439fe82fcc1cf
  • 03:47 Krinkle: git-deploy: Deploying integration/slave-scripts 05a5593..1e64ed9
  • 01:11 marxarelli: gzip'd /var/log/account/pacct.0 on deployment-bastion to free space

March 2

  • 21:35 twentyafterfour: <Krenair> (per #mediawiki-core, have deleted the job queue key in redis, should get regenerated. also cleared screwed up log and restarted job runner service)
  • 15:39 Krinkle: Removing /usr/local/src/zuul from integration-slave12xx and integration-slave14xx to let puppet re-install zuul-cloner (T90984)
  • 13:39 Krinkle: integration-slave12xx and integration-slave14xx instances still depooled due to T90984

February 27

  • 21:58 Krinkle: Ragekilled all queued jobs related to beta and force restarted Jenkins slave agent on deployment-bastion.eqiad
  • 21:56 Krinkle: Job beta-update-databases-eqiad and node deployment-bastion.eqiad have been stuck for the past 4 hours
  • 21:49 marxarelli: Reloading Zuul to deploy I273270295fa5a29422a57af13f9e372bced96af1 and I81f5e785d26e21434cd66dc694b4cfe70c1fa494
  • 18:08 Krenair: Kicked deployment-bastion node in jenkins to try to fix jobs
  • 06:42 legoktm: deployed
  • 01:01 Krinkle: Keeping all integration-slave12xx and slave14xx instances depooled.
  • 00:53 Krinkle: Finished provisioning of integration-slave12xx and slave14xx instance. Initial testing failed due to "/usr/local/bin/zuul-cloner: No such file or directory"

February 26

  • 23:24 Krinkle: integration-puppetmaster /var disk is full (1.8 of 1.9GB) - /var/log/puppet/reports is 1.1GB - purging
  • 23:23 Krinkle: Puppet failing on new instances due to "Error 400 on SERVER: cannot generate tempfile `/var/lib/puppet/yaml/"
  • 13:27 Krinkle: Provisioning the new integration-slave12xx and integration-slave14xx instances
  • 05:05 legoktm: deployed
  • 03:48 Krinkle: Creating integration-slave1201,02,03,04 and integration-slave1401,02,03,04,05 per T74011 (not yet setup/provisioned, keep depooled)
  • 03:39 Krinkle: Cleaned up and re-pooled integration-slave1006 (was depooled since yesterday)
  • 03:39 Krinkle: Cleaned up and re-pooled integration-slave1007 and integration-slave1008 (was auto-depooled by Jenkins)
  • 01:54 Krinkle: integration-slave1007 and integration-slave1008 were auto-deplooed due to main disk (/ and its /tmp) being < 900 MB free
  • 01:20 legoktm: actually deployed this time
  • 01:16 legoktm: deployed

February 25

  • 23:55 Krinkle: Re-established Jenkins-Gearman connection
  • 23:54 Krinkle: Zuul queue is growing. Nothing is added to its dashboard. Jenkins executers all idle. Gearman deadlock?
  • 20:38 legoktm: deployed
  • 20:18 legoktm: deployed
  • 17:22 ^d: reloading zuul to pick up utfnormal jobs
  • 02:15 Krinkle: integration-slave1006 has <700MB free disk space (including /tmp)

February 24

  • 18:41 marxarelli: Running `jenkins-jobs update` to create
  • 17:55 Krinkle: It seems xdebug was enabled on integration slaves running trusty. This makes errors in build logs incomprehensible.

February 21

  • 03:01 Krinkle: Reloading Zuul to deploy I3bcd3d17cb886740bd67b33b573aa25972ddb574

February 20

  • 07:25 Krinkle: Finished setting up integration-slave1010 and added it to Jenkins slave pool
  • 00:54 Krinkle: Setting up integration-slave1010 (replacement for integration-slave1009)

February 19

  • 23:13 bd808: added Thcipriani to under_NDA sudoers group; WMF staff
  • 19:45 Krinkle: Destroying integration-slave1009 and re-imaging
  • 19:02 bd808: VICTORY! deployment-bastion jenkins slave unstuck
  • 19:01 bd808: toggling gearman plugin in jenkins admin console
  • 18:58 bd808: took deployment-bastion jenkins connection offline and online 5 times; gearman plugin still stuck
  • 18:41 bd808: cleaned up mess in /tmp on integration-slave1008
  • 18:38 bd808: brought integration-slave1007 back online
  • 18:37 bd808: cleaned up mess in /tmp on integration-slave1007
  • 18:29 bd808: restarting jenkins because I messed up and disabled gearman plugin earlier
  • 16:30 bd808: disconnected and reconnected deployment-bastion.eqiad again
  • 16:28 bd808: reconnected deployment-bastion.eqiad to jenkins
  • 16:28 bd808: disconnected deployment-bastion.eqiad from jenkins
  • 16:27 bd808: killed all pending jobs for deployment-bastion.eqiad
  • 16:26 bd808: disconnected deployment-bastion.eqiad from jenkins
  • 16:20 legoktm: updated phpunit for

February 18

  • 23:50 marxarelli: Reloading Zuul to deploy Id311d632e5032ed153277ccc9575773c0c8f30f1
  • 23:37 marxarelli: Running `jenkins-jobs update` to create mediawiki-vagrant-bundle17-cucumber job
  • 23:15 marxarelli: Running `jenkins-jobs update` to update mediawiki-vagrant-bundle17 jobs
  • 22:56 marxarelli: Reloading Zuul to deploy I3b71f4dc484d5f9ac034dc1050faf3ba6f321752
  • 22:42 marxarelli: running `jenkins-jobs update` to create mediawiki-vagrant-bundle17 jobs
  • 22:13 hashar: saving Jenkins configuration at to reset the locale
  • 16:41 bd808: beta-scap-eqiad job fixed after manually rebuilding git clones of scap/scap on rsync01 and videoscaler01
  • 16:39 bd808: rebuilt corrupt deployment-videoscaler01:/srv/deployment/scap/scap
  • 16:36 bd808: rebuilt corrupt deployment-rsync01:/srv/deployment/scap/scap
  • 16:26 bd808: scap failures only from deployment-videoscaler01 and deployment-rsync01
  • 16:25 bd808: scap failing with "ImportError: cannot import name cli" after latest update; investigating
  • 16:23 bd808: redis-cli srem 'deploy:scap/scap:minions' i-0000059b.eqiad.wmflabs i-000007f8.eqiad.wmflabs i-0000022e.eqiad.wmflabs i-0000044e.eqiad.wmflabs i-000004ba.eqiad.wmflabs
  • 16:16 bd808: 5 deleted instances in trebuchet redis cache for salt/salt repo
  • 16:16 bd808: updated scap to 7c64584 (Add universal argument to ignore ssh_auth_sock)
  • 16:14 bd808: scap clone on deployment-mediawiki02 corrupt; git fsck did not fix; will delete and refetch
  • 01:41 bd808: fixed git rebase conflict on deployment-salt caused by outdated cherry-pick; cherry-picks are merged now so reset to tracking origin/production

February 17

  • 17:47 hashar: beta cluster is mostly down because the instance supporting the main database (deployment-db1) is down. The root cause is an outage on the labs infra
  • 03:43 Krinkle: Depooled integration-slave1009 (Debugging T89180)
  • 03:38 Krinkle: Depooled integration-slave1009

February 14

  • 00:55 marxarelli: gzip'd /var/log/account/pacct.0 on deployment-bastion
  • 00:02 bd808: Stopped udp2log ans started udp2log-mw on deployment-bastion

February 13

  • 23:25 bd808: cherry-picked to deployment-salt for testing
  • 14:03 Krinkle: Jenkins UI stuck in Spanish. Resetting configuration.
  • 13:05 Krinkle: Reloading Zuul to deploy I0eaf2085576165b

February 12

  • 11:11 hashar: changed passwords of selenium users.
  • 10:41 hashar: Removing MEDIAWIKI_PASSWORD* global env variables from Jenkins configuration bug T89226

February 11

  • 19:39 Krinkle: Jenkins UI is stuck in French. Resetting..
  • 17:56 greg-g: hashar saved Jenkins global configuration at to hopefully reset the web interface default locale
  • 09:57 hashar: restarting Jenkins to upgrade the Credentials plugin
  • 09:25 hashar: bunch of puppet failure since 8:00am UTC. Seems to be DNS timeouts.

February 10

  • 09:18 hashar: reenabling puppet-agent on deployment-salt . Was disabled with no reason nor sal entry.
  • 06:32 Krinkle: Fix lanthanum:/srv/ssd/jenkins-slave/workspace/mediawiki-extensions-zend@3/src/extensions/Flow/.git/config.lock
  • 00:50 bd808: Updated integration/slave-scripts to "Load extensions using wfLoadExtensions() if possible" (b532a9a)

February 9

  • 22:40 Krinkle: Various mediawiki-extensions-zend builds are jammed half-way through phpunit execution (filed T89050)
  • 21:31 hashar: Deputized legoktm to the Gerrit 'integration' group. Brings +2 on integration/* repos.
  • 20:38 hashar: reconnected jenkins slave agents 1006 1007 and 1008
  • 20:37 hashar: deleted /tmp on integration slaves 1006 1007 and 1008. Filled with npm temp directories
  • 15:51 hashar: integration : allowed ssh from gallium to the instances
  • 09:20 hashar: starting puppet agent on integration-puppetmaster

February 7

  • 16:23 hashar: puppet is broken on integration project for some reason. No clue what is going on :-( bug T88960
  • 16:19 hashar: restarted puppetmaster on integration-puppetmaster.eqiad.wmflabs
  • 00:42 Krinkle: Jenkins is alerting for integration-slave1006, integration-slave1007 and integration-slave1008 having low /tmp space free (< 0.8GB)

February 6

  • 22:40 Krinkle: Installed dsh on integration-dev
  • 05:46 Krinkle: Reloading Zuul to deploy I096749565 and I405bea9d3e
  • 01:35 Krinkle: Upgraded all integration slaves to npm v2.4.1

February 5

  • 13:11 hasharAway: restarted Zuul server to clear out stalled jobs
  • 12:25 hashar: Upgrading puppet-lint from 0.3.2 to 1.1.0 on all repositories. All jobs are non voting beside mediawiki-vagrant-puppetlint-lenient which pass just fine with 1.1.0
  • 03:21 Krinkle: Reloading Zuul to deploy I08a524ea195c
  • 00:22 marxarelli: Reloaded Zuul to deploy Iebdd0d2ddd519b73b1fc5e9ce690ecb59da9b2db

February 4

  • 10:43 hashar: beta-scap-eqiad job is broken because mwdeploy can no more ssh from deployment-bastion to deployment-mediawiki01 . Filled as bug T88529
  • 10:30 hashar: piok

February 3

  • 13:55 hashar: ElasticSearch /var/log/ filling up is bug T88280
  • 09:15 hashar: Running puppet on deployment-eventlogging02 has been stalled for 3d15h. No log :-(
  • 09:08 hashar: cleaning /var/log on deployment-elastic06 and deployment-elastic07
  • 00:44 Krinkle: Restarting Jenkins-Gearman connection

February 2

  • 21:39 Krinkle: Deployed I94f65b56368 and reloading Zuul

January 31

  • 20:31 hashar: canceling a bunch of browser tests jobs that are deadlocked waiting for SauceLabs. The http request has no timeout bug T88221

January 29

January 28

  • 22:53 Krinkle: rm -rf integration-slave1007 rm -rf /mnt/jenkins-workspace/workspace/mwext-DonationInterface-np*
  • 22:43 Krinkle: /srv/deployment/integration/slave-scripts got corrupted by puppet on labs slaves. No longer has the appropriate permission flags.
  • 16:52 marktraceur: restarting nginx on deployment-upload so beta images might work again

January 27

  • 18:54 Krinkle: rm -rf integration-slave1007 mwext-VisualEditor-*

January 26

  • 23:22 bd808: rm integration-slave1006:/mnt/jenkins-workspace/workspace/mediawiki-phpunit-hhvm/src/.git/HEAD.lock (file was timestamped Jan 22 23:55)
  • 21:06 bd808: I just merged a scap change that probably will break the beta-recomile-math-textvc-eqiad job --

January 24

  • 01:05 hashar: restarting Jenkins (deadlock on deployment-bastion slave)

January 20

  • 18:50 Krinkle: Reconfigure Jenkins default language back to 'en' as it was set to Turkish

January 17

  • 20:20 James_F: Brought deployment-bastion.eqiad back online, but without effect AFAICS.
  • 20:19 James_F: Marking deployment-bastion.eqiad as temporarily offline to try to fix the backlog.

January 16

  • 23:26 bd808: cherry-picked to fix puppet errors on deployment-prep
  • 12:43 _joe_: added hhvm.pcre_cache_type = "lru" to beta hhvm config
  • 12:32 _joe_: installing the new HHVM package on mediawiki hosts
  • 11:59 akosiaris: removed ferm from all beta hosts via salt

January 15

January 14

  • 23:22 mutante: cherry-picked I1e5f9f7bcbbe6c4 on deployment-bastion
  • 20:37 hashar: Restarting Zuul
  • 20:36 hashar: Zuul applied Ori patch to fix a git lock contention in Zuul-cloner bug T86730 . Tagged wmf-deploy-20150114-1
  • 16:58 greg-g: rm -rf'd the Wikigrok checkout in integration-slave1006:/mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions to (hopefully) fix
  • 14:56 anomie: Cherry-pick to Beta Labs
  • 02:05 bd808: There is some kind of race / conflict with the mediawiki-extensions-hhvm; I cleaned up the same error for a different extension yesterday
  • 02:04 bd808: integration-slave1006 IOError: Lock for file '/mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/WikiGrok/.git/config' did already exist, delete '/mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/WikiGrok/.git/config.lock' in case the lock is illegal

January 13

  • 22:37 hashar: Restarted Zuul, deadlocked waiting for Gerrit
  • 21:38 ori: deployment-prep upgraded nutcracker on mw1/mw2 to 0.4.0+dfsg-1+wm1
  • 17:49 hashar: If Zuul status page ( ) shows a lot of changes with completed jobs and the number of results growing, Zuul is deadlocked waiting for Gerrit. Have to restart it on with /etc/init.d/zuul restart
  • 17:43 hashar: Restarted deadlocked Zuul , which drops ALL events. Reason is Gerrit lost connection with its database which is not handled by Zuul . See
  • 17:32 James_F: No effect from restarting Gearman. Getting Timo to restart Zuul.
  • 17:30 James_F: No effect. Restarting Gearman.
  • 17:26 James_F: Trying a shutdown/re-enable of Jenkins.
  • 13:59 YuviPanda: running scap via jenkins, hitting buttons on
  • 13:58 YuviPanda: scap failed
  • 13:58 YuviPanda: running scap, because why not
  • 13:58 YuviPanda: modified PrivateSettings.php to make it use wikiadmin user rather than mw user
  • 13:51 YuviPanda: created user wikiadmin on deployment-db1
  • 04:31 James_F: Zuul now appears fixed.
  • 04:29 marktraceur: FORCE RESTART ZUUL (James_F told me to)
  • 04:28 marktraceur: Attempting graceful zuul restart
  • 04:26 marktraceur: Reloaded zuul to see if it will help
  • 04:24 James_F: Took the gallium Jenkins slave offline, disconnected and relaunched; no effect.
  • 04:19 James_F: Disabled and re-enabled Gearman, no effect.
  • 04:15 James_F: Flagged and unflagged Jenkins for restart, no effect.
  • 04:10 James_F: Jenkins/zuul/whatever not working, investigating.
  • 01:12 marxarelli: Added twentyafterfour as an admin to the integration project
  • 01:08 bd808: Added Dduvall as an admin in the integration project
  • 00:55 bd808: zuul is plugged up because a gate-and-submit job failed on integration-slave1006 (ZeroBanner clone problem) and then the patch was force merged
  • 00:48 bd808: deleted ntegration-slave1006:/mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/ZeroBanner to try and clear the git clone problem there
  • 00:35 bd808: git clone failure in blocking merge of core patch

January 12

  • 21:17 hashar: qa-morebots moved from #wikimedia-qa to #wikimedia-releng bug T86053
  • 20:57 greg-g: yuvi removed webserver:php5-mysql role from deployment-sentry2, thus getting puppet onit to unfail
  • 20:57 greg-g: test-qa
  • 11:41 hashar: foo
  • 10:28 hashar: Removing Jenkins IRC notifications from #wikimedia-qa , please switch to #wikimedia-releng
  • 09:06 hashar: Tweak Zuul configuration to pin python-daemon <= 2.0 and deploying tag wmf-deploy-20150112-1. bug T86513

January 8

  • 19:21 Krinkle: Force restart Zuul
  • 19:21 Krinkle: Gearman is back up but Zuul itself still stuck (no longer processing new events, doing "Updating information for .." for the same three jobs over and over again)
  • 19:08 Krinkle: Relaunched Gearman from Jenkins manager
  • 19:05 Krinkle: Zuul/Gearman stuck
  • 18:26 YuviPanda: purged nscd cache on all deployment-prep hosts
  • 16:34 Krinkle: Reload Zuul to deploy I9bed999493feb715
  • 14:58 hashar: contintcloud labs project has been created! bug T86170. Added Krinkle and 20after4 as project admins.
  • 14:44 hashar: on gallium and lanthanum, pushing integration/jenkins.git which would: 1b6a290 - Upgrade JSHint from v2.5.6 to 2.5.11

January 7

  • 10:57 hashar: Taught Jenkins configuration about Java 8. Name: "Ubuntu - OpenJdk 8" JAVA_HOME: /usr/lib/jvm/java-8-openjdk-amd64/ . Only available on Trusty slaves though
  • 10:56 hashar: installed openjdk 8 on CI Trusty labs slaves
  • 10:34 hashar: varnish text cache is back up. Had to delete /etc/varnish and reinstall varnish from scratch + rerun puppet.
  • 10:25 hashar: deleting /etc/varnish on deplloyment-cache-text02 and running puppet
  • 10:24 hashar: beta varnish text cache is broken. The vcl refuses to load because of undefined probes
  • 10:01 hashar: restarted deployment-cache-mobile03 and deployment-cache-text02
  • 09:49 hashar: rebooting deployment-cache-bits01
  • 00:41 Krinkle: rm -rf slave-scripts and re-cloning from integration/jenkins.git on all slaves (under sudo, just like puppet originally did) - git-status and jshint both work fine now
  • 00:40 Krinkle: Permissions of deployment/integration/slave-scripts on labs slave are all screwed up (git-status says files are dirty, but when run as root git-status is clean and jshint also works fine via sudo)
  • 00:29 Krinkle: Tried reconnecting Gearman, relaunching slave agents. Force-restarting Zuul now.
  • 00:15 Krinkle: Permissions in deployment/integration/slave-scripts on integration-slave1003 are screwed up as well

January 6

  • 22:13 hashar: jshint complains with: Error: Cannot find module './lib/node'  :-(
  • 22:12 hashar: integration-slave1005 chmod -R go+r /srv/deployment/integration/slave-scripts
  • 22:08 hashar: integration-slave1007 chmod -R go+r /srv/deployment/integration/slave-scripts . cscott mentioned build failures of parsoidsvc-jslint which could not read /srv/deployment/integration/slave-scripts/tools/node_modules/jshint/src/cli.js
  • 02:29 ori: qdel -f'd qa-morebots and started a new instance

December 22

December 21

08:31 Krinkle: /var on integration-slave1005 had 93% of 2GB full. Removed some large items in /var/cache/apt/archives that seemed unneeded and don't exist on other slaves.

December 19

  • 23:01 greg-g: Krinkle restarted Gearman, which got the jobs to flow again
  • 20:51 Krinkle: integration-slave1005 (new Ubuntu Trusty instance) is now pooled
  • 18:51 Krinkle: Re-created and provisioning integration-slave1005 (UbuntuTrusty)
  • 18:23 bd808: redis input to logstash stuck; restarted service
  • 18:16 bd808: ran `apt-get dist-upgrade` on logstash01
  • 18:02 bd808: removed local mwdeploy user & group from videoscaler01
  • 18:01 bd808: deployment-videoscaler01 has mysteriously aquired a local mwdeploy user instead of the ldap one
  • 17:58 bd808: forcing puppet run on deploymnet-videoscaler01
  • 07:24 Krinkle: Restarting Gearman connection to Jenkins
  • 07:24 Krinkle: Attempt #5 at re-creating integration-slave1001. Completed provisioning per Setup instructions. Pooled.
  • 05:33 Krinkle: Rebasing integration-puppetmaster with latest upstream operations/puppet (5 local patches) and labs/private
  • 00:06 bd808: restored local commit with ssh keys for scap to deployment-salt

December 18

  • 23:57 bd808: temporarily disabled jenkins scap job
  • 23:56 bd808: killed some ancient screen sessions on deployment-bastion
  • 23:53 bd808: Restarted udp2log-mw on deployment-bastion
  • 23:53 bd808: Restarted salt-minion on deployement-bastion
  • 23:47 bd808: Updated scap to latest HEAD version
  • 21:57 Krinkle: integration-slave1005 is not ready. It's incompletely setup due to
  • 19:29 marxarelli: restarted puppetmaster on deployment-salt
  • 19:29 marxarelli: seeing "Could not evaluate: getaddrinfo: Temporary failure in name resolution" in the deployment-* puppet logs
  • 14:17 hashar: deleting instance deployment-parsoid04 and removing it from Jenkins
  • 14:08 hashar: restarted varnish backend on parsoidcache02
  • 14:00 hashar: parsoid05 seems happy: curl http://localhost:8000/_version: {"name":"parsoid","version":"0.2.0-git","sha":"d16dd2db6b3ca56e73439e169d52258214f0aeb2"}
  • 14:00 hashar: parsoid05 seems happy: curl http://localhost:8000/_version
  • 13:56 hashar: applying latest changes of Parsoid on parsoid05 via: zuul enqueue --trigger gerrit --pipeline postmerge --project mediawiki/services/parsoid --change 180671,2
  • 13:56 hashar: parsoid05: disabling puppet, stopping parsoid, rm -fR /srv/deployment/parsoid ; rerunning the Jenkins beta-parsoid-update-eqiad to hopefully recreate everything properly
  • 13:52 hashar: making parsoid05 a Jenkins slave to replace parsoid04
  • 13:24 hashar: apt-get upgrade on parsoidcache02 and parsoid04
  • 13:23 hashar: updated labs/private on puppet master to fix a puppet dependency cycle with sudo-ldap
  • 13:19 hashar: rebased puppetmaster repo
  • 12:53 hashar: reenqueuing last merged change of Parsoid in Zuul postmerge pipeline in order to trigger the beta-parsoid-update-eqiad job properly. zuul enqueue --trigger gerrit --pipeline postmerge --project mediawiki/services/parsoid --change 180671,2
  • 12:52 hashar: deleting the workspace for the beta-parsoid-update-eqiad jenkins job on deployment-parsoid04 . Some file belong to root which prevent the job from processing
  • 09:13 hashar: enabled MediaWiki core 'structure' PHPUnit tests for all extensions. Will require folks to fix their incorrect AutoLoader and RessourceLoader entries. 180496 bug T78798

December 17

  • 21:02 hashar: cancelled all browser tests,suspecting them to deadlock Jenkins somehow :(

December 16

  • 17:17 bd808: git-sync-upstream runs cleanly on deployment-salt again!
  • 17:16 bd808: removed cherry pick of Ib2a0401a7aa5632fb79a5b17c0d0cef8955cf990 (-2 by _joe_; replaced by Ibcad98a95413044fd6c5e9bd3c0a6fb486bd5fe9)
  • 17:15 bd808: removed cherry pick of I3b6e37a2b6b9389c1a03bd572f422f898970c5b4 (modified in gerrit by bd808 and not repicked; merged)
  • 17:15 bd808: removed cherry pick of I08c24578596506a1a8baedb7f4a42c2c78be295a (-2 by _joe_ in gerrit; replaced by Iba742c94aa3df7497fbff52a856d7ba16cf22cc7)
  • 17:13 bd808: removed cherry pick of I6084f49e97c855286b86dbbd6ce8e80e94069492 (merged by Ori with a change)
  • 17:09 bd808: trying to fix it without using important changes
  • 17:08 bd808: deployment-salt:/var/lib/git/operations/puppet is a rebase hell of cherry-picks that don't apply
  • 13:51 hashar: deleting integration-slave1001 and recreating it. It is blocked on boot and we can't console on it

December 15

  • 23:24 Krinkle: integration-slave1001 isn't coming back (T76250), building integration-slave1005 as its replacement.
  • 12:53 YuviPanda: manually restarted diamond on all betalabs host, to see if that is why metrics aren’t being sent anymore
  • 09:41 hashar: deleted hhvm core files in /var/tmp/core from both mediawiki01 and mediawiki02 Template:T1259 and Template:T71979

December 13

  • 18:51 bd808: Running chmod -R g+s /data/project/upload7 on deploymnet-mediawiki02
  • 18:25 bd808: Running chmod -R u=rwX,g=rwX,o=rX /data/project/upload7 from deployment-mediawiki02
  • 18:16 bd808: chown done for /data/project/upload7
  • 17:51 bd808: Running chown -R apache:apache on /data/project/upload7 from deployment-mediawiki02
  • 17:11 bd808: Labs DNS seems to be flaking out badly and causing random scap and puppet failures
  • 16:58 bd808: restarted puppetmaster on deployment-salt
  • 16:31 bd808: apache user renumbered on deployment-mediawiki03
  • 16:23 bd808: apache and hhvm restarted on beta app servers following apache user renumber
  • 16:09 bd808: apache and hhvm stopped on beta app server tier. All requests expected to return 503 from varnish
  • 16:03 bd808: Starting work on phab:T78076 to renumber apache users in beta
  • 08:21 YuviPanda|zzz: forcing puppet run on all deployment-prep hosts

December 12

  • 22:38 bd808: Fixed scap by deleting /srv/mediawiki/~tmp~ on deployment-rsync01
  • 22:27 hashar: Creating 1300 Jenkins jobs to run extensions PHPUnit tests under either HHVM or Zend PHP flavors.
  • 18:35 bd808: Added puppet config to record !log messages in logstash
  • 17:32 bd808: forcing puppet runs on deployment-mediawiki0[12]; hiera settings specific to beta were not applied on the hosts leading to all kinds of problems
  • 17:12 bd808: restarted hhvm on deployment-mediawiki0[12] and purged hhbc database
  • 17:00 bd808: restarted apache2 on deployment-mediawiki01
  • 16:59 bd808: restarted apache2 on deployment-mediawiki02

December 11

  • 22:13 hashar: Adding chrismcmahon to the 'integration' Gerrit group so he can +2 changes made to integration/config.git
  • 21:47 hashar: Jenkins re adding integration-slave1009 to the pool of slaves
  • 19:45 bd808|LUNCH: I got nerd snipped into looking at beta. Major personal productivity failure.
  • 19:43 bd808|LUNCH: nslcd log noise is probably a red herring --
  • 19:39 bd808|LUNCH: lots of nslcd errors in syslog on deployment-rsync01 which may be causing scap failures
  • 07:45 YuviPanda: shut up shinken-wm

December 10

  • 22:17 bd808: restarted logstash on logstash1001. redis event queue not being processed
  • 10:30 hashar: Adding hhvm on Trusty slaves, using depooled integration-slave1009 as the main work area

December 9

  • 16:33 bd808: restarted puppetmaster to pick up changes to custom functions
  • 16:19 bd808: forced install of sudo-ldap across beta with: salt '*' 'env SUDO_FORCE_REMOVE=yes DEBIAN_FRONTEND=noninteractive apt-get -y install sudo-ldap'

December 8

  • 23:45 bd808: deleted hhvm core on mediawiki01
  • 23:43 bd808: Ran `apt-get clean` on deployment-mediawiki01

December 5

  • 22:21 bd808: 1.1G free on deployment-mediawiki02:/var after removing a lot of crap form logs and /var/tmp/cores
  • 22:06 bd808: /var full on deployment-mediawiki02 again :(((
  • 10:50 hashar: applying mediawiki::multimedia class on contint slaves ( | )
  • 01:01 bd808: Deleted a ton of jeprof.*.heap files from deployment-mediawiki02:/
  • 00:54 YuviPanda: cleared out pngs from mediawiki02 to kill low space warning
  • 00:53 YuviPanda: mediawiki02 instance is low on space, /tmp has lots of... pngs?

December 4

  • 22:48 YuviPanda: manually rebased puppet on deployment-prep
  • 00:29 bd808: deleted instance "udplog"

December 3

  • 19:11 bd808: Cleaned up legacy jobrunner scripts on deployment-jobrunner01 (/etc/default/mw-job-runner /etc/init.d/mw-job-runner /usr/local/bin/

December 2

  • 23:39 bd808: Cause of full disk on deployment-mediawiki01 was an hhvm core file; fixed now
  • 23:35 bd808: /var full on deployment-mediawiki01
  • 11:27 hashar: deleting /srv/vdb/varnish* files on all varnish instances ( )
  • 10:23 hashar: restarted parsoid on deployment-parsoid05
  • 05:26 Krinkle: integration-slave1001 has been down since the failed reboot on 28 November 2014. Still unreachable over ssh and no Jenkins slave agent.

December 1

  • 18:54 bd808: Got jenkins updates working again by taking deployment-bastion node offline, killing waiting jobs and bringing it back online again.
  • 18:51 bd808: updates in beta suck with the "Waiting for next available executor" deadlock again
  • 17:59 bd808: Testing rsyslog event forwarding to logstash via puppet cherry-pick

November 27

  • 12:28 hashar: enabled puppet master autoupdate by setting puppetmaster_autoupdate: true in Hiera:Integration .
  • 12:28 hashar: rebased integration puppetmaster : 5d35de4..1a5ebee
  • 00:32 bd808: Testing local hack on deployment-salt to switch order of heira backends
  • 00:16 bd808: Testing a proposed puppet patch to allow pointing hhvm logs back to deploment-bastion

November 26

  • 00:51 bd808: cherry-picked patch for redis logstash input from MW 175896
  • 00:50 bd808: Restored puppet cherry-picks from reflog [phab:T75947]

November 25

  • 23:45 hashar: Fixed upload cache on beta cluster, the Varnish backend had a mmap SILO error that prevented the backend from starting.
  • 21:05 bd808: Running `sudo find . -type d ! -perm -o=w -exec chmod 0777 {} +` to fix upload permissions
  • 18:01 legoktm: cleared out renameuser_status table (old broken global merges)
  • 18:00 legoktm: 4086 rows deleted from localnames, 3929 from localuser
  • 17:59 legoktm: clearing out localnames/localuser where wikis don't exist on beta
  • 17:10 legoktm: ran migratePass0.php on all wikis
  • 17:09 legoktm: ran checkLocalUser.php --delete on all wikis
  • 17:08 legoktm: PHP Notice: Undefined index: wmgExtraLanguageNames in /mnt/srv/mediawiki/php-master/includes/SiteConfiguration.php on line 307
  • 17:07 legoktm: ran checkLocalNames.php --delete on all wikis
  • 04:37 jgage: restarted jenkins at 20:31

November 24

  • 17:24 greg-g: stupid https
  • 16:40 bd808|deploy: My problem with was caused by a forceHTTPS cookie being set in my browser and redirecting to the broken https endpoint
  • 16:33 bd808|deploy: scap fixed by reverting bad config patch; still looking into failures from
  • 16:27 bd808: Looking at scap crash
  • 15:18 YuviPanda: restored local hacks + fixed 'em to account for 47dcefb74dd4faf8afb6880ec554c7e087aa947b on deployment-salt puppet repo, puppet failures recovering now

November 21

  • 17:06 bd808: deleted salt keys for deleted instances: i-00000289, i-0000028a, i-0000028b, i-0000028e, i-000002b7, i-000006ad
  • 15:57 hashar: fixed puppet cert on deployment-restbase01
  • 15:50 hashar: deployment-sca01 regenerating puppet CA for deployment-sca01
  • 15:34 hashar: Renerated puppet master certificate on deployment-salt. It needs to be named deployment-salt.eqiad.wmflabs not i-0000015c.eqiad.wmflabs. Puppet agent works on deployment-salt now.
  • 15:19 hashar: I have revoked the deployment-salt certificates. All puppet agent are thus broken!
  • 15:01 hashar: deployment-salt cleaning certs with puppet cert clean
  • 14:52 hashar: manually switching restbase01 puppet master from virt1000 to deployment-salt.eqiad.wmflabs
  • 14:50 hashar: deployment-restbase01 has some puppet error: Error 400 on SERVER: Must provide non empty value. on node i-00000727.eqiad.wmflabs . That is due to puppet pickle() function being given an empty variable

November 20

November 19

  • 21:27 bd808: Ran `GIT_SSH=/var/lib/git/ssh git pull --rebase` in deployment-salt:/srv/var-lib/git/labs/private

November 18

November 17

  • 09:24 YuviPanda: moved *old* /var/log/eventlogging into /home/yuvipanda so puppet can run without bitching
  • 04:57 YuviPanda: cleaned up coredump on mediawiki02 on deployment-prep

November 14

  • 21:03 marxarelli: loaded and re-saved jenkins configuration to get it back to english
  • 17:27 bd808: /var full on deployment-mediawiki02. Adjusted ~bd808/cleanup-hhvm-cores for core found in /var/tmp/core rather than the expected /var/tmp/hhvm
  • 11:14 hashar: Recreated a labs Gerrit setup on integration-zuul-server . Available from using OpenID for authentication.

November 13

November 12

  • 21:03 hashar: Restarted Jenkins due to a deadlock with deployment-bastion slave

November 9

  • 16:51 bd808: Running `chmod -R =rwX .` in /data/project/upload7

November 8

  • 08:06 YuviPanda: that fixed it
  • 08:04 YuviPanda: disabling/enabling gearman

November 6

November 5

  • 16:14 bd808: Updated scap to include Ic4574b7fed679434097be28c061927ac459a86fc (Revert "Make scap restart HHVM")

October 31

October 30

  • 16:34 hashar: cleared out /var/ on integration-puppetmaster
  • 16:34 bd808: Upgraded kibana to v3.1.1
  • 15:54 hashar: Zuul: merging in which should fix jobs being stuck in queue on merge/gearman failures. bug 72113
  • 15:45 hashar: Upgrading Zuul reference copy from upstream c9d11ab..1f4f8e1
  • 15:43 hashar: Going to upgrade Zuul and monitor the result over the next hour.

October 29

October 28

  • 21:39 bd808: RoanKattouw creating deployment-parsoid05 as a replacement for the totally broken deployment-parsoid04

October 24

  • 13:36 hashar: That bumps hhvm on contint from 3.3.0-20140925+wmf2 to 3.3.0-20140925+wmf3
  • 13:36 hashar: apt-get upgrade on Trusty Jenkins slaves

October 23

  • 22:43 hashar: Jenkins resumed activity. Beta cluster code is being updated
  • 21:36 hashar: Jenkins: disconnected / reconnected slave node deployment-bastion.eqiad

October 22

  • 20:54 bd808: Enabled puppet on deployment-logstash1
  • 09:07 hashar: Jenkins: upgrading gearman-plugin from 0.0.7-1-g3811bb8 to 0.1.0-1-gfa5f083 . Ie bring us to latest version + 1 commit

October 21

  • 21:10 hashar: contint: refreshed slave-scripts 0b85d48..8c3f228 sqlite files will be cleared out after 20 minutes (instead of 60 minutes) bug 71128
  • 20:51 cscott: deployment-prep _joe_ promises to fix this properly tomorrow am
  • 20:51 cscott: deployment-prep turned off puppet on deployment-pdf01, manually fixed broken /etc/ocg/mw-ocg-service.js
  • 20:50 cscott: deployment-prep updated OCG to version 523c8123cd826c75240837c42aff6301032d8ff1
  • 10:55 hashar: deleted salt master key on deployment-elastic{06,07}, restarted salt-minion and reran puppet. It is now passing on both instances \O/
  • 10:48 hashar: rerunning puppet manually on deployment-elastic{06,07}
  • 10:48 hashar: beta: signing puppet cert for deployment-elastic{06,07}. On deployment-salt ran: puppet ca sign i-000006b6.eqiad.wmflabs; puppet ca sign i-000006b7.eqiad.wmflabs
  • 09:29 hashar: forget me deployment-logstash1 has a puppet agent error but it is simply because the agent is disabled "'debugging logstash config'"
  • 09:28 hashar: deployment-logstash1 disk full

October 20

  • 17:41 bd808: Disabled redis input plugin and restarted logstash on deployment-logstash1
  • 17:39 bd808: Disabled puppet on deployment-logstash1 for some live hacking of logstash config
  • 15:27 apergos: upgrded salt-master on virt1000 (master for labs)

October 17

  • 22:34 subbu: live fixed bad logger config in /srv/deployment/parsoid/deploy/conf/wmf/betalabs.localsettings.js and verified that parsoid doesn't crash anymore -- fix now on gerrit and being merged
  • 20:48 hashar: qa-morebots is back
  • 20:30 hashar: beta: switching Parsoid config file to the one in mediawiki/services/parsoid/deploy.git instead of the puppet maintained config file for subbu. Parsoid seems happy :)
  • hashar: qa-morebots disappeared :( bug 72179
  • hashar: deployment-logstash1 unlocking puppet by deleting left over /var/lib/puppet/state/agent_catalog_run.lock
  • hashar: logstash1 instance being filled up is bug 72175 probably caused by the Diamond collector spamming /server-status?auto
  • hashar: deployment-logstash1 deleting files under /var/log/apache2/ gotta fill a bug to prevent access log from filling the partition

October 16

  • 06:14 apergos: updated remaining beta instances to salt-minion 2014.1.11 from salt ppa

October 15

  • 12:56 apergos: updated i-000002f4, i-0000059b, i-00000504, i-00000220 salt-minion to 2014.1.11
  • 12:20 apergos: updated salt-master and salt-minion on the deployment-salt host _only_ to 2014.1.11 (using salt ppa for now)
  • 01:08 Krinkle: Pooled integration-slave1009
  • 01:00 Krinkle: Setting up integration-slave1009 (bug 72014 fixed)
  • 01:00 Krinkle: integration-publisher and integration-zuul-server were rebooted by me yesterday. Seems they only show up in graphite now. Maybe they were shutdown or had puppet stuck.

October 14

  • 21:00 JohnLewis: icinga says deployment-sca01 is good (yay)
  • 20:42 JohnLewis: deleted and recreated deployment-sca01 (still needs puppet set up)
  • 20:24 JohnLewis: rebooted deployment-sca01
  • 09:26 hashar: renamed deployment-cxserver02 node slaves to 03 and updated the ip address
  • 06:49 Krinkle: Did a slow-rotating graceful depool/reboot/repool of all integration-slave's over the past hour to debug problems whilst waiting for puppet to unblock and set up new slaves.
  • 06:43 Krinkle: Keeping the new integration-slave1009 unpooled because setup could not be completed due to bug 72014.
  • 06:43 Krinkle: Pooled integration-slave1004
  • 05:40 Krinkle: Setting up integration-slave1004 and integration-slave1009 (bug 71873 fixed)

October 10

  • 20:53 Krinkle: Deleted integration-slave1004 and integration-slave1009. When bug 71873 is fixed, they'll need to be re-created.
  • 19:11 Krinkle: integration-slave1004 (new instance, not set up yet) was broken (bug 71741). The bug seems fixed for new instances so, I deleted and re-created it. Will be setting up as a Precise instance and pool it.
  • 19:09 Krinkle: integration-slave1009 (new instance) remains unpooled as it is not yet fully set up (bug 71874). See Nova_Resource:Integration/Setup

October 9

  • 20:17 bd808: rebooted deployment-sca01 via wikitech ui
  • 20:16 bd808: deployment-sca01 dead -- Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000100
  • 19:44 bd808: added role::deployment::test to deployment-rsync01 and deployment-mediawiki03 for trebuchet testing
  • 19:07 bd808: updated scap to include 8183d94 (Fix "TypeError bufsize must be an integer")
  • 09:34 hashar: migrating deployment-cxserver02 to beta cluster puppet and salt masters
  • 09:22 hashar: Renamed Jenkins slave deployment-cxserver01 to deployment-cxserver02 and updated IP. It is marked offline until the instance is ready and has the relevant puppet classes applied.
  • 09:19 hashar: deleting deployment-cxserver01 (borked since virt1005 outage) creating deployment-cxserver02 to replace it bug 71783

October 7

  • 19:19 bd808: ^d deleted all files/directories in gallium:/var/lib/jenkins-slave/tmpfs
  • 18:24 bd808: /var/lib/jenkins-slave/tmpfs full (100%) on gallium
  • 11:54 Krinkle: The new integration-slave1009 must remain unpooled because Setup failed (puppet unable to mount /mnt, bug 71874) - see also Nova Resource:Integration/Setup
  • 11:53 Krinkle: Deleted integration-slave1004 because bug 71741
  • 10:16 hashar: beta: apt-get upgraded all instances beside the lucid one.
  • 09:57 hashar: beta: deleting old occurrences of /etc/apt/preferences.d/puppet_base_2.7
  • 09:53 hashar: apt-get upgrade on all beta cluster instances
  • 09:34 Krinkle: Rebase integration-puppetmaster on latest operations-puppet (patches: I7163fd38bcd082a1, If2e96bfa9a1c46)
  • 09:32 Krinkle: Apply I44d33af1ce85 instead of Ib95c292190d on integration-puppetmaster (remove php5-parsekit package)
  • 09:28 hashar: upgrading php5-fss on both beta-cluster and integration instances. bug 66092
  • 08:55 Krinkle: Building additional contint slaves in labs (integration-slave1004 with precise and integration-slave1009 with trusty)
  • 08:21 Krinkle: Reload Zuul to deploy 5e905e7c9dde9f47482d

October 3

  • 22:53 bd808: Had to stop and start zuul due to NoConnectedServersError("No connected Gearman servers") in zuul.log on gallium
  • 22:34 bd808|deploy: Merged Ie731eaa7e10548a947d983c0539748fe5a3fe3a2 (Regenerate autoloader) to integration/phpunit for bug 71629
  • 14:01 manybubbles: rebuilding beta's simplewiki cirrus index
  • 08:24 hashar: deployment-bastion clearing up /var/log/account a bit bug 69604. Puppet patch pending :]

October 2

  • 19:42 bd808: Updated scap to include eff0d01 Fix format specifier for error message
  • 11:58 hashar: Migrated all mediawiki-core-regression* jobs to Zuul cloner bug 71549
  • 11:57 hashar: Migrated all mediawiki-core-regression* jobs to Zuul cloner

October 1

  • 20:57 bd808: hhvm servers broken because of I5f9b5c4e452e914b33313d0774fb648c1cdfe7ad
  • 17:29 bd808: Stopped service udp2log and started service udp2log-mw on deployment-bastion
  • 16:21 bd808: Cherry-picked into scap for beta. hhvm will be restarted on each scap. Keep your eyes open for weird problems like 503 responses that this may cause.
  • 14:14 hashar: rebased contint puppetmaster

September 30

  • 23:47 bd808: jobrunner using outdated ip address for redis01. Testing patch to use hostname rather than hardcoded ip
  • 21:45 bd808: jobrunner not running. ebernhardson is debugging.
  • 21:38 bd808: /srv on rsync01 now has 3.2G of free space and should be fine fro quite a while again.
  • 21:37 bd808: I figured out the disk space problem on rsync01 (just as I was ready to replace it with rsync02). The old /src/common-local directory was still there which doubled the disk utilization. /src/mediawiki is the correct sync dir now following prod changes.
  • 21:15 bd808: local l10nupdate users on bastion, mediawiki01 and rsync01
  • 21:06 bd808: Local mwdeploy user on deployment-bastion making things sad
  • 20:36 bd808: lots and lots of "file has vanished" errors from rsync. Not sure why
  • 20:35 bd808: Initial puppet run with role::beta::rsync_slave applied on rsync02 failed spectacularly in /Stage[main]/Mediawiki::Scap/Exec[fetch_mediawiki] stage
  • 20:02 bd808: Started building deployment-rsync02 to replace deployment-rsync01
  • 19:59 bd808|LUNCH: /srv partition on deployment-rsync01 full again. We need a new rsync server with more space
  • 17:44 bd808: Updated scap to 064425b (Remove restart-nutcracker and restart-twemproxy scripts)
  • 16:08 bd808: Occasional memecached-serious errors in beta for something trying to connect to the default memcached port (11211) rather than the nutcracker port (11212).
  • 15:58 bd808: scap happy again after fixing rogue group/user on rsync01 \o/ Not sure why they were created but likely an ldap hiccup during a puppet run
  • 15:56 bd808: removed local group/user mwdeploy on deployment-rsync01
  • 15:54 bd808: Local mwdeploy (gid=996) shadowing ldap group gid=603(mwdeploy) on deployment-rsync01
  • 15:49 bd808: apt-get dist-upgrade fixed hhvm on deployment-mediawiki03
  • 15:45 hashar: Updating our Jenkins job builder fork 686265a..ee80dbc (no job changed)
  • 15:44 bd808: scap failing in beta due to "Permission denied (publickey)" talking to deployment-rsync01.eqiad.wmflabs
  • 15:39 bd808: hhvm not starting after puppet run on deployment-mediawiki03. Investigating.
  • 15:36 bd808: enabling puppet and forcing run on deployment-mediawiki03
  • 15:34 bd808: enabling puppet and forcing run on deployment-mediawiki02
  • 15:29 bd808: puppet showed no changes on mediawiki01‽
  • 15:27 bd808: enabling puppet and forcing run on deployment-mediawiki01
  • 15:13 bd808: Fixed logstash by installing
  • 15:02 bd808: Logstash doesn't bundle the prune filter by default any more --
  • 14:59 bd808: Logstash rules need to be adjusted for latest upstream version: "Couldn't find any filter plugin named 'prune'"
  • 12:37 hashar: Fixed some file permissions under deployment-bastion:/srv/mediawiki-staging/php-master/vendor/.git some files belonged to root instead of mwdeploy
  • 00:34 bd808: Updated kibana to latest upstream head 8653aba

September 29

  • 14:22 hashar: apt-get upgrade and reboot of all integration-slaveXX instances
  • 14:07 hashar: updated puppetmaster labs/private on both integration and beta cluster projects ( a41fcdd..84f0906 )
  • 08:57 hashar: rebased puppetmaster

September 26

  • 22:16 bd808: Deleted deployment-mediawiki04 (i-000005ba.eqiad.wmflabs) and removed from salt and trebuchet
  • 07:50 hashar: Pooled back integration-slave1006 , was removed because of bug 71314
  • 07:41 hashar: Updated our Jenkins Job Builder fork 2d74b16..686265a

September 25

  • 23:35 bd808: Done messing with puppet repo. Replaced 2 local commits with proper gerrit cherry picks. Removed a cherry-pick that had been rearranged and merged. Removed a cherry-pick that had been abandoned in gerrit.
  • 23:10 bd808: removed cherry-pick of abandoned; if beta wikis stop working this would be a likely culprit
  • 22:36 bd808: Trying to reduce the number of untracked changes in puppet repo. Expect some short term breakage.
  • 22:21 bd808: cleaned up puppet repo with `git rebase origin/production; git submodule update --init --recursive`
  • 22:18 bd808: puppet repo on deployment-salt out of whack. I will try to fix.
  • 08:15 hashar: beta: puppetmaster rebased
  • 08:10 hashar: beta: dropped a patch that reverted OCG LVS configuration ( ), it has been fixed by
  • 08:04 hashar: attempting to rebase beta cluster puppet master. Currently at 74036376

September 24

September 23

  • 23:08 bd808: Jenkins and deployment-bastion talking to each other again after six (6!) disconnect, cancel jobs, reconnect cycles
  • 22:53 greg-g: The dumb "waiting for executors" bug is
  • 22:51 bd808: Jenkins stuck trying to update database in beta again with the dumb "waiting for executors" bug/problem

September 22

  • 16:09 bd808: Ori updating HHVM to 3.3.0-20140918+wmf1 (from deployment-prep SAL)
  • 09:37 hashar_: Jenkins: deleting old mediawiki extensions jobs (rm -fR /var/lib/jenkins/jobs/*testextensions-master). They are no more triggered and superseded by the *-testextension jobs.

September 20

  • 21:30 bd808: Deleted /var/log/atop.* on deployment-bastion to free some disk space in /var
  • 21:29 bd808: Deleted /var/log/account/pacct.* on deployment-bastion to free some disk space in /var

September 19

  • 21:16 hashar: puppet is broken on Trusty integration slaves because they try to install the non existing package php-parsekit. WIP will get it sorted on eventually.
  • 14:57 hashar: Jenkins friday deploy: migrate all MediaWiki extension qunit jobs to Zuul cloner.

September 17

  • 12:20 hashar: upgrading jenkins 1.565.1 -> 1.565.2

September 16

  • 16:36 bd808: Updated scap to 663f137 (Check php syntax with parallel `php -l`)
  • 04:01 jeremyb: deployment-mediawiki02: salt was broken with a msgpack exception. mv -v /var/cache/salt{,.old} && service salt-minion restart fixed it. also did salt-call saltutil.sync_all
  • 04:00 jeremyb: deployment-mediawiki02: (/run was 99%)
  • 03:59 jeremyb: deployment-mediawiki02: rm -rv /run/hhvm/cache && service hhvm restart
  • 00:51 jeremyb: deployment-pdf01 removed base::firewall (ldap via wikitech)

September 15

  • 22:53 jeremyb: deployment-pdf01: pkill -f grain-ensure
  • 21:36 bd808: Trying to fix salt with `salt '*' service.restart salt-minion`
  • 21:32 bd808: only hosts responding to salt in beta are deployment-mathoid, deployment-pdf01 and deployment-stream
  • 21:29 bd808: salt calls failing in beta with errors like "This master address: 'salt' was previously resolvable but now fails to resolve!"
  • 20:18 hashar: restarted salt-master
  • 19:50 hashar: killed on deployment-bastion a bunch of python /usr/local/sbin/grain-ensure contains ... and /usr/bin/python /usr/bin/salt-call --out=json grains.append deployment_target scap commands
  • 18:57 hashar: scap breakage due to ferm is logged as
  • 18:48 hashar: tweaked a default ferm configuration file which caused puppet to reload ferm. It ends up having rules that prevent ssh from other host thus breaking rsync \O/
  • 18:37 hashar: beta-scap-eqiad job is broken since ~17:20 UTC || rsync: failed to connect to deployment-bastion.eqiad.wmflabs ( Connection timed out (110)

September 13

  • 01:07 bd808: Moved /srv/scap-stage-dir to /srv/mediawiki-staging; put a symlink in as a failsafe
  • 00:31 bd808: scap staging dir needs some TLC on deployment-bastion; working on it
  • 00:30 bd808: Updated scap to I083d6e58ecd68a997dd78faabe60a3eaf8dfaa3c

September 12

  • 01:28 ori: services promoted User:Catrope to projectadmin

September 11

  • 20:59 spagewmf: is down with 503 errors
  • 16:13 bd808: Now that scap is pointed to labmon1001.eqiad.wmnet the deployment-graphite.eqiad.wmflabs host can probably be deleted; it never really worked anyway
  • 16:12 bd808: Updated scap to include I0f7f5cae72a87f68d861340d11632fb429c557b9
  • 15:09 bd808: Updated hhvm-luasandbox to latest version on mediawiki03 and verified that mediawiki0[12] were already updated
  • 15:01 bd808: Fixed incorrect $::deployment_server_override var on deployment-videoscaler01; deployment-bastion.eqiad.wmflabs is correct and deployment-salt.eqiad.wmflabs is not
  • 10:05 ori: deployment-prep upgraded luasandbox and hhvm across the cluster
  • 08:41 spagewmf: deployment-mediawiki01/02 are not getting latest code
  • 05:10 bd808: Reverted cherry-pick of I621d14e4b75a8415b16077fb27ca956c4de4c4c3 in scap; not the actual problem
  • 05:02 bd808: Cherry-picked I621d14e4b75a8415b16077fb27ca956c4de4c4c3 to scap to try and fix l10n update issue

September 10

  • 19:38 bd808: Fixed beta-recompile-math-texvc-eqiad job on deployment-bastion
  • 19:38 bd808: Made /usr/local/apache/common-local a symlink to /srv/mediawiki on deployment-bastion
  • 19:37 bd808: Deleted old /srv/common-local on deployment-videoscaler01
  • 19:32 bd808: Killed tasks on deployment-jobrunner01
  • 19:30 bd808: Removed old mw-job-runner cron job on deployment-jobrunner01
  • 19:19 bd808: Deleted /var/log/account/pacct* and /var/log/atop.log.* on deployment-jobrunner01 to make some temporary room in /var
  • 19:14 bd808: Deleted /var/log/mediawiki/jobrunner.log and restarted jobrunner on deployment-jobrunner01:
  • 19:11 bd808: /var full on deployment-jobrunner01
  • 19:05 bd808: Deleted /srv/common-local on deployment-jobrunner01
  • 19:04 bd808: Changed /usr/local/apache/common-local symlink to point to /srv/mediawiki on deployment-jobrunner01
  • 19:03 bd808: w00t!!! scap jobs is green again --
  • 19:00 bd808: sync-common finished on deployement-jobrunner01; trying Jenkins scap job again
  • 18:53 bd808: Removed symlink and make /srv/mediawiki a proper directory on deployment-jobrunner01; Running sync-common to populate.
  • 18:45 bd808: Made /srv/mediawiki a symling to /srv/common-local on deployment-jobrunner01
  • 10:20 jeremyb: deployment-bastion /var at 97%, freed up ~500MB. apt-get clean && rm -rv /var/log/account/pacct*
  • 10:17 jeremyb: deployment-bastion good puppet run
  • 10:16 jeremyb: deployment-salt had an oom-kill recently. and some box (maybe master, maybe client?) had a disk fill up
  • 10:15 jeremyb: deployment-mediawiki0[12] both had good puppet runs
  • 10:15 jeremyb: deployment-salt started puppetmaster && puppet run
  • 10:14 jeremyb: deployment-bastion killed puppet lock
  • 03:04 bd808: Ori made puppet changes that moved the MediaWiki install dir to /srv/mediawiki ( I didn't see that in SAL so I'm adding it here.

September 9

  • 03:06 bd808: Restarted jenkins agent on delopment-bastion twice to resolve executor deadlock (bug 70597)

September 7

  • 07:00 jeremyb: testing 1,2,3