You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Release Engineering/SAL: Difference between revisions
Jump to navigation
Jump to search
imported>Labslogbot (Upgraded Zuul upstream code being 66c8e52..30a433b package is 2.1.0-151-g30a433b-wmf1precise1 (hashar)) |
imported>Labslogbot (krinkle@integration-slave-trusty-1017:~$ sudo rm -rf /mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/Babel (T86730) (Krinkle)) |
||
Line 1: | Line 1: | ||
== 2016-06-04 == | |||
* 00:09 Krinkle: krinkle@integration-slave-trusty-1017:~$ sudo rm -rf /mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/Babel (T86730) | |||
== 2016-06-03 == | |||
* 19:18 hashar: Image ci-jessie-wikimedia-1464981111 in wmflabs-eqiad is ready Zend 5.x for qunit | T136301 | |||
* 15:17 hashar: refreshed Nodepool Trusty image due to some imagemagick upgrade issue. Image ci-trusty-wikimedia-1464966671 in wmflabs-eqiad is ready | |||
* 10:40 hashar: scandium (zuul merger): rm -fR /srv/ssd/zuul/git/mediawiki/extensions/Collection T136930 | |||
== 2016-06-02 == | == 2016-06-02 == | ||
* 12:10 hashar: Upgraded Zuul upstream code being 66c8e52..30a433b package is 2.1.0-151-g30a433b-wmf1precise1 | * 12:10 hashar: Upgraded Zuul upstream code being 66c8e52..30a433b package is 2.1.0-151-g30a433b-wmf1precise1 |
Revision as of 00:09, 4 June 2016
2016-06-04
- 00:09 Krinkle: krinkle@integration-slave-trusty-1017:~$ sudo rm -rf /mnt/jenkins-workspace/workspace/mediawiki-extensions-hhvm/src/extensions/Babel (T86730)
2016-06-03
- 19:18 hashar: Image ci-jessie-wikimedia-1464981111 in wmflabs-eqiad is ready Zend 5.x for qunit | T136301
- 15:17 hashar: refreshed Nodepool Trusty image due to some imagemagick upgrade issue. Image ci-trusty-wikimedia-1464966671 in wmflabs-eqiad is ready
- 10:40 hashar: scandium (zuul merger): rm -fR /srv/ssd/zuul/git/mediawiki/extensions/Collection T136930
2016-06-02
- 12:10 hashar: Upgraded Zuul upstream code being 66c8e52..30a433b package is 2.1.0-151-g30a433b-wmf1precise1
2016-06-01
- 17:49 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/292186
- 16:45 tgr: enabling AuthManager on beta cluster
- 15:20 legoktm: deploying https://gerrit.wikimedia.org/r/292153
- 14:44 twentyafterfour: jenkins restart completed
- 14:36 twentyafterfour: restarting jenkins to install "single use slave" plugin (jenkins will restart when all builds are finished)
- 13:49 hashar: Beta : clearing temporary files under /data/project/upload7 (mainly wikimedia/commons/temp )
- 10:29 hashar: Upgraded Linux kernel on deployment-salt02 T136411
- 10:14 hashar: beta: salt-key -d deployment-salt.deployment-prep.eqiad.wmflabs T136411
- 09:16 hashar: Enabling puppet again on Trusty slaves. Chromium is now properly pinned to version 49 ( https://gerrit.wikimedia.org/r/#/c/291116/3 | T136188 )
- 08:55 hashar: integration slaves : salt -v '*' pkg.upgrade
2016-05-31
- 20:24 bd808: Reloading zuul to pick up I58f878f3fd19dfa21a46a52464575cb06aacbb22
2016-05-30
- 18:39 hashar: Upgraded our Jenkins Job Builder fork to 1.5.0 + a couple of cherry picks: cd63874...10f2bcd
- 12:53 hashar: Upgrading Zuul 1cc37f7..66c8e52 T128569
- 08:04 ori: zuul is back up but jobs which were enqueued are gone
- 07:50 ori: restarting jenkins on gallium, too
- 07:49 ori: restarted zuul-merger service on gallium
- 07:44 ori: Disconnecting and then reconnecting Gearman from Jenkins did not appear to do anything; going to depool / repool nodes.
- 07:42 ori: Temporarily disconnecting Gearman from Jenkins, per <https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues>
2016-05-28
- 04:43 ori: depooling integration-slave-trusty-1015 to profile phpunit runs
2016-05-27
- 19:29 hasharAway: Refreshed Nodepool images
- 18:13 thcipriani: restarting zuul for deadlock
- 18:00 thcipriani: Reloading Zuul to deploy I0c3aeacf92d430ad1272f5f00e7fb7182b8a05bf
- 02:55 bd808: Deleted deployment-fluorine:/srv/mw-log/archive/*-20160[34]* logs; freed 26G
2016-05-26
- 22:23 hashar: salt -v '*trusty*' cmd.run 'puppet agent --disable "Chromium needs to be v49. See T136188"'
- 21:47 hashar: integration-slave-trusty-1015 still on Chromium 50 .. T136188
- 21:42 hashar: downgrading chromium-browser on integration-slave-1015 T136188
- 09:24 jzerebecki: reloading zuul for d38ad0a..6798539
- 07:48 gehel: deployment-prep upgrading elasticsearch to 2.3.3 and restarting (T133124)
- 07:36 dcausse: deployment-prep elastic: updating cirrussearch warmers (T133124)
- 07:31 gehel: deployment-prep deploying new elasticsearch plugins (T133124)
2016-05-25
- 22:38 Amir1: running puppet agent manually on sca01
- 16:26 hashar: 2016-05-25 16:24:35,491 INFO nodepool.image.build.wmflabs-eqiad.ci-trusty-wikimedia: Notice: /Stage[main]/Main/Package[ruby-jsduck]/ensure: ensure changed 'purged' to 'present' T109005
- 15:07 hashar: g++ added to Jessie and Trusty Nodepool instances | T119143
- 14:12 hashar: Regenerating Nodepool snapshot to include g++ which is required by some NodeJS native modules T119143
- 10:58 hashar: Updating Nodepool ci-jessie-wikimedia snapshot image to get netpbm package installed into it. T126992 https://gerrit.wikimedia.org/r/290651
- 09:30 hashar: Clearing git-sync-upstream script on integration-slave-trusty1013 and integration-slave-trusty-1017. That is only supposed to be on the puppetmaster
- 09:15 hashar: Fixed resolv.conf on integration-slave-trusty-1013 and force running puppet to catch up with change since May 16 19:52
- 09:11 hashar: restarting puppetmaster on integration-puppetmaster ( memory leak / can not fork)
2016-05-24
- 07:03 mobrovac: rebooting deployment-tin, can't log in
2016-05-23
- 19:35 hashar: killed all mysqld process on Trusty CI slaves
- 15:49 thcipriani: beta code update not running, disconnect-reconnect dance resulted in: [05/23/16 15:48:39] [SSH] Authentication failed.
- 14:32 jzerebecki: offlined integration-slave-trusty-1004 because it can't connect to mysql T135997
- 13:32 hashar: Upgrading Jenkins git plugins and restarting Jenkins
- 11:01 hashar: Upgrading hhvm on Trusty slaves. Bring him hhvm compiled against libicu52 instead of libicu48
- 09:12 _joe_: deployment-prep: all hhvm hosts in beta upgraded to run on the newer libicu; now running updateCollation.php (T86096)
- 09:11 hashar: Image ci-jessie-wikimedia-1463994307 in wmflabs-eqiad is ready
- 09:01 hashar: Image ci-trusty-wikimedia-1463993508 in wmflabs-eqiad is ready
- 08:56 _joe_: deployment-prep: starting upgrade of HHVM to a version linked to libicu52, T86096
- 08:54 hashar: Regenerating Nodepool image manually. Broke over the week-end due to a hhvm/libicu transition. Should get pip 8.1.x now
2016-05-20
- 20:30 bd808: Killing https://integration.wikimedia.org/ci/job/mediawiki-extensions-qunit/43608/ which has been running for 5 hours
2016-05-19
- 16:47 thcipriani: deployment-tin jenkins worker seems to be back online after some prodding
- 16:41 thcipriani: beta-code-update eqiad hung for past few hours
- 15:16 hashar: Restarted zuul-merger daemons on both gallium and scandium : file descriptors leaked
- 11:59 hashar: CI: salt -v '*' cmd.run 'pip install --upgrade pip==8.1.2'
- 11:54 hashar: Upgrading pip on CI slaves from 7.0.1 to 8.1.2 https://gerrit.wikimedia.org/r/#/c/289639/
- 10:15 hashar: puppet broken on deployment-tin : ?[1;31mError: Could not retrieve catalog from remote server: Error 400 on SERVER: Invalid parameter trusted_group on node deployment-tin.deployment-prep.eqiad.wmflabs?[0m
2016-05-18
- 13:16 Amir1: deploying a05e830 to ores nodes (sca01 and ores-web)
- 12:46 urandom: (re)cherry-picking c/284078 to deployment-prep
- 11:36 hashar: Restarted qa-morebots
- 11:36 hashar: Marked mediawiki/core/vendor repository has hidden in Gerrit. It got moved to mediawiki/vendor including the whole history Settings page: https://gerrit.wikimedia.org/r/#/admin/projects/mediawiki/core/vendor
2016-05-13
- 14:39 thcipriani: remove shadow l10nupdate user from deployment-tin and mira in beta
- 10:20 hashar: Put integration-slave-trusty-1004 offline. Ssh/passwd is borked T135217
- 09:59 hashar: Deleting non nodepool mediawiki PHPUnit jobs for T135001 (mediawiki-phpunit-hhvm mediawiki-phpunit-parsertests-hhvm mediawiki-phpunit-parsertests-php55 mediawiki-phpunit-php55)
- 04:06 thcipriani|afk: changed ownership of mwdeploy public keys post shadow mwdeploy user removal is important
- 03:47 thcipriani|afk: ldap failure has created a shadow mwdeploy user on beta, deleted using vipw
2016-05-12
- 22:53 bd808: Started dead mysql on integration-slave-precise-1011
2016-05-11
- 21:05 hashar: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/288128 #T134946
- 20:26 hashar: rebooting integration-slave-trusty-1016 is back up
- 20:15 hashar: rebooting integration-slave-trusty-1016 unreachable somehow
- 16:43 hashar: Reduced number of executors on Trusty instances from 3 to 2. Memory get exhausted causing the tmpfs to drop files and thus MW jobs to fail randomly.
- 13:33 hashar: Added contint::packages::php to Nodepool images T119139
- 12:59 hashar: Dropping texlive and its dependencies from gallium.
- 12:52 hashar: deleted integration-dev
- 12:51 hashar: creating integration-dev instance to hopefully have Shinken clean itself
- 11:42 hashar: rebooting deployment-aqs01 via wikitech T134981
- 10:46 hashar: beta/ci puppetmaster : deleting old tags in /var/lib/git/operations/puppet and repacking the repos
- 08:49 hashar: Deleting instances deployment-memc02 and deployment-memc03 (Precise instances, migrated to Jessie) #T134974
- 08:43 hashar: Beta: switching memcached to new Jessie servers by cherry picking https://gerrit.wikimedia.org/r/#/c/288156/ and running puppet on mw app servers #T134974
- 08:20 hashar: Creating deployment-memc04 and deployment-memc05 to switch beta cluster memcached to Jessie. m1.medium with security policy "cache" T13497
- 01:44 matt_flaschen: Created Flow-specific External Store tables (blobs_flow1) on all wiki databases on Beta Cluster: T128417
2016-05-10
- 19:17 hashar: beta / CI purging old Linux kernels: salt -v '*' cmd.run 'dpkg -l|grep ^rc|awk "{ print \$2 }"|grep linux-image|xargs dpkg --purge'
- 17:34 cscott: updated OCG to version b0c57a1c6890e9fa1f2c3743fc14cb6a7f244fc3
- 16:44 bd808: Cleaned up 8.5G of pbuilder tmp output on integration-slave-jessie-1001 with `sudo find /mnt/pbuilder/build -maxdepth 1 -type d -mtime +1 -exec rm -r {} \+`
- 16:35 bd808: https://integration.wikimedia.org/ci/job/debian-glue failure on integration-slave-jessie-1001 due to /mnt being 100$ full
- 14:20 hashar: deployment-puppetmaster mass cleaned packages/service/users etc T134881
- 13:54 moritzm: restarted zuul-merger on scandium for openssl update
- 13:52 moritzm: restarting zuul on gallium for openssl update
- 13:51 moritzm: restarted apache and zuul-merger on gallium for openssl update
- 13:48 hashar: deployment-puppetmaster : dropping role::ci::jenkins_access role::ci::slave::labs and role::ci::slave::labs::common T134881
- 13:46 hashar: Deleting Jenkins slave deployment-puppetmaster T134881
- 13:45 hashar: Change https://integration.wikimedia.org/ci/job/beta-build-deb/ job to use label selector "DebianGlue && DebianJessie" instead of "BetaDebianRepo" T134881
- 13:33 hashar: Migrating all debian glue jobs to Jessie permanent slaves T95545
- 13:30 hashar: Adding integration-slave-jessie-1002 in Jenkins. it is all puppet compliant
- 12:59 thcipriani|afk: triggering puppet run on scap targets in beta for https://gerrit.wikimedia.org/r/#/c/287918/ cherry pick
- 09:07 hashar: fixed puppet.conf on deployment-cache-text04
2016-05-09
- 20:58 hashar: Unbroke puppet on integration-raita.integration.eqiad.wmflabs . Puppet was blocked because role::ci::raita was no more. Fixed by rebasing https://gerrit.wikimedia.org/r/#/c/208024 T115330
- 20:13 hashar: beta: salt -v '*' cmd.run 'dpkg --purge libganglia1 ganglia-monitor; rm -fR /etc/ganglia' # T134808
- 20:06 hashar: CI, removing ganglia configuration entirely via: salt -v '*' cmd.run 'rm -fRv /etc/ganglia' # T134808
- 20:04 hashar: CI, removing ganglia configuration entirely via: salt -v '*' cmd.run 'dpkg --purge ganglia-monitor' # T134808
- 16:32 jzerebecki: reloading zuul for 3e2ab56..d663fd0
- 15:39 andrewbogott: migrating deployment-flourine to labvirt1009
- 15:39 hashar: Adding label contintLabsSlave to integration-slave-jessie1001 and integration-slave-jessie1002
- 15:26 hashar: Creating integration-slave-jessie-1001 T95545
2016-05-06
- 19:45 urandom: Restart cassandra-metrics-collector on deployment-restbase0[1-2]
- 19:41 urandom: Rebasing 02ae1757 on deployment-puppetmaster : T126629
2016-05-05
- 22:09 MaxSem: Promoted Yurik and Jgirault to sysops on beta enwiki. Through shell because logging in is broken for me.
2016-05-04
- 21:28 cscott: deployed puppet FQDN domain patch for OCG: https://gerrit.wikimedia.org/r/286068 and restarted ocg on deployment-pdf0[12]
- 15:03 hashar: beta-scap: deployment-tin.deployment-prep.eqiad.wmflabs Name or service not known
- 15:03 hashar: beta-scap: deployment-tin.deployment-prep.eqiad.wmflabs
- 12:24 hashar: deleting Jenkins job mediawiki-core-phpcs , replaced by Nodepool version mediawiki-core-phpcs-trusty T133976
- 12:11 hashar: beta: restarted nginx on varnish caches ( systemctl restart nginx.service ) since they were not listening on port 443 #T134362
- 11:07 hashar: restarted CI puppetmaster (out of memory leak)
- 10:57 hashar: CI: mass upgrading deb packages
- 10:53 hashar: beta: clearing out leftover apt conf that points to unreachable web proxy : salt -v '*' cmd.run "find /etc/apt -name '*-proxy' -delete"
- 10:48 hashar: Manually fixing nginx upgrade on deployment-cache-text04 and deployment-cache-upload04 see T134362 for details
- 09:27 hashar: deployment-cache-text04 systemctl stop varnish-frontend.service . To clear out all the stuck CLOSE_WAIT connections T134346
- 08:33 hashar: fixed puppet on deployment-cache-text04 (race condition generating puppet.conf )
2016-05-03
- 23:21 bd808: Changed "Maximum Number of Retries" for ssh agent launch in jenkins for deployment-tin from "0" to "10"
- 23:01 twentyafterfour: rebooting deployment-tin
- 23:00 bd808: Jenkins agent on deployment-tin not spawning; investigating
- 20:02 hashar: Restarting Jenkins
- 16:49 hashar: Notice: /Stage[main]/Contint::Packages::Python/Package[pypy]/ensure: ensure changed 'purged' to 'present' | T134235
- 16:46 hashar: Refreshing Nodepool Jessie image to have it include pypy | T134235 poke @jayvdb
- 14:49 mobrovac: deployment-tin rebooting it
- 14:25 hashar: beta salt -v '*' pkg.upgrade
- 14:19 hashar: beta: added unattended upgrade to Hiera::deployment-prep
- 13:30 hashar: Restarted nslcd on deployment-tin , pam was refusing authentication for some reason
- 13:29 hashar: beta: got rid of a leftover Wikidata/Wikibase patch that broke scap salt -v 'deployment-tin*' cmd.run 'sudo -u jenkins-deploy git -C /srv/mediawiki-staging/php-master/extensions/Wikidata/ checkout -- extensions/Wikibase/lib/maintenance/populateSitesTable.php'
- 13:23 hashar: deployment-tin force upgraded HHVM from 3.6 to 3.12
- 09:42 hashar: adding puppet class contint::slave_scripts to deployment-sca01 and deployment-sca02 . Ships multigit.sh T134239
- 09:31 hashar: Deleting CI slave deployment-cxserver03 , added deployment-sca01 and deployment-sca02 in Jenkins. T134239
- 09:28 hashar: deployment-sca01 removing puppet lock /var/lib/puppet/state/agent_catalog_run.lock and running puppet again
- 09:26 hashar: Applying puppet class role::ci::slave::labs::common on deployment-sca01 and deployment-sca02 (cxserver and parsoid being migrated T134239 )
- 03:33 kart_: Deleted deployment-cxserver03, replaced by deployment-sca0x
2016-05-02
- 21:27 cscott: updated OCG to version b775e612520f9cd4acaea42226bcf34df07439f7
- 21:26 hashar: Nodepool is acting just fine: Demand from gearman: ci-trusty-wikimedia: 457 | <AllocationRequest for 455.0 of ci-trusty-wikimedia>
- 21:25 hashar: restarted qa-morebots "2016-05-02 21:22:23,599 ERROR: Died in main event loop"
- 21:23 hashar: gallium: enqueued 488 jobs directly in Gearman. That is to test https://gerrit.wikimedia.org/r/#/c/286462/ ( mediawiki/extensions to hhvm/zend5.5 on Nodepool). Progress /home/hashar/gerrit-286462.log
- 20:14 hashar: MediaWiki phpunit jobs to run on Nodepool instances \O/
- 16:41 urandom: Forcing puppet run and restarting Cassandra on deployment-restbase0[1-2] : T126629
- 16:40 urandom: Cherry-picking https://gerrit.wikimedia.org/r/operations/puppet refs/changes/78/284078/12 to deployment-puppetmaster : T126629
- 16:24 urandom: Restarat Cassandra on deployment-restbase0[1-2] : T126629
- 16:21 urandom: forcing puppet run on deployment-restbase0[1-2] : T126629
- 16:21 urandom: cherry-picking latest refs/changes/78/284078/11 onto deployment-puppetmaster : T126629
- 09:44 hashar: On zuul-merger instances (gallium / scandium), cleared out pywikibot/core working copy ( rm -fR /srv/ssd/zuul/git/pywikibot/core/ ) T134062
2016-04-30
- 18:31 Amir1: deploying d4f63a3 from github.com/wiki-ai/ores-wikimedia-config into targets in beta cluster via scap3
2016-04-29
- 16:37 jzerebecki: restarting zuul for 4e9d180..ebb191f
- 15:45 hashar: integration: deleting integration-trusty-1026 and cache-rsync . Maybe that will clear them up from Shinken
- 15:14 hashar: integration: created 'cache-rsync' and 'integration-trusty-1026' , attempting to have Shinken to deprovision them
2016-04-28
- 22:03 urandom: deployment-restbase01 upgrade to 2.2.6 complete : T126629
- 21:56 urandom: Stopping Cassandra on deployment-restbase01, upgrading package to 2.2.6, and forcing puppet run : T126629
- 21:55 urandom: Snapshotting Cassandra tables on deployment-restbase01 (name = 1461880519833) : T126629
- 21:55 urandom: Snapshotting Cassandra tables on deployment-restbase01 : T126629
- 21:52 urandom: Forcing puppet run on deployment-restbase02 : T126629
- 21:51 urandom: Cherry picking operations/puppet refs/changes/78/284078/10 to puppmaster : T126629
- 20:46 urandom: Starting Cassandra on deployment-restbase02 (now v2.2.6) : T126629
- 20:41 urandom: Re-enable puppet and force run on deployment-restbase02 : T126629
- 20:38 urandom: Halting Cassandra on deployment-restbase02, masking systemd unit, and upgrading package(s) to 2.2.6 : T126629
- 20:37 urandom: Snapshotting Cassandra tables on deployment-restbase02 (snapshot name = 1461875833996) : T126629
- 20:37 urandom: Snapshotting Cassandra tables on deployment-restbase02 : T126629
- 20:33 urandom: Cassandra on deployment-restbase01.deployment-prep started : T126629
- 20:25 urandom: Restarting Cassandra on deployment-restbase01.deployment-prep : T126629
- 20:14 urandom: Re-enable puppet on deployment-restbase01.deployment-prep, and force a run : T126629
- 20:12 urandom: cherry-picking https://gerrit.wikimedia.org/r/#/c/284078/ to deployment-puppetmaster : T126629
- 20:06 urandom: Disabling puppet on deployment-restbase0[1-2].deployment-prep : T126629
- 14:43 hashar: Rebuild Nodepool Jessie image. Comes with hhvm
- 12:52 hashar: Puppet is happy on deployment-changeprop
- 12:47 hashar: apt-get upgrade deployment-changeprop (outdated exim package)
- 12:42 hashar: Rebuild Nodepool Trusty instance to include the PHP wrapper script T126211
2016-04-27
- 23:57 thcipriani: nodepool instances running again after an openstack rabbitmq restart by andrewbogott
- 22:51 duploktm: also ran openstack server delete ci-jessie-wikimedia-85342
- 22:42 legoktm: nodepool delete 85342
- 22:41 matt_flaschen: Deployed https://gerrit.wikimedia.org/r/#/c/285765/ to enable External Store everywhere on Beta Cluster
- 22:38 legoktm: stop/started nodepool
- 22:36 thcipriani: I don't have permission to restart nodepool
- 22:35 thcipriani: restarting nodepool
- 22:18 matt_flaschen: Deployed https://gerrit.wikimedia.org/r/#/c/282440/ to switch Beta Cluster to use External Store for new testwiki writes
- 21:00 hashar: thcipriani downgraded git plugins successfully (we wanted to rule out their upgrade for some weird issue)
- 20:13 cscott: updated OCG to version e39e06570083877d5498da577758cf8d162c1af4
- 14:10 hashar: restarting Jenkins
- 14:09 hashar: Jenkins upgrading credential plugin 1.24 > 1.27 And Credentials binding plugin 1.6 > 1.7
- 14:07 hashar: Jenkins upgrading git plugin 2.4.1 > 2.4.4
- 14:01 hashar: Jenkins upgrading git client plugin 1.19.1. > 1.19.6
- 13:13 jzerebecki: reloading zuul for 81a1f1a..0993349
- 11:43 hashar: fixed puppet on deployment-cache-text04 T132689
- 10:38 hashar: Rebuild Image ci-trusty-wikimedia-1461753210 in wmflabs-eqiad is ready
- 09:43 hashar: tmh01.deployment-prep.eqiad.wmflabs denies mwdeploy user breaking https://integration.wikimedia.org/ci/job/beta-scap-eqiad/
2016-04-26
- 20:45 hashar: Regenerating Nodepool Jessie snapshot to include composer and HHVM | T128092
- 20:23 jzerebecki: reloading zuul for eb480d8..81a1f1a
- 19:25 jzerebecki: reload zuul for 4675213..eb480d8
- 19:25 jzerebecki: 4675213..eb480d8
- 14:18 hashar: Applied security patches to 1.27.0-wmf.22 | T131556
- 12:39 hashar: starting cut of 1.27.0-wmf.22 branch ( poke ostriches )
- 10:29 hashar: restored integration/phpunit on CI slaves due to https://integration.wikimedia.org/ci/job/operations-mw-config-phpunit/ failling
- 09:11 hashar: CI is back up!
- 08:20 hashar: shutoff instance castor, does not seem to be able to start again :( | T133652
- 08:12 hashar: hard rebooting castor instance | T133652
- 08:10 hashar: soft rebooting castor instance | T133652
- 08:06 hashar: CI jobs deadlocked due to castor being unavailable | https://phabricator.wikimedia.org/T133652
- 00:46 thcipriani: temporary keyholder fix in place in beta
- 00:18 thcipriani: beta-scap-eqiad failure due to bad keyholder-auth.d fingerprints
2016-04-25
- 20:58 cscott: updated OCG to version 58a720508deb368abfb7652e6a8c7225f95402d2
- 19:46 hashar: Nodepool now has a couple trusty instances intended to experiment with Zend 5.5 / HHVM migration . https://phabricator.wikimedia.org/T133203#2236625
- 13:34 hashar: Nodepool is attempting to create a Trusty snapshot with name ci-trusty-wikimedia-1461591203 | T133203
- 13:15 hashar: openstack image create --file /home/hashar/image-trusty-20160425T124552Z.qcow2 ci-trusty-wikimedia --disk-format qcow2 --property show=true # T133203
- 10:38 hashar: Refreshing Nodepool Jessie snapshot based on new image
- 10:35 hashar: Refreshed Nodepool Jessie image ( image-jessie-20160425T100035Z )
- 09:24 hashar: beta / scap failure filled as T133521
- 09:20 hashar: Keyholder / mwdeploy ssh keys have been messed up on beta cluster somehow :-(
- 08:47 hashar: mwdeploy@deployment-tin has lost ssh host keys file :(
2016-04-24
- 17:14 jzerebecki: reloading e06f1fe..672fc84
2016-04-22
- 18:13 legoktm: deploying https://gerrit.wikimedia.org/r/284841
- 08:13 legoktm: deploying https://gerrit.wikimedia.org/r/284860
2016-04-21
- 19:07 thcipriani: scap version testing should be done, puppet should no longer be disabled on hosts
- 18:02 thcipriani: disabling puppet on scap targets to test scap_3.1.0-1+0~20160421173204.70~1.gbp6706e0_all.deb
2016-04-20
- 22:28 thcipriani: rolling back scap version in beta, legit failure :(
- 21:52 thcipriani: testing new scap version in beta on deployment-tin
- 17:54 thcipriani: Reloading Zuul to deploy gerrit:284494
- 13:58 hashar: Stopping HHVM on CI slaves by cherry picking a couple puppet patches | T126594
- 13:33 hashar: salt -v '*trusty*' cmd.run 'rm /usr/lib/x86_64-linux-gnu/hhvm/extensions/current' # Cleanup on CI slaves for T126658
- 13:27 hashar: Restarted integration puppet master service (out of memory / mem leak)
2016-04-17
- 01:01 legoktm: deploying https://gerrit.wikimedia.org/r/283837
2016-04-16
- 14:21 Krenair: restarted qa-morebots per request
- 14:18 Krenair: <jzerebecki> !log reloading zuul for 3f64dbd..c6411a1
2016-04-13
- 01:48 legoktm: deploying https://gerrit.wikimedia.org/r/282952
2016-04-12
- 19:47 bd808: Cleaned up large hhbc cache file on deployment-medaiwiki03 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
- 19:47 bd808: Cleaned up large hhbc cache file on deployment-medaiwiki02 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
- 19:46 bd808: Cleaned up large hhbc cache file on deployment-medaiwiki01 via `sudo service hhvm stop; sudo rm /var/cache/hhvm/fcgi.hhbc.sq3; sudo service hhvm start`
- 19:10 Amir1: manually rebooted deployment-ores-web
- 19:08 Amir1: manually cherry-picked 282992/2 into to puppetmaster
- 17:05 Amir1: ran puppet agen in sca01 manually in /srv directory
- 11:34 hashar: Jenkins upgrading "Script Security Plugin" from 1.17 to 1.18.1 https://wiki.jenkins-ci.org/display/SECURITY/Jenkins+Security+Advisory+2016-04-11
2016-04-11
- 21:23 csteipp: deployed and reverted oath
- 20:30 thcipriani: relaunched slave-agent on integration-slave-trusty-1025, back online
- 20:19 thcipriani: integration-slave-trusty-1025 horizon console filled with INFO: task jbd2/vda1-8:170 blocked for more than 120 seconds. rebooting
- 20:13 thcipriani: killing stuck jobs, marking integration-slave-trusty-1025 as offline temporarily
- 14:42 thcipriani: deployment-mediawiki01 disk full :(
2016-04-08
- 22:46 matt_flaschen: Created blobs1 table for all wiki DBs on Beta Cluster
- 14:34 hashar: Image ci-jessie-wikimedia-1460125717 in wmflabs-eqiad is ready adds package 'unzip' | T132144
- 12:49 hashar: Image ci-jessie-wikimedia-1460119481 in wmflabs-eqiad is ready , adds package 'zip' | T132144
- 09:30 hashar: Removed label hasAndroidSdk from gallium . That prevent that slave from sometime running the job apps-android-commons-build
- 08:42 hashar: Rebased puppet master and fixed conflict with https://gerrit.wikimedia.org/r/#/c/249490/
2016-04-07
- 20:16 hashar: deployment-mediawiki02.deployment-prep.eqiad.wmflabs , cleared up random left over stuff / big logs etc
- 20:08 hashar: deployment-mediawiki02.deployment-prep.eqiad.wmflabs / is full
2016-04-05
- 23:56 marxarelli: Removed cherry-pick and rebased /var/lib/git/operations/puppet on integration-puppetmaster after merge of https://gerrit.wikimedia.org/r/#/c/281706/
- 21:58 marxarelli: Restarting puppetmaster on integration-puppetmaster
- 21:53 marxarelli: Cherry picked https://gerrit.wikimedia.org/r/#/c/281706/ on integration-puppetmaster and applying on integration-slave-trusty-1014
- 10:32 hashar: gallium removing texlive
- 10:29 hashar: gallium removing libav / ffmpeg. No more needed since jobs are no more running on that server
2016-04-04
- 17:30 greg-g: Phabricator going down in about 10 minutes to hopefully address the overheating issue: T131742
- 10:06 hashar: integration: salt -v '*-slave*' cmd.run 'rm /usr/local/bin/grunt; rm -fR /usr/local/lib/node_modules/grunt-cli' | T124474
- 10:04 hashar: integration: salt -v '*-slave*' cmd.run 'npm -g uninstall grunt-cli' | T124474
- 03:15 greg-g: Phabricator is down
2016-04-03
- 07:02 legoktm: deploying https://gerrit.wikimedia.org/r/281079
- 03:16 Amir1: manually rebooted deployment-ores-web and deployment-sca01
2016-04-02
- 22:58 Amir1: added local hack to pupetmaster to make scap3 provider more verbose
- 19:46 hashar: Upgrading Jenkins Gearman plugin to v2.0 , bring in diff registration for faster updates of Gearman server
- 14:39 Amir1: manually added 281170/5 to beta puppetmaster
- 14:22 Amir1: manually added 281161/1 to beta puppetmaster
- 11:31 Reedy: deleted archived logs older than 30 days from deployment-fluorine
2016-04-01
- 22:16 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/281046
- 21:13 hashar: Image ci-jessie-wikimedia-1459544873 in wmflabs-eqiad is ready
- 20:57 hashar: Refreshing Nodepool snapshot to hopefully get npm 2.x installed T124474
- 20:37 hashar: Added Luke081515 as a member of deployment-prep (beta cluster) labs project
- 20:31 hashar: Dropping grunt-cli from the permanent slaves. People can have it installed by listing it in their package.json devDependencies https://gerrit.wikimedia.org/r/#/c/280974/
- 14:06 hashar: integration: removed sudo policy permitting sudo as any member of the project for any member of the project, which included jenkins-deploy user
- 14:05 hashar: integration: removed sudo policy permitting sudo as root for any member of the project, which included jenkins-deploy user
- 11:23 bd808: Freed 4.5G on deployment-fluorine:/srv/mw-log by deleting wfDebug.log
- 04:00 Amir1: manually rebooted deployment-sca01
- 00:16 csteipp: created oathauth_users table on centralauth db in beta
2016-03-31
- 21:19 legoktm: deploying https://gerrit.wikimedia.org/r/280756
- 13:52 hashar: rebasing integration puppetmaster (it had some merge commit )
- 01:40 Krinkle: Purge npm cache in integration-slave-trusty-1015:/mnt/home/jenkins-deploy/.npm was corrupted around March 23 19:00 for unknown reasons (T130895)
2016-03-30
- 19:32 twentyafterfour: deleted some nutcracker and hhvm log files on deployment-mediawiki01 to free space
- 15:37 hashar: Gerrit has trouble sending emails T131189
- 13:48 Reedy: deployment-prep Make that deployment-tmh01
- 13:48 Reedy: deployment-prep upgrade hhvm on deployment-mediawiki01 and reboot
- 13:35 Reedy: deployment-prep upgrade hhvm on deployment-mediawiki03 and reboot
- 12:16 gehel: deployment-prep restarting varnish on deployment-cache-text04
- 11:04 Amir1: cherry-picked 280413/1 in beta puppetmaster, manually running puppet agent in deployment-ores-web
- 10:22 Amir1: cherry-picking 280403 to beta puppetmaster and manually running puppet agent in deployment-ores-web
2016-03-29
- 23:22 marxarelli: running jenkins-jobs update config/ 'mwext-donationinterfacecore125-testextension-zend53' to deploy https://gerrit.wikimedia.org/r/#/c/280261/
- 19:52 Amir1: manually updated puppetmaster, deleted SSL cert key in deployment-ores-web in VM, running puppet agent manually
- 02:20 jzerebecki: reloading zuul fo 46923c8..c0937ee
2016-03-26
- 22:38 jzerebecki: reloading zuul for 2d7e050..46923c8
2016-03-25
- 23:55 marxarelli: deleting instances integration-slave-trusty-1002 and integration-slave-trusty-1005
- 23:54 marxarelli: deleting jenkins nodes integration-slave-trusty-1002 and integration-slave-trusty-1005
- 23:41 marxarelli: completed rolling manual deploy of https://gerrit.wikimedia.org/r/#/c/279640/ to trusty slaves
- 23:27 marxarelli: starting rolling offline/remount/online of trusty slaves to increase tmpfs size
- 23:22 marxarelli: pooled new trusty slaves integration-slave-trusty-1024 and integration-slave-trusty-1025
- 23:13 jzerebecki: reloading zuul fro 0aec21d..2d7e050
- 22:14 marxarelli: creating new jenkins node for integration-slave-trusty-1024
- 22:11 marxarelli: rebooting integration-slave-trusty-{1024,1025} before pooling as replacements for trusty-1002 and trusty-1005
- 21:06 marxarelli: repooling integration-slave-trusty-{1005,1002} to help with load while replacement instances are provisioning
- 16:59 marxarelli: depooling integration-slave-trusty-1002 until DNS resolution can be resolved. still investigating disk space issue
2016-03-24
- 16:39 thcipriani: restarted rsync service on deployment-tin
- 13:45 thcipriani|afk: rearmed keyholder on deployment-tin
- 04:41 Krinkle: beta-update-databases-eqiad and beta-scap-eqiad stuck for over 8 hours (IRC notifier plugin deadlock)
- 03:28 Krinkle: beta-mediawiki-config-update-eqiadqueued has been stuck for over 5 hours.
2016-03-23
- 23:00 Krinkle: rm-rf integration-slave-trusty-1013:/mnt/home/jenkins-deploy/tmpfs/jenkins-2/karma-54925082/ (bad permissions, caused Karma issues)
- 19:02 legoktm: restarted zuul
2016-03-22
- 17:40 legoktm: deploying https://gerrit.wikimedia.org/r/278926
2016-03-21
- 21:55 hashar: zuul: almost all MediaWiki extensions migrated to run the npm job on Nodepool (with Node.js 4.3) T119143 . All tested. Will monitor the build results that ran overnight tomorrow
- 20:28 hashar: Mass running npm-node-4.3 jobs against MediaWiki extensions to make sure they all pass ( https://gerrit.wikimedia.org/r/#/c/278004/ | T119143 )
- 17:40 elukey: executed git rebase --interactive on deployment-puppetmaster.deployment-prep.eqiad.wmflabs to remove https://gerrit.wikimedia.org/r/#/c/278713/
- 15:46 elukey: hacked manually the cdh puppet submodule on deployment-puppetmaster.deployment-prep.eqiad.wmflabs - please let me know if interfere with anybody's tests
- 14:24 elukey: executed git submodule update --init on deployment-puppetmaster.deployment-prep.eqiad.wmflabs
- 11:25 elukey: beta: cherry picked https://gerrit.wikimedia.org/r/#/c/278713/ to test an updated to the cdh module (analytics)
- 11:13 hashar: beta: rebased puppet master which had a conflict on https://gerrit.wikimedia.org/r/#/c/274711/ which got merged meanwhile (saves Elukey )
- 11:02 hashar: beta: added Elukey (wikimedia ops) to the project as member and admin
2016-03-19
- 13:04 hashar: Jenkins: added ldap-labs-codfw.wikimedia.org as a fallback LDAP server T130446
2016-03-18
- 17:16 jzerebecki: reloading zuul for e33494f..89a9659
2016-03-17
- 21:10 thcipriani: updating scap on deployment-tin to test D133
- 18:31 cscott: updated OCG to version c1a8232594fe846bd2374efd8f7c20d7e97ac449
- 09:34 hashar: deployment-jobrunner01 deleted /var/log/apache/*.gz T130179
- 09:04 hashar: Upgrading hhvm and related extensions on jobrunner01 T130179
2016-03-16
- 14:28 hashar: Updated jobs having the package manager cache system (castor) via https://gerrit.wikimedia.org/r/#/c/277774/
2016-03-15
- 15:17 jzerebecki: added wikidata.beta.wmflabs.org in https://wikitech.wikimedia.org/wiki/Special:NovaAddress to deployment-cache-text04.deployment-prep.eqiad.wmflabs
- 14:19 hashar: Image ci-jessie-wikimedia-1458051246 in wmflabs-eqiad is ready T124447
- 14:14 hashar: Refreshing Nodepool snapshot images so it get a fresh copy of slave-scripts T124447
- 14:08 hashar: Deploying slave script change https://gerrit.wikimedia.org/r/#/c/277508/ "npm-install-dev.py: Use config.dev.yaml instead of config.yaml" for T124447
2016-03-14
- 22:18 greg-g: new jobs weren't processing in Zuul, lego fixed it and blamed Reedy
- 20:13 hashar: Updating Jenkins jobs mwext-Wikibase-* so they no more rely on --with-phpunit ( ping @hoo https://gerrit.wikimedia.org/r/#/c/277330/ )
- 17:03 Krinkle: Doing full Zuul restart due to deadlock (T128569)
- 10:18 moritzm: re-enabled systemd unit for logstash on deployment-logstash2
2016-03-11
- 22:42 legoktm: deploying https://gerrit.wikimedia.org/r/276901
- 19:41 legoktm: legoktm@integration-slave-trusty-1001:/mnt/jenkins-workspace/workspace$ sudo rm -rf mwext-Echo-testextension-* # because it was broken
2016-03-10
- 20:22 hashar: Nodepool Image ci-jessie-wikimedia-1457641052 in wmflabs-eqiad is ready
- 20:19 hashar: Refreshing Nodepool to include the 'varnish' package T128188
- 20:05 hashar: apt-get upgrade integration-slave-jessie1001 (bring in ffmpeg update and nodejs among other things)
- 12:22 hashar: Nodeppol Image ci-jessie-wikimedia-1457612269 in wmflabs-eqiad is ready
- 12:18 hashar: Nodepool: rebuilding image to get mathoid/graphoid packages included (hopefully) T119693 T128280
2016-03-09
- 17:56 bd808: Cleaned up git clone state in deployment-tin.deployment-prep:/srv/mediawiki-staging/php-master and queued beta-code-update-eqiad to try again (T129371)
- 17:48 bd808: Git clone at deployment-tin.deployment-prep:/srv/mediawiki-staging/php-master in completely horrible state. Investigating
- 17:22 bd808: Fixed https://integration.wikimedia.org/ci/job/beta-mediawiki-config-update-eqiad/4452/
- 17:19 bd808: Manually cleaning up broken rebase in deployment-tin.deployment-prep:/srv/mediawiki-staging
- 16:27 bd808: Removed cherry-pick of https://gerrit.wikimedia.org/r/#/c/274696 ; manually cleaned up systemd unit and restarted logstash on deployment-logstash2
- 14:59 hashar: Image ci-jessie-wikimedia-1457535250 in wmflabs-eqiad is ready T129345
- 14:57 hashar: Rebuilding snapshot image to get Xvfb enabled at boot time T129345
- 13:04 moritzm: cherrypicked patch to deployment-prep which provides a systemd unit for logstash
- 10:52 hashar: Image ci-jessie-wikimedia-1457520493 in wmflabs-eqiad is ready
- 10:29 hashar: Nodepool: created new image and refreshing snapshot in attempt to get Xvfb running T129320 T128090
2016-03-08
- 23:42 legoktm: running CentralAuth's checkLocalUser.php --verbose=1 --delete=1 on deployment-tin for T115198
- 21:33 hashar: Nodepool Image ci-jessie-wikimedia-1457472606 in wmflabs-eqiad is ready
- 19:23 hashar: Zuul inject DISPLAY https://gerrit.wikimedia.org/r/#/c/273269/
- 16:03 hashar: Image ci-jessie-wikimedia-1457452766 is ready T128090
- 15:59 hashar: Nodepool: refreshing snapshot image to ship browsers+Xvfb for T128090
- 14:27 hashar: Mass refreshed CI slave-scripts 1d2c60d..e27c292
- 13:38 hashar: Rebased integration puppet master. Dropped a make-wmf-branch patch and the one for raita role
- 11:26 hashar: Nodepool: created new snapshot to set puppet $::labsproject : ci-jessie-wikimedia-1457436175 hoping to fix hiera lookup T129092
- 02:51 ori: deployment-prep Updating HHVM on deployment-mediawiki01
- 02:27 ori: deployment-prep Updating HHVM on deployment-mediawiki02
- 01:50 Krinkle: integration-saltmater: salt -v '*slave-trusty*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/BlueSky' (T117710)
- 01:50 Krinkle: integration-saltmater: salt -v '*slave-trusty*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer/src/skins/BlueSky'
2016-03-07
- 21:03 hashar: Nodepool upgraded to 0.1.1-wmf.4 , it no more waits 1 minute before deleted a used node | T118573
- 20:05 hashar: Upgrading Nodepool from 0.1.1-wmf3 to 0.1.1-wmf.4 with andrewbogott | T118573
2016-03-06
- 10:20 legoktm: deploying https://gerrit.wikimedia.org/r/274911
2016-03-04
- 19:31 hashar: Nodepool Image ci-jessie-wikimedia-1457119603 in wmflabs-eqiad is ready - T128846
- 13:29 hashar: Nodepool Image ci-jessie-wikimedia-1457097785 in wmflabs-eqiad is ready
- 08:42 hashar: CI deleting integration-slave-precise-1001 (2 executors). It is not in labs DNS which causes bunch of issues, no need for the capacity anymore. T128802
- 02:49 Krinkle: Reloading Zuul to deploy https://gerrit.wikimedia.org/r/274889
- 00:11 Krinkle: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
2016-03-03
- 23:37 legoktm: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
- 22:34 legoktm: mysql not running on integration-slave-precise-1002, manually starting (T109704)
- 22:30 legoktm: mysql not running on integration-slave-precise-1011, manually starting (T109704)
- 22:19 legoktm: mysql not running on integration-slave-precise-1012, manually starting (T109704)
- 22:07 legoktm: deploying https://gerrit.wikimedia.org/r/274821
- 21:58 Krinkle: Reloading Zuul to deploy (EventLogging and AdminLinks) https://gerrit.wikimedia.org/r/274821 /
- 18:49 thcipriani: killing deployment-bastion since it is no longer used
- 14:23 hashar: https://integration.wikimedia.org/ci/computer/integration-slave-trusty-1011/ is out of disk space
2016-03-02
- 16:22 jzerebecki: reloading zuul for 9398fa1..943f17b
- 10:38 hashar: Zuul should no more be caught in death loop due to Depends-On on an event-schemas change. Hole filled with https://gerrit.wikimedia.org/r/#/c/274356/ T128569
- 08:53 hashar: gerrit set-account Jsahleen --inactive T108854
- 01:19 thcipriani: force restarting zuul because the queue is very stuck https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Restart
- 01:13 thcipriani: following steps for gearman deadlock: https://www.mediawiki.org/wiki/Continuous_integration/Zuul#Known_issues
2016-03-01
- 23:10 Krinkle: Updated Jenkins configuration to also support php5 and hhvm for Console Sections detection of "PHPUnit"
- 17:05 hashar: gerrit: set accounts inactive for Eloquence and Mgrover. Former employees of wmf and mail bounceback
- 16:41 hashar: Restarted Jenkins
- 16:32 hashar: Bunch of Jenkins job got stall because I have killed threads in Jenkins to unblock integration-slave-trusty-1003 :-(
- 12:14 hashar: integration-slave-trusty-1003 is back online
- 12:13 hashar: Might have killed the proper Jenkins thread to unlock integration-slave-trusty-1003
- 12:03 hashar: Jenkins can not pool back integration-slave-trusty-1003 Jenkins master has a bunch of blocking threads pilling up with hudson.plugins.sshslaves.SSHLauncher.afterDisconnect() locked somehow
- 11:41 hashar: Rebooting integration-slave-trusty-1003 (does not reply to salt / ssh)
- 10:34 hashar: Image ci-jessie-wikimedia-1456827861 in wmflabs-eqiad is ready
- 10:24 hashar: Refreshing Nodepool snapshot instances
- 10:22 hashar: Refreshing Nodepool base image to speed instances boot time (dropping open-iscsi package https://gerrit.wikimedia.org/r/#/c/273973/ )
2016-02-29
- 16:23 hashar: salt -v '*slave*' cmd.run 'rm -fR /mnt/jenkins-workspace/workspace/mwext*jslint' T127362
- 16:17 hashar: Deleting all mwext-.*-jslint jobs from Jenkins. Paladox has migrated all of them to jshint/jsonlint generic jobs T127362
- 16:16 hashar: Deleting all mwext-.*-jslint jobs from Jenkins. Paladox has migrated all of them to jshint/jsonlint generic jobs
- 09:46 hashar: Jenkins installing Yaml Axis Plugin 0.2.0
2016-02-28
- 01:30 Krinkle: Rebooting integration-slave-precise-1012 – Might help T109704 (MySQL not running)
2016-02-26
- 15:14 jzerebecki: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'" T128191
- 15:14 jzerebecki: salt -v --show-timeout '*slave*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
- 14:44 hashar: (since it started, dont be that scared!)
- 14:44 hashar: Nodepool has triggered 40 000 instances
- 11:53 hashar: Restarted memcached on deployment-memc02 T128177
- 11:53 hashar: memcached process on deployment-memc02 seems to have a nice leak of socket usages (from lost) and plainly refuse connections (bunch of CLOSE_WAIT) T128177
- 11:53 hashar: memcached process on deployment-memc02 seems to have a nice leak of socket usages (from lost) and plainly refuse connections (bunch of CLOSE_WAIT)
- 11:40 hashar: deployment-memc04 find /etc/apt -name '*proxy' -delete (prevented apt-get update)
- 11:26 hashar: beta: salt -v '*' cmd.run 'apt-get -y install ruby-msgpack' . I am tired of seeing puppet debug messages: "Debug: Failed to load library 'msgpack' for feature 'msgpack'"
- 11:24 hashar: puppet keep restarting nutcracker apparently T128177
- 11:20 hashar: Memcached error for key "enwiki:flow_workflow%3Av2%3Apk:63dc3cf6a7184c32477496d63c173f9c:4.8" on server "127.0.0.1:11212": SERVER HAS FAILED AND IS DISABLED UNTIL TIMED RETRY
2016-02-25
- 22:38 hashar: beta: maybe deployment-jobunner01 is processing jobs a bit faster now. Seems like hhvm went wild
- 22:23 hashar: beta: jobrunner01 had apache/hhvm killed somehow .... Blame me
- 21:56 hashar: beta: stopped jobchron / jobrunner on deployment-jobrunner01 and restarting them by running puppet
- 21:49 hashar: beta did a git-deploy of jobrunner/jobrunner hoping to fix puppet run on deployment-jobrunner01 and apparently it did! T126846
- 11:21 hashar: deleting workspace /mnt/jenkins-workspace/workspace/browsertests-Wikidata-WikidataTests-linux-firefox-sauce on slave-trusty-1015
- 10:08 hashar: Jenkins upgraded T128006
- 01:44 legoktm: deploying https://gerrit.wikimedia.org/r/273170
- 01:39 legoktm: deploying https://gerrit.wikimedia.org/r/272955 (undeployed) and https://gerrit.wikimedia.org/r/273136
- 01:37 legoktm: deploying https://gerrit.wikimedia.org/r/273136
- 00:31 thcipriani: running puppet on beta to update scap to latest packaged version: sudo salt -b '10%' -G 'deployment_target:scap/scap' cmd.run 'puppet agent -t'
- 00:20 thcipriani: deployment-tin not accepting jobs for some time, ran through https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update, is back now
2016-02-24
- 19:55 legoktm: legoktm@deployment-tin:~$ mwscript extensions/ORES/maintenance/PopulateDatabase.php --wiki=enwiki
- 18:30 bd808: "configuration file '/etc/nutcracker/nutcracker.yml' syntax is invalid"
- 18:27 bd808: nutcracker dead on mediawiki01; investigating
- 17:20 hashar: Deleted Nodepool instances so new ones get to use the new snapshot ci-jessie-wikimedia-1456333979
- 17:12 hashar: Refreshing nodepool snapshot. Been stall since Feb 15th T127755
- 17:01 bd808: https://wmflabs.org/sal/releng missing SAL data since 2016-02-20T20:19 due to bot crash; needs to be backfilled from wikitech data (T127981)
- 16:43 hashar: sal on elastic search is stall https://phabricator.wikimedia.org/T127981
- 15:07 hasharAW: beta app servers have lost access to memcached due to bad nutcracker conf | T127966
- 14:41 hashar: beta: we have a lost a memcached server 11:51am UTC
2016-02-23
- 22:45 thcipriani: deployment-puppetmaster is in a weird rebase state
- 22:25 legoktm: running sync-common manually on deployment-mediawiki02
- 09:59 hashar: Deleted a bunch of mwext-.*-jslint jobs that are no more in used (migrated to either 'npm' or 'jshint' / 'jsonlint' )
2016-02-22
- 22:06 bd808: Restarted puppetmaster service on deployment-puppetmaster to "fix" error "invalid byte sequence in US-ASCII"
- 17:46 jzerebecki: ssh integration-slave-trusty-1017.eqiad.wmflabs 'sudo -u jenkins-deploy rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/.git/config.lock
- 16:47 gehel: deployment-prep upgrading deployment-logstash2 to elasticsearch 1.7.5
- 10:26 gehel: deployment-prep upgrading elastic-search to 1.7.5 on deployment-elastic0[5-8]
2016-02-20
- 20:19 Krinkle: beta-code-update-eqiad job repeatedly stuck at "IRC notifier plugin"
- 19:29 Krinkle: beta-code-update-eqiad broken because deployment-tin:/srv/mediawiki-staging/php-master/extensions/MobileFrontend/includes/MobileFrontend.hooks.php was modified on the server without commit
- 19:22 Krinkle: Various beta-mediawiki-config-update-eqiad jobs have been stuck 'queued' for > 24 hours
2016-02-19
- 12:09 hashar: killed https://integration.wikimedia.org/ci/job/beta-code-update-eqiad/ been running for 13 hours. Blocked because slave went offline due to labs reboots yesterday
- 10:15 hashar: Creating a bunch of repository in GitHub to fix Gerrit replication errors
2016-02-18
- 19:20 legoktm: deploying https://gerrit.wikimedia.org/r/271583 and https://gerrit.wikimedia.org/r/271581, both no-ops
- 18:14 legoktm: deploying https://gerrit.wikimedia.org/r/271012
- 17:36 legoktm: deploying https://gerrit.wikimedia.org/r/271555
- 16:01 hashar: deleting instance integration-slave-precise-1003 think we have enough precise slaves
- 10:44 hashar: Nodepool: JenkinsException: Could not parse JSON info for server[1]
2016-02-17
- 07:36 legoktm: deploying https://gerrit.wikimedia.org/r/271201
- 01:01 yuvipanda: attempting to turn off NFS on 52 instances on deployment-prep project
2016-02-16
- 23:22 yuvipanda: new instances on deployment-prep no longer get NFS because of https://wikitech.wikimedia.org/w/index.php?title=Hiera%3ADeployment-prep&type=revision&diff=311783&oldid=311781
- 23:18 hashar: jenkins@gallium find /var/lib/jenkins/config-history/nodes -maxdepth 1 -type d -name 'ci-jessie*' -exec rm -vfR {} \;
- 23:17 hashar: Jenkins accepting slave creations again. Root cause is /var/lib/jenkins/config-history/nodes/ has reached the 32k inode limit.
- 23:14 hashar: Jenkins: Could not create rootDir /var/lib/jenkins/config-history/nodes/ci-jessie-wikimedia-34969/2016-02-16_22-40-23
- 23:02 hashar: Nodepool can not authenticate with Jenkins anymore. Thus it can not add slaves it spawned.
- 22:56 hashar: contint: Nodepool instances pool exhausted
- 21:14 andrewbogott: deployment-logstash2 migration finished
- 20:49 jzerebecki: reloading zuul for 3bf7584..67fec7b
- 19:58 andrewbogott: migrating deployment-logstash2 to labvirt1010
- 19:00 hashar: tin: checking out mw 1.27.0-wmf.14
- 15:23 hashar: integration-make-wmfbranch : /mnt/make-wmf-branch mount now has gid=wikidev and group setuid (i.e. mode 2775)
- 15:20 hashar: integration-make-wmfbranch : change tmpfs to /mnt/make-wmf-branch (from /var/make-wmf-branch )
- 11:30 jzerebecki: T117710 integration-saltmaster:~# salt -v '*slave-trusty*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer/src/skins/BlueSky'
- 09:52 hashar: will cut the wmf branches this afternoon starting around 14:00 CET
2016-02-15
- 16:28 jzerebecki: reloading zuul for 2d16ad3..3bb0afa
- 16:10 hashar: Image ci-jessie-wikimedia-1455552377 in wmflabs-eqiad is ready
- 15:25 jzerebecki: reloading zuul for e174335..2d16ad3
- 15:23 hashar: Image ci-jessie-wikimedia-1455549539 in wmflabs-eqiad is ready
- 15:19 hashar: Regenerating Nodepool snapshot. Slave scripts have 0 bytes...
- 15:04 hashar: Slave scripts added to Nodepool instances! Image ci-jessie-wikimedia-1455548346 in wmflabs-eqiad is ready
- 11:05 hashar: Image ci-jessie-wikimedia-1455534001 in wmflabs-eqiad is ready
- 07:52 legoktm: deploying https://gerrit.wikimedia.org/r/270686
- 06:52 legoktm: legoktm@gallium:/srv/org/wikimedia/doc$ sudo -u jenkins-slave rm -rf EventLogging/ GuidedTour/ MultimediaViewer/ TemplateData/
- 06:22 legoktm: deploying https://gerrit.wikimedia.org/r/270677
- 06:12 legoktm: deploying https://gerrit.wikimedia.org/r/270675
- 06:02 legoktm: deploying https://gerrit.wikimedia.org/r/270674
- 05:56 legoktm: deploying https://gerrit.wikimedia.org/r/270673
- 05:32 legoktm: deploying https://gerrit.wikimedia.org/r/270670
- 04:05 legoktm: deploying https://gerrit.wikimedia.org/r/270667
- 03:26 legoktm: deploying https://gerrit.wikimedia.org/r/270665
- 02:56 legoktm: deploying https://gerrit.wikimedia.org/r/270657
2016-02-14
- 23:54 legoktm: deploying https://gerrit.wikimedia.org/r/270656
- 23:25 legoktm: deploying https://gerrit.wikimedia.org/r/270654
- 23:13 legoktm: also deploying https://gerrit.wikimedia.org/r/#/c/265098/
- 23:11 legoktm: deploying https://gerrit.wikimedia.org/r/270651
- 05:18 bd808: tools.stashbot Testing after restart (T126419)
2016-02-13
- 06:42 bd808: restarted nutcracker on deployment-mediawiki01
- 06:32 bd808: jobrunner on deployment-jobrunner01 enabled after reverting changes from T87928 that caused T126830
- 05:51 bd808: disabled jobrunner process on jobrunner01; queue full of jobs broken by T126830
- 05:31 bd808: trebuchet clone of /srv/jobrunner/jobrunner broken on jobrunner01; failing puppet runs
- 05:25 bd808: jobrunner process on deployment-jobrunner01 badly broken; investigating
- 05:20 bd808: Ran https://phabricator.wikimedia.org/P2273 on deployment-jobrunner01.deployment-prep.eqiad.wmflabs; freed ~500M; disk utilization still at 94%
2016-02-12
- 23:54 hashar: beta cluster broken since 20:30 UTC https://logstash-beta.wmflabs.org/#/dashboard/elasticsearch/fatalmonitor havent looked
- 17:36 hashar: salt -v '*slave-trusty*' cmd.run 'apt-get -y install texlive-generic-extra' # T126422
- 17:32 hashar: adding texlive-generic-extra on CI slaves by cherry picking https://gerrit.wikimedia.org/r/#/c/270322/ - T126422
- 17:19 hashar: get rid of integration-dev it is broken somehow
- 17:10 hashar: Nodepool back at spawning instances. contintcloud has been migrated in wmflabs
- 16:51 thcipriani: running sudo salt '*' -b '10%' deploy.fixurl to fix deployment-prep trebuchet urls
- 16:31 hashar: bd808 added support for saltbot to update tasks automagically!!!! T108720
- 03:10 yurik: attempted to sync graphoid from gerrit 270166 from deployment-tin, but it wouldn't sync. Tried to git pull sca02, submodules wouldn't pull
2016-02-11
- 22:53 thcipriani: shutting down deployment-bastion
- 21:28 hashar: pooling back slaves 1001 to 1006
- 21:18 hashar: re enabling hhvm service on slaves ( https://phabricator.wikimedia.org/T126594 ) Some symlink is missing and only provided by the upstart script grrrrrrr https://phabricator.wikimedia.org/T126658
- 20:52 legoktm: deploying https://gerrit.wikimedia.org/r/270098
- 20:35 hashar: depooling the six recent slaves: /usr/lib/x86_64-linux-gnu/hhvm/extensions/current/luasandbox.so cannot open shared object file
- 20:29 hashar: pooling integration-slave-trusty-1004 integration-slave-trusty-1005 integration-slave-trusty-1006
- 20:14 hashar: pooling integration-slave-trusty-1001 integration-slave-trusty-1002 integration-slave-trusty-1003
- 19:35 marxarelli: modifying deployment server node in jenkins to point to deployment-tin
- 19:27 thcipriani: running sudo salt -b '10%' '*' cmd.run 'puppet agent -t' from deployment-salt
- 19:27 twentyafterfour: Keeping notes on the ticket: https://phabricator.wikimedia.org/T126537
- 19:24 thcipriani: moving deployment-bastion to deployment-tin
- 17:59 hashar: recreated instances with proper names: integration-slave-trusty-{1001-1006}
- 17:52 hashar: Created integration-slave-trusty-{1019-1026} as m1.large (note 1023 is an exception it is for Android). Applied role::ci::slave , lets wait for puppet to finish
- 17:42 Krinkle: Currently testing https://gerrit.wikimedia.org/r/#/c/268802/ in Beta Labs
- 17:27 hashar: Depooling all the ci.medium slaves and deleting them.
- 17:27 hashar: I tried. The ci.medium instances are too small and MediaWiki tests really need 1.5GBytes of memory :-(
- 16:00 hashar: rebuilding integration-dev https://phabricator.wikimedia.org/T126613
- 15:27 Krinkle: Deploy Zuul config change https://gerrit.wikimedia.org/r/269976
- 11:46 hashar: salt -v '*' cmd.run '/etc/init.d/apache2 restart' might help for Wikidata browser tests failling
- 11:32 hashar: disabling hhvm service on CI slaves ( https://phabricator.wikimedia.org/T126594 , cherry picked both patches )
- 10:50 hashar: reenabled puppet on CI. All transitioned to a 128MB tmpfs (was 512MB)
- 10:16 hashar: pooling back integration-slave-trusty-1009 and integration-slave-trusty-1010 (tmpfs shrunken)
- 10:06 hashar: disabling puppet on all CI slaves. Trying to lower tmpfs 512MB to 128MB ( https://gerrit.wikimedia.org/r/#/c/269880/ )
- 02:45 legoktm: deploying https://gerrit.wikimedia.org/r/269853 https://gerrit.wikimedia.org/r/269893
2016-02-10
- 23:54 hashar_: depooling Trusty slaves that only have 2GB of ram that is not enough. https://phabricator.wikimedia.org/T126545
- 22:55 hashar_: gallium: find /var/lib/jenkins/config-history/config -type f -wholename '*/2015*' -delete ( https://phabricator.wikimedia.org/T126552 )
- 22:34 Krinkle: Zuul is back up and procesing Gerrit events, but jobs are still queued indefinitely. Jenkins is not accepting new jobs
- 22:31 Krinkle: Full restart of Zuul. Seems Gearman/Zuul got stuck. All executors were idling. No new Gerrit events processed either.
- 21:22 legoktm: cherry-picking https://gerrit.wikimedia.org/r/#/c/269370/ on integration-puppetmaster again
- 21:17 hashar: CI dust have settled. Krinkle and I have pooled a lot more Trusty slaves to accommodate for the overload caused by switching to php55 (jobs run on Trusty)
- 21:08 hashar: pooling trusty slaves 1009, 1010, 1021, 1022 with 2 executors (they are ci.medium)
- 20:38 hashar: cancelling mediawiki-core-jsduck-publish and mediawiki-core-doxygen-publish jobs manually. They will catch up on next merge
- 20:34 Krinkle: Pooled integration-slave-trusty-1019 (new)
- 20:28 Krinkle: Pooled integration-slave-trusty-1020 (new)
- 20:24 Krinkle: created integration-slave-trusty-1019 and integration-slave-trusty-1020 (ci1.medium)
- 20:18 hashar: created integration-slave-trusty-1009 and 1010 (trusty ci.medium)
- 20:06 hashar: creating integration-slave-trusty-1021 and integration-slave-trusty-1022 (ci.medium)
- 19:48 greg-g: that cleanup was done by apergos
- 19:48 greg-g: did cleanup across all integration slaves, some were very close to out of room. results: https://phabricator.wikimedia.org/P2587
- 19:43 hashar: Dropping slaves Precise m1.large integration-slave-precise-1014 and integration-slave-precise-1013 , most load shifted to Trusty (php53 -> php55 transition)
- 18:20 Krinkle: Creating a Trusty slave to support increased demand following MediaWIki php53(precise)>php55(trusty) bump
- 16:06 jzerebecki: reloading zuul for 41a92d5..5b971d1
- 15:42 jzerebecki: reloading zuul for 639dd40..41a92d5
- 14:12 jzerebecki: recover a bit of disk space: integration-saltmaster:~# salt --show-timeout '*slave*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/*WikibaseQuality*'
- 13:46 jzerebecki: reloading zuul for 639dd40
- 13:15 jzerebecki: reloading zuul for 3be81c1..e8e0615
- 08:07 legoktm: deploying https://gerrit.wikimedia.org/r/269619
- 08:03 legoktm: deploying https://gerrit.wikimedia.org/r/269613 and https://gerrit.wikimedia.org/r/269618
- 06:41 legoktm: deploying https://gerrit.wikimedia.org/r/269607
- 06:34 legoktm: deploying https://gerrit.wikimedia.org/r/269605
- 02:59 legoktm: deleting 14GB broken workspace of mediawiki-core-php53lint from integration-slave-precise-1004
- 02:37 legoktm: deleting /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm-composer on trusty-1017, it had a skin cloned into it
- 02:26 legoktm: queuing mwext jobs server-side to identify failing ones
- 02:21 legoktm: deploying https://gerrit.wikimedia.org/r/269582
- 01:03 legoktm: deploying https://gerrit.wikimedia.org/r/269576
2016-02-09
- 23:17 legoktm: deploying https://gerrit.wikimedia.org/r/269551
- 23:02 legoktm: gracefully restarting zuul
- 22:57 legoktm: deploying https://gerrit.wikimedia.org/r/269547
- 22:29 legoktm: deploying https://gerrit.wikimedia.org/r/269540
- 22:18 legoktm: re-enabling puppet on all CI slaves
- 22:02 legoktm: reloading zuul to see if it'll pickup the new composer-php53 job
- 21:53 legoktm: enabling puppet on just integration-slave-trusty-1012
- 21:52 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/269370/ onto integration-puppetmaster
- 21:50 legoktm: disabling puppet on all trusty/precise CI slaves
- 21:40 legoktm: deploying https://gerrit.wikimedia.org/r/269533
- 17:49 marxarelli: disabled/enabled gearman in jenkins, connection works this time
- 17:49 marxarelli: performed stop/start of zuul on gallium to restore zuul and gearman
- 17:45 marxarelli: "Failed: Unable to Connect" in jenkins when testing gearman connection
- 17:40 marxarelli: killed old zull process manually and restarted service
- 17:39 marxarelli: restart of zuul fails as well. old process cannot be killed
- 17:38 marxarelli: reloading zuul fails with "failed to kill 13660: Operation not permitted"
- 16:06 bd808: Deleted corrupt integration-slave-precise-1003:/mnt/jenkins-workspace/workspace/mediawiki-core-php53lint/.git
- 15:11 hashar: mira: /srv/mediawiki-staging/multiversion/checkoutMediaWiki 1.27.0-wmf.13 php-1.27.0-wmf.13
- 14:51 hashar: ./make-wmf-branch -n 1.27.0-wmf.13 -o master
- 14:50 hashar: pooling back integration-slave-precise1001 - 1004. Manually fetched git repos in workspace for mediawiki core php53
- 14:49 hashar: make-wmf-branch instance: created a local ssh key pair and set the config to use User: hashar
- 14:13 hashar: pooling https://integration.wikimedia.org/ci/computer/integration-slave-precise-1012/ Mysql is back .. Blame puppet
- 14:12 hashar: de pooling https://integration.wikimedia.org/ci/computer/integration-slave-precise-1012/ Mysql is gone somehow
- 14:04 hashar: Manually git fetching mediawiki-core in /mnt/jenkins-workspace/workspace/mediawiki-core-php53lint of slaves precise 1001 to 1004 (git on Precise is remarkably too slow)
- 13:28 hashar: salt '*trusty*' cmd.run 'update-alternatives --set php /usr/bin/hhvm'
- 13:28 hashar: salt '*precise*' cmd.run 'update-alternatives --set php /usr/bin/php5'
- 13:18 hashar: salt -v --batch=3 '*slave*' cmd.run 'puppet agent -tv'
- 13:15 hashar: removing https://gerrit.wikimedia.org/r/#/c/269370/ from CI puppet master
- 13:14 hashar: slave recurse infinitely doing /bin/bash -eu /srv/deployment/integration/slave-scripts/bin/mw-install-mysql.sh then loop over /bin/bash /usr/bin/php maintenance/install.php --confpath /mnt/jenkins-workspace/workspace/mediawiki-core-qunit/src --dbtype=mysql --dbserver=127.0.0.1:3306 --dbuser=jenkins_u2 --dbpass=pw_jenkins_u2 --dbname=jenkins_u2_mw --pass testpass TestWiki WikiAdmin https://phabricator.wikimedia.org/T126327
- 12:46 hashar: Mass testing php loop of death: salt -v '*slave*' cmd.run 'timeout 2s /srv/deployment/integration/slave-scripts/bin/php --version'
- 12:40 hashar: mass rebooting CI slaves from wikitech
- 12:39 hashar: salt -v '*' cmd.run "bash -c 'cd /srv/deployment/integration/slave-scripts; git pull'"
- 12:33 hashar: all slaves dieing due to PHP looping
- 12:02 legoktm: re-enabling puppet on all trusty/precise slaves
- 11:20 legoktm: cherry-picked https://gerrit.wikimedia.org/r/#/c/269370/ on integration-puppetmaster
- 11:20 legoktm: enabling puppet just on integration-slave-trusty-1012
- 11:13 legoktm: disabling puppet on all *(trusty|precise)* slaves
- 10:26 hashar: pooling in integration-slave-trusty-1018
- 03:19 legoktm: deploying https://gerrit.wikimedia.org/r/269359
- 02:53 legoktm: deploying https://gerrit.wikimedia.org/r/238988
- 00:39 hashar: gallium edited /usr/share/python/zuul/local/lib/python2.7/site-packages/zuul/trigger/gerrit.py and modified: replication_timeout = 300 -> replication_timeout = 10
- 00:37 hashar: live hacking Zuul code to have it stop sleeping() on force merge
- 00:36 hashar: killing zuul
2016-02-08
- 23:48 legoktm: finally deploying https://gerrit.wikimedia.org/r/269327
- 23:14 hashar: zuul promote --pipeline gate-and-submit --changes 269065,2 https://gerrit.wikimedia.org/r/#/c/269065/
- 23:10 hashar: pooling integration-slave-precise-1001 1002 1004
- 22:47 hashar: Err need to reboot newly provisioned instances before adding them to Jenkins (kernel upgrade,apache restart etc)
- 22:45 hashar: Pooled https://integration.wikimedia.org/ci/computer/integration-slave-precise-1003/
- 22:25 hashar: integration-slave-precise-{1001-1004} applied role::ci::slave::labs, running puppet in slaves. I have added the instances as Jenkins slaves and put them offline. Whenever puppet is done, we can mark them online in Jenkins then monitor the jobs running on them are working properly
- 22:15 hashar: Provisioning integration-slave-precise-{1001-1004} https://phabricator.wikimedia.org/T126274 (need more php53 slots)
- 22:13 hashar: Deleted cache-rsync instance superseded by castor instance
- 22:10 hashar: Deleting pmcache.integration.eqiad.wmflabs (was to investigate various kind of central caches).
- 20:14 marxarelli: aborting pending mediawiki-extensions-php53 job for CheckUser
- 20:08 bd808: toggled "Enable Gearman" off and on in Jenkins to wake up deployment-bastion workers
- 14:54 hashar: nodepool: refreshed snapshot image , Image ci-jessie-wikimedia-1454942958 in wmflabs-eqiad is ready
- 14:47 hashar: regenerated nodepool reference image (got rid of grunt-cli https://gerrit.wikimedia.org/r/269126 )
- 09:41 legoktm: deploying https://gerrit.wikimedia.org/r/269093 https://gerrit.wikimedia.org/r/269094
- 09:36 hashar: restarting integration puppetmaster (out of memory / cannot fork)
- 06:11 bd808: tgr set $wgAuthenticationTokenVersion on beta cluster (test run for T124440)
- 02:09 legoktm[NE]: deploying https://gerrit.wikimedia.org/r/268047
- 00:57 legoktm[NE]: deploying https://gerrit.wikimedia.org/r/268031
2016-02-06
- 18:34 jzerebecki: reloading zuul for bdb2ed4..46ccca9
2016-02-05
- 13:30 hashar: beta cleaning out /data/project/logs/archive was from pre logstash area. We no more log this way since May 2015 apparently
- 13:29 hashar: beta deleting /data/project/swift-disk created in august 2014 , unused since june 2015. Was a fail attempt at bringing swift to beta
- 13:27 hashar: beta: reclaiming disk space from extensions.git. On bastion: find /srv/mediawiki-staging/php-master/extensions/.git/modules -maxdepth 1 -type d -print -execdir git gc \;
- 13:03 hashar: integration-slave-trusty-1011 went out of disk space. Did some brute clean up and git gc.
- 05:21 Tim: configured mediawiki-extensions-qunit to only run on integration-slave-trusty-1017, did a rebuild and then switched it back
2016-02-04
- 22:08 jzerebecki: reloading zuul for bed7be1..f57b7e2
- 21:51 hashar: salt-key -d integration-slave-jessie-1001.eqiad.wmflabs
- 21:50 hashar: salt-key -d integration-slave-precise-1011.eqiad.wmflabs
- 00:57 bd808: Got deployment-bastion processing Jenkins jobs again via instructions left by my past self at https://phabricator.wikimedia.org/T72597#747925
- 00:43 bd808: Jenkins agent on deployment-bastion.eqiad doing the trick where it doesn't pick up jobs again
2016-02-03
- 22:24 bd808: Manually ran sync-common on deployment-jobrunner01.eqiad.wmflabs to pickup wmf-config changes that were missing (InitializeSettings, Wikibase, mobile)
- 17:43 marxarelli: Reloading Zuul to deploy previously undeployed Icd349069ec53980ece2ce2d8df5ee481ff44d5d0 and Ib18fe48fe771a3fe381ff4b8c7ee2afb9ebb59e4
- 15:12 hashar: apt-get upgrade deployment-sentry2
- 15:03 hashar: redeployed rcstream/rcstream on deployment-stream by using git-deploy on deployment-bastion
- 14:55 hashar: upgrading deployment-stream
- 14:42 hashar: pooled back integration-slave-trusty-1015 Seems ok
- 14:35 hashar: manually triggered a bunch of browser tests jobs
- 11:40 hashar: apt-get upgrade deployment-ms-be01 and deployment-ms-be02
- 11:32 hashar: fixing puppet.conf on deployment-memc04
- 11:09 hashar: restarting beta cluster puppetmaster just in case
- 11:07 hashar: beta: apt-get upgrade on delpoyment-cache* hosts and checking puppet
- 10:59 hashar: integration/beta: deleting /etc/apt/apt.conf.d/*proxy files. There is no need for them, in fact web proxy is not reachable from labs
- 10:53 hashar: integration: switched puppet repo back to 'production' branch, rebased.
- 10:49 hashar: various beta cluster have puppet errors ..
- 10:46 hashar: integration-slave-trusty-1013 heading to out of disk space on /mnt ...
- 10:42 hashar: integration-slave-trusty-1016 out of disk space on /mnt ...
- 03:45 bd808: Puppet failing on deployment-fluorine with "Error: Could not set uid on user[datasets]: Execution of '/usr/sbin/usermod -u 10003 datasets' returned 4: usermod: UID '10003' already exists"
- 03:44 bd808: Freed 28G by deleting deployment-fluorine:/srv/mw-log/archive/*2015*
- 03:42 bd808: Ran deployment-bastion.deployment-prep:/home/bd808/cleanup-var-crap.sh and freed 565M
2016-02-02
- 18:32 marxarelli: Reloading Zuul to deploy If1f3cb60f4ccb2c1bca112900dbada03a8588370
- 17:42 marxarelli: cleaning mwext-donationinterfacecore125-testextension-php53 workspace on integration-slave-precise-1013
- 17:06 ostriches: running sync-common on mw2051 and mw1119
- 09:38 hashar: Jenkins is fully up and operational
- 09:33 hashar: restarting Jenkins
- 08:47 hashar: pooling back integration-slave-precise1011 , puppet run got fixed ( https://phabricator.wikimedia.org/T125474 )
- 03:48 legoktm: deploying https://gerrit.wikimedia.org/r/267828
- 03:29 legoktm: deploying https://gerrit.wikimedia.org/r/266941
- 00:42 legoktm: due to T125474
- 00:42 legoktm: marked integration-slave-precise-1011 as offline
- 00:39 legoktm: precise-1011 slave hasn't had a puppet run in 6 days
2016-02-01
- 23:53 bd808: Logstash working again; I applied a change to the default mapping template for Elasticsearch that ensures that fields named "timestamp" are indexed as plain strings
- 23:46 bd808: Elasticsearch index template for beta logstash cluster making crappy guesses about syslog events; dropped 2016-02-01 index; trying to fix default mappings
- 23:09 bd808: HHVM logs causing rejections during document parse when inserting in Elasticsearch from logstash. They contain a "timestamp" field that looks like "Feb 1 22:56:39" which is making the mapper in Elasticsearch sad.
- 23:04 bd808: Elasticsearch on deployment-logstash2 rejecting all documents with 400 status. Investigating
- 22:50 bd808: Copying deployment-logstash2.deployment-prep:/var/log/logstash/logstash.log to /srv for debugging later
- 22:48 bd808: deployment-logstash2.deployment-prep:/var/log/logstash/logstash.log is 11G of fail!
- 22:46 bd808: root partition on deployment-logstash2 full
- 22:43 bd808: No data in logstash since 2016-01-30T06:55:37.838Z; investigating
- 15:33 hashar: Image ci-jessie-wikimedia-1454339883 in wmflabs-eqiad is ready
- 15:01 hashar: Refreshing Nodepool image. Might have npm/grunt properly set up
- 03:15 legoktm: deploying https://gerrit.wikimedia.org/r/267630
2016-01-31
- 13:35 hashar: Jenkins IRC bot started falling at Jan 30 01:04:00 2016 for whatever reason.... Should be fine now
- 13:33 hashar: cancelling/aborting jobs that are stuck while reporting to IRC (mostly browser tests and beta cluster jobs)
- 13:32 hashar: Jenkins jobs are being blocked because they can no more report back to IRC :-(((
- 13:28 hashar: Jenkins jobs are being blocked because they can no more report back to IRC :-(((
2016-01-30
- 12:46 hashar: integration-slave-jessie-1001 : fixed puppet.con server name and ran puppet
2016-01-29
- 18:43 thcipriani: updated scap on beta
- 16:44 thcipriani: deployed scap updates on beta
- 11:58 _joe_: upgraded hhvm to 3.6 wm8 in deployment-prep
2016-01-28
- 23:22 MaxSem: Updated portals on betalabs to master
- 22:23 hashar: salt '*slave-precise*' cmd.run 'apt-get install php5-ldap' ( https://phabricator.wikimedia.org/T124613 ) will need to be puppetized
- 18:17 thcipriani: cleaning npm cache on slave machines: salt -v '*slave*' cmd.run 'sudo -i -u jenkins-deploy -- npm cache clean'
- 18:12 thcipriani: running npm cache clean on integration-slave-precise-1011 sudo -i -u jenkins-deploy -- npm cache clean
- 15:25 hashar: apt-get upgrade deployment-sca01 and deployment-sca02
- 15:09 hashar: fixing puppet.conf hostname on deployment-upload deployment-conftool deployment-tmh01 deployment-zookeeper01 and deployment-urldownloader
- 15:06 hashar: fixing puppet.con hostname on deployment-upload.deployment-prep.eqiad.wmflabs and running puppet
- 15:00 hashar: Running puppet on deployment-memc02 and deployment-elastic07 . It is catching up with lot of changes
- 14:59 hashar: fixing puppet hostnames on deployment-elastic07
- 14:59 hashar: fixing puppet hostnames on deployment-memc02
- 14:55 hashar: Deleted salt keys deployment-pdf01.eqiad.wmflabs and deployment-memc04.eqiad.wmflabs (obsolete, entries with '.deployment-prep.' are already there)
- 07:38 jzerebecki: reload zuul for 4951444..43a030b
- 05:55 jzerebecki: doing https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update
- 03:49 mobrovac: deployment-prep re-enabled puppet on deployment-restbase0x
- 02:49 mobrovac: deployment-prep deployment-restbase01 disabled puppet to set up cassandra for
- 02:27 mobrovac: deployment-prep recreating deployment-restbase01 for T125003
- 02:23 mobrovac: deployment-prep deployment-restbase02 disabled puppet to recreate deployment-restbase01 for T125003
- 01:42 mobrovac: deployment-prep recreating deployment-sca02 for T125003
- 01:28 mobrovac: deployment-prep recreating deployment-sca01 for T125003
- 00:36 mobrovac: deployment-prep re-imaging deployment-mathoid for T125003
- 00:02 jzerebecki: integration-slave-trusty-1016:~$ sudo -i rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/Donate
2016-01-27
- 23:49 jzerebecki: integration-slave-precise-1011:~$ sudo -i /etc/init.d/salt-minion restart
- 23:46 jzerebecki: work around https://phabricator.wikimedia.org/T117710 : salt --show-timeout '*slave*' cmd.run 'rm -rf /mnt/jenkins-workspace/workspace/mwext-testextension-hhvm/src/skins/BlueSky'
- 21:19 cscott: updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf (should be no-op after yesterday's deploy)
- 10:29 hashar: triggered bunch of browser tests, deployment-redis01 was dead/faulty
- 10:08 hashar: mass restarting redis-server process on deployment-redis01 (for https://phabricator.wikimedia.org/T124677 )
- 10:07 hashar: mass restarting redis-server process on deployment-redis01
- 09:00 hashar: beta: commenting out "latency-monitor-threshold 100" parameter from any /etc/redis/redis.conf we have ( https://phabricator.wikimedia.org/T124677 ). Puppet will not reapply it unless distribution is Jessie
2016-01-26
- 16:51 cscott: updated OCG to version 64050af0456a43344b32e3e93561a79207565eaf
- 12:14 hashar: Added Jenkins IRC bot (wmf-insecte) to #wikimedia-perf for https://gerrit.wikimedia.org/r/#/c/265631/
- 09:30 hashar: restarting Jenkins to upgrade the gearman plugin with https://review.openstack.org/#/c/271543/
- 04:18 bd808: integration-slave-jessie-1001:/mnt full; cleaned up 15G of files in /mnt/pbuilder/build (27 hours after the last time I did that)
2016-01-25
- 18:59 twentyafterfour: started redis-server on deployment-redis01 by commenting out latency-monitor-threshold from the redis.conf
- 15:22 hashar: CI: fixing kernels not upgrading via: rm /boot/grub/menu.lst ; update-grub -y (i.e.: regenerate the Grub menu from scratch)
- 14:21 hashar: integration-slave-trusty-1015.integration.eqiad.wmflabs is gone. I have failed the kernel upgrade / grub update
- 01:35 bd808: integration-slave-jessie-1001:/mnt full; cleaned up 15G of files in /mnt/pbuilder/build
2016-01-24
- 06:45 legoktm: deploying https://gerrit.wikimedia.org/r/266039
- 06:13 legoktm: deploying https://gerrit.wikimedia.org/r/266041
2016-01-22
- 23:58 legoktm: removed skins from mwext-qunit workspace on trusty-1013 slave
- 23:34 legoktm: rm -rf /mnt/jenkins-workspace/workspace/mediawiki-phpunit-php53 on slave precise 1012
- 22:45 legoktm: deploying https://gerrit.wikimedia.org/r/265864
- 22:27 hashar: rebooted all CI slaves using OpenStackManager
- 22:09 hashar: rebooting deployment-redis01 (kernel upgrade)
- 21:22 hashar: Image ci-jessie-wikimedia-1453497269 in wmflabs-eqiad is ready (with node 4.2 for https://phabricator.wikimedia.org/T119143 )
- 21:14 hashar: updating nodepool snapshot based on new image
- 21:12 hashar: rebuilding nodepool reference image
- 20:04 hashar: Image ci-jessie-wikimedia-1453492820 in wmflabs-eqiad is ready
- 20:00 hashar: Refreshing nodepool image to hopefully get Nodejs 4.2.4 https://phabricator.wikimedia.org/T124447 https://gerrit.wikimedia.org/r/#/c/265802/
- 16:32 hashar: Nuked corrupted git repo on integration-slave-precise-1012 /mnt/jenkins-workspace/workspace/mediawiki-extensions-php53
- 12:23 hashar: beta: reinitialized keyholder on deployment-bastion. The proxy apparently had no identity
- 09:32 hashar: beta cluster Jenkins job have been stalled for 9hours and 25 minutes. Disabling/reenabling the Gearman plugin to remove the deadlock
2016-01-21
- 21:41 hashar: restored role::mail::mx on deployment-mx
- 21:36 hashar: dropping role::mail::mx from deployment-mx to let puppet run
- 21:33 hashar: rebooting deployment-jobrunner01 / kernel upgrade / /tmp is only 1MBytes
- 21:19 hashar: fixing up deployment-jobrunner01 /tmp and / disks are full
- 19:57 thcipriani: ran REPAIR TABLE globalnames; on centralauth db
- 19:48 legoktm: deploying https://gerrit.wikimedia.org/r/265552
- 19:39 legoktm: deploying jjb changes for https://gerrit.wikimedia.org/r/264990
- 19:25 legoktm: deploying https://gerrit.wikimedia.org/r/265546
- 01:59 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions/SpellingDictionary$ rm -r modules/jquery.uls && git rm modules/jquery.uls
- 01:00 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions$ git pull && git submodule update --init --recursive
- 00:57 jzerebecki: jenkins-deploy@deployment-bastion:/srv/mediawiki-staging/php-master/extensions$ git reset HEAD SpellingDictionary
2016-01-20
- 20:05 hashar: beta sudo find /data/project/upload7/math -type f -delete (probably some old left over)
- 19:50 hashar: beta: on commons ran deleteArchivedFile.php : Nuked 7130 files
- 19:49 hashar: beta : foreachwiki deleteArchivedRevisions.php -delete
- 19:26 hasharAway: Nuked all files from http://commons.wikimedia.beta.wmflabs.org/wiki/Category:GWToolset_Batch_Upload
- 19:19 hasharAway: beta: sudo find /data/project/upload7/*/*/temp -type f -delete
- 19:14 hasharAway: beta: sudo rm /data/project/upload7/*/*/lockdir/*
- 18:57 hasharAway: beta cluster code has been stalled for roughly 2h30
- 18:55 hasharAway: disconnecting Gearman plugin to remove deadlock for beta cluster rjobs
- 17:06 hashar: clearing files from beta-cluster to prepare for Swift migration. python pwb.py delete.py -family:betacommons -lang:en -cat:'GWToolset Batch Upload' -verbose -putthrottle:0 -summary:'Clearing out old batched upload to save up disk space for Swift migration'
2016-01-19
- 22:25 legoktm: deleting *zend* workspaces on precise slaves
- 21:58 thcipriani: trying https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update again
- 21:57 thcipriani: beta-scap-eqiad still can't find executor on deployment-bastion.eqiad
- 21:52 thcipriani: following steps at https://www.mediawiki.org/wiki/Continuous_integration/Jenkins#Hung_beta_code.2Fdb_update for deployment-bastion
- 19:34 legoktm: deleting all *zend* jobs from jenkins
- 09:40 hashar: Created github repo https://github.com/wikimedia/operations-debs-varnish4
- 03:59 legoktm: deploying https://gerrit.wikimedia.org/r/264912 and https://gerrit.wikimedia.org/r/264922
2016-01-17
- 18:02 legoktm: deploying https://gerrit.wikimedia.org/r/264605
2016-01-16
- 21:47 legoktm: deploying https://gerrit.wikimedia.org/r/264489
- 21:36 legoktm: deploying https://gerrit.wikimedia.org/r/264488
- 21:29 legoktm: deploying https://gerrit.wikimedia.org/r/264487
- 21:21 legoktm: deploying https://gerrit.wikimedia.org/r/264483 https://gerrit.wikimedia.org/r/264485
- 20:58 legoktm: deploying https://gerrit.wikimedia.org/r/264492
- 18:55 jzerebecki: reloadin zuul for 996c558..5f8eb50
- 09:12 legoktm: deploying https://gerrit.wikimedia.org/r/264448
- 09:01 legoktm: deploying https://gerrit.wikimedia.org/r/264446 and https://gerrit.wikimedia.org/r/264447
- 07:46 legoktm: sudo -u jenkins-deploy mv /mnt/jenkins-workspace/workspace/mediawiki-core-phplint /mnt/jenkins-workspace/workspace/mediawiki-core-php53lint on all precise slaves
- 07:17 legoktm: deploying https://gerrit.wikimedia.org/r/264444
- 06:31 legoktm: deploying https://gerrit.wikimedia.org/r/264441
- 06:10 legoktm: added phpflavor-php53 label to all phpflavor-zend slaves
2016-01-15
- 12:17 hashar: restarting Jenkins for plugins updates
- 02:49 bd808: Trying to fix submodules in deployment-bastion:/srv/mediawiki-staging/php-master/extensions for T123701
2016-01-14
- 20:06 legoktm: deploying https://gerrit.wikimedia.org/r/264122
- 19:32 legoktm: deploying https://gerrit.wikimedia.org/r/264114
- 19:18 legoktm: deploying https://gerrit.wikimedia.org/r/264108
2016-01-13
- 21:06 hashar: beta cluster code is up to date again. Got delayed by roughly 4 hours.
- 20:55 hashar: unlocked Jenkins jobs for beta cluster by disabling/reenabling Jenkins Gearman client
- 10:15 hashar: beta: fixed puppet on deployment-elastic06 . Was still using cert/hostname without .deployment-prep. .... Mass update occurring.
2016-01-12
- 23:30 legoktm: deploying https://gerrit.wikimedia.org/r/263757 https://gerrit.wikimedia.org/r/263756
- 13:32 hashar: beta cluster: running /usr/local/sbin/cleanup-pam-config
- 13:29 hashar: integration running /usr/local/sbin/cleanup-pam-config on slaves
2016-01-11
- 22:24 hashar: Deleting old references on Zuul-merger for mediawiki/core : /usr/share/python/zuul/bin/python /home/hashar/zuul-clear-refs.py --until 15 /srv/ssd/zuul/git/mediawiki/core
- 22:21 hashar: gallium in /srv/ssd/zuul/git/mediawiki/core$ git gc --prune=all && git remote update --prune
- 22:21 hashar: scandium in /srv/ssd/zuul/git/mediawiki/core$ git gc --prune=all && git remote update --prune
- 07:35 legoktm: deploying https://gerrit.wikimedia.org/r/263319
2016-01-07
- 23:16 legoktm: deleted /mnt/jenkins-workspace/workspace/mediawiki-extensions-qunit/src/extensions/PdfHandler/.git/refs/heads/wmf/1.26wmf16.lock on slave 1013
- 06:32 legoktm: deploying https://gerrit.wikimedia.org/r/262868
- 02:24 legoktm: deploying https://gerrit.wikimedia.org/r/262855
- 01:25 jzerebecki: reloading zuul for b0a5335..c16368a
2016-01-06
- 21:13 thcipriani: kicking integration puppetmaster, weird node unable to find definition.
- 21:11 jzerebecki: on scandium: sudo -u zuul rm -rf /srv/ssd/zuul/git/mediawiki/services/mathoid
- 21:04 legoktm: ^ on gallium
- 21:04 legoktm: manually deleted /srv/ssd/zuul/git/mediawiki/services/mathoid to force zuul to re-clone it
- 20:17 hashar: beta: dropped a few more /etc/apt/apt.conf.d/*-proxy files. webproxy is no more reachable from labs
- 09:44 hashar: CI/beta: deleting all git tags from /var/lib/git/operations/puppet and doing git repack
- 09:39 hashar: restoring puppet hacks on beta cluster puppetmaster.
- 09:35 hashar: beta/CI: salt -v '*' cmd.run 'rm -v /etc/apt/apt.conf.d/*-proxy' https://phabricator.wikimedia.org/T122953
2016-01-05
- 16:54 hashar_: Removed elastic search from CI slaves https://phabricator.wikimedia.org/T89083 https://gerrit.wikimedia.org/r/#/c/259301/
- 03:45 Krinkle: integration-slave-trusty-1015: rm -rf /mnt/home/jenkins-deploy/.npm per https://integration.wikimedia.org/ci/job/mediawiki-core-qunit/56577/console
2016-01-04
- 21:06 hashar: gallium has puppet enabled again
- 20:53 hashar: stopping puppet on gallium and live hacking Zuul configuration for https://phabricator.wikimedia.org/T122656
2016-01-02
- 03:17 yurik: purged varnishs on deployment-cache-text04
2016-01-01
- 22:17 bd808: No nodepool ci-jessie-* hosts seen in Jenkins interface and rake-jessie jobs backing up
Archive
- Archive 1 (September 2014 - December 2015)