You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/20160915-MediaWiki: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Hashar
m (action: Decide a plan to push <code>1.28.0-wmf.19</code> again. To be discussed by Release Engineering team on September 19th team meeting.)
 
imported>Krinkle
 
(3 intermediate revisions by 2 users not shown)
Line 1: Line 1:
== Summary ==
#REDIRECT [[Incidents/20160915-MediaWiki]]
 
Deployment of <code>1.28.0-wmf.19</code> caused jobs to fail when invoking <code>SiteConfiguration::getConfig</code>. That initially caused Wikidata to no more dispatch updates to the wiki clients (until Thursday 15th 20:55 UTC) and '''prevented any account creation''' from Thursday 15th 19:10UTC to Friday 16th 12:50 UTC.
 
Reverting all wikis to <code>1.28.0-wmf.18</code> solved the problem.
 
The root cause is a failure to update the HHVM bytecode cache because of a filesize limit of 512MBytes enforced via <code>ulimit</code>. The rough chain of call being:
 
* jobrunner service
* /rpc/RunJobs.php
* wfShellExec( 'mwscript maintenance/getConfiguration.php' ); with ulimit filesize=512MBytes
* HHVM in CLI mode attempting to write to either /var/cache/hhvm/fcgi.sq3 or /var/cache/hhvm/cli.sq3 (undetermined). <code>EFBIG (File too large)</code>
 
As a result wfShellExec() fails with exit code 153 (which is 128 + value of Posix signal SIGXFSZ 25), since the job fails to get configuration from the shell call, it errors out.
 
The extensive debugging details are on {{Phabricator|T145819}}.
 
As a result ALL wikis are at <code>1.28.0-wmf.18</code> as of Friday 15th 12:50 UTC.
== Timeline ==
 
''This is a wall of text, you can probably skip it entirely and move straight to'' Conclusions ''below.''
 
'''Tuesday 13th at 19:00 UTC'''
 
<code>1.28.0-wmf.19</code> is pushed to group0 as scheduled. That caused renames to get stuck on mediawiki.org ({{Phabricator|T145596}}) fixed via https://phabricator.wikimedia.org/rECAU20f54121d177f996c155d5feea6c799a8c9ed242
 
'''Wednesday 14th at 19:00 UTC'''
 
<code>1.28.0-wmf.19</code> pushed to group1. It is promptly reverted due to MediaWiki:Common.css no more being included ({{Phabricator|T145673}}), first noticed on the Hebrew Wikipedia which has the Infobox on the left instead of floating on the right.
 
After the rollback train conductor Antoine unblocks stuck renames on mediawiki.org ({{Phabricator|T145596}}). They originally failed due to a database related issue ([[phab:rECAU20f54121d177f996c155d5feea6c799a8c9ed242|commit]])
 
Only group0 is running <code>1.28.0-wmf.19</code>
 
'''Thursday 15th at 19:00 UTC'''
 
The renaming being fixed, <code>1.28.0-wmf.19</code> is pushed to group1 at 19:10 UTC. A user notice that Wikidata is lagged. The Wikidata process that dispatch updates to client Wikis is lagging. Aude assumes it is a temporary spike and but we fill {{Phabricator|T145819}} nonetheless. After more monitoring it is evident the lag is not resolving, the Grafana boards for Wikidata Dispatch having all the informations.
 
Logs show that <code>SiteConfiguration::getConfig()</code> fails on terbium.
 
We rollback Wikidata to <code>1.28.0-wmf.18</code>. The Wikidata dispatch jobs are processed again. The situation seems to be resolved.
 
After some more monitoring, and to stay on schedule with the train deployment all wikis '''but wikidata ones''' are upgraded to <code>1.28.0-wmf.19</code> at 20:55 UTC.  The account renaming have been fixed the day before, the Wikidata dispatch works fine, there is not much error in fatalmonitor. It is assumed that every is fine.
 
At 22:00 UTC everything looks fine.
 
'''''<big>Night goes on in Europe</big>'''''
 
'''Friday 16th at 8:00 UTC'''
Antoine dig and further investigate the Wikidata <code>SiteConfiguration::getConfig</code> issue. The extensive debugging is show on {{Phabricator|T145819}} with support from Addshore. That points out a rather messy situation with MediaWiki has a web service having to shell out to mwscript maintenance/getConfiguration.php T111441.  Since that is invoked in the context of the webservice, ulimit is applied and restrict file size to 512 MBytes. Turns out that an unknown reason, HHVM attempt to refresh its Sqlite3 bytecode cache when being run on <code>1.28.0-wmf.19</code> but not on <code>1.28.0-wmf.18</code> .
 
The error is manually reproducible from the CLI on terbium. At 12:00 UTC, Antoine deletes <code>/var/cache/hhvm/cli.hhbc.sq3</code> and the error is gone.
 
12:40 UTC, Antoine announces decision to rollback all wikis to <code>1.28.0-wmf.18</code>, with the hope that HHVM would not have to trigger the write to the cli cache since  <code>1.28.0-wmf.18</code> did not.  That is the safe thing to do on a Friday, potentially a hack to remove the filesize limit would have worked as well.
 
12:45 UTC Antoine notices quite late that Account creation are 100% erroring out since  <code>1.28.0-wmf.19</code> deployment at 19:10 UTC the day before ([[phab:T145839#2643508|see comment and graph on T145839]]). That confirms rollbacking.
 
12:50 UTC rebuilt wikiversions.php and synchronized wikiversions files: All wikis back to <code>1.28.0-wmf.18</code> :( T145819
 
Monitoring ensue, the jobs are all happy.
 
== Conclusions ==
* Despite an enormous amount of metrics being collected for the infrastructure or the MediaWiki stack, most do not generate an alarm. The 100% account creation error rate must be made a critical alarm.
* Antoine could not find a good dashboard or overview of jobs failure by type of job/wiki/version/error reason
* Having MediaWiki as a web service shelling out to run a maintenance script is absolutely terrible and confusing
* It has been absolutely not trivial to find out that HHVM tried to write to the bytecode cache and got denied. The shell out fails to capture standard error.
 
== Actionables ==
 
<onlyinclude>
* <code>SiteConfiguration::getConfig</code> must stop using mwscript maintenance/getConfiguration.php ({{Phabricator|T111441}})
* Investigate HHVM writing to the bytecode cache even when <code>1.28.0-wmf.19</code> got deployed everywhere. ('''Task to be filled''')
* Have the train deployment to purge HHVM bytecode cache for both cli and fcgi. ('''Task to be filled''')
* Add monitoring alarm for Account creation errors ('''Task to be filled''')
* Add monitoring alarm for global and type of jobs errors ('''Task to be filled''')
* Decide on proper fix to push again <code>1.28.0-wmf.19</code> . Easiest seems to be one of ('''Task to be filled'''):
** clear HHVM bytecode cache everywhere
** monkey patch <code>SiteConfiguration::getConfig</code> to override the filesize limit (it already does for memorylimit)
</onlyinclude>
* Decide a plan to push <code>1.28.0-wmf.19</code> again. To be discussed by Release Engineering team on September 19th team meeting.
[[Category:Incident documentation]]

Latest revision as of 17:45, 8 April 2022