You are browsing a read-only backup copy of Wikitech. The live site can be found at

Incident documentation/20170512-API: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Subramanya Sastry
Line 1: Line 1:
== Summary ==
#REDIRECT [[Incident documentation/2017-05-12 API]]
The TextExtracts extension (which provides the <code>extracts</code> prop to the MediaWiki API) was providing empty <code>extract</code> fields for queries for a period of 18 hours. This affected RESTBase and subsequently mobileapps.
== Timeline ==
* '''14:26:30''': Icinga begins reporting problems with RESTBase <pre>PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /{format} (Random title redirect) is CRITICAL: Test Random title redirect returned the unexpected status 503 (expecting: 303): /{title} (Get summary from storage) is WARNING: Test Get summary from storage responds with unexpected body: /extract = : /{title} (Get rev by title</pre>
* '''14:27:34''': <code>bblack</code>, <code>_Joe_</code>, <code>godog</code>, <code>elukey</code>, and <code>mobrovac</code> begin investigation in IRC
* '''15:06:40''':
<pre><mobrovac> the "extract" field is empty for any page i try to fetch the summary for
<mobrovac> we get that from the MW API
* '''15:08:14''': <code>thcipriani</code> is pinged to rollback from <code>1.30.0-wmf.1</code> → <code>1.29.0-wmf.21</code> since API has been pinpointed, <code>thcipriani</code> rolls back all wikis on <code>mwdebug1002</code> to <code>1.29.0-wmf.21</code> and pings <code>mobrovac</code> and <code>_Joe_</code> to test.
* '''15:24:19''': [ RESTBase query] is tried and determined that a rollback does not fix it, but rollback + null edit does, caching is involved
* ~'''15:27''': Help is needed from Mediawiki devs, <code>ebernhardson</code> troubleshoots <code>anomie</code> is in meeting but helps, too
* '''15:47:58''':
<pre><ebernhardson> hmm, so i can verify at this point that ExtractFormatter::filterContent() is the part that's stripping everything, but not sure yet what exactly
<ebernhardson> a full page goes in, and what comes out is just a body tag and an html comment
* '''15:58:00''': problem found
<pre><ebernhardson> so, after manually removing the <div class="mw-parser-output"> from the beginning, we now get the correct response from (because it's cached in memcached from my mwrepl session)</pre>
<pre>16:13:08:    <ebernhardson> so random friday options: #1) -> some other tag? not sure what though #2) revert  #3) Drop 'div' from list of things to strip from text extracts, live with some extra content that wasn't wanted
16:15:05:    <thcipriani> option 2 is the one that seems like not new code :)
16:32:19:    <mobrovac> anomie, ebernhardson, thcipriani: at this point, I honestly prefer #2, it seems the safest, the rest can be dealt with during quieter and more appropriate times
16:33:03:    * anomie submitted and FWIW
16:41:09:    <thcipriani> anomie: so the options to fix would be revert or merge these 2 ?
16:42:37:    <anomie> thcipriani: Yes. Either of those options should also fix T165115. In both cases additional caching being done by TextExtracts/MobileFrontend might mask the fix.
16:52:23:    <thcipriani> mobrovac: anomie ok, I'm going to revert and run a sync since (a) it's friday before a week where releng won't be looking deployment stuffs and I don't want to deploy new code and (b) it looks like it was supposed to be reverted for wmf.1 anyway Is there any concern about a revert breaking anything additional
17:01:58:    <anomie> thcipriani: I don't think reverting in 1.30.0-wmf.1 will break anything else.</pre>
* '''18:10''': thcipriani@tin: Finished scap: Revert "Wrap parser output in" 4/4 (duration: 19m 13s)
* '''18:20''': <code>legoktm</code> reports revert not working submits to version the memcached key so that the cache will be cleared
* '''18:27''': code live on <code>mwdebug1002</code>, but determined not to be working, <code>MaxSem</code> explains: <code>18:35:37: <MaxSem> however, due to reverts of the whole wrapping diff thing, we have corrupt cache</code>
* '''18:43''': <code>legoktm</code> creates to clear parser cache and it's deployed, scap catches problem and is created and sync to correct problem
* '''19:17''': <code>thcipriani</code> syncs to change memcached key
== Other service impacts because of the core change ==
* [[phab:T165139|T165139]] - inline <nowiki><math> editing broken in VE</nowiki>
* [[phab:T165070|T165070]] - spurious linter errors reported for <nowiki><maplink> tags</nowiki>
Underlying cause for both of these is the extra <nowiki><div> tag around Parsoid's extension output because Parsoid calls the mediawiki API (action=parse) to process extension tags. This has since been addressed in </nowiki> and so that when the core change is redeployed, Parsoid and downstream clients aren't impacted.
== Conclusions ==
* TextExtracts enables a lot of services on which we depend
* TextExtracts can be broken by adding a div to parser output
* There are, evidently, few alarms that catch empty textextract summaries in the API
* This outcome may have been expected judging from a revert that was proposed before the train
* The core patch was proposed to be reverted in master as part of but an alternative proposal was to not revert it in master, but to revert it in 1.30.0-wmf.1 instead. But, it appears that the revert didn't happen in wmf.1 before deploy.
* Parsoid uses the mediawiki API to process extension tags and there a number of downstream services that depend on Parsoid: VE, Flow, Content Translation, Android App (via Mobile Content Service), and Linter. This bug would have showed up in the beta cluster as reported in T165139. So, there is clearly a testing gap here.
== Actionables ==
* Better alarm mechanism for empty textextract summaries so they don't languish for 18 hours
* (/me is likely missing lots of subtlety, but armchair-quarterbacks anyway) Adding a div shouldn't break textextract, is there some kind of class/id (e.g., <code>.mw-ext-textextract-start</code>) that can be added to parser output that can then be used by textextract that will have the single express purpose of being used by textextract? Adding a div breaking stuff makes the mechanisms by which it is extracting data seem fragile.
* If it is likely that deploying a change will cause breakage (i.e., let deployers/branchers know so that they can backport fixes or reverts before problems manifest.
* Identify how to update the QA tests that are run on the beta cluster so that any potential breakage of VE, etc. is caught before changes are rolled out to production.
{{#ifeq:{{SUBPAGENAME}}|Report Template||
[[Category:Incident documentation]]

Revision as of 22:24, 31 March 2021