You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Incident documentation/2018-04-10 Deleting a page on enwiki: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
 
imported>Krinkle
 
Line 1: Line 1:
 
#REDIRECT [[Incidents/2018-04-10 Deleting a page on enwiki]]
 
== Summary ==
An alter table running on the archive table on the English Wikipedia master database server caused deletion of pages to fail for en.wikipedia.org for ~40 minutes.
 
It was originally reported at: {{Phabricator|T191875}}
According to logtash 85 errors happened during that time
 
== Timeline (IN UTC) ==
*5:20 an ALTER table on enwiki master (db1052) on the externallinks table was started
 
*8:20 an ALTER table on enwiki master (db1052) on the archive table was started
 
*8:30 Some errors pop in the error log: https://logstash.wikimedia.org/goto/becc429ddb975af71624b66402c3f6bb
 
Example of a failure:
 
''Read timeout is reached (10.64.16.77) INSERT  INTO archive (ar_namespace,ar_title,ar_timestamp,ar_minor_edit,ar_rev_id,ar_parent_id,ar_text_id,ar_text,ar_flags,ar_len,ar_page_id,ar_deleted,ar_sha1,ar_comment,ar_comment_id,ar_user,ar_user_text,ar_content_model,ar_content_format) VALUES ('14','Articles_needing_sections_from_August_2015','xx','0','xx','0','xx','','','xx','xx','0','xx','Creating monthly dated maintenance category for current month','xx','xx','xx',NULL,NULL)''
 
*9:19: Alter table finished and errors are gone
 
*9:22 it is confirmed that everything works again at: {{Phabricator|T191875#4119589}}
 
A total of 85 errors happened between 08:20 and 09:19
 
== Conclusions ==
The following ALTER table caused issues on db1052 (enwiki master) when deleting (and probably when moving) a page but not on the other 6 masters it was executed before:
''SET SESSION innodb_lock_wait_timeout=1; SET SESSION lock_wait_timeout=30; ALTER TABLE archive MODIFY COLUMN ar_text mediumblob NULL, MODIFY COLUMN ar_flags tinyblob NULL;''
 
There were a total of 85 errors during the time of the incident.
 
It was also noticed that the master had ongoing queries ({{Phabricator|T191875#4119715}}) for the archive table, that might have also contributed to this issue. Whether it is the cause or the consequence is still unknown.
 
As described at: {{Phabricator|T191875#4119820}} this ALTER is fully online (and has not caused issues anywhere else)
 
== Actionables ==
We are still investigating why this has been caused, but so far everything looks like a race condition.
 
The following task has also been created to try to check if the new queries may be creating additional contention (although it is not yet clear this is the root/only cause of the incident)
<onlyinclude>
* Reduce locking contention on deletion of pages ([[phab:T191892]])
</onlyinclude>
 
[[Category:Incident documentation]]

Latest revision as of 17:46, 8 April 2022