You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Incident documentation/20160907-Android

From Wikitech-static
< Incident documentation
Revision as of 22:56, 7 September 2016 by imported>Mobrovac
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Note: this is a draft in progress.

Summary

On September 7, for approximately 6.5 hours, the Wikipedia app for Android crashed on startup (or navigation to the feed activity) for all users with an app language other than English. The crash was caused by a recent change to how feed content responses are composed in RESTBase, and affected both the beta and stable versions of the app. The team regrets the error and apologizes to all affected users.

Timeline

13:31: A change is deployed in RESTBase so that it composes content for the production app feed endpoint[1] from individual feed component endpoints in the mobile content service, rather than from the mobile content service's own aggregated feed endpoint. However, there was a bug in the individual feed endpoints which prompted RESTBase to return a 500 code (relayed from the Mobile Content Service) for aggregated feed requests for all non-English requests.

14:55: The bug fix for the above is deployed. From this point, production feed endpoint responses can contain empty JSON objects that the app does not expect and can't handle.

17:14: A Phabricator ticket is filed stating that the app is crashing on startup.[2]

18:35: A user reports on #wikimedia-mobile that the app is crashing on startup and the apps and services engineers begin to investigate. It's soon determined that this affects at least Hebrew and German language users.

19:52: Fixes to RESTBase and the mobile content service are deployed and the RESTBase cache is cleared, fixing the crash.

Discussion

The mobile content service, which provides the content for the app's Explore feed feature, can surface a Wikipedia's featured article for the day. Currently, this is enabled for English only.

In production, RESTBase obtains the feed endpoint response from the mobile content service and stores it for faster retrieval. Previously, RESTBase obtained this content from the mobile content service's aggregated feed content endpoint, which obtains and agglomerates content from internal service endpoints for single pieces of feed content. This response omits empty properties from the response object, including featured articles for languages for which the featured article is not to be included in the response.

After the 13:31 deployment, RESTBase changed to instead request content from the individual internal feed content endpoints and compose the response on its own. This change was intended to improve performance, since the aggregated response is currently updated every two seconds. Unfortunately, for languages other than English, the response composed by RESTBase contained empty JSON objects that the app did not expect and was not prepared to handle. Encountering these caused the app to crash. Since the app starts on the feed activity by default, this most often would have manifested as a crash on startup.

It should be noted that aggregated feed content responses from the mobile content service always omitted empty objects as intended, and thus our content service unit testing did not surface the bug before hitting production.

Conclusions

What weakness did we learn about and how can we address them?

Actionables

  • The app should gracefully handle unexpected responses, within the bounds of the API contract. (TODO: Create Phab ticket)
  • Unit tests should be added to RESTBase to ensure the absence of unexpected fields and empty JSON objects. (TODO: Create Phab ticket and/or GitHub issue)