You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Maps/Maintenance: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>MSantos
(Created page with "= Maps maintenance = This document outlines the maintenance activities and points to further documentation explaining each process. The idea is to make the maps infrastructur...")
 
imported>MSantos
Line 36: Line 36:
=== Application layer ===
=== Application layer ===


'''Tilerator'''
'''Kartotherian'''


{| class="wikitable"
{| class="wikitable"
|-
|-
! Activity !! SRE !! PI !! Automated?
! Activity !! Responsible !! Consulted !! Informed !! Automated?
|-
|-
| Monitor tile generation triggered by OSM replication ||  ||  ||
| Investigate and fix application production errors ||  ||  || ||
|-
| Monitor z0 - z9 monthly tile regeneration ||  ||  ||
|-
| Manually trigger a tile regeneration for a specifc part of the planet ||  ||  ||
|}
|}
'''Kartotherian'''
 
 
'''Tegola'''


{| class="wikitable"
{| class="wikitable"
|-
|-
! Activity !! SRE !! PI !! Automated?
! Activity !! Responsible !! Consulted !! Informed !! Automated?
|-
|-
| Investigate and fix application production errors ||  ||  ||
| Investigate and fix application production errors and submit code to upstream ||  ||  || ||
|}
|}
=== Infrastructure ===
=== Infrastructure ===


'''Beta Cluster'''
'''Beta Cluster''' <syntaxhighlight>*.maps-experiments.eqiad1.wikimedia.cloud</syntaxhighlight>


{| class="wikitable"
{| class="wikitable"
|-
|-
! Activity !! SRE !! PI !! Automated?
! Activity !! Responsible !! Consulted !! Informed !! Automated?
|-
|-
|  ||  ||  ||
|  ||  ||  || ||
|}
|}
'''Varnish'''
'''Varnish'''


{| class="wikitable"
{| class="wikitable"
|-
|-
! Activity !! SRE !! PI !! Automated?
! Activity !! Responsible !! Consulted !! Informed !! Automated?
|-
|-
| Purge tile cache when vandalism occurs ||  ||  ||
| Purge tile cache when vandalism occurs ||  ||  || ||
|}
|}
'''Cassandra'''
 
'''PostgreSQL/OSM'''


{| class="wikitable"
{| class="wikitable"
|-
|-
! Activity !! SRE !! PI !! Automated?
! Activity !! Responsible !! Consulted !! Informed !! Automated?
|-
| Restore replica out of sync with main DB ||  ||  || ||
|-
|-
| Restore node replica when there is a disk space issue ||  ||  ||
| Initial OSM import  ||  ||  || ||
|-
|-
| Setup storage keyspace on master machine ||  ||  ||
| Restore DB causing disk space issue ||  ||  || ||
|-
|-
| Enable data replication for the remaining nodes ||  ||  ||
| Restore OSM replication because OSM lag is falling behind ||  ||  || ||
|-
|-
| Setup storage keyspace on master machine ||  ||  ||
| Make sure that the proper binaries are successfully installed in the infrastructure ||  ||  || ||
|}
|}
'''PostgreSQL'''
 
'''Swift'''


{| class="wikitable"
{| class="wikitable"
|-
|-
! Activity !! SRE !! PI !! Automated?
! Activity !! Responsible !! Consulted !! Informed !! Automated?
|-
|-
| Restore replica out of sync with main DB || || ||
| || || ||
|}
 
'''Kafka'''
 
{| class="wikitable"
|-
|-
| Initial OSM import  ||  ||  ||
! Activity !! Responsible !! Consulted !! Informed !! Automated?
|-
|-
| Restore DB causing disk space issue || || ||
| Wipe Kafka topic (empty queue) by moving skipping all events in the stream || || ||
|}
 
'''Tegola'''
 
{| class="wikitable"
|-
|-
| Restore OSM replication because OSM lag is falling behind ||  ||  ||
! Activity !! Responsible !! Consulted !! Informed !! Automated?
|-
|-
| Make sure that the proper binaries are succesfully installed in the infrastructure || || ||
| Enable pre-generation on eqiad/codfw tegola || || ||
|}
|}
'''Redis'''
> No known production issues or maintenance tasks

Revision as of 16:50, 5 February 2022

Maps maintenance

This document outlines the maintenance activities and points to further documentation explaining each process. The idea is to make the maps infrastructure:

Understood

Teams responsible for aspects of the service understand where their responsibilities begin and end, and have the information required to fulfill those responsibilities (alerting, documentation, SLO, etc)

Supported

Modern components, up to date (where possible and realistic) versions and if possible internally standardized (I’m thinking Prometheus here, but also possibly the discussion around Cassandra that emerged in our meeting, running maps services in Buster where metal is needed, nodejs updates, etc)

Automated

Wherever possible, manual intervention isn’t required for updates and self-healing. This isn’t a problem as such at the moment but if we could avoid things like resyncing the databases in the way we do now it would be excellent.

Distributed/fault tolerant

Currently, if we lose an individual maps node, we lose four services at once. The plan to move components to k8s where possible greatly improves this situation. Avoiding tightly coupling application components and state.

Known issues

  • Services are not in k8s
  • Currently if we lose an individual maps node, we lose four services at once
  • Components tightly coupled
  • Resyncing the DB has a high cost
  • Metrics needs to move from Graphite to Prometheus
  • Service is not paging
  • Kartotherian is not publicly available for 3rd parties

Maintenance activities responsibility and support

What's needed for the maintenance work? R = Responsible, S = Support

Application layer

Kartotherian

Activity Responsible Consulted Informed Automated?
Investigate and fix application production errors


Tegola

Activity Responsible Consulted Informed Automated?
Investigate and fix application production errors and submit code to upstream

Infrastructure

Beta Cluster

*.maps-experiments.eqiad1.wikimedia.cloud
Activity Responsible Consulted Informed Automated?

Varnish

Activity Responsible Consulted Informed Automated?
Purge tile cache when vandalism occurs

PostgreSQL/OSM

Activity Responsible Consulted Informed Automated?
Restore replica out of sync with main DB
Initial OSM import
Restore DB causing disk space issue
Restore OSM replication because OSM lag is falling behind
Make sure that the proper binaries are successfully installed in the infrastructure

Swift

Activity Responsible Consulted Informed Automated?

Kafka

Activity Responsible Consulted Informed Automated?
Wipe Kafka topic (empty queue) by moving skipping all events in the stream

Tegola

Activity Responsible Consulted Informed Automated?
Enable pre-generation on eqiad/codfw tegola