You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "Wikidata Query Service/Manual maintenance"

From Wikitech-static
Jump to navigation Jump to search
imported>DCausse
imported>DCausse
Line 6: Line 6:
* Workaround: re-sync deleted entities polling the deletion log using an ad-hoc tool.
* Workaround: re-sync deleted entities polling the deletion log using an ad-hoc tool.
* Tool: https://people.wikimedia.org/~dcausse/wdqs_manual_deletes/
* Tool: https://people.wikimedia.org/~dcausse/wdqs_manual_deletes/
* Last run: ''2021-08-23'' for the period ''2021-08-02T00:00:00Z'' to ''2021-08-23T00:00:00Z''
* Last run: ''2021-09-17'' for the period ''2021-08-23T00:00:00Z'' to ''2021-09-17T00:00:00Z''


{{Note|If you run it please update the info above}}
{{Note|If you run it please update the info above}}

Revision as of 07:51, 17 September 2021

This page describes the maintenance tasks that are currently manual. This page is not meant to be a documentation but rather track the actions done/required by members of the mw:Wikimedia Search Platform team. Ideally this page should not exist.

Workaround for missing deletes in the current updater

Streaming updater: test with wdqs1009

As of 2021-02-19 the streaming updater is running a custom build:

  • user: analytics-search
  • machine: stat1004
  • flink base: /home/dcausse/flink-1.12.0-wdqs
  • session cluster started with: sudo -u analytics-search kerberos-run-command analytics-search sh -c HADOOP_CLASSPATH="`hadoop classpath`" ./bin/yarn-session.sh -tm 4g -jm 2600m -s 4 -nm "WDQS Streaming Updater"
  • application started with: sudo -u analytics-search kerberos-run-command analytics-search sh -c 'export HADOOP_CLASSPATH=`hadoop classpath`; ./bin/flink run -p 12 -s swift://updater.thanos-swift/wdqs_streaming_updater/bootstrap_savepoint_20210201 -c org.wikidata.query.rdf.updater.UpdaterJob ~/streaming-updater-producer-0.3.64-SNAPSHOT-jar-with-dependencies.jar ../updater-job-partial-reordering.properties'
  • custom build: /home/dcausse/streaming-updater-producer-0.3.64-SNAPSHOT-jar-with-dependencies.jar (master minus https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/649715)

Verify that it's running by checking that the rate of offset commits is not zero: https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=wdqs_streaming_updater

It might fail because:

  • stat1004 was rebooted
  • the jobmanager has its node being killed by yarn/analytics

How to recover it

Find the last valid checkpoint from application log using:

sudo -u analytics-search kerberos-run-command analytics-search yarn logs -applicationId application_1612875249838_7657 | less

Find the proper application id using https://yarn.wikimedia.org/cluster/apps and search for WDQS. If yarn still has the data there should be one line with:

  • User: analytics-search
  • Name: WDQS Streaming Updater
  • Application Type: Apache Flink

In the logs search for the last instance of Completed checkpoint NNNN for job XYZ

2021-02-19 13:42:54,810 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 3087 (type=CHECKPOINT) @ 1613742173917 for job b85e02696673c5d09d41918872c98898.
2021-02-19 13:43:00,980 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 3087 for job b85e02696673c5d09d41918872c98898 (103619620 bytes in 5275 ms).
2021-02-19 13:43:24,763 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 3088 (type=CHECKPOINT) @ 1613742203917 for job b85e02696673c5d09d41918872c98898.
2021-02-19 13:43:30,731 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Completed checkpoint 3088 for job b85e02696673c5d09d41918872c98898 (103072267 bytes in 5037 ms).
2021-02-19 13:43:54,776 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] - Triggering checkpoint 3089 (type=CHECKPOINT) @ 1613742233917 for job b85e02696673c5d09d41918872c98898.
End of LogType:jobmanager.log.This log file belongs to a running container (container_1612875249838_46677_01_000001) and so may not be complete.
*******************************************************************************

here 3088 is right one, 3089 is not yet complete.

the checkpoint path should be wdqs_streaming_updater/checkpoints/b85e02696673c5d09d41918872c98898/chk-3088.

If yarn has purged the application logs you can still try to dig into swift to find the checkpoint. The command:

swift -A https://thanos-swift.discovery.wmnet/auth/v1.0 -U wdqs:flink -K CHANGEME list updater -l

will list the file on the store used for checkpointing, you must find the last valid checkpoint created around the date of the failure. It is a folder named:

wdqs_streaming_updater/flink application id / chk-NNNN

It must have a _metadata file in it, example:

           0 2021-02-17 10:11:41 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786
           0 2021-02-17 10:28:09 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/chk-10349
           0 2021-02-17 10:44:36 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/chk-10353
           0 2021-02-17 10:45:25 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/chk-10355
      413964 2021-02-17 10:45:29 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/chk-10355/_metadata
           0 2021-02-17 10:11:41 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared
      169742 2021-02-17 10:29:11 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared/033fa5a9-4fe8-4cfe-ae8f-367b589b345c
       90547 2021-02-17 10:44:56 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared/1186aa00-32cb-41de-8fe7-c3e961c77100
     1029972 2021-02-17 10:29:10 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared/13ae0214-c902-4c6f-943f-27cc5d1ac5c9
      800483 2021-02-17 10:29:07 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared/162aae61-11fd-4ace-8440-6cf62a9f58f0
     1045523 2021-02-17 10:29:08 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared/1ed541ce-a530-4853-8ad2-b0f6b6799709
       32453 2021-02-17 10:44:57 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared/2743e36e-7d98-43a8-85cc-f75845473148

wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/chk-10355 sounds like a good checkpoint to use if the failure happened around 2021-02-17T10:45:29.

How to stop it

List of flink apps:

sudo -u analytics-search kerberos-run-command analytics-search sh -c 'HADOOP_CLASSPATH="`hadoop classpath`" ./bin/flink list'

Stop with a savepoint:

sudo -u analytics-search kerberos-run-command analytics-search sh -c 'HADOOP_CLASSPATH="`hadoop classpath`" ./bin/flink stop c4ebc03ec0c5431fb6b8232f45b54aa2 --drain -p swift://updater.thanos-swift/wdqs_streaming_updater/savepoints'

Note the savepoint path on the last line e.g.:

Savepoint completed. Path: swift://updater.thanos-swift/wdqs_streaming_updater/savepoints/savepoint-c4ebc0-e5907d976b64

The swift path is the path to use as the -s param when starting the application.

Downstream consumer (wdqs1009)

As of writing the downstream consumer is still reloading the data but as soon as the data reload is complete and the flag file /srv/wdqs/data_loaded is created the updater (consumer) will start on this machine. It will start consuming the topic kafka-main eqiad.rdf-streaming-updater.mutation at whatever offset the consumer group wdqs1009 is currently at. Note that it is currently well positioned according to the last run (2021-02-18 09:33:53).

Even though this should not be needed for this particular reload here is how it works:

After the initial load it must be positioned at the offset corresponding to the first message produced from flink when the app is started using the bootstrap savepoint.

Use stat1004@/home/dcausse/set_offsets.py to manipulate offsets:

The backfill from the bootstrap state started around 2021-02-18T09:33:53 position the offset around this time:

dcausse@stat1004:~/flink-1.12.0-wdqs$ python3.7 ~/set_offsets.py -t eqiad.rdf-streaming-updater.mutation -c wdqs1009 -b kafka-main1002.eqiad.wmnet:9092 -s 2021-02-18T09:33:53
Configuring offsets for wdqs1009@kafka-main1002.eqiad.wmnet:9092 for topics ['eqiad.rdf-streaming-updater.mutation'] at time 2021-02-18 09:33:53
Setting: TopicPartition(topic='eqiad.rdf-streaming-updater.mutation', partition=0) to OffsetAndTimestamp(offset=12080891, timestamp=1613640833827)

Check that the data at this offset correspond to the event time of the bootstrap state (date of the dumps):

dcausse@stat1004:~/flink-1.12.0-wdqs$ kafkacat -b kafka-main1003.eqiad.wmnet:9092 -t eqiad.rdf-streaming-updater.mutation -o 12080891 -c 1 | jq .event_time
% Auto-selecting Consumer mode (use -P or -C to override)
"2021-01-29T23:06:26Z"

Double check that it's the first message by looking at the previous one (offset minus one):

dcausse@stat1004:~/flink-1.12.0-wdqs$ kafkacat -b kafka-main1003.eqiad.wmnet:9092 -t eqiad.rdf-streaming-updater.mutation -o 12080890 -c 1 | jq .event_time
% Auto-selecting Consumer mode (use -P or -C to override)
"2021-01-30T17:22:00Z"

Useful dashboard: https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1