You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Wikidata Query Service/Manual maintenance: Difference between revisions
imported>DCausse |
imported>DCausse No edit summary |
||
(4 intermediate revisions by the same user not shown) | |||
Line 6: | Line 6: | ||
* Workaround: re-sync deleted entities polling the deletion log using an ad-hoc tool. | * Workaround: re-sync deleted entities polling the deletion log using an ad-hoc tool. | ||
* Tool: https://people.wikimedia.org/~dcausse/wdqs_manual_deletes/ | * Tool: https://people.wikimedia.org/~dcausse/wdqs_manual_deletes/ | ||
* Last run: ''2021- | * Last run: ''2021-09-27'' for the period ''2021-09-17T00:00:00Z'' to ''2021-09-27T00:00:00Z'' | ||
{{Note|If you run it please update the info above}} | {{Note|If you run it please update the info above}} |
Latest revision as of 16:05, 27 September 2021
This page describes the maintenance tasks that are currently manual. This page is not meant to be a documentation but rather track the actions done/required by members of the mw:Wikimedia Search Platform team. Ideally this page should not exist.
Workaround for missing deletes in the current updater
- Context: phab:T272120
- Workaround: re-sync deleted entities polling the deletion log using an ad-hoc tool.
- Tool: https://people.wikimedia.org/~dcausse/wdqs_manual_deletes/
- Last run: 2021-09-27 for the period 2021-09-17T00:00:00Z to 2021-09-27T00:00:00Z
![]() | If you run it please update the info above |
Streaming updater: test with wdqs1009
- Context: phab:T266470
As of 2021-02-19 the streaming updater is running a custom build:
- user: analytics-search
- machine: stat1004
- flink base: /home/dcausse/flink-1.12.0-wdqs
- session cluster started with:
sudo -u analytics-search kerberos-run-command analytics-search sh -c HADOOP_CLASSPATH="`hadoop classpath`" ./bin/yarn-session.sh -tm 4g -jm 2600m -s 4 -nm "WDQS Streaming Updater"
- application started with:
sudo -u analytics-search kerberos-run-command analytics-search sh -c 'export HADOOP_CLASSPATH=`hadoop classpath`; ./bin/flink run -p 12 -s swift://updater.thanos-swift/wdqs_streaming_updater/bootstrap_savepoint_20210201 -c org.wikidata.query.rdf.updater.UpdaterJob ~/streaming-updater-producer-0.3.64-SNAPSHOT-jar-with-dependencies.jar ../updater-job-partial-reordering.properties'
- custom build: /home/dcausse/streaming-updater-producer-0.3.64-SNAPSHOT-jar-with-dependencies.jar (master minus https://gerrit.wikimedia.org/r/c/wikidata/query/rdf/+/649715)
Verify that it's running by checking that the rate of offset commits is not zero: https://grafana-rw.wikimedia.org/d/000000484/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=wdqs_streaming_updater
It might fail because:
- stat1004 was rebooted
- the jobmanager has its node being killed by yarn/analytics
How to recover it
Find the last valid checkpoint from application log using:
sudo -u analytics-search kerberos-run-command analytics-search yarn logs -applicationId application_1612875249838_7657 | less
Find the proper application id using https://yarn.wikimedia.org/cluster/apps and search for WDQS. If yarn still has the data there should be one line with:
- User: analytics-search
- Name: WDQS Streaming Updater
- Application Type: Apache Flink
In the logs search for the last instance of Completed checkpoint NNNN for job XYZ
2021-02-19 13:42:54,810 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 3087 (type=CHECKPOINT) @ 1613742173917 for job b85e02696673c5d09d41918872c98898. 2021-02-19 13:43:00,980 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 3087 for job b85e02696673c5d09d41918872c98898 (103619620 bytes in 5275 ms). 2021-02-19 13:43:24,763 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 3088 (type=CHECKPOINT) @ 1613742203917 for job b85e02696673c5d09d41918872c98898. 2021-02-19 13:43:30,731 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed checkpoint 3088 for job b85e02696673c5d09d41918872c98898 (103072267 bytes in 5037 ms). 2021-02-19 13:43:54,776 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering checkpoint 3089 (type=CHECKPOINT) @ 1613742233917 for job b85e02696673c5d09d41918872c98898. End of LogType:jobmanager.log.This log file belongs to a running container (container_1612875249838_46677_01_000001) and so may not be complete. *******************************************************************************
here 3088
is right one, 3089 is not yet complete.
the checkpoint path should be wdqs_streaming_updater/checkpoints/b85e02696673c5d09d41918872c98898/chk-3088
.
If yarn has purged the application logs you can still try to dig into swift to find the checkpoint. The command:
swift -A https://thanos-swift.discovery.wmnet/auth/v1.0 -U wdqs:flink -K CHANGEME list updater -l
will list the file on the store used for checkpointing, you must find the last valid checkpoint created around the date of the failure. It is a folder named:
wdqs_streaming_updater
/flink application id
/ chk-NNNN
It must have a _metadata file in it, example:
0 2021-02-17 10:11:41 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786 0 2021-02-17 10:28:09 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/chk-10349 0 2021-02-17 10:44:36 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/chk-10353 0 2021-02-17 10:45:25 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/chk-10355 413964 2021-02-17 10:45:29 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/chk-10355/_metadata 0 2021-02-17 10:11:41 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared 169742 2021-02-17 10:29:11 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared/033fa5a9-4fe8-4cfe-ae8f-367b589b345c 90547 2021-02-17 10:44:56 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared/1186aa00-32cb-41de-8fe7-c3e961c77100 1029972 2021-02-17 10:29:10 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared/13ae0214-c902-4c6f-943f-27cc5d1ac5c9 800483 2021-02-17 10:29:07 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared/162aae61-11fd-4ace-8440-6cf62a9f58f0 1045523 2021-02-17 10:29:08 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared/1ed541ce-a530-4853-8ad2-b0f6b6799709 32453 2021-02-17 10:44:57 application/octet-stream wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/shared/2743e36e-7d98-43a8-85cc-f75845473148
wdqs_streaming_updater/checkpoints/a2f9c2f3d78e904a87e08bade5a44786/chk-10355
sounds like a good checkpoint to use if the failure happened around 2021-02-17T10:45:29
.
![]() | Using the wrong checkpoint/savepoint will cause the pipeline to produce data inconsistent with what downstream consumers might expect |
How to stop it
List of flink apps:
sudo -u analytics-search kerberos-run-command analytics-search sh -c 'HADOOP_CLASSPATH="`hadoop classpath`" ./bin/flink list'
Stop with a savepoint:
sudo -u analytics-search kerberos-run-command analytics-search sh -c 'HADOOP_CLASSPATH="`hadoop classpath`" ./bin/flink stop c4ebc03ec0c5431fb6b8232f45b54aa2 --drain -p swift://updater.thanos-swift/wdqs_streaming_updater/savepoints'
Note the savepoint path on the last line e.g.:
Savepoint completed. Path: swift://updater.thanos-swift/wdqs_streaming_updater/savepoints/savepoint-c4ebc0-e5907d976b64
The swift path is the path to use as the -s
param when starting the application.
Downstream consumer (wdqs1009)
As of writing the downstream consumer is still reloading the data but as soon as the data reload is complete and the flag file /srv/wdqs/data_loaded
is created the updater (consumer) will start on this machine.
It will start consuming the topic kafka-main eqiad.rdf-streaming-updater.mutation
at whatever offset the consumer group wdqs1009
is currently at. Note that it is currently well positioned according to the last run (2021-02-18 09:33:53).
Even though this should not be needed for this particular reload here is how it works:
After the initial load it must be positioned at the offset corresponding to the first message produced from flink when the app is started using the bootstrap savepoint.
Use stat1004@/home/dcausse/set_offsets.py
to manipulate offsets:
The backfill from the bootstrap state started around 2021-02-18T09:33:53
position the offset around this time:
dcausse@stat1004:~/flink-1.12.0-wdqs$ python3.7 ~/set_offsets.py -t eqiad.rdf-streaming-updater.mutation -c wdqs1009 -b kafka-main1002.eqiad.wmnet:9092 -s 2021-02-18T09:33:53 Configuring offsets for wdqs1009@kafka-main1002.eqiad.wmnet:9092 for topics ['eqiad.rdf-streaming-updater.mutation'] at time 2021-02-18 09:33:53 Setting: TopicPartition(topic='eqiad.rdf-streaming-updater.mutation', partition=0) to OffsetAndTimestamp(offset=12080891, timestamp=1613640833827)
Check that the data at this offset correspond to the event time of the bootstrap state (date of the dumps):
dcausse@stat1004:~/flink-1.12.0-wdqs$ kafkacat -b kafka-main1003.eqiad.wmnet:9092 -t eqiad.rdf-streaming-updater.mutation -o 12080891 -c 1 | jq .event_time % Auto-selecting Consumer mode (use -P or -C to override) "2021-01-29T23:06:26Z"
Double check that it's the first message by looking at the previous one (offset minus one):
dcausse@stat1004:~/flink-1.12.0-wdqs$ kafkacat -b kafka-main1003.eqiad.wmnet:9092 -t eqiad.rdf-streaming-updater.mutation -o 12080890 -c 1 | jq .event_time % Auto-selecting Consumer mode (use -P or -C to override) "2021-01-30T17:22:00Z"
Useful dashboard: https://grafana-rw.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater?orgId=1