You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/Manual maintenance: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Milimetric
imported>Milimetric
No edit summary
 
Line 5: Line 5:
* [[Analytics/Systems/Manual_maintenance/Refined flags script|Add _REFINED flags]] for events that contribute to the <code>wmf.wikidata_item_page_link</code> dataset. (this is not even documented anywhere outside of email)
* [[Analytics/Systems/Manual_maintenance/Refined flags script|Add _REFINED flags]] for events that contribute to the <code>wmf.wikidata_item_page_link</code> dataset. (this is not even documented anywhere outside of email)
* We run the [[Analytics/Systems/Dealing_with_data_loss_alarms#Check_dataloss_False_positives|false positive checker]] for webrequest loss probably once or more a month.  This could be partially automated, if the script finds that all instances of loss are false positives, the job could be automatically rerun.  If we do this automatically, we could update the webrequest_sequence_stats table with the results, allowing for trend tracking on top of that table.  Currently if you try to analyze data loss over time you find lots of noise with high % loss due to host restarts, etc.
* We run the [[Analytics/Systems/Dealing_with_data_loss_alarms#Check_dataloss_False_positives|false positive checker]] for webrequest loss probably once or more a month.  This could be partially automated, if the script finds that all instances of loss are false positives, the job could be automatically rerun.  If we do this automatically, we could update the webrequest_sequence_stats table with the results, allowing for trend tracking on top of that table.  Currently if you try to analyze data loss over time you find lots of noise with high % loss due to host restarts, etc.
* We re-run sanitization.  It's painful to update the command because you often have to change a property file nested in another property file nested in the command.  Docs [[Analytics/Systems/EventLogging/Backfilling#Backfilling_sanitization|are here]], and we should build a rerun command that just takes a list of schemas, since, and until parameters.


=== Annual ===
=== Annual ===


* [[Analytics/Data Lake/Edits/Geoeditors/Public#Country Protection List|Update country protection list]]
* [[Analytics/Data Lake/Edits/Geoeditors/Public#Country Protection List|Update country protection list]]

Latest revision as of 20:04, 3 November 2021

Monthly

  • Mediawiki history Druid data source switch
  • Check for newly created wikis
  • Add _REFINED flags for events that contribute to the wmf.wikidata_item_page_link dataset. (this is not even documented anywhere outside of email)
  • We run the false positive checker for webrequest loss probably once or more a month. This could be partially automated, if the script finds that all instances of loss are false positives, the job could be automatically rerun. If we do this automatically, we could update the webrequest_sequence_stats table with the results, allowing for trend tracking on top of that table. Currently if you try to analyze data loss over time you find lots of noise with high % loss due to host restarts, etc.
  • We re-run sanitization. It's painful to update the command because you often have to change a property file nested in another property file nested in the command. Docs are here, and we should build a rerun command that just takes a list of schemas, since, and until parameters.

Annual