You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

WMDE New Editors and Banner Campaigns Dashboard

From Wikitech-static
Revision as of 08:39, 12 April 2019 by imported>GoranSMilovanovic (→‎Overview)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Introduction

The WMDE New Editors/Banner Campaigns Dashboard is an RStudio Shiny dashboard developed by WMDE in order to track, visualize, and compare two sources of data: (a) editor activity (user registration and edits) of spontaneously registered editors of the German Wikipedia, and (b) editor activity of those users of the German Wikipedia who registered through some of their Banner Campaigns. It provides an overview of user edits on the German Wikipedia since the encyclopedia’s birth and until nowadays as well as for a reference period beginning with 2017. when WMDE started to run their Banner Campaigns.

Overview

The following chart depicts the operational workflow from the (a) dashboard's back-end update engine, (currently) running on stat1007, through (b) the production of the public datasets used, to (c) the Shiny dashboard that is run in the CloudVPS environment.

WMDE New Editors/Banner Campaigns Dashboard
WMDE New Editors/Banner Campaigns Dashboard

We briefly outline the operational workflow:

  • The wmdeCampaignEvaluator_PRODUCTION.R script:
    • orchestrates HiveQL calls to the wmf.mediawiki_history table, SQL calls to dewiki.revision, and SQL calls to the log database,
    • matches the campaign codes and campaign banner tags with the user registration and user edits data,
    • produces aggregated data and anonymizes user IDs in the datasets.
  • The data produced by wmdeCampaignEvaluator_PRODUCTION.R are copied into the following public directory:
  • The data are downloaded from the CloudVPS dashboard update component and stored locally, where
  • the WMDE New Editors/Banner Campaigns Shiny dashboard uses them to produce the respective visualizations and tables.

The Update Engine

We review only the most essential procedures here.

To collect the CET dates of the 10th user edit, and R orchestration of HiveQL calls to wmf.mediawiki_history, where projects[i] is any project (dewiki, for example) and snapshot is the current table snapshot:

### --- run HiveQL scripts
for (i in 1:length(projects)) {
  
  # - toReport
  print(paste0("REPORT: beeline -  When do the dewiki users reach their 10th edit?", 
               Sys.time(), "."))
  
  ### --- HiveQL query
  hiveQL <- paste("SELECT event_user_id, substr(from_utc_timestamp(event_timestamp, 'CET'), 1, 10) FROM (
                  SELECT *,
                  row_number() over (partition by event_user_id order by event_timestamp) rownum
                  FROM wmf.mediawiki_history where wiki_db = '",
                  projects[i],
                  "' AND event_entity = 'revision'
                  AND event_type = 'create'
                  AND event_user_is_bot_by_name = false 
                  AND NOT ARRAY_CONTAINS(event_user_groups, 'bot') 
                  AND NOT ARRAY_CONTAINS(event_user_groups_historical, 'bot')
                  AND event_user_revision_count >= 10
                  AND snapshot = '", snapshot,
                  "') tab1 WHERE rownum = 1;",
                  sep = "")
  
  # - write hql
  write(hiveQL, 'newEds10.hql')
  
  ### --- output filename
  filename <- paste('newUsers10_', projects[i],".tsv", sep = "")
  
  ### --- execute hql script:
  hiveArgs <- '/usr/local/bin/beeline --silent --incremental=true --verbose=false -f'
  hiveInput <- paste(paste(localDir, 'newEds10.hql', sep = ""),
                     " > ",
                     filename,
                     sep = "")
  # - command:
  hiveCommand <- paste(hiveArgs, hiveInput)
  system(command = hiveCommand, wait = TRUE)
}
### --- END run HiveQL scripts
file.remove('newEds10.hql')

A similar procedure is used to obtain the dates for the 50th edit.

A similar procedure is used to obtain the dates for the 50th edit.

To collect all campaign registered user IDs, one first needs to list all the ServerSideAccountCreation schemata available in the log database, and then iterate simple SQL queries over them (note: event_campaign_tags are the event_campaign tags used in the WMDE Banner Campaigns that the code searches for):

### ------------------------------------------------------
### Collect all campaign registered users
### ------------------------------------------------------
# - toReport
print(paste0("REPORT: Collect all campaign registered users", 
             Sys.time(), "."))
event_campaign_tags <- read.csv(paste0(params$inputDir, 
                                       'event_campaign_tags_PRODUCTION.csv'), 
                                header = T, 
                                check.names = F, 
                                stringsAsFactors = F)
event_campaign_tags <- event_campaign_tags$event_campaign
# - collect event_userIds from 
event_campaign_tags_Condition <- paste0(
  "(",
  paste(paste0("(event_campaign like '%", event_campaign_tags, "%')"), collapse = " or "),
  ")"
)
# - Enlist all ServerSideAccountCreation schemata
# - toReport
print(paste0("REPORT: Enlist all ServerSideAccountCreation schemata", 
             Sys.time(), "."))
sqlLogIn <- 'mysql --defaults-file=/etc/mysql/conf.d/analytics-research-client.cnf -h analytics-slave.eqiad.wmnet -A -e'
query <- '"use log; show tables from log like \'%ServerSideAccountCreation%\';"'
outFile <- paste0('> ', localDir, 'ssCreationSchemata.tsv')
sQuery <- paste(sqlLogIn, query, outFile, sep = " ")
system(sQuery, wait = T)
sscSchemata <- read.delim('ssCreationSchemata.tsv', sep = "\t")
sscSchemata <- as.character(sscSchemata[, 1])
# - Determine which ServerSideAccountCreation schema has the event_campaign field
for (i in 1:length(sscSchemata)) {
  qCommand <- paste0("mysql --defaults-file=/etc/mysql/conf.d/analytics-research-client.cnf -h analytics-slave.eqiad.wmnet -A -e ", 
                     "\"use log; describe ", sscSchemata[i], ";\" > ", localDir,
                     "campaignEdits_", paste0(sscSchemata[i], "_description.tsv")
  )
  system(command = qCommand, wait = TRUE)
}
schemaFields <- lapply(
  paste0("campaignEdits_", sscSchemata, "_description.tsv"), 
  function(x) {
    fread(x, sep = "\t")
  }
)
schemaFields <- sapply(schemaFields, function(x) {
  'event_campaign' %in% x$Field
})
sscSchemata <- sscSchemata[schemaFields]
# - Collect all event_userId, event_campaign data:
# - toReport
print(paste0("REPORT: Collect all event_userId, event_campaign data", 
             Sys.time(), "."))
for (i in 1:length(sscSchemata)) {
  qCommand <- paste0("mysql --defaults-file=/etc/mysql/conf.d/analytics-research-client.cnf -h analytics-slave.eqiad.wmnet -A -e ", 
                     "\"select event_userId, event_campaign from log.", sscSchemata[i], " where ((webHost = 'de.wikipedia.org') and ", 
                     event_campaign_tags_Condition, ");\" > ", localDir,
                     "campaignEdits_", sscSchemata[i], ".tsv"
  )
  system(command = qCommand, wait = TRUE)
}
# - load schemata
sscSchemata <- paste0("campaignEdits_", sscSchemata, ".tsv")
c <- 0
userIds <- list()
for (i in 1:length(sscSchemata)) {
  if (file.info(sscSchemata[i])$size > 0 ) {
    c <- c + 1
    userIds[[c]] <- tryCatch({
      fread(sscSchemata[i], sep = "\t")
    }, 
    error = function(condition) {
      NULL
    })
  }
}
userIds <- rbindlist(userIds)

Public Datasets

The following public datasets are produced by the dashboard’s update engine:

  • prod_edits10_Update.csv – this dataset encompasses
    • (1) anonymized user IDs (e.g. u_1, u_2, etc.),
    • (2) CET date when the user made their 10th edit,
    • (3) the date of the user registration,
    • (4) campaign code and campaign banner tag if the user registered via any of the WMDE’s banner campaigns,
    • (5) time (in days) since user registration and until the 10th edit, and
    • (6) data on Training Modules (started, completed) if the user took any Training Modules;
  • prod_edits50_Update.csv – the dataset has the same structure as the previously described except for the criterion is the 50th edit;
  • prod_monthlyEdits_Reference.csv – monthly aggregated edit counts for all campaigns, by campaign banner tag, since January 2017;
  • prod_editClassAggregate.csv – counts of users who made
    • (1) 1 edit,
    • (2) 2 – 4 edits,
    • (3) 5 – 9 edits,
    • (4) 10 – 49 edits, and
    • (5) 50 and more edits, by campaign and campaign banner tag, since January 2017;
  • prod_userRegistrationDatesDaily_dewiki.csv – daily (year-month-day, as separate fields) counts of user registrations, per campaign and campaign banner tag.

The Dashboard (CloudVPS Component)

The Dashboard (CloudVPS Component) itself is thoroughly document on Meta and in the Documentation section of the dashboard itself. The technology is RStudio Shiny. The means of visualization are R packages {dypgraph} and {ggplot2}. The dashboard's CloudVPS update component runs on an hourly scheduled, checking for any changes in the public data directory, and updates the dashboard as soon as any changes are detected. The dashboard has a SQL back-end to access the WMDE Campaign Archive, stored across several tables of the u16664__wdcm_p database. The database itself is fed manually following the archiving procedures for each finished WMDE Banner Campaign separately.