You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Wikidata Concepts Monitor

From Wikitech-static
Revision as of 06:23, 27 October 2017 by imported>GoranSMilovanovic
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The Wikidata Concepts Monitor (WDCM) is a system that analyzes and visualizes Wikidata usage across the Wikimedia projects [WDCM Dashboards|Gerrit|Diffusion].

WDCM is developed and maintained (mainly) by Goran S. Milovanovic, Data Scientist, WMDE; any suggestions, contributions, and questions are welcome and should be directed to him.

Introduction

This page presents the technical documentation and important aspects of the system design of Wikidata Concepts Monitor (WDCM). The WDCM data product presents a set of Shiny dashboards that provide analytical insight into the Wikidata usage across its client projects, fully developed in R. In deployment, WDCM resides on the open source version of the RStudio Shiny Server. The WDCM dashboards are hosted on the wikidataconcepts Labs instance, relying on a MariaDB back-end that supports their immediate functionality; however, the WDCM system as a whole also depends on ETL procedures that are run from production (stat1005) and supported by Apache Sqoop and Hadoop, as well as on a set of SPARQL queries that extract pre-defined sets of Wikidata items for further analyses. The document explains the modular design of the WDCM and documents the critical procedures; a public code repository where the respective procedures are found is found on Gerrit and Diffusion.

Note: the WDCM Dashboards user manuals are found in the Description section on the respective dashboards (WDCM Overview, WDCM Usage, WDCM Semantics).

Approach

While Wikidata itself is a semantic ontology with pre-defined and evolving normative rules of description and inference, Wikidata usage is essentialy a social, behavioral phenomenon, suitable for study by means of machine learning in the field of distributional semantics: the analysis and modeling of statistical patterns of occurrence and co-occurence of Wikidata item and property usage across the client projects (e.g. enwiki, frwiki, ruwiki, etc). WDCM thus employs various statistical methodologies in an attempt to describe and provide insights from the observable Wikidata usage statistics (e.g. topic modeling, clustering, dimensionality reduction, all beyond providing elementary descriptive statistics of Wikidata usage, of course.).

Wikidata Usage Patterns. The “golden line” that connects the reasoning behind all WDCM functions can be non-technically described in the following way. Imagine observing the number of times a set of size N of particular Wikidata items was used across some project (enwiki, for example). Imagine having the same data or other projects as well: for example, if 200 projects are under analysis, then we have 200 counts for N items in a set, and the data can be desribed by a N x 200 matrix (items x projects). Each column of counts, representing the frequency of occurrence of all Wikidata entities under consideration across one of the 200 projects under discussion - a vector, obviously - represents a particular Wikidata usage pattern. By inspecting and modeling statistically the usage pattern matrix - a matrix that encompasses all such usage patterns across the projects, or the derived co-variance/correlation matrix - many insights into the similarities between Wikimedia projects items projects (or, more precisely, the similarities between their usage patterns) can be found. 

In essence, the technology and mathematics behind WDCM relies on the same set of practical tools and ideas that support the development of semantic search engines and recommendation systems, only applied to a specific dataset that encompasses the usage patterns for tens of millions of Wikidata entities across its client projects.

Motivation

The data obtained in this way, and analyzed properly, allow for the inferences about how different communities use Wikidata to build their specific projects, or about the ways in which semantically related collections of entities are used across some set of projects. By knowing this, it becomes possible to develop suggestions on what cooperation among the communities would be fruitful and mutually beneficial in terms of enhancing the Wikidata usage on the respective projects. On the other hand, communities that are focused on some particular semantic topics, categories (sets), sub-ontologies, etc. can advance by recognizing the similarity in their approaches and efforts. Thus, a whole new level of collaborative development around Wikipedia could be achieved. This goal motivates the development of the WDCM system, beyond the obvious possibility to assess data of fundamental scientific importance - for cognitive and data scientists, sociologists of knowledge, AI engineers, ontologists, pure enthusiasts, and many others.

Definitions

Wikidata usage

Wikidata usage analytics

By Wikidata usage analytics it is meant: all important and interesting statistics, summaries of statistical models and tests, visualizations, and reports on how Wikidata is used across the Wikimedia projects. The end goal of WDCM is to deliver consistent, high quality Wikidata usage analytics.

Wikidata usage (statistics)

Consider a set of sister projects (e.g. enwiki, dewiki, frwiki, zhwiki, ruwiki, etc; from the viewpoint of Wikidata usage, we also call them: client projects). Statistical count data that represent the frequency of usage of particular Wikidata entities over any given set of client projects are considered to be Wikidata usage (statistics) in the context of WDCM.

NOTE on Wikidata usage definition

The following discussion relies on the understanding of the Wikibase Schema, especially the wbc_entity_usage table schema (a more thorough explanation of Wikidata item usage tracking in the wbc_entity_usage tables is provided on Phabricator). The methodological discussion of the development of Wikidata usage tracking in relation to this schema is also found on Phabricator.

A strict, working, operational definition of Wikidata usage data is still under development. The problem with its development is of a technical nature and related to the current logic of the wbc_entity_usage table schema. This table is found on MariaDB replicas in the database for any respective project that has a client-side Wikidata usage tracking enabled. 

The “S”, “T”, “O”, and “X” usage aspects

The problematic field in the current wbc_entity_usage schema is eu_aspect. With its current definition, this field enables to select in a non-redundant way only the “S”, “O”, and “T” entity usage aspects; meaning: only “S”, “O”, and “T” occurrences of any given Wikidata entity on any given sister projects that maintains client-side Wikidata usage tracking signal one and only one entity usage in the respective aspect on that project (i.e. these aspects are non-overlapping in their registration of Wikidata usage). However, while “S”, “O”, and “T” do not overlap, they may overlap with the “X” usage aspect. Excluding the “X” aspect from the definition is again not possible, namely: ignoring it implies that the majority of relevant usage, e.g. usage in infoboxes, will not be tracked (accessing statement data via Lua is typically tracked as “X”).

The “L” aspects problem: tracking the fallback mechanism

The “L” aspects, usually modified by a specific language modifier (e.g. “L.de”, “L.en”, and similar) cannot be counted in a non-redundant way currently. This is a consequence of the way the wbc_entity_table is produced in respect to the possible triggering of the language fallback mechanism. To explain a language fallback mechanism in a glimpse: for example, let a language fallback chain for a particular language be: “L.de-ch” → “L.de” → “L.en”. That implies the following: if the usage of item label in Swiss German (“L.de-ch”) was attempted, and no label in Swiss German was found, an attempt to use the German (“L.de”) would be made, and an attempt at the English label (“L.en”) made in the end if the previous attempt fails. However, if a language fallback mechanism is triggered on a particular entity usage occasion, all L aspects in that fallback chain will be registered in the wbc_entity_usage table as if they were used simultaneously. From the viewpoint of Wikidata usage, it would be interesting to track (a) the attempted – i.e. the user intended – L aspect, or (at least) (b) the actually used L aspect for a given entity usage. However, the current design of the wbc_entity_usage table does not provide for an assessment of neither of these possibilities. 

Finally, there are other uncertainties related to the current design of the wbc_entity_usage table. For example, imagine an editor action that results in a presence of a particular item, with a sitelink, instantiating a label in a particular language at the same time. How many item usage counts do we have: one, two, or more (one “S” aspect count for the sitelink, and at least another for a specific “L” aspect count)?

In conclusion, if Wikidata usage statistics are to encompass all different ways in which an item usage could be defined, by mapping onto all possible editor actions in instantiating a particular item on a particular page, the design of the wbc_entity_usage table would have to undergo a thorough revision, or a new Wikidata usage tracking mechanism would have to be developed from scratch. The wbc_entity_usage table was never designed to enable for analytical purposes in the first place; however, it is the only source for Wikidata usage statistics that we can currently rely on.

A proposal for an initial solution:

- [NOTE] This is the current Wikidata usage definition in the context of WDCM.

From the existing wbc_entity_table schema, it seems possible to rely on the following definition. For the initial version of the WDCM system, use a simplified definition of Wikidata usage that excludes the multiple item per-page usage cases, in effect: 

  • count on how many pages a particular Wikidata item occurs in a project;
  • take that as a Wikidata usage per-project statistic;
  • ignore usage aspects completely until a proper tracking of usage per-page is enabled in the future.

By "proper tracking of usage per-page" the following is meant:

  • a methodology that counts exactly how many usage cases of a particular item there are on a particular page in a particular project.

WDCM Taxonomy

The WDCM Taxonomy presents a human choice of specific categories and items from the Wikidata ontology that are submitted to WDCM for analytics.

Currently, only one WDCM Taxonomy is specified (Lydia Pintcher, 05/03/2017, Berlin).

The fact that the WDCM relies on a specific choice of taxonomy implies that not all Wikidata items are necessarily tracked and analyzed by the system.

Users of WDCM can specify any imaginable taxonomy that presents a proper subset of Wikidata; no components of the WDCM system are dependent upon any characteristics of some particular choice of taxonomy.

Once defined, the WDCM taxonomy is translated into a set of (typically very simple) SPARQL queries that are used to collect the respective item IDs; only collected items will be tracked and analyzed by the system.

WDCM Schema

This WDCM schema is simply the RDBS schema that provides immediate support the WDCM Shiny dashboards (i.e. the dashboards fetch data directly from this database).

The WDCM database is currently hosted on: labsdb, and accessed from the wikidataconcepts Cloud VPS instance (where the WDCM Dashboards are hosted);

The current database is: u16664__wdcm_p

The tables in u16664__wdcm_p represent Wikidata usage statistics that are pre-processed in various ways, suitable for data presentation and analytics on the WDCM Shiny Dashboards.

The tables in u16664__wdcm_p are produced by the WDCM_Process.R script, run from the wikidataconcepts Cloud VPS instance; the details of the database design are found in section WDCM Database Design.

Wikidata Concepts Monitor Modules (WDCM system)

The WDCM system encompasses the following modules:

WDCM Search

The WDCM Search module:

  1. contacts the Wikidata Query Service to collect a list of item IDs from the currently relevant WDCM Taxonomy by running the WDCM_Collect_Items.R script
  2. utilizes Sqoop to transfer the data from the wbc_entity_usage tables from the m2 MariaDB replicas to the Hadoop/Hive wdcm_clients_wb_entity_usage table in the goransm database, by running the WDCM_Sqoop_Clients.R script, and then 
  3. accesses the wdcm_clients_wb_entity_usage Hive table in the goransm database from production to search for and fetch the Wikidata usage statistics of these items across the client projects, by running the WDCM_Search_Clients.R script
WDCM Process

The Process module runs the pre-processing procedures in R/SQL in production (currently, stat1005: WDCM_Pre-Process.R) and labs (WDCM_Process.R) across the data fetched by the WDCM Search module to create or update the WDCM Database.

Modules Search and Process run consequently, in the respective order, for each WDCM Database creation/update call. The order of WDCM R scripts to run is the following: WDCM_Collect_Items.R → WDCM_Sqoop_Clients.R → WDCM_Search_Clients.R → WDCM_Process.R. More about the workflow will be explained.

WDCM Dashboards

The Dashboards module is a set of RStudio Shiny dashboards that serve the Wikidata usage analytics to its end-users.

This set of Shiny dashboards relies on the WDCM Database (u16664__wdcm_p on tools.labsdb) to serve Wikidata usage analytics; the database is obtained directly from the WDCM_Process.R script.

Currently, the WDCM System runs three dashboards:

  • WDCM Overview, providing an elementary overview - the "big picture" - of Wikidata usage
  • WDCM Usage, providing for detailed usage statistics, and
  • WDCM Semantics, providing insights from the topic models derived from the usage data.

WDCM System Operation Workflow

The following diagram depicts the order of operation of various WDCM modules and their component R scritps. Additional details about the WDCM Process module and the WDCM Database design are provided in the following sections.

Error creating thumbnail: File with dimensions greater than 12.5 MP
Wikidata Concepts Monitor (WDCM) System Operation Workflow.

WDCM Usage Statistics/Statistical Modeling: WDCM_Pre-Process.R Script

This section provides a more detailed description of the operation of WDCM_Pre-Process.R script (currently run in production from stat1005, and WDCM_Process.R (run from the wikidataconcepts labs instance).; it is a part of the WDCM Process module which produces all of the WDCM Wikidata usage statistics, runs machine learning algorithms to provide the WDCM statistical models, and prepares some of the visualization datasets for the WDCM Dashboards.

WDCM_Pre-Process.R

WDCM_Pre-Process.R works with the wdcm_maintable Hive table on hdfs, database: goransm, to produce the .tsv files that migrate to the wikidataconcepts.wmflabs.org Cloud VPS instance from production (currently: stat1005.eqiad. Its results are stored locally as .tsv files on production (on stat1005.eqiad.wmnet) and migrate to Labs where they are further processed by the WDCM_Process.R. All heavy computations are performed by this script; the WDCM_Process.R provides additional data wrangling and merely populates the WDCM database on labsdb.

Given that the Wikidata item usage distributions are highly skewed, i.e. that the huge amount of item usage from a given semantic category is due to a relatively small number of items, we take only 5,000 Wikidata items per category to produce the term-frequency matrices that undergo topic modeling with Latent Dirichlet Allocation. The estimation of topic models is performed by WDCM_Pre-Process.R and relies on a rapid MAP estimation procedure provided by the {maptpx} R package. Note: this will most probably change in the near future, because (1) the {maptpx} package will not be receiving any further updates, and (2) it would be probably better to rely on distributed processing in our Analytics cluster and utilize an online LDA estimation procedure provided by Apache Spark's Mlib.

This script also performs t-SNE 2D dimensionality reduction of the LDA topic models (i.e. to 2D from a client project spawned hiper-dimensional topic space). The dimensionality reduction step relies on {Rtsne} and the results are used directly on the WDCM Semantics dashboard.

Finally, the script prepares the data structures for the visualization of higher dimensional client project topic models with the {visNetwork} package. These datasets are produced but not yet used on any of the WDCM dashboards. The plan is to integrate them into the WDCM Semantics dashboard.

WDCM Database Design: WDCM_Process.R Script

[Expand this section]

WDCM Puppetization

The puppetization of WDCM is ongoing: