You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Analytics/Systems/DataHub: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Btullis
m (Removed the warning about version 0.8.28)
imported>Btullis
m (Moved the screenshot to the right)
Line 1: Line 1:
We run an instance of DataHub which acts as a centralized data catalog.
[[File:Datahub screenshot.png|thumb|468x468px|DataHub User Interface]]


Having carried out a [[Data Catalog Application Evaluation]] the decision was taken to use DataHub and to implement an MVP deployment.
== Overview ==
We run an instance of DataHub which acts as a centralized data catalog, intended to facilitate the following:


* <u>Discovery</u> by potential users of the various data stores operated by WMF.
* <u>Documentation</u> of the data structures, formats, access rights, and other associated details.
* <u>Governance</u> of these data stores, including details of retention, sanitization, recording changes over time.
Currently our DataHub should be considered at an [[:en:Minimum_viable_product|MVP]] stage, so functionality and content may change rapidly as the service develops.
== Accessing DataHub ==
[[File:Location of shell account.png|thumb|Ascertaining your shell login name]]
The URL for the web interface for DataHub is: '''https://datahub.wikimedia.org'''
This requires a [[Help:Create a Wikimedia developer account|Wikimedia developer account]] and access is currently limited to members of the <code>wmf</code> or <code>nda</code> LDAP groups.
When logging in, enter your '''shell login name''' and your '''wikitech password''' in order to gain access.
If you are in any doubt as to your shell login name you can go to [[Special:Preferences#mw-prefsection-personal|Special:Preferences]] as shown in the following screenshot.
n.b. We intend to use your ''Developer Account username'' in due course, but currently there is [https://github.com/datahub-project/datahub/issues/4915 a bug] preventing this from working.
The DataHub Metadata Service is: '''https://datahub-gms.discovery.wmnet:30443'''
This service is not public facing and it is only available from our private networks. Currently we have not enabled authentication on this interface, although it is planned.
== Service Overview ==
=== DataHub Components ===
The DataHub instance is composed of several components, each built from the same codebase
* a '''metadata server''' (or GMS)
* a '''frontend''' web application
* an '''mce consumer''' (metadata change event)
* an '''mae consumer''' (metadata audit event)
All of these components are stateless and currently run on the [[Kubernetes/Clusters#wikikube (aka eqiad/codfw)|Wikikube]] Kubernetes clusters.
Their containers are built using the [[Deployment pipeline]] and the configuration for this is in the [[gerrit:plugins/gitiles/analytics/datahub/+/refs/heads/wmf/.pipeline/|wmf branch]] of our fork of the datahub repository:
=== Backend Data Tiers ===
The stateful components of the system are:
* a MariaDB database on the [[Analytics/Systems/Cluster/Mysql Meta|analytics-meta]] database
* an Opensearch cluster running on three VMs named <code>datahubsearch100[1-3].eqiad.wmnet</code>
* an instance of Karapace, which acts as a schema registry, running on <code>karapace1001.eqiad.wmnet</code>
* serveral Kafka topics running on the [[Kafka#jumbo (eqiad)|kafka-jumbo]] cluster
Our Opensearch cluster fulfils two roles of:
* a search index
* a graph database
The design document for the DataHub service is [https://docs.google.com/document/d/1EDXwh4WPDp-nYzV-Rvy01x8s1fb9drxLnzihRrgNAeM/edit here] (restricted to WMF staff)
We had previously carried out a [[Data Catalog Application Evaluation]] and subsequently the decision was taken to use DataHub and to implement an MVP deployment.
== Metadata Sources ==
We have several key sources of metadata.
* [[Analytics/Systems/Cluster/Hive|Hive]]
* [[Kafka]]
* [[Analytics/Systems/Druid|Druid]]
* [[Analytics/Systems/Superset|Superset]]
== Ingestion ==
Currently ingestion can be performed by any machine on our private networks, including the [[Analytics/Systems/Clients|stats servers]].
=== Automated Ingestion ===
We are moving to automated and regularly scheduled metadata ingestion using [[Analytics/Systems/Airflow|Airflow]]. Please check back soon for updated documentation on this topic.
=== Manual Ingestion Example ===
The following procedure should help to get started with manual ingestion.
# Select a [[Analytics/Systems/Clients|stats server]] for your use.
# Activate a stacked [[Analytics/Systems/Anaconda|anaconda]] environment.
# Configure the [[HTTP proxy]] servers in your shell
# Install the necessary python modules
<pre>
<pre>
ssh stat1005.eqiad.wmnet
source conda-activate-stacked my-cool-env
export http_proxy=http://webproxy:8080
export https_proxy=http://webproxy:8080
export HTTP_PROXY=http://webproxy:8080
export HTTPS_PROXY=http://webproxy:8080
export no_proxy=127.0.0.1,::1,localhost,.wmnet,.wikimedia.org,.wikipedia.org
export NO_PROXY=127.0.0.1,::1,localhost,.wmnet,.wikimedia.org,.wikipedia.org
# it's very important to install the same version CLI as server that's running,
# it's very important to install the same version CLI as server that's running,
# otherwise ingestion will not work
# otherwise ingestion will not work

Revision as of 11:26, 16 May 2022

File:Datahub screenshot.png
DataHub User Interface

Overview

We run an instance of DataHub which acts as a centralized data catalog, intended to facilitate the following:

  • Discovery by potential users of the various data stores operated by WMF.
  • Documentation of the data structures, formats, access rights, and other associated details.
  • Governance of these data stores, including details of retention, sanitization, recording changes over time.


Currently our DataHub should be considered at an MVP stage, so functionality and content may change rapidly as the service develops.

Accessing DataHub

File:Location of shell account.png
Ascertaining your shell login name

The URL for the web interface for DataHub is: https://datahub.wikimedia.org

This requires a Wikimedia developer account and access is currently limited to members of the wmf or nda LDAP groups.

When logging in, enter your shell login name and your wikitech password in order to gain access.


If you are in any doubt as to your shell login name you can go to Special:Preferences as shown in the following screenshot.

n.b. We intend to use your Developer Account username in due course, but currently there is a bug preventing this from working.


The DataHub Metadata Service is: https://datahub-gms.discovery.wmnet:30443

This service is not public facing and it is only available from our private networks. Currently we have not enabled authentication on this interface, although it is planned.

Service Overview

DataHub Components

The DataHub instance is composed of several components, each built from the same codebase

  • a metadata server (or GMS)
  • a frontend web application
  • an mce consumer (metadata change event)
  • an mae consumer (metadata audit event)

All of these components are stateless and currently run on the Wikikube Kubernetes clusters.

Their containers are built using the Deployment pipeline and the configuration for this is in the wmf branch of our fork of the datahub repository:

Backend Data Tiers

The stateful components of the system are:

  • a MariaDB database on the analytics-meta database
  • an Opensearch cluster running on three VMs named datahubsearch100[1-3].eqiad.wmnet
  • an instance of Karapace, which acts as a schema registry, running on karapace1001.eqiad.wmnet
  • serveral Kafka topics running on the kafka-jumbo cluster

Our Opensearch cluster fulfils two roles of:

  • a search index
  • a graph database

The design document for the DataHub service is here (restricted to WMF staff)

We had previously carried out a Data Catalog Application Evaluation and subsequently the decision was taken to use DataHub and to implement an MVP deployment.

Metadata Sources

We have several key sources of metadata.

Ingestion

Currently ingestion can be performed by any machine on our private networks, including the stats servers.

Automated Ingestion

We are moving to automated and regularly scheduled metadata ingestion using Airflow. Please check back soon for updated documentation on this topic.

Manual Ingestion Example

The following procedure should help to get started with manual ingestion.

  1. Select a stats server for your use.
  2. Activate a stacked anaconda environment.
  3. Configure the HTTP proxy servers in your shell
  4. Install the necessary python modules
# it's very important to install the same version CLI as server that's running,
# otherwise ingestion will not work
pip install 'acryl-datahub[datahub]'~=0.8.32.0
datahub version
datahub init
server: https://datahub-gms.discovery.wmnet:30443

Until this bug is fixed, we will also need to use:

export REQUESTS_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

Then create a recipe file, install more plugins if required. datahub ingest -c recipe.yaml

Some examples of recipes, including Hive, Kafka, and Druid are avaiable on this ticket