You are browsing a read-only backup copy of Wikitech. The primary site can be found at


From Wikitech-static
< Analytics‎ | Systems‎ | Cluster
Revision as of 13:49, 29 August 2019 by imported>Bearloga (→‎DataGrip)
Jump to navigation Jump to search

Command line access

You can access the Hadoop and Hive on the the stats machines stat1007 and stat1004. For information on getting access, see Analytics/Data access and production shell access.

HTTP access

Access to HTTP GUIs in the Analytics Cluster is currently very restricted. You must have shell accounts on analytics nodes.

Hue (Hadoop User Experience) GUI is available at Log in using your UNIX shell username and Wikimedia developer account (Wikitech) password. If you already have cluster access, but can't log into Hue, it is likely that your account needs to be manually synced. Ask an Analytics Opsen – ottomata (aotto at or elukey (ltoscano at – or file a Phabricator task for help.

Admin Instructions to sync a Hue account

When a new Hadoop user is added, an admin should give them a Hue account. Once T127850 is resolved, this process should be automatic.

  1. Log into
  2. In the upper right, click on your username, and select Manage Users (you will only be able to do this if you are Hue admin. Another admin can make you one.)
  3. Click 'Add/Sync LDAP User'
  4. Fill in the form with their UNIX shell username (not their Wikimedia developer account username), deselect both 'Distinguished name' and 'Create home directory', and click 'Add/Sync user'


SSH tunnel

If you are in the wmf LDAP group (open to every WMF employee/contractor) and you care only about the Yarn Resource Manager UI, you can login directly to

Otherwise, to send HTTP requests to an internal analytics server, use an SSH tunnel. For example:

To access the Hadoop Resourcemanager jobbrowser, try running:

 ssh -N -L 8088:analytics1001.eqiad.wmnet:8088

And then navigate to http://localhost:8088/cluster in your browser. The FairScheduler interface will be at http://localhost:8088/cluster/scheduler.

For more information see Proxy access to cluster.


Figure 1: Hive driver settings in DataGrip
Figure 2: Hive connection details in DataGrip

JetBrains provides a license to the Foundation which can be used to install the professional versions of IntelliJ IDEA and PyCharm IDEs. For more information refer to this page on Office wiki. (Contractors need to get in touch with a full-time staffer for those details.) One of the IDEs available through this license is DataGrip, which can connect to MariaDB and Hive and many others. Once you have downloaded and installed DataGrip, follow these instructions to connect to Hive:

  1. Download a copy of the JDBC driver from the cluster: scp stat1007.eqiad.wmnet:/usr/lib/hive/lib/hive-jdbc-1.1.0-cdh5.16.1-standalone.jar ~/Downloads/
  2. Open an SSH tunnel to ssh -N stat1007.eqiad.wmnet -L 10000:an-coord1001.eqiad.wmnet:10000 in Terminal
  3. In DataGrip, File > New > Data Source > Apache Hive to open the "Data Sources and Drivers" window.
  4. In the Drivers section of this window, go to Apache Hive and click + under "Driver files", then "Custom JARs…", and select the JAR file you just downloaded from stat1007. Note: if you have previously downloaded a JetBrains-provided Hive driver, you will need to remove it (-) to ensure DataGrip uses the correct driver. Refer to Fig. 1 for confirmation.
  5. In the Project Data Sources section of this window, where a new Hive data source has been created called @localhost, specify a name – e.g. "Hive (via SSH tunnel)" – and the following in the General tab:
    • Host and Port: leave as is (localhost and 10000)
    • User: your shell username
    • Schema: default
    • URL: will be automatically filled and should look like jdbc:hive2://localhost:10000/default
  6. Click TEST CONNECTION to check that it works. If it asks you for password, leave that field empty and click OK to proceed. Refer to Fig. 2 for confirmation.
  7. In the same window, go to the Schemas section. DataGrip will then connect to Hive and fetch the list of databases (schemas). Once the list has been populated you can select the ones you want to see in DataGrip (e.g. wmf, event, event_sanitized) and click OK to be done.

You can now use DataGrip as a local alternative to Hue. You will need to open an SSH tunnel to an-coord1001 (like you did in step 2) every time you want to connect with DataGrip. Yes, there is an SSH/SSL tab in the connection details window but it doesn't support our Bastion setup. Adding the following to your ~/.bash_profile will make that part easy because then you can just type data-grip to open the tunnel:

alias data-grip="ssh -N stat7 -L 10000:an-coord1001.eqiad.wmnet:10000"