You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Analytics/Systems/Cluster/Access: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>EBernhardson
imported>Elukey
 
(11 intermediate revisions by 6 users not shown)
Line 1: Line 1:
== Command line access ==
== Command line access ==
You can access the Hadoop and Hive on the the stats machines [[stat1002]] and [[stat1004]]. For information on getting access, see [[Analytics/Data access]] and [[production shell access]].
You can access the Hadoop and Hive on most of the [[Analytics/Systems/Clients|Analytics Clients]]. For information on getting access, see [[Analytics/Data access]].


== HTTP Access ==
== HTTP access ==
Access to HTTP GUIs in the Analytics Cluster is currently very restricted. You must have shell accounts on analytics nodes. You must use a SOCKS proxy or ssh tunnels to access to HTTP services.
Access to HTTP GUIs in the Analytics Cluster is currently very restricted. You must have shell accounts on analytics nodes.


At the very minimum, you must have a shell account on the primary NameNode (analytics1001). HDFS uses POSIX accounts on the NameNode (analytics1001) for granting access to files.
== SSH tunnel ==
{{See|For the main article on creating a tunnel, see [[Proxy access to cluster]]}}
If you are in the wmf LDAP group (open to every WMF employee/contractor) and you care only about the Yarn Resource Manager UI, you can login directly to [https://yarn.wikimedia.org/ yarn.wikimedia.org].


Hue (Hadoop User Experience) GUI is available at https://hue.wikimedia.org.  Log in using your shell username and your LDAP credentials.  If you already have cluster access, but can't log into Hue, it is likely that your LDAP account needs to be manually synced.  Ask an Analytics Opsen (ottomata (aotto@wikimedia.org) or elukey (ltoscano@wikimedia.org) ) for help.
Otherwise if you need to contact specific UIs or services within the cluster (so with internal IPs), see [[Proxy access to cluster]].


=== Admin Instructions to sync a Hue LDAP account ===
== Kerberos ==
Sadly DataGrip seems not working with Hadoop and Kerberos, see [https://phabricator.wikimedia.org/T241170 T241170].


When a new Hadoop user is added, an admin should give them a Hue account. If [https://phabricator.wikimedia.org/T127850 this ticket] is resolved, this process should be automatic.
== DataGrip ==
[[File:Hive driver in DataGrip.png|thumb|'''Figure 1''': Hive driver settings in DataGrip]]
[[File:Hive connection details in DataGrip.png|thumb|'''Figure 2''': Hive connection details in DataGrip]]
[[:en:JetBrains|JetBrains]] provides a license to the Foundation which can be used to install the professional versions of [[:en:IntelliJ IDEA|IntelliJ IDEA]] and [[:en:PyCharm|PyCharm]] IDEs. For more information refer to [https://office.wikimedia.org/wiki/JetBrains this page] on Office wiki. (Contractors need to get in touch with a full-time staffer for those details.) One of the IDEs available through this license is [https://www.jetbrains.com/datagrip/ DataGrip], which can connect to MariaDB and Hive and many others. Once you have downloaded and installed DataGrip, follow these instructions to connect to Hive:


# Log into http://hue.wikimedia.org
# Download a copy of the JDBC driver from the cluster: <code>scp stat1007.eqiad.wmnet:/usr/lib/hive/lib/hive-jdbc-1.1.0-cdh5.16.1-standalone.jar ~/Downloads/</code>
# In the upper right, click on your username, and select Manage Users (you will only be able to do this if you are Hue admin. Another admin can make you one.)
# Open an SSH tunnel to <code>ssh -N stat1007.eqiad.wmnet -L 10000:analytics-hive.eqiad.wmnet:10000</code> in Terminal
# Click 'Add/Sync LDAP User'
# In DataGrip, File > New > Data Source > Apache Hive to open the "Data Sources and Drivers" window.
# Fill in the form with their shell username (not LDAP/Wikitech login), deselect both 'Distinguished name' and 'Create home directory', and click 'Add/Sync user'
# In the ''Drivers'' section of this window, go to Apache Hive and click '''+''' under "Driver files", then "Custom JARs…", and select the JAR file you just downloaded from stat1007. '''Note''': if you have previously downloaded a JetBrains-provided Hive driver, you will need to remove it ('''-''') to ensure DataGrip uses the correct driver. Refer to Fig. 1 for confirmation.
# In the ''Project Data Sources'' section of this window, where a new Hive data source has been created called <code>@localhost</code>, specify a name – e.g. "Hive (via SSH tunnel)" – and the following in the General tab:
#* '''Host''' and '''Port''': leave as is (<code>localhost</code> and 10000)
#* '''User''': your shell username
#* '''Schema''': <code>default</code>
#* '''URL''': will be automatically filled and should look like <code>jdbc:hive2://localhost:10000/default</code>
# Click TEST CONNECTION to check that it works. If it asks you for password, leave that field empty and click OK to proceed. Refer to Fig. 2 for confirmation.
# In the same window, go to the Schemas section. DataGrip will then connect to Hive and fetch the list of databases (schemas). Once the list has been populated you can select the ones you want to see in DataGrip (e.g. <code>wmf</code>, <code>event</code>, <code>event_sanitized</code>) and click OK to be done.


Done!
You can now use DataGrip as a local alternative to Hue. You will need to open an SSH tunnel to an-coord1001 (like you did in step 2) every time you want to connect with DataGrip. Yes, there is an SSH/SSL tab in the connection details window but it doesn't support our Bastion setup. Adding the following to your '''~/.bash_profile''' will make that part easy because then you can just type <code>data-grip</code> to open the tunnel:<syntaxhighlight lang="bash">
 
alias data-grip="ssh -N stat7 -L 10000:analytics-hive.eqiad.wmnet:10000"
== sshuttle ==
</syntaxhighlight>
 
[https://github.com/apenwarr/sshuttle sshuttle] is a 'Transparent proxy server that works as a poor man's VPN. Forwards over ssh. Doesn't require admin. Works with Linux and MacOS. Supports DNS tunneling.'.  You can use this to proxy traffic through a bastion host to the cluster.
 
Download and install sshuttle following [https://github.com/apenwarr/sshuttle#this-is-how-you-use-it these instructions].  Then, run
  ./sshuttle --dns -vvr bast1001.wikimedia.org 10.0.0.0/8
 
'''Be warned, this will proxy DNS requests through the Wikimedia network, and any requests to an IP on the internal Wikimedia network will be proxied through the bastion.'''
 
While this is running, you should be able to navigate to internally hosted web services from your browser.  Try accessing the ResourceManager jobbrowser at http://analytics1001.eqiad.wmnet:8088/
 
== ssh tunnel(s) ==
 
If you are in the wmf LDAP group (every WMF employee/contractor) and you care only about the Yarn Resource Manager UI, you can login directly to [https://yarn.wikimedia.org/ yarn.wikimedia.org].
 
Otherwise, If you have access to the nodes you want to send HTTP requests to,
then you can access specific HTTP services using direct ssh tunneling.
 
To access the Hadoop Resourcemanager jobbrowser, try running:
 
  ssh -N stat1004.eqiad.wmnet -L 8088:analytics1001.eqiad.wmnet:8088
 
or
 
  ssh -N bast1001.wikimedia.org -L 8088:analytics1001.eqiad.wmnet:8088
 
And then navigate to http://localhost:8088/cluster in your browser.
 
You might want to check out the FairScheduler interface here too.  It will show you usage of the cluster per user:  http://localhost:8088/cluster/scheduler
 
== SOCKS proxy & FoxyProxy ==
:''Also see the explanation in [[Help:Access#Setting up the proxy]]''
For this to work, you need automatic ssh proxying to stat1004.eqiad.wmnet through bast1001.wikimedia.org. You can add the following to your <tt>.ssh/config</tt> file if you don't already have something more generic (see [[SSH access]]):
 
  Host analytics*
      ProxyCommand ssh -a -W %h:%p bast1001.wikimedia.org
 
Once that works (verify that you can <kbd>ssh</kbd> into stat1004.eqiad.wmnet), you can open up a
SOCKS proxy through stat1004.eqiad.wmnet:
 
  ssh -N -D 8999 stat1004.eqiad.wmnet
 
Finally, configure your browser to connect via host: localhost port 8999.  If you use FoxyProxy, you can set up specific URL patterns that you would like to proxy.  https?://analytics.* should do.
 
Once there, you should be able to navigate to services. Try out http://analytics1001.eqiad.wmnet:8088/cluster to be sure that it works.

Latest revision as of 07:18, 27 November 2020

Command line access

You can access the Hadoop and Hive on most of the Analytics Clients. For information on getting access, see Analytics/Data access.

HTTP access

Access to HTTP GUIs in the Analytics Cluster is currently very restricted. You must have shell accounts on analytics nodes.

SSH tunnel

If you are in the wmf LDAP group (open to every WMF employee/contractor) and you care only about the Yarn Resource Manager UI, you can login directly to yarn.wikimedia.org.

Otherwise if you need to contact specific UIs or services within the cluster (so with internal IPs), see Proxy access to cluster.

Kerberos

Sadly DataGrip seems not working with Hadoop and Kerberos, see T241170.

DataGrip

Figure 1: Hive driver settings in DataGrip
Figure 2: Hive connection details in DataGrip

JetBrains provides a license to the Foundation which can be used to install the professional versions of IntelliJ IDEA and PyCharm IDEs. For more information refer to this page on Office wiki. (Contractors need to get in touch with a full-time staffer for those details.) One of the IDEs available through this license is DataGrip, which can connect to MariaDB and Hive and many others. Once you have downloaded and installed DataGrip, follow these instructions to connect to Hive:

  1. Download a copy of the JDBC driver from the cluster: scp stat1007.eqiad.wmnet:/usr/lib/hive/lib/hive-jdbc-1.1.0-cdh5.16.1-standalone.jar ~/Downloads/
  2. Open an SSH tunnel to ssh -N stat1007.eqiad.wmnet -L 10000:analytics-hive.eqiad.wmnet:10000 in Terminal
  3. In DataGrip, File > New > Data Source > Apache Hive to open the "Data Sources and Drivers" window.
  4. In the Drivers section of this window, go to Apache Hive and click + under "Driver files", then "Custom JARs…", and select the JAR file you just downloaded from stat1007. Note: if you have previously downloaded a JetBrains-provided Hive driver, you will need to remove it (-) to ensure DataGrip uses the correct driver. Refer to Fig. 1 for confirmation.
  5. In the Project Data Sources section of this window, where a new Hive data source has been created called @localhost, specify a name – e.g. "Hive (via SSH tunnel)" – and the following in the General tab:
    • Host and Port: leave as is (localhost and 10000)
    • User: your shell username
    • Schema: default
    • URL: will be automatically filled and should look like jdbc:hive2://localhost:10000/default
  6. Click TEST CONNECTION to check that it works. If it asks you for password, leave that field empty and click OK to proceed. Refer to Fig. 2 for confirmation.
  7. In the same window, go to the Schemas section. DataGrip will then connect to Hive and fetch the list of databases (schemas). Once the list has been populated you can select the ones you want to see in DataGrip (e.g. wmf, event, event_sanitized) and click OK to be done.

You can now use DataGrip as a local alternative to Hue. You will need to open an SSH tunnel to an-coord1001 (like you did in step 2) every time you want to connect with DataGrip. Yes, there is an SSH/SSL tab in the connection details window but it doesn't support our Bastion setup. Adding the following to your ~/.bash_profile will make that part easy because then you can just type data-grip to open the tunnel:

alias data-grip="ssh -N stat7 -L 10000:analytics-hive.eqiad.wmnet:10000"