You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Kerberos Authentication is not enabled in the Analytics Hadoop cluster yet, stay tuned!
Purpose of Kerberos
Hadoop by default does not ship with a strong authentication mechanism for users and daemons. In order to enable its "Secure mode", an external authentication service must be plugged in, and the only compatible one is Kerberos.
When enabled, like in the Hadoop test cluster, it means that users and daemons will need to authenticate to our Kerberos service before being able to use Hadoop. Please read the next sections to get more info about what do to.
High level overview
The picture depicts a high level overview of how Kerberos authentication affects users. First of all, it must be noticed that the Hadoop cluster is the only part of the infrastructure that will be configured to use Kerberos. The red lines show what are the parties that will require to authenticate to Kerberos to be able to use Hadoop:
- Druid, since its deep storage is Hadoop HDFS. Please note that this will not mean that Druid itself will require Kerberos authentication from users, but only that Druid itself will need to authenticate before fetching data from HDFS. This means that Superset and Turnilo dashboards will keep working as before, without changes.
- Users logging on stat100[4,5,7] and notebook100[3,4]. If anybody wants to use any tool that interacts with Hadoop (Oozie, Hive, Spark, Notebooks, etc..) he/she will need to authenticate via Kerberos.
How do I..
Authenticate via Kerberos
kinit command, insert your password and then execute any command (spark, etc..). This is very important since if you don't do it, you'll see horrible error messages reported by basically anything you'll use. The kinit command grants you a so called Kerberos TGT, that will be used to allow you to authenticate to various services/hosts. The ticket last 24h, so you will not need to run kinit every time, just once a day. You can inspect the status of your ticket via
Get a password for Kerberos
Please request an identity in https://phabricator.wikimedia.org/T237605. If you have any doubt, feel free to ask to the Analytics team on Freenode #wikimedia-analytics or via email. You'll receive an email containing a temporary password, that you'll be required to change during you first authentication (see section above).
- This is really annoying, can't we just use LDAP or something similar to avoid another password?
- We tried really hard but for a lot of technical reasons, the integration would be complicated and cumbersome to maintain for Analytics and SRE. There might be some changes in the future, but for now we'll have to deal with another password to remember.
Run a recurrent job via Cron or similar without kinit every day
The option that is currently available is a Kerberos keytab: a file with permissions set that only the owner can read, holding the password to authenticate to Kerberos. We use keytabs for daemons/services, and we'll plan to provide those to users with the need to run periodical jobs. The major drawbacks are:
- our security standard lowers down a bit, since it is sufficient to ssh to a host to access HDFS (as opposed to also know a password). This is more or less the current scheme, so not a big deal, but we have to think about it.
- The keytab needs to be generated for every host that needs to have this automation and it also needs to be regenerated and re-deployed when the user changes the password (this doesn't happen for daemons of course). It is currently not automated, and it requires a ping to Analytics every time..
To summarize: we'll work on a solution, possibly shaped by feedback from users, but for the first iteration we'll not provide keytabs for all users (only selectively deploying those if needed). If you think that your use case needs it, please ping Analytics.
Know what datasets I can use
In the Hadoop test cluster we have only a few datasets available:
- webrequest (sampled)
- pageviews (sampled)
If you need more, please ping Luca or the Analytics team :)
Check the Yarn Resource Manager's UI
Nothing changed for the Yarn's UI!
Nothing changed for Hue!
hive cli is compatible with Kerberos, even if it uses an old protocol (connecting to the Hive Metastore and HDFS directly). The
beeline command line uses the Hive 2 server via JDBC and it is also compatible with Kerberos. You just need to authenticate as described above and then run the tool on an-tool1006.eqiad.wmnet.
Use Spark 2
On stat100[4,5,7] and notebook100[3,4] authenticate via kinit and then use the spark shell as you are used to. There are currently some limitations:
spark2-thriftserverrequires the hive keytab, that is only present on an-coord1001, so when running on client nodes it will return the following error:
org.apache.hive.service.ServiceException: Unable to login to kerberos with given principal/keytab
Use Jupyterhub (SWAP replica)
You can authenticate to Kerberos running kinit in the Terminal window. Please remember that it will be needed only once every 24h, not every time.
Use Hive2 actions in Oozie
+ <credentials> + <credential name='my-hive-creds' type='hive2'> + <property> + <name>hive2.server.principal</name> + <value>hive/an-coord1001.eqiad.wmnet@WIKIMEDIA</value> + </property> + <property> + <name>hive2.jdbc.url</name> + <value>jdbc:hive2://an-coord1001.eqiad.wmnet:10000/default</value> + </property> + </credential> + </credentials> - <action name="add_partition"> + <action name="add_partition" cred="my-hive-creds">
How do I... (Analytics admins version)
Check the status of the HDFS Namenodes and Yarn Resource Managers
Most of the commands are the same, but of course to authenticate as the user hdfs you'll need to use a keytab:
sudo -u hdfs kerberos-run-command hdfs /usr/bin/yarn rmadmin -getServiceState an-master1001-eqiad-wmnet sudo -u hdfs kerberos-run-command hdfs /usr/bin/hdfs haadmin -getServiceState an-master1002-eqiad-wmnet