You are browsing a read-only backup copy of Wikitech. The live site can be found at

Difference between revisions of "Analytics/Systems/Cluster/AMD GPU"

From Wikitech
Jump to navigation Jump to search
Line 4: Line 4:
== Use the Debian packages ==
== Use the Debian packages ==
See profile::statistics::gpu or the amd_rocm module in operations/puppet.
See [ profile::statistics::gpu] or the [ amd_rocm] module in operations/puppet.
== Use the GPU on the host ==
== Use the GPU on the host ==
Line 44: Line 44:
In Grafana:
In Grafana:
Code available in:
== Outstanding issues ==
== Outstanding issues ==

Revision as of 11:28, 13 September 2019

The Analytics team added a GPU to the stat1005 host in

The model is AMD Radeon Pro WX 9100 16GB. The choice fell to AMD since they are currently the only ones releasing their software stack open source:

Use the Debian packages

See profile::statistics::gpu or the amd_rocm module in operations/puppet.

Use the GPU on the host

You need to be in the gpu-testers POSIX group in operations/puppet. This is a workaround to force the users in that group to be in the render POSIX group (available on Debian), that grants access to the GPU. Eventually all Analytics groups will have automatic access, but for the moment gpu-testers is the only one.

Use tensorflow

The easiest solution is to create a Python 3 virtual environment on stat1005 and then pip3 install Please remember that every version of the package is linked against a specific version of ROCm, so it may be possible that newer versions of tensorflow-rocm don't run on our hosts since we don't have an up to date version of ROCm deployed yet.

An example that we learnt from

  • tensorflow-rocm 1.13.4+ are all versions supported only by ROCm 2.6
  • if you use ROCm 2.5, then you can run up to tensorflow-rocm 1.13.3. Newer versions trigger ABI compatibility issues (like symbols not found in libraries etc..).

Upstream suggested to follow and check every time what combination of tensorflow-rocm and ROCm is supported.

With the current version of ROCm, 2.7.1, only tensorflow-rocm 1.14.1 is supported.

Check the version of ROCm deployed on a host

elukey@stat1005:~/test$ dpkg -l rocm-dev
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name           Version      Architecture Description
ii  rocm-dev       2.7.22       amd64        Radeon Open Compute (ROCm) Runtime software stack

Changelog in

Check usage of the GPU

On the host:

elukey@stat1005:~$ /opt/rocm/bin/rocm-smi --showuse

========================ROCm System Management Interface========================
GPU[1] 		: Current GPU use: 0%
==============================End of ROCm SMI Log ==============================

In Grafana:

Code available in:

Outstanding issues

Upgrade the Debian packages

We import the Debian packages released by AMD for Ubuntu Xenial to the amd-rocm component in wikimedia-buster. Up to now (July 2019) there is one Debian package released by AMD that is not open source, hsa-ext-rocr-dev. It contains binary libraries to have a better image support in OpenCL, and we don't use it for obvious reasons. The package is sadly required by other packages, and upstream still hasn't made it optional (

The solution found in was to create a dummy package via Debian equiv to satisfy dependencies and please the apt install process. This means that every time a new ROCm release is out, the following procedure needs to be done:

1) Check and see if a new version is out. If so, create a new component like:

Name: amd-rocmXX
Suite: xenial
Components: main>thirdparty/amd-rocmXX
Architectures: amd64
VerifyRelease: 9386B48A1A693C5C
ListShellHook: grep-dctrl -e -S '^([..cut..])$' || [ $? -eq 1 ]

Replace the XX wildcards with the version number of course.

2) ssh to install1002, run puppet and check for updates (remember to replace the XX wildcards):

root@install1002:/srv/wikimedia# reprepro --noskipold --ignore=forbiddenchar --component thirdparty/amd-rocmXX checkupdate buster-wikimedia
Calculating packages to get...
Updates needed for 'buster-wikimedia|thirdparty/amd-rocm|amd64':
'hsa-rocr-dev': newly installed as '1.1.9-87-g1566fdd' (from 'amd-rocm'):
 files needed: pool/thirdparty/amd-rocm/h/hsa-rocr-dev/hsa-rocr-dev_1.1.9-87-g1566fdd_amd64.deb

3) find the new version of hsa-rocr-dev, since it is the only package in ROCm that requires a precise version of the hsa-ext-rocr-dev package (namely its version).

3) create a control file like the following on boron:

### Commented entries have reasonable defaults.
### Uncomment to edit them.
# Source: <source package name; defaults to package name>
Section: devel
Priority: optional
# Homepage: <enter URL here; no default>
Standards-Version: 3.9.2

Package: hsa-ext-rocr-dev
Version: 1.1.9-87-g1566fdd
Maintainer: Luca Toscano <>
# Pre-Depends: <comma-separated list of packages>
# Depends: <comma-separated list of packages>
# Recommends: <comma-separated list of packages>
# Suggests: <comma-separated list of packages>
# Provides: <comma-separated list of packages>
# Replaces: <comma-separated list of packages>
Architecture: amd64
# Multi-Arch: <one of: foreign|same|allowed>
# Copyright: <copyright file; defaults to GPL2>
# Changelog: <changelog file; defaults to a generic changelog>
# Readme: <README.Debian file; defaults to a generic one>
# Extra-Files: <comma-separated list of additional files for the doc directory>
# Files: <pair of space-separated paths; First is file to include, second is destination>
#  <more pairs, if there's more than one file to include. Notice the starting space>
Description: dummy package to satisfy dependencies for hsa-rocr-dev
 hsa-rocr-dev-ext contains binary only and non open-source libraries

Make sure the Version is the new one of hsa-rocr-dev and save.

4) build the package with equivs-build control

5) upload the package to reprepro (remember to replace the XX wildcards):

reprepro -C thirdparty/amd-rocmXX includedeb buster-wikimedia /home/elukey/hsa-ext-rocr-dev_1.1.9-87-g1566fdd_amd64.deb

6) Update the thirdparty/rocmXX component (remember to replace the XX wildcards):

reprepro --noskipold --ignore=forbiddenchar --component thirdparty/amd-rocmXX update buster-wikimedia

7) Update the versions supported by the amd_rocm module in operations/puppet. 8) On the host that you want to upgrade:

sudo apt autoremove -y rocm_bandwidth_test rocminfo hsakmt-roct hsa-rocr-dev rocm-cmake hsa-ext-rocr-dev rocm-device-libs hip_base hip_samples

And then run puppet to install the new packages. Some quick tests to see if the GPU is properly recognized:

elukey@stat1005:~$ /opt/rocm/bin/rocminfo

elukey@stat1005:~$ /opt/rocm/opencl/bin/x86_64/clinfo

elukey@stat1005:~$ virtualenv -p python3 test

elukey@stat1005:~$ source test/bin/activate

(test) elukey@stat1005:~$ pip3 install tensorflow-rocm

(test) elukey@stat1005:~$ cat
import tensorflow as tf
# Creates a graph.
with tf.device('/device:GPU:0'):
  a = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[2, 3], name='a')
  b = tf.constant([1.0, 2.0, 3.0, 4.0, 5.0, 6.0], shape=[3, 2], name='b')
  c = tf.matmul(a, b)
# Creates a session with log_device_placement set to True.
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
# Runs the op.

(test) elukey@stat1005:~$ python3
[[22. 28.]
 [49. 64.]]