For long time I have been using a good old nvidia GeForce GTX 1050 for my display and deep learning needs. I reported a few times how to get Tensorflow running on Debian/Sid, see here and here. Later on I switched to AMD GPU in the hope that an open source approach to both GPU driver as well as deep learning (ROCm) would improve the general experience. Unfortunately it turned out that AMD GPUs are generally not ready for deep learning usage.
The problems with AMD and ROCm are far and wide. First of all, it seems that for anything more complicated then simple stuff, AMD’s flagship RX 5700(XT) and all GFX10 (Navi) based cards are not(!!!) supported in ROCm. Yes, you read correct … AMD does not support 5700(XT) cards in the ROCm stack. Some simple stuff works, but nothing for real computations.
Then, even IF they would support, ROCm as distributed is currently a huge pain in the butt. The source code is a huge mess, and building usable packages from it is probably possible, but quite painful (I am member of the ROCm packaging team in Debian, and have tried many hours). And the packages provided by AMD are not installable on Debian/sid due to library incompatibilities.
So that left me with a bit a problem: for work I need to train quite some neural networks, do model selection, etc. Doing this on a CPU is a bit a burden. So at the end I decided to put the nVidia card back into the computer (well, after moving it to a bigger case – but that is a different story to tell). Here are the steps I did to get both cards working for their respective target: AMD GPU for driving the console and X (and games!), and the nVidia card doing the deep learning stuff (tensorflow using the GPU).
Starting point was a working AMD GPU installation. The AMD GPU is also the first GPU card (top slot) and thus the one that is used by the BIOS and the Linux console. If you want the video output on the second card you need to trick, and probably don’t have console output, etc etc. So not a solution for me.
Installing libcuda1 and the nvidia kernel drivers
Next step was installing the
This installs a lot of stuff, including the nvidia drivers, GLX libraries, alternatives setup, and
update-glx tool and package.
The kernel module should be built and installed automatically for your kernel.
Follow more or less the instructions here and do
wget -O- https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub | sudo tee /etc/apt/trusted.gpg.d/nvidia-cuda.asc echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /" | sudo tee /etc/apt/sources.list.d/nvidia-cuda.list sudo apt-get update sudo apt-get install cuda-libraries-10-1
Warning! At the moment Tensorflow packages require CUDA 10.1, so don’t install the 10.0 version. This might change in the future!
This will install lots of libs into
/usr/local/cuda-10.1 and add the respective directory to the ld.so path by creating a file
Install CUDA CuDNN
One difficult to satisfy dependency are the CuDNN libraries. In our case we need the version 7 library for CUDA 10.1. To download these files one needs to have a NVIDIA developer account, which is quick and painless. After that go to the CuDNN page where one needs to select
Archived releases and then
Download cuDNN v7.N.N (xxxx NN, YYYY), for CUDA 10.1 and then
cuDNN Runtime Library for Ubuntu18.04 (Deb).
At the moment (as of today) this will download a file
libcudnn7_18.104.22.168-1+cuda10.1_amd64.deb which needs to be installed with
dpkg -i libcudnn7_22.214.171.124-1+cuda10.1_amd64.deb.
Updating the GLX setting
Here now comes the very interesting part – one needs to set up the GLX libraries. Reading the output of
update-glx --help and then the output of
update-glx --list glx:
$ update-glx --help update-glx is a wrapper around update-alternatives supporting only configuration of the 'glx' and 'nvidia' alternatives. After updating the alternatives, it takes care to trigger any follow-up actions that may be required to complete the switch. It can be used to switch between the main NVIDIA driver version and the legacy drivers (eg: the 304 series, the 340 series, etc). For users with Optimus-type laptops it can be used to enable running the discrete GPU via bumblebee. Usage: update-glx <command> Commands: --auto <name> switch the master link <name> to automatic mode. --display <name> display information about the <name> group. --query <name> machine parseable version of --display <name>. --list <name> display all targets of the <name> group. --config <name> show alternatives for the <name> group and ask the user to select which one to use. --set <name> <path> set <path> as alternative for <name>. <name> is the master name for this link group. Only 'nvidia' and 'glx' are supported. <path> is the location of one of the alternative target files. (e.g. /usr/lib/nvidia) $ update-glx --list glx /usr/lib/mesa-diverted /usr/lib/nvidia
I was tempted into using
update-glx --config glx /usr/lib/mesa-diverted
because at the end the Mesa GLX libraries should be used to drive the display via the AMD GPU.
Unfortunately, with this neither the nvidia kernel module was loaded, the nvidia persistenced couldn’t run because the library
libnvidia-cfg1 wasn’t found (not sure it was needed at all…), and with that also no way to run tensorflow on GPU.
So what I did I tried
(which is the same as
update-glx --config glx /usr/lib/nvidia), and rebooted, and decided to check afterwards what is broken.
To my big surprise, the AMD GPU still worked out of the box, including direct rendering, and the games I tried (Overload, Supraland via Wine) all worked without a hinch.
Not that I really understand why the GLX libraries that are seemingly now in use are from nvidia but work the same (if anyone has an explanation, that would be great!), but since I haven’t had any problems till now, I am content.
Checking GPU usage in tensorflow
Make sure that you remove tensorflow-rocm and reinstall tensorflow with GPU support:
pip3 uninstall tensorflow-rocm pip3 install --upgrade tensorflow-gpu
After that a simple
$ python3 -c "import tensorflow as tf;print(tf.reduce_sum(tf.random.normal([1000, 1000])))" ....(lots of output) 2020-09-02 11:57:04.673096: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3581 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:05:00.0, compute capability: 6.1) tf.Tensor(1093.4915, shape=(), dtype=float32) $
should indicate that the GPU is used by tensorflow!
The R Keras package should also work out of the box and pick up the system-wide tensorflow which in turn picks the GPU, see this post for example code to run for tests.
All in all it was easier than expected, despite the dances one has to do for nvidia to get the correct libraries. What still puzzles me is the selection option in update-glx, and might need a better support for secondary nvidia GPU cards.