How we fixed CUDA Error 101: invalid device ordinal ... torch._C._cuda

In an AI developer's life, a few things are more unsettling than strange CUDA errors —except maybe unsolicited Windows updates 🔥. But I digress. Our story begins on an otherwise ordinary day when our trusty GPU server, a behemoth designed to house 8 GPUs, decided to throw us a curveball.

A routine check with nvidia-smi revealed a peculiar anomaly: one of our GPUs had decided to play hide and seek, leaving us with just 7 visible GPUs. The output was stark, showing us a lineup of 7 GPU attendees, with the 8th conspicuously absent 🏖️.

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.161.07             Driver Version: 535.161.07   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla P40                      Off | 00000000:05:00.0 Off |                  Off |
| N/A   33C    P0              50W / 250W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  Tesla P40                      Off | 00000000:08:00.0 Off |                  Off |
| N/A   40C    P0              52W / 250W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   2  Tesla P40                      Off | 00000000:09:00.0 Off |                  Off |
| N/A   39C    P0              51W / 250W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   3  Tesla P40                      Off | 00000000:84:00.0 Off |                  Off |
| N/A   35C    P0              52W / 250W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   4  Tesla P40                      Off | 00000000:85:00.0 Off |                  Off |
| N/A   36C    P0              50W / 250W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   5  Tesla P40                      Off | 00000000:88:00.0 Off |                  Off |
| N/A   34C    P0              50W / 250W |      0MiB / 24576MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   6  Tesla P40                      Off | 00000000:89:00.0 Off |                  Off |
| N/A   33C    P0              49W / 250W |      0MiB / 24576MiB |      1%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

Our journey into the rabbit hole deepened when we discovered that despite the presence of these 7 GPUs, CUDA was throwing a tantrum, refusing to acknowledge any of them 😡:

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
/home/joy/.venv/lib/python3.10/site-packages/torch/cuda/__init__.py:141: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 101: invalid device ordinal (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:108.)
  return torch._C._cuda_getDeviceCount() > 0
False

Digging through kernel messages with sudo dmesg, we stumbled upon a clue: a power connector on one of the GPUs was loose 🔌.

[ 5362.995401] NVRM: GPU 0000:04:00.0: GPU does not have the necessary power cables connected.
[ 5362.995812] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x24:0x1c:1435)
[ 5362.995852] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 5368.246428] NVRM: GPU 0000:04:00.0: GPU does not have the necessary power cables connected.
[ 5368.246782] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x24:0x1c:1435)
[ 5368.246833] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 5368.603123] NVRM: GPU 0000:04:00.0: GPU does not have the necessary power cables connected.
[ 5368.603470] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x24:0x1c:1435)
[ 5368.603520] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0
[ 5373.879295] NVRM: GPU 0000:04:00.0: GPU does not have the necessary power cables connected.
[ 5373.879683] NVRM: GPU 0000:04:00.0: RmInitAdapter failed! (0x24:0x1c:1435)
[ 5373.879726] NVRM: GPU 0000:04:00.0: rm_init_adapter failed, device minor number 0

So, to use the rest of the GPUs, we somehow needed to disable this device. We tried at first by adding the following using -- sudo nano /etc/modprobe.d/nvidia.conf followed by sudo update-initramfs -u rebooting the machine. But it did not work.

options nvidia NVreg_AssignGpus="pci:0000:05:00.0,pci:0000:08:00.0,pci:0000:09:00.0,pci:0000:84:00.0,pci:0000:85:00.0,pci:0000:88:00.0,pci:0000:89:00.0"

Another option was to blacklist the GPU in GRUB using -- sudo vim /etc/default/grub and adding

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci_stub.ids=10de:1b30"

Followed by sudo update-grub && sudo reboot. But this seemed too complicated (and dangerous), and it would have probably disabled all the devices 🧨.

But there was a much easier and faster way (without rebooting the machine). Thanks to ChatGPT, we disabled the devices from being loaded by the NVIDIA driver.

cd /sys/bus/pci/devices/0000:04:00.0/
readlink driver
echo -n "0000:04:00.0" | sudo tee /sys/bus/pci/drivers/nvidia/unbind

The 0000:04:00.0 is the PCI device ID, which can be found using lspci -k or sudo dmesg

And that's it, everything was working ✅!

Python 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.is_available()
True

Our workaround did the trick, and though we'll need to physically address that loose power connector, the server's operational once again. This experience highlights how, sometimes, the solution doesn't require massive changes but just the correct command at the right time.

If you're navigating your own GPU issues, remember, often the answer lies in a simple, direct approach—aided, perhaps, by a nudge in the right direction from AI or a timely piece of advice.

I appreciate you reading through this journey. May your computing be seamless and your hardware cooperative 🖖.

How we fixed CUDA Error 101: invalid device ordinal ... torch._C._cuda_getDeviceCount() > 0 🤯