I was training GCN model on my Linux server and I suddenly got this error.
RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW
Pytorch version: 1.10.1+cu102
OS: Linux
Python version: Python 3.8.10
CUDA Version: 11.2
No, it doesn’t return any errors:
NVIDIA-SMI 450.57, Driver Version: 450.57 , CUDA Version: 11.2
I have restarted it many times but still the same problem.
I didn’t do any updates. I installed PyTorch and it’s installing successfully
I got:
/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at ../c10/cuda/CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0
False
Based on this issue other users were running into the same error message if
their setup was broken due to a driver/library mismatch (rebooting seemed to solve the issue)
their installed drivers didn’t match the user-mode driver inside a docker container (and forward compatibility failed due to the usage of non-server GPUs)
Was your setup working before and if so, what changed?
I might have very similar issue below. I’m implementing DDP training on HPCs (with SLURM or LSF), where each node has 4 V100 GPUs. Without any changes of my code, envs etc (pyotrch=1.9.0, cuda 10.2), I randomly meet this issue very recently. As a results, I found torch.cuda.is_available=False on few nodes while CUDA on most nodes do available. It’s hard to reboot the cluster, do you have some suggestions for further debugging?
[UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at …/c10/cuda/CUDAFunctions.cpp:115.)
return torch._C._cuda_getDeviceCount() > 0]
Do you receive this message randomly?
If so, I would guess your system encountered any kind of issue and might have dropped the GPU.
Based on the error message I would have guessed you are running into a setup issue, but this wouldn’t explain the randomness (assuming you are indeed seeing these errors randomly). Are you seeing any Xids in dmesg?
Hi ptrblck,
Thanks for the quick reply. I think you are right about the “randomness”, and sorry for the confusion. I think the “randomness” comes from the randomly signed computing nodes at each time submitting a job. If it is the setup issue of pytorch, cuda and driver, how can most nodes work well, but few of nodes are failed, assuming each computing node is identically setup?
Buy running dmesg, I do find some Xids as examples below . But I don’t think these Xids make torch.cuda unavailable, since I can succesfully perform DDP trainig with these Xids. Screen Shot 2022-07-24 at 4.48.56 PM2480×398 102 KBScreen Shot 2022-07-24 at 5.28.15 PM2337×561 114 KB JiamingLiu-Jeremy:
I think the “randomness” comes from the randomly signed computing nodes at each time submitting a job.
Does this mean that the same node causes the issue if it’s selected in your env?
If so, check the node’s health status as it seems it has trouble with the driver.
Just to confirm, same here. Randomly met this error. In my last run the code was running properly. Suddenly it start producing this error. I tried rebooting but didn’t work. Checked cuda runtimes, all good.
I would really appreciate some help.