RuntimeError: Unexpected error from cudaGetDeviceCount() - torch.package / torch::deploy

相关文章推荐

腼腆的茶叶 · Unable to connect to ...· 2 周前 ·

风流倜傥的匕首 · 为爱而生的神奇女侠，一刀一个黑暗势力· 1 月前 ·

不拘小节的仙人掌 · 2025年银行从业资格考试报名时间-报名入口 ...· 2 月前 ·

胡子拉碴的椰子 · 古龙的经典名著《白玉老虎》明明写完了，为何被 ...· 8 月前 ·

谦逊的红豆 · 网球女单决赛：第一盘结束，郑钦文1-0维基奇· 1 年前 ·

I was training GCN model on my Linux server and I suddenly got this error.

RuntimeError: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW

Pytorch version: 1.10.1+cu102

OS: Linux

Python version: Python 3.8.10

CUDA Version: 11.2

No, it doesn’t return any errors:

NVIDIA-SMI 450.57, Driver Version: 450.57 , CUDA Version: 11.2

I have restarted it many times but still the same problem.

I didn’t do any updates. I installed PyTorch and it’s installing successfully

I got:

/usr/local/lib/python3.8/dist-packages/torch/cuda/__init__.py:80: UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 804: forward compatibility was attempted on non supported HW (Triggered internally at  ../c10/cuda/CUDAFunctions.cpp:112.)
  return torch._C._cuda_getDeviceCount() > 0
False
              Based on this issue other users were running into the same error message if
their setup was broken due to a driver/library mismatch (rebooting seemed to solve the issue)
their installed drivers didn’t match the user-mode driver inside a docker container (and forward compatibility failed due to the usage of non-server GPUs)
Was your setup working before and if so, what changed?
              Thank you Sir :). My problem is solved.

By doing:
!pip3 install torch==1.10.1+cu113 torchvision==0.11.2+cu113 torchaudio==0.10.1+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html

!pip3 install torch-scatter -f https://data.pyg.org/whl/torch-1.10.0+cu113.html

!pip3 install torch-scatter torch-sparse -f https://data.pyg.org/whl/torch-1.10.0+cu113.html

!pip3 install torch-cluster -f https://data.pyg.org/whl/torch-1.10.0+cu113.html

!pip3 install torch-geometric
              Hi ptrblck,
Thanks for your valuable comments.
I might have very similar issue below. I’m implementing DDP training on HPCs (with SLURM or LSF), where each node has 4 V100 GPUs. Without any changes of my code, envs etc (pyotrch=1.9.0, cuda 10.2), I randomly meet this issue very recently. As a results, I found torch.cuda.is_available=False on few nodes while CUDA on most nodes do available. It’s hard to reboot the cluster, do you have some suggestions for further debugging?
[UserWarning: CUDA initialization: Unexpected error from cudaGetDeviceCount(). Did you run some cuda functions before calling NumCudaDevices() that might have already set an error? Error 803: system has unsupported display driver / cuda driver combination (Triggered internally at  …/c10/cuda/CUDAFunctions.cpp:115.)

return torch._C._cuda_getDeviceCount() > 0]
              Do you receive this message randomly?

If so, I would guess your system encountered any kind of issue and might have dropped the GPU.

Based on the error message I would have guessed you are running into a setup issue, but this wouldn’t explain the randomness (assuming you are indeed seeing these errors randomly). Are you seeing any Xids in dmesg?
              Hi ptrblck,
Thanks for the quick reply. I think you are right about the “randomness”, and sorry for the confusion. I think the “randomness” comes from the randomly signed computing nodes at each time submitting a job. If it is the setup issue of pytorch, cuda and driver, how can most nodes work well, but few of nodes are failed, assuming each computing node is identically setup?
Buy running dmesg, I do find some Xids as examples below . But I don’t think these Xids make torch.cuda unavailable, since I can succesfully perform DDP trainig with these Xids.

Screen Shot 2022-07-24 at 4.48.56 PM2480×398 102 KB
Screen Shot 2022-07-24 at 5.28.15 PM2337×561 114 KB
 JiamingLiu-Jeremy:
I think the “randomness” comes from the randomly signed computing nodes at each time submitting a job.
Does this mean that the same node causes the issue if it’s selected in your env?

If so, check the node’s health status as it seems it has trouble with the driver.
              Just to confirm, same here. Randomly met this error. In my last run the code was running properly. Suddenly it start producing this error. I tried rebooting but didn’t work. Checked cuda runtimes, all good.

I would really appreciate some help.