最近买了一个银河W5,塞了一块公版4090,跑Yolo在运行了一个迭代后,nvidia的显卡会挂掉,具体的表现为nvidia-smi会卡住十几分钟,之后输出No devices were found,但是执行lspci | grep -i nvidia还是可以看到四块显卡好好的挂在上面,这种情况应该直接reboot就可以修复,但是reboot了之后同样的程序运行一段时间之后显卡还是会掉。

最后根据分析,是因为没有开启GPU的Persistence Mode。

查看Nvidia的文档 Driver Persistence(https://docs.nvidia.com/pdf/Driver_Persistence.pdf):

Under Linux systems where X runs by default on the target GPU the kernel mode driver will generally be initalized and kept alive from machine startup to shutdown, courtesy of the X process. On headless systems or situations where no long-lived X-like client maintains a handle to the target GPU, the kernel mode driver will initilize and deinitialize the target GPU each time a target GPU application starts and stops. In HPC environments this situation is quite common. Since it is often desireable to keep the GPU initialized in these cases, NVIDIA provides two options for changing driver behavior: Persistence Mode (Legacy) and the Persistence Daemon.

一般的机器上安装GPU,GPU的驱动程序会在机器的开启时被加载,机器关闭时再被卸载。而在在没有显示器的Linux操作系统(headless systems)中,尤其是HPC中非常常见,GPU的驱动程序会随着GPU运行的程序开始的时候自动被加载,程序关闭时自动被卸载,NVIDIA提供了两种方法来设置GPU的Persistence Mode,我们使用这一种:sudo nvidia-smi -pm 1。

开启了该模式之后,GPU的响应速度会变快,但是待机功耗会增加一点。

开启Persistence Mode之前使用nvidia-smi,输入命令之后需要等到四到五秒加载驱动程序:

Wed Apr 12 13:16:12 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 530.30.02 Driver Version: 530.30.02 CUDA Version: 12.1 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 4090 On | 00000000:01:00.0 Off | Off |
| 46% 69C P2 386W / 450W| 21517MiB / 24564MiB | 99% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 2273 G /usr/lib/xorg/Xorg 145MiB |
| 0 N/A N/A 2493 C+G ...libexec/gnome-remote-desktop-daemon 390MiB |
| 0 N/A N/A 2583 G /usr/bin/gnome-shell 151MiB |
| 0 N/A N/A 6441 C python 20816MiB |
+---------------------------------------------------------------------------------------+

开启Persistence Mode之后使用nvidia-smi,输入命令之后立刻产生输出,并且可以看到Persistence-M这里从off变成了ON,但没有运行程序的时候功耗增加了几瓦,如果关闭的话,使用:sudo nvidia-smi -pm 0 就行。

为什么不开启Persistence Mode,GPU会掉卡?

根据我的分析,每一个Application运行的过程中,GPU的驱动都需要被反复加载和卸载,一方面会损失很多性能,另一方面driver频繁卸载加载,GPU频繁被初始化,CPU访问PCIe config registers时间过长导致 softlock,从而造成GPU的死机。(引用自:https://bbs.gpuworld.cn/index.php?topic=10353.0)

在开启了Persistence Mode之后,机器正常运行,还没有掉卡,问题应该是解决了。


扫码手机观看或分享: