Observing GPU memory and/or CPU OS memory leaks with use_persistent_mapping
enabled in gdrdrv
during multi-process termination
#313
Milestone
Impacted platform
All server side products, first observed on Grace-Hopper based system
Impacted gdrcopy versions
2.4.1, 2.4.2, 2.4.3
Impacted gdrcopy configs
gdrdrv
driver loaded with module parameter setuse_persistent_mapping=1
Scenarios
If gdrcopy persistent mapping mode is enabled,
If one process opens a connection to the driver (via
gdr_open
), and intents to expliclity share connection (using UNIX Domain socket) with one or more processes to use the underlying connection, then the cleanup of the driver resources (viagdr_close
) may be executed by one of the non-owning processes, which would be silently ignored therefore leading to CPU and GPU memory leaks.If a parent process A forks one or more child process B (instead of linux
fork
+exec
), then connections opened by A can be attempted to be closed by B during an ungraceful termination of processes via signals (SIGSEV
orSIGKILL
), resulting in OS and GPU memory leaks.By default, if persistent mode is disabled, under both scenarios, GPU resources cleanup is performed through an independent workflow in CUDA driver and hence dropping the request to close this connection is benign.
Irrespective of persistent mode, this bug may lead to small CPU kernel memory leaks.
Signature of the defect
nvidia-persistenced
service may hang, requiring rebooting the machine.Known mitigations
Turn off by setting driver module parameter
use_persistent_mapping=0
and reloading the driver.Fixed gdrcopy version
2.4.4
The text was updated successfully, but these errors were encountered: