Not able to see GPU memory consumption as part of system metrics in aim stack. #3020

dushyantbehl · 2023-09-28T13:12:57Z

❓Question

I have been using aimstack version 3.17.5 and unable to see any GPU memory consumption when doing aimstack runs.

The dashboard shows GPU %, GPU temprature but not the memory used. Is there a way to track what is going on?

I am happy to share any information about the environment you may have. Thanks in advance.

The text was updated successfully, but these errors were encountered:

dushyantbehl · 2023-10-03T08:52:00Z

Hi @alberttorosyan Thanks for marking this as a bug.
Could I be of any help here to fix things or dig deeper. Please let me know, I'll be happy to see if I can help.

alberttorosyan · 2023-10-03T09:26:45Z

@dushyantbehl, here's the code snippet which extracts the GPU information before passing it to Aim tracking methods:

                gpu_info = dict()
                handle = nvml.nvmlDeviceGetHandleByIndex(i)
                try:
                    util = nvml.nvmlDeviceGetUtilizationRates(handle)
                    # GPU utilization percent
                    gpu_info["gpu"] = round10e5(util.gpu)
                except nvml.NVMLError_NotSupported:
                    pass
                try:
                    # Get device memory
                    memory = nvml.nvmlDeviceGetMemoryInfo(handle)
                    # Device memory usage
                    # 'memory_used': round10e5(memory.used / 1024 / 1024),
                    gpu_info["gpu_memory_percent"] = round10e5(memory.used * 100 / memory.total)
                except nvml.NVMLError_NotSupported:
                    pass
                try:
                    # Get device temperature
                    nvml_tmp = nvml.NVML_TEMPERATURE_GPU
                    temp = nvml.nvmlDeviceGetTemperature(handle, nvml_tmp)
                    # Device temperature
                    gpu_info["gpu_temp"] = round10e5(temp)
                except nvml.NVMLError_NotSupported:
                    pass
                try:
                    # Compute power usage in watts and percent
                    power_watts = nvml.nvmlDeviceGetPowerUsage(handle) / 1000
                    power_cap = nvml.nvmlDeviceGetEnforcedPowerLimit(handle)
                    power_cap_watts = power_cap / 1000
                    power_watts / power_cap_watts * 100
                    # Power usage in watts and percent
                    gpu_info["gpu_power_watts"]: round10e5(power_watts)
                    # gpu_info["power_percent"] = round10e5(power_usage)
                except nvml.NVMLError_NotSupported:
                    pass

Each call to nmvl API is wrapped with try/except block. If you see the power consumption, temperature, etc. that means that the specific call has failed, due to device support.

ChanderG · 2023-10-12T08:35:37Z

@alberttorosyan It was not a device support problem since directly using nmvl APIs worked. After some debugging, I found the cause. Have opened a PR here: #3044

dushyantbehl · 2023-11-20T17:59:08Z

Fix merged here - #3044

dushyantbehl added the type / question Issue type: question label Sep 28, 2023

alberttorosyan added the type / bug Issue type: something isn't working label Oct 2, 2023

dushyantbehl closed this as completed Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not able to see GPU memory consumption as part of system metrics in aim stack. #3020

Not able to see GPU memory consumption as part of system metrics in aim stack. #3020

dushyantbehl commented Sep 28, 2023

dushyantbehl commented Oct 3, 2023

alberttorosyan commented Oct 3, 2023

ChanderG commented Oct 12, 2023

dushyantbehl commented Nov 20, 2023

Not able to see GPU memory consumption as part of system metrics in aim stack. #3020

Not able to see GPU memory consumption as part of system metrics in aim stack. #3020

Comments

dushyantbehl commented Sep 28, 2023

❓Question

dushyantbehl commented Oct 3, 2023

alberttorosyan commented Oct 3, 2023

ChanderG commented Oct 12, 2023

dushyantbehl commented Nov 20, 2023