Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to see GPU memory consumption as part of system metrics in aim stack. #3020

Closed
dushyantbehl opened this issue Sep 28, 2023 · 4 comments
Labels
type / bug Issue type: something isn't working type / question Issue type: question

Comments

@dushyantbehl
Copy link
Contributor

❓Question

I have been using aimstack version 3.17.5 and unable to see any GPU memory consumption when doing aimstack runs.

The dashboard shows GPU %, GPU temprature but not the memory used. Is there a way to track what is going on?

I am happy to share any information about the environment you may have. Thanks in advance.

@dushyantbehl dushyantbehl added the type / question Issue type: question label Sep 28, 2023
@alberttorosyan alberttorosyan added the type / bug Issue type: something isn't working label Oct 2, 2023
@dushyantbehl
Copy link
Contributor Author

Hi @alberttorosyan Thanks for marking this as a bug.
Could I be of any help here to fix things or dig deeper. Please let me know, I'll be happy to see if I can help.

@alberttorosyan
Copy link
Member

@dushyantbehl, here's the code snippet which extracts the GPU information before passing it to Aim tracking methods:

                gpu_info = dict()
                handle = nvml.nvmlDeviceGetHandleByIndex(i)
                try:
                    util = nvml.nvmlDeviceGetUtilizationRates(handle)
                    # GPU utilization percent
                    gpu_info["gpu"] = round10e5(util.gpu)
                except nvml.NVMLError_NotSupported:
                    pass
                try:
                    # Get device memory
                    memory = nvml.nvmlDeviceGetMemoryInfo(handle)
                    # Device memory usage
                    # 'memory_used': round10e5(memory.used / 1024 / 1024),
                    gpu_info["gpu_memory_percent"] = round10e5(memory.used * 100 / memory.total)
                except nvml.NVMLError_NotSupported:
                    pass
                try:
                    # Get device temperature
                    nvml_tmp = nvml.NVML_TEMPERATURE_GPU
                    temp = nvml.nvmlDeviceGetTemperature(handle, nvml_tmp)
                    # Device temperature
                    gpu_info["gpu_temp"] = round10e5(temp)
                except nvml.NVMLError_NotSupported:
                    pass
                try:
                    # Compute power usage in watts and percent
                    power_watts = nvml.nvmlDeviceGetPowerUsage(handle) / 1000
                    power_cap = nvml.nvmlDeviceGetEnforcedPowerLimit(handle)
                    power_cap_watts = power_cap / 1000
                    power_watts / power_cap_watts * 100
                    # Power usage in watts and percent
                    gpu_info["gpu_power_watts"]: round10e5(power_watts)
                    # gpu_info["power_percent"] = round10e5(power_usage)
                except nvml.NVMLError_NotSupported:
                    pass

Each call to nmvl API is wrapped with try/except block. If you see the power consumption, temperature, etc. that means that the specific call has failed, due to device support.

@ChanderG
Copy link
Contributor

@alberttorosyan It was not a device support problem since directly using nmvl APIs worked. After some debugging, I found the cause. Have opened a PR here: #3044

@dushyantbehl
Copy link
Contributor Author

Fix merged here - #3044

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type / bug Issue type: something isn't working type / question Issue type: question
Projects
Status: No status
Development

No branches or pull requests

3 participants