-
Notifications
You must be signed in to change notification settings - Fork 105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add System Monitor Class for Enhanced Debugging (CPU and net i/o) #1242
Conversation
In my latest commit, I renamed the CpuMonitor class to SystemMonitor to better reflect its enhanced functionality. It now monitors data sent and received on the user's machine using psutil, encompassing all IO operations, even those not involving Lithops storage, like boto3 usage. This monitoring includes user data and a minimal container-related overhead, totaling approximately 0.01 to 0.025 MB, as verified in testing. Additionally, users can now monitor their runtime object storage read and write operations. |
I have a few comments:
|
ce48f17
to
bca73d3
Compare
Added start and end methods to SystemMonitor to avoid the use of threads, and now no interval is specified. According to the psutil documentation, cpu_times should be called and ignored once, and the second call will return the CPU usage since the last function call. |
lithops/worker/handler.py
Outdated
# Get and log System statistics if log level is DEBUG | ||
if task.log_level == logging.DEBUG: | ||
cpu_usage, cpu_system_time, cpu_user_time = sys_monitor.calculate_cpus_values() | ||
|
||
call_status.add('worker_func_cpu_usage', cpu_usage) | ||
call_status.add('worker_func_cpu_system_time', round(cpu_system_time, 8)) | ||
call_status.add('worker_func_cpu_user_time', round(cpu_user_time, 8)) | ||
call_status.add('worker_func_cpu_total_time', round(cpu_system_time + cpu_user_time, 8)) | ||
|
||
net_io = sys_monitor.get_network_io() | ||
call_status.add('worker_func_sent_net_io', net_io[0]) | ||
call_status.add('worker_func_recv_net_io', net_io[1]) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks better now, thanks for the changes.
Since now it does not run every second, I was thinking about the possibility to remove if task.log_level == logging.DEBUG:
and return these stats always, if the overhead of this code is negligible. So, can you check what is the overhead in seconds of this block of code? I think you only need to add a logger.info("---to be removed---")
at line 233 and see the difference in time between it and the logger.debug('JobRunner process finished')
in the functions logs.
Hello, Since we're considering both network and cpu usage, memory usage could also be an addition? Currently I am wrapping my pipeline around lithops workers to extract that information. Thanks. |
Hi @JosepSampe, the monitoring added introduces an overhead of approximately 0.00516 seconds to the execution time. |
Hi @abourramouss, I've added the key memory metrics to our stats for a better understanding of resource use. Could you check if these metrics look good to you? |
Does this include memory stats? If so or similar, it is negligible, so you can remove |
Yes, the time measurements we're capturing include memory, CPU stats, and everything else. The maximum overhead I've experienced so far has been 0.01 seconds. |
@JosepSampe @Usamasource Can you give me your opinion? I think the current approach for non debugging purposes is excellent as it provides an overview of what's happening at each worker with a really small overhead. I want to propose an approach for debugging purposes, it's something i have been using on a pipeline and might be useful for Lithops. In my pipeline (currently about radio interferometry) i wrapped the Lithops futures around my "custom" framework for profiling, which consists on a monitoring process that is spawned along the Lithops process. The monitoring process and Lithops process are communicated through a pipe and the monitoring proccess is constantly polling statistics (each second) from the Lithops process, when the lithops process finishes, it sends a signal to the monitoring process to stop the recollection of statistics, those are sent back to the Lithops process to be sent to the client (And now that i think about it, we could send it in a streaming fashion, where data is sent constantly to the client via a socket?) A part from that, I added a decorator to check execution times of several functions within a step, which are returned to the client. This results in two types of plots, a gantt plot to see the execution time of each function within a step and the SystemMonitor stats. Sadly there's no automatic way (that I know of) for the time each function took within each worker, so i had to leverage decorators. The plots show the average usage of resources and times for each worker, obviously the user would be able to plot whatever he wants, like the aggregated usage of resources or whatever. An issue with this approach is that if the polling frequency is small and the duration is big enough, then the logs can be huge. |
@Usamasource Have you checked that the new stats are correct? For example, running a call_async() function that reads 100MB of data from s3 and the Why in the You are capturing the memory stats using |
@abourramouss I understand the first gantt diagram is more related to your app and any user can generate its own, but the metrics you show in the 2nd diagram looks nice and for general purpose. Are you using the same library as here (psutil)? how this differ from what is being added here? does it make sense to include both functionalities? If so, can you submit a separate PR? Probably it would make sense to extend the |
Yes, metrics on the 2nd diagram are generic and do not need of any decorators.
So the difference is that the current PR is more a global overview on what's happening inside the each worker, the great thing about this is that it has a very little overhead and it's a good idea to have these stats always. Yes, everything is psutil. I think it does make sense to include both functionalities, but since intuition tells me it adds a lot of overhead, it might be a better idea to use it only for debugging mode i.e the user doesn't know what's happening inside the worker, or maybe a process doesn't scale well enough. Then you can have a light monitoring for every day (this PR), and a heavy one (that probes the lithops process every second) to get detailed statistics when debugging. |
Hi there, I have been checking the CPU monitoring utility lately, and found an observation worth mentioning (at least at its documentation). Checked its functioning with the following, cpu-bound function (with s=5000): def calculation(s):
# Create two large matrices
matrix1 = np.random.rand(s, s)
matrix2 = np.random.rand(s, s)
# Perform dot product
result = np.dot(matrix1, matrix2) Which locally executed on a single CPU, gives a CPU usage of almost 99%. When executed at AWS Lambda (at 1769MB runtime memory, which should provide the equivalent to a single vCPU), got the following result (essentially the same as in local, split into two):
Which makes sense, as Lambda, at 1769MB runtime memory, seems to give you the capacity of a single CPU, through two non-concurrent vCPUs. Maybe the CPU monitoring results should be pairwise-sumed, or at least provide a note within the documentation on result interpretation. |
Yes, It has been confirmed that the network traffic statistics match exactly with the volume of data handled, with a minimal overhead of 0.01 to 0.025 MB. This was verified using various functions like call_async and map.
The display of 16 cores in stats.md for worker_func_cpu_usage is due to a test conducted on a local machine with 16 cores. I will update stats.md with a test on a machine with fewer cores to simplify the example.
Using the get_total_memory_usage function, I measured the total memory usage, including both the main process and its child processes. This method yielded a usage of 59.87109375 MB. Comparing this result with the worker_func_rss value, which was 59.875 MB and represents the physical memory used, a close match was observed. This confirms the accuracy of the memory measurement performed.
|
@Usamasource Thanks for the verification. Is the PR ready now? Any comments on German's post above? |
Hi @JosepSampe, Now it's ready, but I'll continue working on the monitoring part to monitor the nodes. As for @GEizaguirre's post, for this case in Lambda, summing CPU usages is the best option. However, in a local environment with 8 cores, summing in pairs would result in 4 values, missing the presence of 8 cores. Would it be better to detail this difference in the documentation, or is it preferable to calculate the average CPU usage and consolidate it into a single value? I did this to provide more detail on how the load is distributed among CPUs and to verify if the resources for that machine are being fully utilized. |
Description:
Key Features:
Apologies for the previous pull request, but it seems I had an issue with Git while using WSL during the pull request. I removed unused imports so I think that now my code is clear. Thank you.
Developer's Certificate of Origin 1.1