Add System Monitor Class for Enhanced Debugging (CPU and net i/o) #1242

Usamasource · 2024-01-25T21:28:41Z

Description:

This update introduces the CPUMonitor class, designed to periodically monitor CPU usage when Lithops is operating in debug mode (log_level=debug). The addition of this class aims to facilitate more in-depth performance analysis and troubleshooting.

Key Features:

Periodic Monitoring: The CPUMonitor class periodically checks CPU usage, aggregating data to provide insights into CPU performance during Lithops execution.

Debug Mode Exclusive: To minimize resource consumption, the CPU monitoring functionality is enabled only in debug mode. This ensures that the additional overhead of running a monitoring thread does not impact regular Lithops operations.

Apologies for the previous pull request, but it seems I had an issue with Git while using WSL during the pull request. I removed unused imports so I think that now my code is clear. Thank you.

Developer's Certificate of Origin 1.1

   By making a contribution to this project, I certify that:

   (a) The contribution was created in whole or in part by me and I
       have the right to submit it under the Apache License 2.0; or

   (b) The contribution is based upon previous work that, to the best
       of my knowledge, is covered under an appropriate open source
       license and I have the right under that license to submit that
       work with modifications, whether created in whole or in part
       by me, under the same open source license (unless I am
       permitted to submit under a different license), as indicated
       in the file; or

   (c) The contribution was provided directly to me by some other
       person who certified (a), (b) or (c) and I have not modified
       it.

   (d) I understand and agree that this project and the contribution
       are public and that a record of the contribution (including all
       personal information I submit with it, including my sign-off) is
       maintained indefinitely and may be redistributed consistent with
       this project or the open source license(s) involved.

Usamasource · 2024-01-26T13:22:34Z

In my latest commit, I renamed the CpuMonitor class to SystemMonitor to better reflect its enhanced functionality. It now monitors data sent and received on the user's machine using psutil, encompassing all IO operations, even those not involving Lithops storage, like boto3 usage. This monitoring includes user data and a minimal container-related overhead, totaling approximately 0.01 to 0.025 MB, as verified in testing. Additionally, users can now monitor their runtime object storage read and write operations.
What are your thoughts on these changes?

JosepSampe · 2024-01-26T14:57:38Z

I have a few comments:

Not sure if this is true, but based on this comment it seems all the values in the psutil lib are already cumulative. This means you don't need the CPUMonitor thread and and run it every second. If this is true you only need to add something like this at the end of the handler.py:

cpu_usage, cpu_system_time, cpu_user_time = calculate_cpus_values()
call_status.add('worker_func_cpu_usage', cpu_usage)
call_status.add('worker_func_cpu_system_time', round(cpu_system_time, 8))
call_status.add('worker_func_cpu_user_time', round(cpu_user_time, 8))
call_status.add('worker_func_cpu_total_time', round(cpu_system_time + cpu_user_time, 8))

net_io = psutil.net_io_counters(pernic=False)
call_status.add('worker_func_sent_net_io', net_io.bytes_sent)
call_status.add('worker_func_recv_net_io', net_io.bytes_recv)

In this PR I can see 4 commits that are not relevant to this PR (some made 3 years ago), please remove them
You have to update the api_stats.rst file and include the new stats you add in the PR

… localhost tests)

…d received.

Usamasource · 2024-01-30T15:38:27Z

Added start and end methods to SystemMonitor to avoid the use of threads, and now no interval is specified. According to the psutil documentation, cpu_times should be called and ignored once, and the second call will return the CPU usage since the last function call.

JosepSampe · 2024-01-31T18:44:05Z

lithops/worker/handler.py

+        # Get and log System statistics if log level is DEBUG
+        if task.log_level == logging.DEBUG:
+            cpu_usage, cpu_system_time, cpu_user_time = sys_monitor.calculate_cpus_values()
+
+            call_status.add('worker_func_cpu_usage', cpu_usage)
+            call_status.add('worker_func_cpu_system_time', round(cpu_system_time, 8))
+            call_status.add('worker_func_cpu_user_time', round(cpu_user_time, 8))
+            call_status.add('worker_func_cpu_total_time', round(cpu_system_time + cpu_user_time, 8))
+
+            net_io = sys_monitor.get_network_io()
+            call_status.add('worker_func_sent_net_io', net_io[0])
+            call_status.add('worker_func_recv_net_io', net_io[1])
+


Looks better now, thanks for the changes.

Since now it does not run every second, I was thinking about the possibility to remove if task.log_level == logging.DEBUG: and return these stats always, if the overhead of this code is negligible. So, can you check what is the overhead in seconds of this block of code? I think you only need to add a logger.info("---to be removed---") at line 233 and see the difference in time between it and the logger.debug('JobRunner process finished') in the functions logs.

abourramouss · 2024-02-01T10:53:03Z

Hello,

Since we're considering both network and cpu usage, memory usage could also be an addition? Currently I am wrapping my pipeline around lithops workers to extract that information.

Thanks.

Usamasource · 2024-02-06T11:29:26Z

Hi @JosepSampe, the monitoring added introduces an overhead of approximately 0.00516 seconds to the execution time.

Usamasource · 2024-02-06T12:54:16Z

Hi @abourramouss,

I've added the key memory metrics to our stats for a better understanding of resource use. Could you check if these metrics look good to you?

JosepSampe · 2024-02-07T14:01:52Z

Hi @JosepSampe, the monitoring added introduces an overhead of approximately 0.00516 seconds to the execution time.

Does this include memory stats? If so or similar, it is negligible, so you can remove if task.log_level == logging.DEBUG: and we can return the statistics always

Usamasource · 2024-02-08T10:19:23Z

Hi @JosepSampe, the monitoring added introduces an overhead of approximately 0.00516 seconds to the execution time.

Does this include memory stats? If so or similar, it is negligible, so you can remove if task.log_level == logging.DEBUG: and we can return the statistics always

Yes, the time measurements we're capturing include memory, CPU stats, and everything else. The maximum overhead I've experienced so far has been 0.01 seconds.

abourramouss · 2024-02-08T10:44:25Z

@JosepSampe @Usamasource Can you give me your opinion?

I think the current approach for non debugging purposes is excellent as it provides an overview of what's happening at each worker with a really small overhead.

I want to propose an approach for debugging purposes, it's something i have been using on a pipeline and might be useful for Lithops.

In my pipeline (currently about radio interferometry) i wrapped the Lithops futures around my "custom" framework for profiling, which consists on a monitoring process that is spawned along the Lithops process. The monitoring process and Lithops process are communicated through a pipe and the monitoring proccess is constantly polling statistics (each second) from the Lithops process, when the lithops process finishes, it sends a signal to the monitoring process to stop the recollection of statistics, those are sent back to the Lithops process to be sent to the client (And now that i think about it, we could send it in a streaming fashion, where data is sent constantly to the client via a socket?)

A part from that, I added a decorator to check execution times of several functions within a step, which are returned to the client. This results in two types of plots, a gantt plot to see the execution time of each function within a step and the SystemMonitor stats. Sadly there's no automatic way (that I know of) for the time each function took within each worker, so i had to leverage decorators.

The plots show the average usage of resources and times for each worker, obviously the user would be able to plot whatever he wants, like the aggregated usage of resources or whatever.

An issue with this approach is that if the polling frequency is small and the duration is big enough, then the logs can be huge.

JosepSampe · 2024-02-08T16:26:18Z

@Usamasource Have you checked that the new stats are correct? For example, running a call_async() function that reads 100MB of data from s3 and the worker_func_recv_net_io stat shows a value ~100MB, the same by running a call_async() function writting a 100MB from s3, and so one... ? And the same for the memory usage; if you read a 100MB from s3 and load the object in memory, it should show approx slightly more than 100MB in the memory stats

Why in the stats.md the worker_func_cpu_usage' shows 16 cores? did you test it with a machine of 16 cores? or did you test it in AWS Lambda and it show all the cores from the underlying machine? Why all them are 0? if this value is collected properly, can you put in the stats.md a simple real example with less cores and everything in the same line? like 'worker_func_cpu_usage': [3.0, 2.5, 60.0]

You are capturing the memory stats using self.process = psutil.Process(process_id) where the process id is the handler.py process id. The jobrunner.py runs in a different child process. Did you verify that the memory usage is correctly captured?

JosepSampe · 2024-02-08T16:33:32Z

@abourramouss I understand the first gantt diagram is more related to your app and any user can generate its own, but the metrics you show in the 2nd diagram looks nice and for general purpose.

Are you using the same library as here (psutil)? how this differ from what is being added here? does it make sense to include both functionalities? If so, can you submit a separate PR? Probably it would make sense to extend the plots.py file to include a new method that can generate those 6 plots.

abourramouss · 2024-02-08T17:22:10Z

@abourramouss I understand the first gantt diagram is more related to your app and any user can generate its own, but the metrics you show in the 2nd diagram looks nice and for general purpose.

Yes, metrics on the 2nd diagram are generic and do not need of any decorators.

Are you using the same library as here (psutil)? how this differ from what is being added here? does it make sense to include both functionalities? If so, can you submit a separate PR? Probably it would make sense to extend the plots.py file to include a new method that can generate those 6 plots.

So the difference is that the current PR is more a global overview on what's happening inside the each worker, the great thing about this is that it has a very little overhead and it's a good idea to have these stats always.

Yes, everything is psutil.

I think it does make sense to include both functionalities, but since intuition tells me it adds a lot of overhead, it might be a better idea to use it only for debugging mode i.e the user doesn't know what's happening inside the worker, or maybe a process doesn't scale well enough. Then you can have a light monitoring for every day (this PR), and a heavy one (that probes the lithops process every second) to get detailed statistics when debugging.

GEizaguirre · 2024-02-13T11:19:05Z

Hi there,

I have been checking the CPU monitoring utility lately, and found an observation worth mentioning (at least at its documentation).

Checked its functioning with the following, cpu-bound function (with s=5000):

def calculation(s):
    
    # Create two large matrices
    matrix1 = np.random.rand(s, s)
    matrix2 = np.random.rand(s, s)
    
    # Perform dot product
    result = np.dot(matrix1, matrix2)

Which locally executed on a single CPU, gives a CPU usage of almost 99%.

When executed at AWS Lambda (at 1769MB runtime memory, which should provide the equivalent to a single vCPU), got the following result (essentially the same as in local, split into two):

[41.1, 49.6]

Which makes sense, as Lambda, at 1769MB runtime memory, seems to give you the capacity of a single CPU, through two non-concurrent vCPUs.

Maybe the CPU monitoring results should be pairwise-sumed, or at least provide a note within the documentation on result interpretation.

Usamasource · 2024-02-13T14:40:41Z

@Usamasource Have you checked that the new stats are correct? For example, running a call_async() function that reads 100MB of data from s3 and the worker_func_recv_net_io stat shows a value ~100MB, the same by running a call_async() function writting a 100MB from s3, and so one... ? And the same for the memory usage; if you read a 100MB from s3 and load the object in memory, it should show approx slightly more than 100MB in the memory stats

Yes, It has been confirmed that the network traffic statistics match exactly with the volume of data handled, with a minimal overhead of 0.01 to 0.025 MB. This was verified using various functions like call_async and map.

Why in the stats.md the worker_func_cpu_usage' shows 16 cores? did you test it with a machine of 16 cores? or did you test it in AWS Lambda and it show all the cores from the underlying machine? Why all them are 0? if this value is collected properly, can you put in the stats.md a simple real example with less cores and everything in the same line? like 'worker_func_cpu_usage': [3.0, 2.5, 60.0]

The display of 16 cores in stats.md for worker_func_cpu_usage is due to a test conducted on a local machine with 16 cores. I will update stats.md with a test on a machine with fewer cores to simplify the example.

You are capturing the memory stats using self.process = psutil.Process(process_id) where the process id is the handler.py process id. The jobrunner.py runs in a different child process. Did you verify that the memory usage is correctly captured?

Using the get_total_memory_usage function, I measured the total memory usage, including both the main process and its child processes. This method yielded a usage of 59.87109375 MB. Comparing this result with the worker_func_rss value, which was 59.875 MB and represents the physical memory used, a close match was observed. This confirms the accuracy of the memory measurement performed.

def get_total_memory_usage(process_id):
    parent = psutil.Process(process_id)
    children = parent.children(recursive=True)
    total_memory_usage = parent.memory_info().rss  # rss es Resident Set Size, la porción de memoria ocupada en RAM.
    for child in children:
        total_memory_usage += child.memory_info().rss
    return total_memory_usage

…cross multiple runs

JosepSampe · 2024-02-16T10:52:53Z

@Usamasource Thanks for the verification. Is the PR ready now? Any comments on German's post above?

Usamasource · 2024-02-16T11:20:51Z

Hi @JosepSampe,

Now it's ready, but I'll continue working on the monitoring part to monitor the nodes.

As for @GEizaguirre's post, for this case in Lambda, summing CPU usages is the best option. However, in a local environment with 8 cores, summing in pairs would result in 4 values, missing the presence of 8 cores. Would it be better to detail this difference in the documentation, or is it preferable to calculate the average CPU usage and consolidate it into a single value? I did this to provide more detail on how the load is distributed among CPUs and to verify if the resources for that machine are being fully utilized.

Usamasource added 3 commits January 30, 2024 13:24

Adding CPUMonitor to track cpu usage from handler.py

0d2d8dc

solving linting errors and psutil added in install_requieres (solving…

88f5968

… localhost tests)

Refactored the CPUMonitor class to include monitoring of data sent an…

bca73d3

…d received.

Usamasource force-pushed the lithops-dev branch 2 times, most recently from ce48f17 to bca73d3 Compare January 30, 2024 13:25

Usamasource added 2 commits January 30, 2024 15:21

Refactoring SystemMonitor to avoid threading usage

29c2588

adding SystemMonitor stats to api_stats.rst

b57de93

JosepSampe reviewed Jan 31, 2024

View reviewed changes

Usamasource changed the title ~~Add CPU Monitor Class for Enhanced Debugging~~ Add System Monitor Class for Enhanced Debugging (CPU and net i/o) Jan 31, 2024

Usamasource added 2 commits February 6, 2024 13:48

mem_info monitoring and api_stats updated

a8d4621

flake warning solved

021ebfc

Enable SystemMonitor in all modes, not just debug mode

b4b0680

Usamasource added 3 commits February 16, 2024 11:28

Call reset_network_io from start() to prevent accumulation of stats a…

5841052

…cross multiple runs

update api_stats.rst with fewer cores

8f390ec

Update api_stats.rst explanation for cpu tracking

e0135c8

JosepSampe merged commit e83f50c into lithops-cloud:master Feb 16, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add System Monitor Class for Enhanced Debugging (CPU and net i/o) #1242

Add System Monitor Class for Enhanced Debugging (CPU and net i/o) #1242

Usamasource commented Jan 25, 2024 •

edited

Loading

Usamasource commented Jan 26, 2024 •

edited

Loading

JosepSampe commented Jan 26, 2024 •

edited

Loading

Usamasource commented Jan 30, 2024

JosepSampe Jan 31, 2024 •

edited

Loading

abourramouss commented Feb 1, 2024 •

edited

Loading

Usamasource commented Feb 6, 2024

Usamasource commented Feb 6, 2024

JosepSampe commented Feb 7, 2024

Usamasource commented Feb 8, 2024

abourramouss commented Feb 8, 2024 •

edited

Loading

JosepSampe commented Feb 8, 2024

JosepSampe commented Feb 8, 2024

abourramouss commented Feb 8, 2024

GEizaguirre commented Feb 13, 2024 •

edited

Loading

Usamasource commented Feb 13, 2024

JosepSampe commented Feb 16, 2024 •

edited

Loading

Usamasource commented Feb 16, 2024

Add System Monitor Class for Enhanced Debugging (CPU and net i/o) #1242

Add System Monitor Class for Enhanced Debugging (CPU and net i/o) #1242

Conversation

Usamasource commented Jan 25, 2024 • edited Loading

Usamasource commented Jan 26, 2024 • edited Loading

JosepSampe commented Jan 26, 2024 • edited Loading

Usamasource commented Jan 30, 2024

JosepSampe Jan 31, 2024 • edited Loading

Choose a reason for hiding this comment

abourramouss commented Feb 1, 2024 • edited Loading

Usamasource commented Feb 6, 2024

Usamasource commented Feb 6, 2024

JosepSampe commented Feb 7, 2024

Usamasource commented Feb 8, 2024

abourramouss commented Feb 8, 2024 • edited Loading

JosepSampe commented Feb 8, 2024

JosepSampe commented Feb 8, 2024

abourramouss commented Feb 8, 2024

GEizaguirre commented Feb 13, 2024 • edited Loading

Usamasource commented Feb 13, 2024

JosepSampe commented Feb 16, 2024 • edited Loading

Usamasource commented Feb 16, 2024

Usamasource commented Jan 25, 2024 •

edited

Loading

Usamasource commented Jan 26, 2024 •

edited

Loading

JosepSampe commented Jan 26, 2024 •

edited

Loading

JosepSampe Jan 31, 2024 •

edited

Loading

abourramouss commented Feb 1, 2024 •

edited

Loading

abourramouss commented Feb 8, 2024 •

edited

Loading

GEizaguirre commented Feb 13, 2024 •

edited

Loading

JosepSampe commented Feb 16, 2024 •

edited

Loading