Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MacOS support #119

Merged
merged 35 commits into from
Feb 16, 2024
Merged

MacOS support #119

merged 35 commits into from
Feb 16, 2024

Conversation

sybrenjansen
Copy link
Owner

  • Added support for macOS

    • Fixes memory leaks on macOS
    • Reduced the amount of semaphores used
    • Issues a warning when cpu_ids is used on macOS

Fixes #27
Fixes #79
Fixes #91

@harmenwassenaar most of the work can be looked at commit by commit. That will make it much easier to process I think. The main things were to:

  • Reduce the amount of semaphores used by the WorkerComms (which helps when using it on macos). Event and Condition objects use a lot of semaphores, so I'm using different types now. For 10 workers this reduces the amount of semaphores from 310 to 190 for example.
  • Fixing the semaphore leaks. The leaks came from the tqdmutils where I created a lock and event globally. The dashboard also had leaks, which I also fixed.
  • I refactored all of the manager related code, because after reading some more about them I was using them incorrectly. And they were related to the leaks as well.

  - Fixes memory leaks on macOS
  - Reduced the amount of semaphores used
  - Issues a warning when ``cpu_ids`` is used on macOS

Fixes #27
Fixes #79
Fixes #91
* Restructured manager used in dashboard
@sybrenjansen
Copy link
Owner Author

I'll fix the failed build tomorrow :)

Copy link
Collaborator

@harmenwassenaar harmenwassenaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have a good overview of how the communication between the main process and workers works. Is there some documentation for that, or is the only way to figure that out studying the code?
For example, for all those shared variables, are they only written by the main process? And which ones are actually used/read by the workers? Do the workers keep track of whether they time out or does the main process do that; I'd kind of expect the latter but then why is it shared memory.

mpire/comms.py Outdated Show resolved Hide resolved
mpire/insights.py Show resolved Hide resolved
mpire/comms.py Outdated Show resolved Hide resolved
mpire/comms.py Show resolved Hide resolved
@sybrenjansen
Copy link
Owner Author

Ugh... the leaked semaphores are back :/ Will do some additional digging

@sybrenjansen
Copy link
Owner Author

Yay! semaphores are all gone now :D

@sybrenjansen
Copy link
Owner Author

sybrenjansen commented Feb 6, 2024

I don't really have a good overview of how the communication between the main process and workers works. Is there some documentation for that, or is the only way to figure that out studying the code?

I'm going to add this to the PR:

General overview:

  • When map or imap is used, the workers need to return the idx of the task they just completed. This is
    needed to return the results in order. This is communicated by using the _keep_order boolean value.
  • The main process assigns tasks to the workers by using their respective task queue (_task_queues). When no
    tasks have been completed yet, the main process assigns tasks in order. To determine which worker to assign the
    next task to, the main process uses the _task_idx counter. When tasks have been completed, the main process
    assigns the next task to the worker that completed the last task. This is communicated by using the
    _last_completed_task_worker_id deque.
  • Each worker keeps track of whether it is running a task by using the _worker_running_task boolean value. This
    is used by the main process in case a worker needs to be interrupted (due to an exception somewhere else).
    When a worker is not busy with any task at the moment, the worker will exit itself because of the
    _exception_thrown event that is set in such cases. However, when it is running a task, we want to interrupt
    it. The lock object _worker_running_task_locks is used to ensure that there are no race conditions when
    accessing the _worker_running_task boolean value.
  • Each worker also keeps track of which job it is working on by using the _worker_working_on_job array. This is
    needed to assess whether a certain task times out, and we need to know which job to set to failed.
  • The workers communicate their results to the main process by using the results queue (_results_queue). Each
    worker keeps track of how many results it has added to the queue (_results_added), and the main process
    keeps track of how many results it has received from each worker (_results_received). This is used by the
    workers to know when they can safely exit.
  • Workers can request a restart when a maximum lifespan is configured and reached. This is done by setting the
    _worker_restart_array boolean array. The main process listens to this array and restarts the worker when
    needed. The _worker_restart_condition is used to signal the main process that a worker needs to be
    restarted.
  • The _workers_dead array is used to keep track of which workers are alive and which are not. Sometimes, a
    worker can be terminated by the OS (e.g., OOM), which we want to pick up on. The main process checks regularly
    whether a worker is still alive according to the OS and according to the worker itself. If the OS says it's
    dead, but the value in _workers_dead is still False, we know something went wrong.
  • The _workers_time_task_started array is used to keep track of when a worker started a task. This is used by
    the main process to check whether a worker times out.
  • Exceptions are communicated by using the _exception_thrown event. Both the main process as the workers can set
    this event. The main process will set this when, for example, a timeout has been reached when running a map
    task. The source of the exception is stored in the _exception_job_id value, which is used by the main
    process to obtain the exception and raise accordingly.
  • The workers communicate every 0.1 seconds how many tasks they have completed. This is used by the main process to
    update the progress bar. The workers communicate this by using the _tasks_completed_array array. The
    _progress_bar_last_updated datetime object is used to keep track of when the last update was sent. The
    _progress_bar_shutdown boolean value is used to signal the progress bar handler thread to shut down. The
    _progress_bar_complete event is used to signal the main process and workers that the progress bar is
    complete and that it's safe to exit.

For example, for all those shared variables, are they only written by the main process?

No, the WorkerComms has all the communication primitives needed to communicate between main and workers. They all can read/write to these variables. The workers write a lot, the main reads a lot. But main also writes to some, and workers read to some.

And which ones are actually used/read by the workers?

Most of them (description of what they do are in comms.py):

  • _keep_order: written by main, read by workers.
  • _task_queues: written by main, read by workers
  • _task_idx: only used by main
  • _worker_running_task_locks: used by main and workers
  • _worker_running_task: written by workers and also main sometimes, read by main
  • _results_queue: read by main, written by workers
  • _results_added: main only
  • _results_received: written by main, read by workers
  • _worker_restart_array: written by main and workers
  • _worker_restart_condition: used by main and workers
  • _workers_dead: read by main, written by workers
  • _workers_time_task_started: read by main, written by workers
  • _exception_lock: used by main and workers
  • _exception_thrown: used by main and workers
  • _exception_job_id: written by main, written by workers
  • _kill_signal_received: read by main, written by workers
  • _tasks_completed_array: read by main, written by workers
  • _progress_bar_last_updated: used by workers only
  • _progress_bar_shutdown: main only (but is used to communicate with Progress bar handler, which lives in a thread). I can actually make this a threading.Value from the looks of it
  • _progress_bar_complete: written by main, read by workers

Do the workers keep track of whether they time out or does the main process do that; I'd kind of expect the latter but then why is it shared memory.

No, the main process does that within one of its threads. This is more reliable then doing it in a worker because a worker can deadlock, or not release the GIL. And you would need some way of switching threads to check for a timeout. And then you would also have n_jobs threads instead of just the one.

harmenwassenaar
harmenwassenaar previously approved these changes Feb 7, 2024
Copy link
Collaborator

@harmenwassenaar harmenwassenaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll probably spend some time trying to understand all the communication a bit better (and your comment should help a lot with that).
But I don't think there's a problem with merging this improvement in the meanwhile.

mpire/dashboard/dashboard.py Outdated Show resolved Hide resolved
mpire/comms.py Outdated Show resolved Hide resolved
mpire/comms.py Outdated Show resolved Hide resolved
@sybrenjansen
Copy link
Owner Author

Any new comments on your side @harmenwassenaar ? Otherwise I can start making a new release :)

Copy link
Collaborator

@harmenwassenaar harmenwassenaar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No new comments :)

@sybrenjansen sybrenjansen merged commit 5835015 into master Feb 16, 2024
17 checks passed
@sybrenjansen sybrenjansen deleted the use-less-semaphores branch February 16, 2024 11:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants