Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting warning message "Using fork() can cause Polars to deadlock in the child process" #20255

Closed
francescomandruvs opened this issue Dec 10, 2024 · 8 comments · Fixed by #20309
Assignees
Labels
accepted Ready for implementation documentation Improvements or additions to documentation reference Reference issue for recurring topics

Comments

@francescomandruvs
Copy link

francescomandruvs commented Dec 10, 2024

This is a request for making this message a bit more explicit. We are trying to switch a complex ML pipeline gradually from pandas to Polars. At the moment we are testing to switch a subset of the features computation. This has worked perfectly fine!
However at the start of the processing pipeline (so much earlier the new polars block), we got the following warning:

/usr/local/lib/python3.11/multiprocessing/popen_fork.py:66: RuntimeWarning: Using fork() can cause Polars to deadlock in the child process.
In addition, using fork() with Python in general is a recipe for mysterious
deadlocks and crashes.

The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.

See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.

If you really know what your doing, you can silence this warning with the warning module
or by setting POLARS_ALLOW_FORKING_THREAD=1.

  self.pid = os.fork()

We wonder exactly what should we do about that, if we can ignore it, and what we have to avoid to do. We see a lot of discussion about this specific warning but to the best of our knowledge there is no a clear answer on how to behave. To me it is not clear which consequences we have switching from fork to spawn so any detailed explanation is well accepted!

We found out that we should only take care /avoid to use multiprocessing and Polars together. If this is the case I would highlight it in the message, if there are other things we should take care of, we would like to know them

We see this specific error only in a linux os.

@francescomandruvs francescomandruvs added the enhancement New feature or an improvement of an existing feature label Dec 10, 2024
@ritchie46
Copy link
Member

Fork is outright dangerous. As it assumes no other library holds a lock. It is an insane default and in Python 3.14 they will default to spawn.

This isn't Polars specific. Numpy also has some parallelism in blas and can also be corrupted. (Though with a lower probability).

You can ignore it.
Maybe it works for you, but if you get deadlocks, you must spawn.

Btw, Polars should not be parallelized by the users. It handles it itself.

https://pytorch.org/docs/stable/notes/multiprocessing.html#avoiding-and-fighting-deadlocks

https://docs.pola.rs/user-guide/misc/multiprocessing/#the-problem-with-the-default-multiprocessing-config

@nameexhaustion nameexhaustion changed the title Using fork() can cause Polars to deadlock in the child process Getting warning message "Using fork() can cause Polars to deadlock in the child process" Dec 11, 2024
@nameexhaustion nameexhaustion added the documentation Improvements or additions to documentation label Dec 11, 2024
@francescomandruvs
Copy link
Author

Btw, Polars should not be parallelized by the users. It handles it itself.

  1. If this warning is telling me to avoid using Polars in a user defined parallel function that's fine, I would be a bit more explicit on that.
  2. If instead there are a couple of different situations where I can face a dead lock, I would like to better understand what to avoid.

From your answer I guess we are in (1), so I just need to avoid any multiprocessing / multithreading external library working with Polars.

@ritchie46
Copy link
Member

If this warning is telling me to avoid using Polars in a user defined parallel function that's fine, I would be a bit more explicit on that.

We're quite explicit in the fact that we're multithreaded. (And this whole warning is also quite explicit, isn't it :) )

If instead there are a couple of different situations where I can face a dead lock, I would like to better understand what to avoid.

Forking a process. It can deadlock any process that holds a mutex. Multithreading is fine. Multiprocessing with fork is dangerous. It is not something we can fix.

From your answer I guess we are in (1), so I just need to avoid any multiprocessing / multithreading external library working with Polars.

With any library that does anything multithreaded. Even numpy:

https://stackoverflow.com/questions/51093970/multiprocessing-code-works-using-numpy-but-deadlocked-using-pytorch

Don't fork after threads are created: python/cpython#96971 (comment)

@niccolopetti
Copy link

I'm also facing the same issue but can't figure out what causes it, the only part which seems more related to explicitly using multithreading is calling collect_all, what should we do about this warning?

@ritchie46
Copy link
Member

The error message is pretty clear:

The most likely reason you are seeing this error is because you are using the
multiprocessing module on Linux, which uses fork() by default. This will be
fixed in Python 3.14. Until then, you want to use the "spawn" context instead.

See https://docs.pola.rs/user-guide/misc/multiprocessing/ for details.

If you really know what you're doing, you can silence this warning with the warning module
or by setting POLARS_ALLOW_FORKING_THREAD=1.

E.g. silence it via the warning module or set POLARS_ALLOW_FORKING_THREAD=1.

If you get a deadlock, you know what caused it.

@nameexhaustion nameexhaustion added reference Reference issue for recurring topics and removed enhancement New feature or an improvement of an existing feature labels Dec 13, 2024
@MarcoGorelli MarcoGorelli mentioned this issue Dec 13, 2024
2 tasks
@glemaitre
Copy link

The warning could be raised in a situation that I think is a false positive.

joblib uses loky that uses fork/exec that as far as I understand would be a safe way to start the processes.

Where the warning is more invasive is that, the hook on fork is done at the import time of polars therefore making any parallel processing using joblib (as internally in scikit-learn) will raise the warning even if there is no processing involving polars. While a user could silence the warning with the environment variable, the message is confusing because they are not in charge of handling the parallel processing at there level (the internal of scikit-learn should do that).

@ritchie46
Copy link
Member

will raise the warning even if there is no processing involving polars

Yeah, I agree that this is too broad for of a warning to Polars. I will revert it.

@glemaitre
Copy link

Yeah, I agree that this is too broad for of a warning to Polars. I will revert it.

Thanks for considering this input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accepted Ready for implementation documentation Improvements or additions to documentation reference Reference issue for recurring topics
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

6 participants