Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

migrate from dill to cloudpickle for advanced serialization #7870

Closed
jrwalk opened this issue Mar 25, 2020 · 13 comments
Closed

migrate from dill to cloudpickle for advanced serialization #7870

jrwalk opened this issue Mar 25, 2020 · 13 comments
Labels
airflow3.0:breaking Candidates for Airflow 3.0 that contain breaking changes airflow3.0:candidate Potential candidates for Airflow 3.0 area:core-operators Operators, Sensors and hooks within Core Airflow area:serialization good first issue kind:feature Feature Requests

Comments

@jrwalk
Copy link

jrwalk commented Mar 25, 2020

Description

Usage of dill for optional serialization in PythonVirtualenvOperator may be replaced with cloudpickle as its serialization library. This should be a mostly drop-in replacement.

Use case / motivation

Currently, the PythonVirtualenvOperator optionally uses dill in place of stock pickle to serialize advanced types. However, most major distributed compute frameworks have opted to shift to cloudpickle, meaning using dill for Airflow can introduce redundant dependencies for calling out to other distributed compute (e.g., farming compute-heavy tasks out to a remote dask cluster), and can interfere with serialization of tasks for those tools.

Since both dill and cloudpickle are largely drop-in replacements for pickle, the migration should be fairly minor.

Related Issues

kubeflow/pipelines#1387

dask/distributed#3606

piskvorky/gensim#558 (comment)

uqfoundation/multiprocess#22 (comment)

@jrwalk jrwalk added the kind:feature Feature Requests label Mar 25, 2020
@boring-cyborg
Copy link

boring-cyborg bot commented Mar 25, 2020

Thanks for opening your first issue here! Be sure to follow the issue template!

@ashb
Copy link
Member

ashb commented Mar 26, 2020

Is it worth making the pickle module a config setting, or just always using cloudpickle instead?

@jrwalk
Copy link
Author

jrwalk commented Mar 27, 2020

so dill is optional to use (set by a flag in the PythonVirtualenvOperator) but is always imported and treated as a required dependency for Airflow, right? Keeping that pattern it should be super easy to cut over to cloudpickle.

I'd generally advocate for allowing fewer dependencies though, so it could be nice to make cloudpickle a proper extra-dependency and put some checks around the import. That would let people keep minimal environments if they choose/need to. I've also run into situations where it wasn't being used (no virtualenv operators in my dag) but because it was imported dill was still getting captured by serializers like cloudpickle since they scoop up a fair amount of runtime environment information.

@jrwalk
Copy link
Author

jrwalk commented Mar 27, 2020

either way it would probably also necessitate reworking some of the tests, since (for example) cloudpickle could conceivably serialize lambda functions for the python_callable inputs

@sumeshpremraj
Copy link
Contributor

sumeshpremraj commented Jul 12, 2023

Hi @jrwalk, can I work on this?
dill and cloudpickle seem to be fairly similar, is there anything else I should consider (except for the tests your mentioned)?

@potiuk
Copy link
Member

potiuk commented Jul 12, 2023

Assigned you. The issue is pretty old - we also have ExternalPythonOperator using same approach now. I think just focusing on tests for those might be enough

@Felix-neko
Copy link

Felix-neko commented Nov 8, 2023

#35529 -- maybe using cloudpickle could cure this problem?

@potiuk
Copy link
Member

potiuk commented Nov 8, 2023

#35529 -- maybe using cloudpickle could cure this problem?

If you would like to take a stab on it and attempt to try it - fee free to verify that hypothesis @Felix-neko - PRs are always most welcome.

@Felix-neko
Copy link

#35529 -- maybe using cloudpickle could cure this problem?

If you would like to take a stab on it and attempt to try it - fee free to verify that hypothesis @Felix-neko - PRs are always most welcome.

Heh, it looks like you're overestimating my power (and my current knowledge in cloudpickle and airflow)
; )

@Taragolis
Copy link
Contributor

It might help. There is also limitations for cloudpickle exists:

Cloudpickle can only be used to send objects between the exact same version of Python.

@Felix-neko
Copy link

Cloudpickle can only be used to send objects between the exact same version of Python.

That won't be a big problem. Especially if simple dill will still be supported.

@VladaZakharova
Copy link
Contributor

@sumeshpremraj
Hi there! May I ask how is the progress going on this issue?

@kaxil kaxil added airflow3.0:candidate Potential candidates for Airflow 3.0 airflow3.0:breaking Candidates for Airflow 3.0 that contain breaking changes labels Jul 18, 2024
@gyli
Copy link
Contributor

gyli commented Jul 20, 2024

With #39270 completed, argument serializer was added to support choosing from serializers. Looks like this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
airflow3.0:breaking Candidates for Airflow 3.0 that contain breaking changes airflow3.0:candidate Potential candidates for Airflow 3.0 area:core-operators Operators, Sensors and hooks within Core Airflow area:serialization good first issue kind:feature Feature Requests
Projects
None yet
Development

Successfully merging a pull request may close this issue.