How to best handle external files #27396
Unanswered
nicholasd-ff
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Hi Dagster community.
I've looked at Dagster university and a bunch of the examples, read about sensors and external assets and tried my hand at implementing our initial use-case a bunch of different ways and I am a bit bewildered at how complicated it seems to be to handle external files.
We want to feed a bunch of csvs into a jupyter notebook from s3, this seems like a natural fit for external assets and a sensor, but if we define them as follows:
Then now what do I do? Accessing the metadata of an external asset is extremely awkward And requires me to do something like:
Is this really the right way to approach it? I also thought about materializing the metadata for an external S3 object as a regular python class - something like:
But then it becomes awkward to materialize the asset based on a sensor, I'd have to fetch the metadata twice - once in the sensor to trigger the materialization, and again in the asset materialization because there appears no way other than metadata or the run / partition key to pass information from the sensor. I feel like I'm missing something really obvious here.
I get that dagster wants every asset as a python object so it can be pickled, and we have other / future workflows which that would be true for but there must be plenty of people for whom it's not always the best fit?
I found this question: #18211 but it doesn't really seem like it is coming from the same place. Eventually we might want to run this on dagster+ or on separate executors to materialize outputs in parallel, so I don't really want to materialize each CSV to a hard-coded location under dagster_home as most of the examples in dagster university do.
Beta Was this translation helpful? Give feedback.
All reactions