How to best handle external files #27396

nicholasd-ff · 2025-01-27T09:22:38Z

nicholasd-ff
Jan 27, 2025

Hi Dagster community.

I've looked at Dagster university and a bunch of the examples, read about sensors and external assets and tried my hand at implementing our initial use-case a bunch of different ways and I am a bit bewildered at how complicated it seems to be to handle external files.

We want to feed a bunch of csvs into a jupyter notebook from s3, this seems like a natural fit for external assets and a sensor, but if we define them as follows:

import dagster as dg
import import dagster_aws.s3 as dg_s3
BUCKET="example_bucket"
my_data_csv=AssetSpec("my_data_csv")
@dg.sensor(
        minimum_interval_seconds=120,
 )
def my_csv_sensor(context:dg.SensorEvaluationContext, s3: dg_s3.S3Resource) -> SensorResult|SkipReason
    s3_client = s3.get_client()
    if context.cursor is not None:
        last_updated = datetime.fromisoformat(context.cursor)
    else:
        last_updated =  datetime(1970,1,1)
    try:
        s3_obj = s3_client.get_object_metadata(
            Bucket=BUCKET,
            Key="mydata.csv",
            IfModifiedSince=last_updated
        ) 
        return dg.SensorResult(
            asset_events=[
                dg.AssetMaterialization(
                    asset_key=asset_key,
                    metadata={
                        "LastModified": dg.TimestampMetadataValue(last_updated.timestamp()),
                        "ETag": s3_obj["ETag"],
                        "s3key": s3_key

                    }
                )
            ],
            cursor=s3_obj["LastModified"].isoformat()
        )
    except ClientError as e:
        # Object has not been modified.
        if err := e.response.get("Error", None):
            if err["Code"] == "304":
                return None
            raise Exception(
                "Unexpected s3 client error",
                err
            )

Then now what do I do? Accessing the metadata of an external asset is extremely awkward And requires me to do something like:

my_data_csv_mat = context.instance.get_latest_materialization_event(
       asset_key=my_data_csv.key
   ).asset_materialization
context.log.info(my_data_csv_mat.metadata["ETag"])

Is this really the right way to approach it? I also thought about materializing the metadata for an external S3 object as a regular python class - something like:

class ExternalS3File:
  bucket: str
  s3_key: str
  etag: str 
  last_modified: str

@asset
def my_csv(context dg.AssetExecutionContext, s3: dg_s3.S3Resource)
 # Fetch the object metadata here then return a reference suitable for pickling
  return ExternalS3File(
    bucket=Bucket,
    s3_key="mydata.csv"
  )

But then it becomes awkward to materialize the asset based on a sensor, I'd have to fetch the metadata twice - once in the sensor to trigger the materialization, and again in the asset materialization because there appears no way other than metadata or the run / partition key to pass information from the sensor. I feel like I'm missing something really obvious here.

I get that dagster wants every asset as a python object so it can be pickled, and we have other / future workflows which that would be true for but there must be plenty of people for whom it's not always the best fit?

I found this question: #18211 but it doesn't really seem like it is coming from the same place. Eventually we might want to run this on dagster+ or on separate executors to materialize outputs in parallel, so I don't really want to materialize each CSV to a hard-coded location under dagster_home as most of the examples in dagster university do.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to best handle external files #27396

{{title}}

Replies: 0 comments

Select a reply

How to best handle external files #27396

nicholasd-ff Jan 27, 2025

Replies: 0 comments

nicholasd-ff
Jan 27, 2025