updating and appending pipeline #321

kyle-messier · 2024-03-31T18:36:09Z

Approaches and discussion on how we implement an updatable pipeline when new AQS data becomes available

kyle-messier · 2024-03-31T19:15:31Z

@sigmafelix @eva0marques @mitchellmanware @dzilber @dawranadeep @larapclark

I've been exploring how we can implement a pipeline updates when there is new data available while also being smart enough to only run new data. For example, when calculating covariates with temporality, we only need to calculate the new 6 months and then append.

the targets package has a discussion here. As you can see someone was trying to do this and the developer said this out of scope of targets. As I was thinking about how we could construct lists of targets that use various target or tarchetype functions to handle the checking, running, appending, I also came across the documentation of tar_meta here. In the subsection storage-access you'll see that it is highly advised not to use things like tar_read or tar_meta within targets since they are designed to updated or be updated.

So where does that leave us? I think we need to implement the checks, run, and append in our own R functions in the beethoven code base. The approach that I suggest we use, and I think we were more or less including this information is portions of the pipeline data objects with a site_index, is to create time_index and location_index variable or metadata object that accompanies created target objects. We could update the site_index to be formatted such that space and time is information is discernable or make new, separate ones. Therefore, a target object can check that the requested time and location easily. If additional time or locations are being requested by the pipeline, then the functions called by targets are designed to read the old data, run the function with new data, and append the old and new data.

kyle-messier · 2024-04-01T20:51:26Z

Look into approach for updating - one target in between covariates and model fitting that does the combining
Model updating, option for not updating model parameters

kyle-messier · 2024-04-02T12:49:10Z

Documenting helpful discussion of tar_files and tar_files_raw

sigmafelix · 2024-04-02T15:25:22Z

I want to clarify what I said in today's discussion. The picture above depicts the overall structure of the pipeline I'm working on (sorry for bad handwriting). In my understanding, the pipeline is supposed to represent a static procedure of data analytics; any dynamic components (i.e., periodic updates in raw data) should be declared externally, where my configuration csv file (aka punchcard.csv) works for invalidating nodes by reading the configuration file with meta_run(). This practice can bypass potential issues in pointing the fixed file path that would not update the pipeline (as discussed in ropensci/targets#136).

One challenge is that we want to update the feature data and the models periodically, which is difficult to do with target's static pipeline. Given that we have saved a file to another directory (as represented "external directory" in the figure), the pipeline can be run in the new period only by editing the configuration file's start and end dates. A new feature data object will be appended if anything exists in the directory we use to keep previous feature files. I think it can be done at the function level, for example:

#' Append previous data
post_append <- function(present_features, path_present_features, dir_features) {
  nfiles <- list.files(dir = dir_features, full.names = TRUE)  
  if (!is.null(nfiles) & length(nfiles) > 0) {
    feats_prev <- lapply(nfiles, qread)
    feats_prev <- append(feats_prev, present_features)
    feats_up <- data.table::rbindlist(feats_prev, fill = TRUE)
    return(feats_up)
  } else {
    qsave(present_features, path_present_features)
    return(present_features)
  }
}

Then, how do we know the saved file is from the previous pipeline run? To deal with this issue, we could take a simple way of naming the file to be saved (e.g., containing start and end dates in the file name). We could add a function to beethoven to set up the feature cache directory and validate the naming convention, etc.

The idea I brought up above is half-baked, and I believe there exist similar solutions in targets framework. I will look at the manual and discussion regarding this issue.

kyle-messier · 2024-04-02T15:38:39Z

thanks @sigmafelix. BTW I think it is a beautiful hand-drawn figure. I think the solution you present here could work. Although, I'd also like to continue to look at tar_files and tar_files_raw as I think they may be dealing with this issue. To be continued...

mitchellmanware · 2024-04-02T15:47:40Z

Thank you for the visual, Insang, it is very helpful. I see the pipeline in the same way where the external editing of the configuration file dates/years triggers the rest of the pipeline to run.

I think it would be difficult to implement the dynamic branching for each of the covariate download/processing/calculation steps, but using it at the dt_combined target may alleviate the need for external directory at the appending step. The way I understand it, the dynamic branches can be triggered by a target meeting a certain "new features" condition, which we can set with a new function. If this "new features" condition is met, the dynamic branch runs the subsequent targets, which can include a tar_read() from the already existing features onto which the new features are appended. This would not violate the tar_read() within functions because the function would only trigger the branch to run.

Not sure if I have interpreted the documentation correctly, but I am still reading and working on an example.

eva0marques · 2024-04-02T17:05:02Z

Thank you @insang.
I would personally add an option in the configuration file to give the possibility to the user to choose between the followings (at the model step):

Use model: using existing model to simply predict on new data (this option is possible only if a model already exists, will we share our 5y-trained model in the end?)
Update model: update the previously trained models (but again, I am not sure that it is going to be feasible for all of our base learners because of the algorithms themselves ; I am not a specialist on this question but I only heard about pre-taining for NN architectures)
Create model: completely retrain the whole model.
(I'm sure you'll find better names for this option variable).

kyle-messier · 2024-04-19T18:15:08Z

We need to check in with OSC and make sure all of the necessary packages are installed on GEO
targets
crew
crew.cluster

kyle-messier · 2024-04-24T18:14:54Z

@sigmafelix Here is some code snippet for a 2-target combination that reads and saves a SpatRaster object in targets - based off some discussion here

  tar_target(
    olm_clay_files,
    unlist(list.files("/Volumes/set/Projects/PrestoGP_Pesticides/input/OpenLandMapData/Clay_Content/",pattern = "*.tif",full.names = TRUE))
  ),
  tar_target(# These targets are the raw OLM files
    name = olm_clay_crop,
    command = olm_read_crop(olm_clay_files),
    format = tar_format(
      read = function(path) terra::rast(path),
      write = function(object, path) terra::writeRaster(x = object, filename = path, filetype = "GTiff", overwrite = TRUE),
      marshal = function(object) terra::wrap(object),
      unmarshal = function(object) terra::unwrap(object)
    )
  )

If you try to use a regular format = "file" approach, then the Cpp pointers get messed up and there are errors. I'm not entirely sold on the unlist(list.files(.)) approach I have implemented here. Obviously for beethoven you have the punchcard so no hard paths like I have here.

sigmafelix · 2024-04-24T21:21:40Z

@kyle-messier Thank you for pointing me to the file-based workflow. geotargets supports interface for terra objects, which we could use for the projects using single files. For beethoven, we use multiple files per target in most of the cases; that said, I think that keeping different recipes/punchcards/blueprints for different periods with format = "file" in tarchetypes::tar_files_input (similar to what we discussed two weeks ago) is a great option for our case.

kyle-messier · 2024-04-25T02:01:47Z

@sigmafelix geotargets appears to made exactly for what I mentioned. It also got me wondering more about the functionality of a SpatRasterCollection. That could be useful.

eva0marques · 2024-04-25T13:42:45Z

I didn't follow the conversation so I apologies if I am off-topic but I did wrote functions to convert to and from SpatRasterDataset, which was more adapted to time dimension addition than SpatRasterCollection.

kyle-messier · 2024-04-25T14:29:10Z

@eva0marques no worries - I'm not as familiar with the specifics of the SpatRasterDataset vs the SpatRasterCollection. Perhaps we can discuss briefly at the group meeting today.

eva0marques · 2024-04-25T14:44:44Z

Sure :)
This is what is explained on terra website:

A SpatRasterDataset is a collection of sub-datasets, where each is a SpatRaster for the same area (extent) and coordinate reference system, but possibly with a different resolution. Sub-datasets are often used to capture variables (e.g. temperature and precipitation), or a fourth dimension (e.g. height, depth or time) if the sub-datasets already have three dimensions (multiple layers).

A SpatRasterCollection is a collection of SpatRasters with no restriction in the extent or other geometric parameters.

sigmafelix · 2024-05-01T17:13:44Z

TODO

Writing a manual to changing/managing argument list objects and RDS files
Options for cluster configurations (related to future.batchtools functions)
(Long term) Test crew and crew.cluster

kyle-messier added this to beethoven Mar 31, 2024

kyle-messier converted this from a draft issue Mar 31, 2024

kyle-messier added the Production issues related to pipeline or production label Mar 31, 2024

kyle-messier moved this from 📋 Backlog to 🏗 In progress in beethoven Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

updating and appending pipeline #321

updating and appending pipeline #321

kyle-messier commented Mar 31, 2024 •

edited

Loading

kyle-messier commented Mar 31, 2024

kyle-messier commented Apr 1, 2024

kyle-messier commented Apr 2, 2024 •

edited

Loading

sigmafelix commented Apr 2, 2024

kyle-messier commented Apr 2, 2024

mitchellmanware commented Apr 2, 2024

eva0marques commented Apr 2, 2024 •

edited

Loading

kyle-messier commented Apr 19, 2024

kyle-messier commented Apr 24, 2024

sigmafelix commented Apr 24, 2024

kyle-messier commented Apr 25, 2024

eva0marques commented Apr 25, 2024

kyle-messier commented Apr 25, 2024

eva0marques commented Apr 25, 2024 •

edited

Loading

sigmafelix commented May 1, 2024 •

edited

Loading

updating and appending pipeline #321

updating and appending pipeline #321

Comments

kyle-messier commented Mar 31, 2024 • edited Loading

kyle-messier commented Mar 31, 2024

kyle-messier commented Apr 1, 2024

kyle-messier commented Apr 2, 2024 • edited Loading

Documenting helpful discussion of tar_files and tar_files_raw

sigmafelix commented Apr 2, 2024

kyle-messier commented Apr 2, 2024

mitchellmanware commented Apr 2, 2024

eva0marques commented Apr 2, 2024 • edited Loading

kyle-messier commented Apr 19, 2024

kyle-messier commented Apr 24, 2024

sigmafelix commented Apr 24, 2024

kyle-messier commented Apr 25, 2024

eva0marques commented Apr 25, 2024

kyle-messier commented Apr 25, 2024

eva0marques commented Apr 25, 2024 • edited Loading

sigmafelix commented May 1, 2024 • edited Loading

TODO

kyle-messier commented Mar 31, 2024 •

edited

Loading

kyle-messier commented Apr 2, 2024 •

edited

Loading

eva0marques commented Apr 2, 2024 •

edited

Loading

eva0marques commented Apr 25, 2024 •

edited

Loading

sigmafelix commented May 1, 2024 •

edited

Loading