Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating and appending pipeline #321

Open
kyle-messier opened this issue Mar 31, 2024 · 15 comments
Open

updating and appending pipeline #321

kyle-messier opened this issue Mar 31, 2024 · 15 comments
Labels
Production issues related to pipeline or production

Comments

@kyle-messier
Copy link
Collaborator

kyle-messier commented Mar 31, 2024

Approaches and discussion on how we implement an updatable pipeline when new AQS data becomes available

@kyle-messier kyle-messier converted this from a draft issue Mar 31, 2024
@kyle-messier kyle-messier added the Production issues related to pipeline or production label Mar 31, 2024
@kyle-messier
Copy link
Collaborator Author

@sigmafelix @eva0marques @mitchellmanware @dzilber @dawranadeep @larapclark

I've been exploring how we can implement a pipeline updates when there is new data available while also being smart enough to only run new data. For example, when calculating covariates with temporality, we only need to calculate the new 6 months and then append.

the targets package has a discussion here. As you can see someone was trying to do this and the developer said this out of scope of targets. As I was thinking about how we could construct lists of targets that use various target or tarchetype functions to handle the checking, running, appending, I also came across the documentation of tar_meta here. In the subsection storage-access you'll see that it is highly advised not to use things like tar_read or tar_meta within targets since they are designed to updated or be updated.

So where does that leave us? I think we need to implement the checks, run, and append in our own R functions in the beethoven code base. The approach that I suggest we use, and I think we were more or less including this information is portions of the pipeline data objects with a site_index, is to create time_index and location_index variable or metadata object that accompanies created target objects. We could update the site_index to be formatted such that space and time is information is discernable or make new, separate ones. Therefore, a target object can check that the requested time and location easily. If additional time or locations are being requested by the pipeline, then the functions called by targets are designed to read the old data, run the function with new data, and append the old and new data.

@kyle-messier
Copy link
Collaborator Author

  • Look into approach for updating - one target in between covariates and model fitting that does the combining
  • Model updating, option for not updating model parameters

@kyle-messier
Copy link
Collaborator Author

kyle-messier commented Apr 2, 2024

@kyle-messier kyle-messier moved this from 📋 Backlog to 🏗 In progress in beethoven Apr 2, 2024
@sigmafelix
Copy link
Collaborator

8032FD46-6A2E-415A-8294-B8C94CB7962F_1_201_a

I want to clarify what I said in today's discussion. The picture above depicts the overall structure of the pipeline I'm working on (sorry for bad handwriting). In my understanding, the pipeline is supposed to represent a static procedure of data analytics; any dynamic components (i.e., periodic updates in raw data) should be declared externally, where my configuration csv file (aka punchcard.csv) works for invalidating nodes by reading the configuration file with meta_run(). This practice can bypass potential issues in pointing the fixed file path that would not update the pipeline (as discussed in ropensci/targets#136).

One challenge is that we want to update the feature data and the models periodically, which is difficult to do with target's static pipeline. Given that we have saved a file to another directory (as represented "external directory" in the figure), the pipeline can be run in the new period only by editing the configuration file's start and end dates. A new feature data object will be appended if anything exists in the directory we use to keep previous feature files. I think it can be done at the function level, for example:

#' Append previous data
post_append <- function(present_features, path_present_features, dir_features) {
  nfiles <- list.files(dir = dir_features, full.names = TRUE)  
  if (!is.null(nfiles) & length(nfiles) > 0) {
    feats_prev <- lapply(nfiles, qread)
    feats_prev <- append(feats_prev, present_features)
    feats_up <- data.table::rbindlist(feats_prev, fill = TRUE)
    return(feats_up)
  } else {
    qsave(present_features, path_present_features)
    return(present_features)
  }
}

Then, how do we know the saved file is from the previous pipeline run? To deal with this issue, we could take a simple way of naming the file to be saved (e.g., containing start and end dates in the file name). We could add a function to beethoven to set up the feature cache directory and validate the naming convention, etc.

The idea I brought up above is half-baked, and I believe there exist similar solutions in targets framework. I will look at the manual and discussion regarding this issue.

@kyle-messier
Copy link
Collaborator Author

thanks @sigmafelix. BTW I think it is a beautiful hand-drawn figure. I think the solution you present here could work. Although, I'd also like to continue to look at tar_files and tar_files_raw as I think they may be dealing with this issue. To be continued...

@mitchellmanware
Copy link
Collaborator

Thank you for the visual, Insang, it is very helpful. I see the pipeline in the same way where the external editing of the configuration file dates/years triggers the rest of the pipeline to run.

I think it would be difficult to implement the dynamic branching for each of the covariate download/processing/calculation steps, but using it at the dt_combined target may alleviate the need for external directory at the appending step. The way I understand it, the dynamic branches can be triggered by a target meeting a certain "new features" condition, which we can set with a new function. If this "new features" condition is met, the dynamic branch runs the subsequent targets, which can include a tar_read() from the already existing features onto which the new features are appended. This would not violate the tar_read() within functions because the function would only trigger the branch to run.

Not sure if I have interpreted the documentation correctly, but I am still reading and working on an example.

@eva0marques
Copy link
Collaborator

eva0marques commented Apr 2, 2024

Thank you @insang.
I would personally add an option in the configuration file to give the possibility to the user to choose between the followings (at the model step):

  • Use model: using existing model to simply predict on new data (this option is possible only if a model already exists, will we share our 5y-trained model in the end?)
  • Update model: update the previously trained models (but again, I am not sure that it is going to be feasible for all of our base learners because of the algorithms themselves ; I am not a specialist on this question but I only heard about pre-taining for NN architectures)
  • Create model: completely retrain the whole model.
    (I'm sure you'll find better names for this option variable).

@kyle-messier
Copy link
Collaborator Author

  • We need to check in with OSC and make sure all of the necessary packages are installed on GEO
  • targets
  • crew
  • crew.cluster

@kyle-messier
Copy link
Collaborator Author

@sigmafelix Here is some code snippet for a 2-target combination that reads and saves a SpatRaster object in targets - based off some discussion here

  tar_target(
    olm_clay_files,
    unlist(list.files("/Volumes/set/Projects/PrestoGP_Pesticides/input/OpenLandMapData/Clay_Content/",pattern = "*.tif",full.names = TRUE))
  ),
  tar_target(# These targets are the raw OLM files
    name = olm_clay_crop,
    command = olm_read_crop(olm_clay_files),
    format = tar_format(
      read = function(path) terra::rast(path),
      write = function(object, path) terra::writeRaster(x = object, filename = path, filetype = "GTiff", overwrite = TRUE),
      marshal = function(object) terra::wrap(object),
      unmarshal = function(object) terra::unwrap(object)
    )
  )

If you try to use a regular format = "file" approach, then the Cpp pointers get messed up and there are errors. I'm not entirely sold on the unlist(list.files(.)) approach I have implemented here. Obviously for beethoven you have the punchcard so no hard paths like I have here.

@sigmafelix
Copy link
Collaborator

@kyle-messier Thank you for pointing me to the file-based workflow. geotargets supports interface for terra objects, which we could use for the projects using single files. For beethoven, we use multiple files per target in most of the cases; that said, I think that keeping different recipes/punchcards/blueprints for different periods with format = "file" in tarchetypes::tar_files_input (similar to what we discussed two weeks ago) is a great option for our case.

@kyle-messier
Copy link
Collaborator Author

@sigmafelix geotargets appears to made exactly for what I mentioned. It also got me wondering more about the functionality of a SpatRasterCollection. That could be useful.

@eva0marques
Copy link
Collaborator

I didn't follow the conversation so I apologies if I am off-topic but I did wrote functions to convert to and from SpatRasterDataset, which was more adapted to time dimension addition than SpatRasterCollection.

@kyle-messier
Copy link
Collaborator Author

@eva0marques no worries - I'm not as familiar with the specifics of the SpatRasterDataset vs the SpatRasterCollection. Perhaps we can discuss briefly at the group meeting today.

@eva0marques
Copy link
Collaborator

eva0marques commented Apr 25, 2024

Sure :)
This is what is explained on terra website:

A SpatRasterDataset is a collection of sub-datasets, where each is a SpatRaster for the same area (extent) and coordinate reference system, but possibly with a different resolution. Sub-datasets are often used to capture variables (e.g. temperature and precipitation), or a fourth dimension (e.g. height, depth or time) if the sub-datasets already have three dimensions (multiple layers).

A SpatRasterCollection is a collection of SpatRasters with no restriction in the extent or other geometric parameters.

@sigmafelix
Copy link
Collaborator

sigmafelix commented May 1, 2024

TODO

  • Writing a manual to changing/managing argument list objects and RDS files
  • Options for cluster configurations (related to future.batchtools functions)
  • (Long term) Test crew and crew.cluster

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Production issues related to pipeline or production
Projects
None yet
Development

No branches or pull requests

4 participants