-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updating and appending pipeline #321
Comments
@sigmafelix @eva0marques @mitchellmanware @dzilber @dawranadeep @larapclark I've been exploring how we can implement a pipeline updates when there is new data available while also being smart enough to only run new data. For example, when calculating covariates with temporality, we only need to calculate the new 6 months and then append. the So where does that leave us? I think we need to implement the checks, run, and append in our own R functions in the |
|
Documenting helpful discussion of tar_files and tar_files_raw |
I want to clarify what I said in today's discussion. The picture above depicts the overall structure of the pipeline I'm working on (sorry for bad handwriting). In my understanding, the pipeline is supposed to represent a static procedure of data analytics; any dynamic components (i.e., periodic updates in raw data) should be declared externally, where my configuration csv file (aka punchcard.csv) works for invalidating nodes by reading the configuration file with One challenge is that we want to update the feature data and the models periodically, which is difficult to do with #' Append previous data
post_append <- function(present_features, path_present_features, dir_features) {
nfiles <- list.files(dir = dir_features, full.names = TRUE)
if (!is.null(nfiles) & length(nfiles) > 0) {
feats_prev <- lapply(nfiles, qread)
feats_prev <- append(feats_prev, present_features)
feats_up <- data.table::rbindlist(feats_prev, fill = TRUE)
return(feats_up)
} else {
qsave(present_features, path_present_features)
return(present_features)
}
}
Then, how do we know the saved file is from the previous pipeline run? To deal with this issue, we could take a simple way of naming the file to be saved (e.g., containing start and end dates in the file name). We could add a function to The idea I brought up above is half-baked, and I believe there exist similar solutions in |
thanks @sigmafelix. BTW I think it is a beautiful hand-drawn figure. I think the solution you present here could work. Although, I'd also like to continue to look at |
Thank you for the visual, Insang, it is very helpful. I see the pipeline in the same way where the external editing of the configuration file dates/years triggers the rest of the pipeline to run. I think it would be difficult to implement the dynamic branching for each of the covariate download/processing/calculation steps, but using it at the Not sure if I have interpreted the documentation correctly, but I am still reading and working on an example. |
Thank you @insang.
|
|
@sigmafelix Here is some code snippet for a 2-target combination that reads and saves a SpatRaster object in targets - based off some discussion here tar_target(
olm_clay_files,
unlist(list.files("/Volumes/set/Projects/PrestoGP_Pesticides/input/OpenLandMapData/Clay_Content/",pattern = "*.tif",full.names = TRUE))
),
tar_target(# These targets are the raw OLM files
name = olm_clay_crop,
command = olm_read_crop(olm_clay_files),
format = tar_format(
read = function(path) terra::rast(path),
write = function(object, path) terra::writeRaster(x = object, filename = path, filetype = "GTiff", overwrite = TRUE),
marshal = function(object) terra::wrap(object),
unmarshal = function(object) terra::unwrap(object)
)
)
If you try to use a regular |
@kyle-messier Thank you for pointing me to the file-based workflow. geotargets supports interface for |
@sigmafelix |
I didn't follow the conversation so I apologies if I am off-topic but I did wrote functions to convert to and from |
@eva0marques no worries - I'm not as familiar with the specifics of the |
Sure :) A A |
TODO
|
Approaches and discussion on how we implement an updatable pipeline when new AQS data becomes available
The text was updated successfully, but these errors were encountered: