-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Load/Save - Take 2 #1864
Load/Save - Take 2 #1864
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is close, but needs a few more tweaks - though maybe most can be done in gsm.template
R/RunWorkflow.R
Outdated
if (is.null(lData) && !is.null(lConfig)) { | ||
cli::cli_alert("No data provided. Attempting to load data from `lConfig`.") | ||
lData <- LoadData(lWorkflow, lConfig) | ||
if (!is.null(lConfig)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd like to make this more explicit. Maybe add bReadData
and bWriteData
parameters that explicitly say whether Save/Load should be run. (and then throw a warning if lConfig
isn't found.
I can imagine a scenario where you want to do a 'test' runs where we load data, but don't write out data. Or the opposite, where we manually build inputs in R, but then write out the results to disc.
# Add all columns to the spec if '_all' is present. | ||
if ('_all' %in% names(columnSpecs)) { | ||
missingColumnSpecs <- setdiff(names(dfSource), names(columnSpecs)) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is ok, but it means we really don't want to use all
for raw-mapped
data specs.
R/util-ApplySpec.R
Outdated
# Apply data types to each column in [ dfMapped ]. | ||
for (col in names(dfMapped)) { | ||
if (col %in% names(columnSpecs)) { | ||
spec <- columnSpecs[[ col ]] | ||
if ('type' %in% names(spec)) { | ||
# TODO: handle character NA values | ||
dfMapped[[ col ]] <- as(dfMapped[[ col ]], spec$type) | ||
} | ||
} | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should probably remove this for now. Would want to add some pretty robust testing to make sure we're not having issue coercing between types.
For now, maybe we just throw a warning if see a type mismatch?
R/util-LoadData.R
Outdated
path <- domain_config$table %>% gsub('\\{.*}', '**', .) | ||
|
||
connection <- DBI::dbConnect(duckdb::duckdb()) | ||
lData[[ domain ]] <- connection %>% | ||
dplyr::tbl(glue::glue("read_csv('{path}', all_varchar = true)")) %>% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, guessing this is driving the type-coercion above. Let's discuss.
clindata: clindata | ||
domains: | ||
Raw_STUDY: | ||
db: clindata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just pull this from line 9 above?
clindata: clindata | ||
local: ./data | ||
s3: s3://AA-AA-000-0000/data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think maybe we just specify what is in use for the study here. For this example, probably:
db:
type: local
path: ./data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can add a vignette (in gsm.template) outline the options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I guess there can be multiple relevant dbs, so might need to keep it as is.
schemas: | ||
Raw: | ||
db: | ||
clindata: clindata | ||
domains: | ||
Raw_STUDY: | ||
db: clindata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wonder if we can simplify just using the info for the current study:
schemas:
Raw:
db: clindata
domains:
- Raw_Study: ctms_study
- Raw_Site: ctms_site
- ...
db: | ||
local: ./data/2024-09-26/Mapped/{ID}.csv | ||
s3: s3://AA-AA-000-0000/data/2024-09-26/Mapped/{ID}.parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably just need one of these for any given study, right?
modules: | ||
- id: report_kri_site | ||
repo: Gilead-BioStats/gsm | ||
path: inst/workflow/reports/report_kri_site.yaml | ||
version: dev | ||
package: gsm | ||
file: workflow/reports/report_kri_site.yaml | ||
metrics: | ||
- id: kri0001 | ||
repo: Gilead-BioStats/gsm | ||
path: inst/workflow/metrics/kri0001.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we really need this. Feel like we could just have almost all of this in the workflow yaml header. And then just use standard paths ... Maybe add it in the future if we need customization
package: gsm | ||
file: workflow/metrics/kri0004.yaml | ||
SnapshotDate: '2024-09-26' | ||
domains: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
domains
feels duplicative to some of the stuff above ...
R/RunWorkflow.R
Outdated
@@ -90,6 +94,14 @@ RunWorkflow <- function( | |||
|
|||
lWorkflow$lData[[step$output]] <- result | |||
lWorkflow$lResult <- result | |||
|
|||
if (!is.null(step$save) && step$save) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should get rid of step$save.
Instead, I think we can probably support saving a list of data by adding some basic iteration in SaveData(). Something along the lines of:
if(is.list(lWorkflow$result)){
lWorkflow$result %>% map(SaveData)
} else if(is.data.frame(lWorkflowResult)) {
#existing code goes here
} else {
cli_warning("data format not supported, not saving")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So does this mean we are putting the last step of the metric yamls back in- where we write all tables out as a list? In the current version, this step has been removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we just set bReturnResult
to FALSE when running metrics locally and keep the save
field, but add some logic so that it only triggers a SaveData()
when lConfig
is not null.
@jwildfire, @lauramaxwell - load/save has been removed from this branch so it's ready to review. |
tests and checks are passing now! i think everything looks good, but i'll give it another look before merging. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good!
Overview
Redo of #1861
Test Notes/Sample Code
Connected Issues