Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load/Save - Take 2 #1864

Merged
merged 26 commits into from
Oct 1, 2024
Merged

Load/Save - Take 2 #1864

merged 26 commits into from
Oct 1, 2024

Conversation

jwildfire
Copy link
Contributor

Overview

Redo of #1861

Test Notes/Sample Code

Connected Issues

  • Closes #XXX

Copy link
Contributor Author

@jwildfire jwildfire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is close, but needs a few more tweaks - though maybe most can be done in gsm.template

R/RunWorkflow.R Outdated
if (is.null(lData) && !is.null(lConfig)) {
cli::cli_alert("No data provided. Attempting to load data from `lConfig`.")
lData <- LoadData(lWorkflow, lConfig)
if (!is.null(lConfig)) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd like to make this more explicit. Maybe add bReadData and bWriteData parameters that explicitly say whether Save/Load should be run. (and then throw a warning if lConfig isn't found.

I can imagine a scenario where you want to do a 'test' runs where we load data, but don't write out data. Or the opposite, where we manually build inputs in R, but then write out the results to disc.

Comment on lines +2 to +5
# Add all columns to the spec if '_all' is present.
if ('_all' %in% names(columnSpecs)) {
missingColumnSpecs <- setdiff(names(dfSource), names(columnSpecs))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ok, but it means we really don't want to use all for raw-mapped data specs.

Comment on lines 57 to 67
# Apply data types to each column in [ dfMapped ].
for (col in names(dfMapped)) {
if (col %in% names(columnSpecs)) {
spec <- columnSpecs[[ col ]]
if ('type' %in% names(spec)) {
# TODO: handle character NA values
dfMapped[[ col ]] <- as(dfMapped[[ col ]], spec$type)
}
}
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably remove this for now. Would want to add some pretty robust testing to make sure we're not having issue coercing between types.

For now, maybe we just throw a warning if see a type mismatch?

Comment on lines 41 to 45
path <- domain_config$table %>% gsub('\\{.*}', '**', .)

connection <- DBI::dbConnect(duckdb::duckdb())
lData[[ domain ]] <- connection %>%
dplyr::tbl(glue::glue("read_csv('{path}', all_varchar = true)")) %>%
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, guessing this is driving the type-coercion above. Let's discuss.

clindata: clindata
domains:
Raw_STUDY:
db: clindata
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just pull this from line 9 above?

Comment on lines 3 to 5
clindata: clindata
local: ./data
s3: s3://AA-AA-000-0000/data
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think maybe we just specify what is in use for the study here. For this example, probably:

db:
   type: local
   path: ./data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can add a vignette (in gsm.template) outline the options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I guess there can be multiple relevant dbs, so might need to keep it as is.

Comment on lines 6 to 12
schemas:
Raw:
db:
clindata: clindata
domains:
Raw_STUDY:
db: clindata
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wonder if we can simplify just using the info for the current study:

schemas: 
   Raw: 
      db: clindata
      domains: 
          - Raw_Study: ctms_study
          - Raw_Site: ctms_site
          - ...

Comment on lines 48 to 50
db:
local: ./data/2024-09-26/Mapped/{ID}.csv
s3: s3://AA-AA-000-0000/data/2024-09-26/Mapped/{ID}.parquet
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably just need one of these for any given study, right?

Comment on lines 132 to 142
modules:
- id: report_kri_site
repo: Gilead-BioStats/gsm
path: inst/workflow/reports/report_kri_site.yaml
version: dev
package: gsm
file: workflow/reports/report_kri_site.yaml
metrics:
- id: kri0001
repo: Gilead-BioStats/gsm
path: inst/workflow/metrics/kri0001.yaml
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we really need this. Feel like we could just have almost all of this in the workflow yaml header. And then just use standard paths ... Maybe add it in the future if we need customization

package: gsm
file: workflow/metrics/kri0004.yaml
SnapshotDate: '2024-09-26'
domains:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

domains feels duplicative to some of the stuff above ...

R/RunWorkflow.R Outdated
@@ -90,6 +94,14 @@ RunWorkflow <- function(

lWorkflow$lData[[step$output]] <- result
lWorkflow$lResult <- result

if (!is.null(step$save) && step$save) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should get rid of step$save.

Instead, I think we can probably support saving a list of data by adding some basic iteration in SaveData(). Something along the lines of:

if(is.list(lWorkflow$result)){
   lWorkflow$result %>% map(SaveData)
} else if(is.data.frame(lWorkflowResult)) {
 #existing code goes here
} else {
    cli_warning("data format not supported, not saving")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So does this mean we are putting the last step of the metric yamls back in- where we write all tables out as a list? In the current version, this step has been removed.

Copy link
Contributor

@lauramaxwell lauramaxwell Oct 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alternatively, we just set bReturnResult to FALSE when running metrics locally and keep the save field, but add some logic so that it only triggers a SaveData() when lConfig is not null.

@samussiah
Copy link
Contributor

@jwildfire, @lauramaxwell - load/save has been removed from this branch so it's ready to review.

@lauramaxwell
Copy link
Contributor

tests and checks are passing now! i think everything looks good, but i'll give it another look before merging.

Copy link
Contributor

@lauramaxwell lauramaxwell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good!

@lauramaxwell lauramaxwell merged commit 97d7c18 into dev Oct 1, 2024
5 checks passed
@lauramaxwell lauramaxwell deleted the template-refactor-take2 branch October 1, 2024 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants