New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Load/Save - Take 2 #1864

Merged

lauramaxwell merged 26 commits into dev from template-refactor-take2

Oct 1, 2024

Contributor

jwildfire commented Sep 30, 2024

Overview

Redo of #1861

Test Notes/Sample Code

Connected Issues

Closes #XXX

samussiah and others added 14 commits

September 25, 2024 13:43


          pull in gsm.template updates

76e7737


          code updates

f493f02


          tweak metrics workflow

3610d77


          stray browser

d205ec9


          tweak workflows

b986fbd


          remove lConfig from RunStep

b07676a


          add gsm.template example

d8e3a8d


          add random snapshot date

cc927b4


          ignore snapshot config

b661c0e


          git rm gsm.template side effect file

...


          update documentation

6618b25


          update 2_metrics yaml structure

337d340


          update metric yamls and examples/3_ReportingWorkflow.R

012443a

jwildfire requested a review from lauramaxwell

September 30, 2024 21:36

jwildfire commented

View reviewed changes

Contributor Author

jwildfire left a comment

I think this is close, but needs a few more tweaks - though maybe most can be done in gsm.template

R/RunWorkflow.R Outdated

-                if (is.null(lData) && !is.null(lConfig)) {
-                  cli::cli_alert("No data provided. Attempting to load data from `lConfig`.")
-                  lData <- LoadData(lWorkflow, lConfig)
+                if (!is.null(lConfig)) {

Contributor Author

jwildfire Sep 30, 2024

I think I'd like to make this more explicit. Maybe add bReadData and bWriteData parameters that explicitly say whether Save/Load should be run. (and then throw a warning if lConfig isn't found.

I can imagine a scenario where you want to do a 'test' runs where we load data, but don't write out data. Or the opposite, where we manually build inputs in R, but then write out the results to disc.

R/util-ApplySpec.R

Comment on lines +2 to +5

+                  # Add all columns to the spec if '_all' is present.
+                  if ('_all' %in% names(columnSpecs)) {
+                      missingColumnSpecs <- setdiff(names(dfSource), names(columnSpecs))

Contributor Author

jwildfire Sep 30, 2024

I think this is ok, but it means we really don't want to use all for raw-mapped data specs.

R/util-ApplySpec.R Outdated

Comment on lines 57 to 67

+                  # Apply data types to each column in [ dfMapped ].
+                  for (col in names(dfMapped)) {
+                      if (col %in% names(columnSpecs)) {
+                          spec <- columnSpecs[[ col ]]
+                          if ('type' %in% names(spec)) {
+                              # TODO: handle character NA values
+                              dfMapped[[ col ]] <- as(dfMapped[[ col ]], spec$type)
+                          }
+                      }
+                  }

Contributor Author

jwildfire Sep 30, 2024

I think we should probably remove this for now. Would want to add some pretty robust testing to make sure we're not having issue coercing between types.

For now, maybe we just throw a warning if see a type mismatch?

R/util-LoadData.R Outdated

Comment on lines 41 to 45

+                              path <- domain_config$table %>% gsub('\\{.*}', '**', .)
+                              connection <- DBI::dbConnect(duckdb::duckdb())
+                              lData[[ domain ]] <- connection %>%
+                                  dplyr::tbl(glue::glue("read_csv('{path}', all_varchar = true)")) %>%

Contributor Author

jwildfire Sep 30, 2024

Hmm, guessing this is driving the type-coercion above. Let's discuss.

inst/examples/AA-AA-000-0000/config/2024-09-26/config.yaml Outdated

+                    clindata: clindata
+                  domains:
+                    Raw_STUDY:
+                      db: clindata

Contributor Author

jwildfire Sep 30, 2024

can we just pull this from line 9 above?

inst/examples/AA-AA-000-0000/config/2024-09-26/config.yaml Outdated

Comment on lines 3 to 5

+                clindata: clindata
+                local: ./data
+                s3: s3://AA-AA-000-0000/data

Contributor Author

jwildfire Sep 30, 2024

i think maybe we just specify what is in use for the study here. For this example, probably:

db:
   type: local
   path: ./data

Contributor Author

jwildfire Sep 30, 2024

we can add a vignette (in gsm.template) outline the options

Contributor Author

jwildfire Sep 30, 2024

Actually, I guess there can be multiple relevant dbs, so might need to keep it as is.

inst/examples/AA-AA-000-0000/config/2024-09-26/config.yaml Outdated

Comment on lines 6 to 12

+              schemas:
+                Raw:
+                  db:
+                    clindata: clindata
+                  domains:
+                    Raw_STUDY:
+                      db: clindata

Contributor Author

jwildfire Sep 30, 2024

wonder if we can simplify just using the info for the current study:

schemas: 
   Raw: 
      db: clindata
      domains: 
          - Raw_Study: ctms_study
          - Raw_Site: ctms_site
          - ...

inst/examples/AA-AA-000-0000/config/2024-09-26/config.yaml Outdated

Comment on lines 48 to 50

+                  db:
+                    local: ./data/2024-09-26/Mapped/{ID}.csv
+                    s3: s3://AA-AA-000-0000/data/2024-09-26/Mapped/{ID}.parquet

Contributor Author

jwildfire Sep 30, 2024

probably just need one of these for any given study, right?

inst/examples/AA-AA-000-0000/config/2024-09-26/config.yaml Outdated

Comment on lines 132 to 142

+              modules:
+              - id: report_kri_site
+                repo: Gilead-BioStats/gsm
+                path: inst/workflow/reports/report_kri_site.yaml
+                version: dev
+                package: gsm
+                file: workflow/reports/report_kri_site.yaml
+              metrics:
+              - id: kri0001
+                repo: Gilead-BioStats/gsm
+                path: inst/workflow/metrics/kri0001.yaml

Contributor Author

jwildfire Sep 30, 2024

I wonder if we really need this. Feel like we could just have almost all of this in the workflow yaml header. And then just use standard paths ... Maybe add it in the future if we need customization

inst/examples/AA-AA-000-0000/config/2024-09-26/config.yaml Outdated

+                package: gsm
+                file: workflow/metrics/kri0004.yaml
+              SnapshotDate: '2024-09-26'
+              domains:

Contributor Author

jwildfire Sep 30, 2024

domains feels duplicative to some of the stuff above ...

jwildfire commented

View reviewed changes

R/RunWorkflow.R Outdated

@@ @@ -90,6 +94,14 @@ RunWorkflow <- function( @@
                   lWorkflow$lData[[step$output]] <- result
                   lWorkflow$lResult <- result
+                  if (!is.null(step$save) && step$save) {

Contributor Author

jwildfire Sep 30, 2024

I think we should get rid of step$save.

Instead, I think we can probably support saving a list of data by adding some basic iteration in SaveData(). Something along the lines of:

if(is.list(lWorkflow$result)){
   lWorkflow$result %>% map(SaveData)
} else if(is.data.frame(lWorkflowResult)) {
 #existing code goes here
} else {
    cli_warning("data format not supported, not saving")

Contributor

lauramaxwell Oct 1, 2024

So does this mean we are putting the last step of the metric yamls back in- where we write all tables out as a list? In the current version, this step has been removed.

Contributor

lauramaxwell Oct 1, 2024 •

edited

Loading

Alternatively, we just set bReturnResult to FALSE when running metrics locally and keep the save field, but add some logic so that it only triggers a SaveData() when lConfig is not null.

jwildfire mentioned this pull request

Run {gsm} using local data store. #1861

Closed

samussiah added 5 commits

October 1, 2024 10:24


          rollback lConfig related updates

3f59a3d


          revert workflows to output a list

778a7de


          clean up workflows

e7d6c5d


          .gitignore

47f1d28


          add header documentation to ApplySpec

d6ca050

Contributor

samussiah commented Oct 1, 2024

@jwildfire, @lauramaxwell - load/save has been removed from this branch so it's ready to review.

lauramaxwell added 6 commits

October 1, 2024 11:42


          small updates to results.yaml and examples

89cec79


          fix tests

821f724


          update reporting vignette

05d6a9a


          fix examples

8e1cb4b


          document strResultNames

7a33b08


          pkgdown index addition

2f7608a

Contributor

lauramaxwell commented Oct 1, 2024

tests and checks are passing now! i think everything looks good, but i'll give it another look before merging.


          fix example 2

ea63682

lauramaxwell approved these changes

View reviewed changes

Contributor

lauramaxwell left a comment

This looks good!

lauramaxwell merged commit 97d7c18 into dev

5 checks passed

lauramaxwell deleted the template-refactor-take2 branch

October 1, 2024 19:26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet