Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move vars_dict out of this package to a shared location #2

Closed
dfsnow opened this issue Jul 18, 2023 · 4 comments · Fixed by #6
Closed

Move vars_dict out of this package to a shared location #2

dfsnow opened this issue Jul 18, 2023 · 4 comments · Fixed by #6
Assignees

Comments

@dfsnow
Copy link
Member

dfsnow commented Jul 18, 2023

The vars_dict dataset in this package is used heavily by ccao-data/model-res-avm and ccao-data/model-condo-avm. However, iterating with the model requires us to constantly update this dictionary (and thus the package). Additionally, keeping the dictionary up-to-date is tedious, manual process.

I propose we automate the creation and maintenance of this dictionary and move it out of this package on to S3. We can use Glue and other AWS APIs to construct it on a schedule, then store it in a public bucket.

@jeancochrane
Copy link
Contributor

Here's a brief proposal for how we might do this!

Defining the variables

Variable definitions will be stored as metadata on our dbt model definitions, with a meta entry containing var_* attributes for each column in the columns key. (Example to follow shortly). We'll need to make a few schema changes in order to support mapping the rows in vars_dict.csv to a dbt meta object, and vice versa:

  • The notes column of the vars dict will be moved into the description attribute for the column (outside of meta) and used to document the column.
  • Rows will be aggregated such that there is only one variable object per column in the dbt model; in particular, categorical variables that span multiple rows in vars_dict.csv (one for each variable code) will be refactored such that the var_code, var_value, and var_value_short attributes are combined into a single var_codes object, where each var_code value is its own key with a var_value and var_value_short attribute.

Below is an example of the proposed definition of the default.vw_card_res_char.char_air feature. Here's what the original rows look like in vars_dict.csv:

var_from_source var_from_table var_from_ctas var_from_view var_name_hie var_name_iasworld var_name_athena var_name_model var_name_publish var_name_pretty var_type var_data_type var_model_type var_is_published var_is_predictor var_code var_value var_value_short var_notes
iasWorld iasworld.dweldat default.vw_card_res_char qu_air user7 char_air char_air central_air Central Air Conditioning char categorical res TRUE TRUE 1 Central A/C YES Whether the unit has central air conditioning.
iasWorld iasworld.dweldat default.vw_card_res_char qu_air user7 char_air char_air central_air Central Air Conditioning char categorical res TRUE TRUE 2 No Central A/C NO Whether the unit has central air conditioning.

Here's what that would look like converted to a dbt meta object in dbt/models/default/schema.yml:

diff --git a/dbt/models/default/schema.yml b/dbt/models/default/schema.yml
index c031e78..22d6aa1 100644
--- a/dbt/models/default/schema.yml
+++ b/dbt/models/default/schema.yml
@@ -66,6 +66,32 @@ models:
             error_if: ">266758"
   - name: default.vw_card_res_char
     description: '{{ doc("vw_card_res_char") }}'
+    columns:
+      - name: char_air
+        description: Whether the unit has central air conditioning.
+        meta:
+          var_from_source: iasWorld
+          var_from_table: iasworld.dweldat
+          var_from_ctas: null
+          var_from_view: default.vw_card_res_char
+          var_name_hie: qu_air
+          var_name_iasworld: user7
+          var_name_athena: char_air
+          var_name_model: char_air
+          var_name_publish: central_air
+          var_name_pretty: Central Air Conditioning
+          var_type: char
+          var_data_type: categorical
+          var_model_type: res
+          var_is_published: true
+          var_is_predictor: true
+          var_codes:
+            1:
+              var_value: Central A/C
+              var_value_short: "YES"
+            2:
+              var_value: No Central A/C
+              var_value_short: "NO"
     tests:
       # Unique by card and year
       - dbt_utils.unique_combination_of_columns:

It will probably be most efficient to write a script to do this for the 500+ rows in vars_dict.csv.

Compiling the variables into the vars dict

Whenever dbt compiles the DAG, it produces a new manifest.json file with the full DAG definition; per the dbt docs, any resource metadata stored in a meta field will get compiled into the manifest as well. I confirmed this locally by running:

$ jq '.nodes."model.athena.default.vw_card_res_char".columns.char_air.meta' target/manifest.json

And the output:

{
  "var_from_source": "iasWorld",
  "var_from_table": "iasworld.dweldat",
  "var_from_ctas": null,
  "var_from_view": "default.vw_card_res_char",
  "var_name_hie": "qu_air",
  "var_name_iasworld": "user7",
  "var_name_athena": "char_air",
  "var_name_model": "char_air",
  "var_name_publish": "central_air",
  "var_name_pretty": "Central Air Conditioning",
  "var_type": "char",
  "var_data_type": "categorical",
  "var_model_type": "res",
  "var_is_published": true,
  "var_is_predictor": true,
  "var_codes": {
    "1": {
      "var_value": "Central A/C",
      "var_value_short": "YES"
    },
    "2": {
      "var_value": "No Central A/C",
      "var_value_short": "NO"
    }
  }
}

We can write a simple script that parses a manifest.json file to parse these metadata objects and compile them into output rows matching the format of vars_dict.csv. This will essentially be the reverse of the operation described in Defining the variables above, so it might make sense to write one conversion script that can perform both operations.

Deployment

Once we have metadata compiled to the dbt manifest and a script to convert it into the vars dict, we can write a GitHub workflow that runs on each commit to the main branch and pushes the dict to S3 remote storage in the ccao-data-public-us-east-1 bucket under the vars_dict prefix.

We would like to version the vars dict according to the git commit that generated it. As such, the workflow will push two copies of the dict to S3: one using the short commit hash (e.g. ccao-data-public-us-east-1/vars_dict/3798bfa.csv) and one using the reserved latest key (e.g. ccao-data-public-us-east-1/vars_dict/latest.csv). This way, data consumers can pin to a specific version of the dict if they so desire.

To avoid unnecessary versions, the workflow will only push a new version if the CSV produced by the conversion script differs from latest.csv in S3.

Incorporating the remote CSV into the ccao package

We will update the ccao package so that it pulls vars_dict from the remote S3 bucket. (This is the step that gives me the least confidence, since I haven't yet written an R package!)

We will define a new get_vars_dict function that accepts an optional version argument. Data consumers will be able to use this function to retrieve a specific version of the vars dict according to the short hash of the git commit that generated it. If no version is supplied, get_vars_dict should return the latest version.

When the package is first imported, it should call get_vars_dict() to pull the latest vars dict from S3 and save it to the vars_dict variable. This variable will then be available in memory for use by package internals or package consumers. We may also want to consider passing a version into this initial call so that each published version of the ccao package refers to a specific version of the vars dict by default, but I'm unsure on this point.

@jeancochrane
Copy link
Contributor

Curious what @dfsnow and @wrridgeway think about my proposal above!

@dfsnow
Copy link
Member Author

dfsnow commented Sep 7, 2023

@jeancochrane See my notes on each section:

Defining the variables

  • Re: notes. Is there a way to store separate description (a high-level description of the field) and notes (caveats, long description, etc.) columns?
  • The aggregation of rows also makes sense. However, since a lot of the stuff in vars_dict will already be represented by dbt, I think we can drop a few of the columns. See diff and comments within it:
diff --git a/dbt/models/default/schema.yml b/dbt/models/default/schema.yml
index c031e78..22d6aa1 100644
--- a/dbt/models/default/schema.yml
+++ b/dbt/models/default/schema.yml
@@ -66,6 +66,32 @@ models:
             error_if: ">266758"
   - name: default.vw_card_res_char
     description: '{{ doc("vw_card_res_char") }}'
+    columns:
+      - name: char_air
+        description: Whether the unit has central air conditioning.
+        meta:
+          var_from_source: iasWorld                   # Perhaps we can move this into dbt's loader field?
-          var_from_table: iasworld.dweldat            # All of these should be able to be imputed from dbt
-          var_from_ctas: null
-          var_from_view: default.vw_card_res_char
+          var_name_hie: qu_air
+          var_name_iasworld: user7
+          var_name_athena: char_air
+          var_name_model: char_air
+          var_name_publish: central_air
+          var_name_pretty: Central Air Conditioning
+          var_type: char                              # This maps to the prefix used when renaming  
+          var_data_type: categorical                  # This maps to the R data type, so must stay
-          var_model_type: res
-          var_is_published: true                      # No longer really used
-          var_is_predictor: true                      # Built into the model pipeline now
+          var_codes:
+            1:
+              var_value: Central A/C
+              var_value_short: "YES"
+            2:
+              var_value: No Central A/C
+              var_value_short: "NO"
     tests:
       # Unique by card and year
       - dbt_utils.unique_combination_of_columns:

Compiling the variables into the vars dict

Flattening these vars might be tricky (see my question about duplicates below), but I think it's totally doable with jq or python. Agree that we can probably fit this into one script.

Deployment

Deploying to S3 with the latest + hash makes sense. However, it would also be extremely useful to have this output accessible as an Athena table. IMO, I think the way to achieve this is to write the output to ccao-data-public-us-east-1/vars_dict/, but include the hash/version as a column in the CSV, in addition to using it in the naming schema. This makes the version accessible to Athena.

If we then define ccao-data-public-us-east-1/vars_dict/ as an Athena table with a crawler, we can access all versions of the CSV in a single table and simply use WHERE = 'latest'. This will be extremely handy for grabbing high-level variable descriptions. I would write the table to ccao.vars_dict.

Questions

  • How should we represent duplicate data? Something like char_air is stored in a raw form in dweldat and a cleaned up form in vw_res_card_char. In the former, the numeric codes are stored, in the latter, the string representations are stored.
  • Should we just dupe the definitions across both the view and table? If so, how do we represent that when parsing back to CSV?
  • Can we collapse the duplicate definitions into a single row in vars_dict? i.e. how they are currently represented?
  • One thing to consider: Athena can read and parse arbitrary JSON. We might be able to define the construction of the vars_dict table using SQL alone, meaning it could simply be a dbt build step. Not sure how well this plays with state though (would updating a documentation column alter the table state and force a rebuild?)

@jeancochrane
Copy link
Contributor

We had a productive discussion of this issue offline, resulting in some changes to the scope. Codifying that discussion here for posterity:

It feels like this issue is experiencing scope creep, so we're going to cut the scope and put the extraneous features on the backburner. Specifically, there are a few distinct goals that we were trying to accomplish with the proposed design above:

  1. Speeding up the model iteration cycle by making the variables easier to update (the original goal as stated in the issue above)
  2. Consolidating our data documentation by moving variable definitions into dbt
  3. Exposing a public interface for translating between variable names and accessing their rich descriptions (e.g. in Tableau reports)

The proposed solution (in brief: variables stored in dbt metadata, which gets compiled to a CSV, then updated to add a version column, then uploaded to S3, then crawled by Glue, then exposed in Athena, then downloaded by the ccao package as necessary to support ccao::vars_dict) feels overengineered due to our desire to solve all of these problems at once. Instead, an 80/20 solution might improve on each goal incrementally:

  1. Speed up the model iteration cycle by stripping the model variables out of the ccao variable dictionary and removing the requirement that we keep it up to date with our changes
  2. Consolidate our data documentation by moving column notes from the ccao variable dictionary to the dbt DAG
  3. Expose a public interface for column names/descriptions by designing a dedicated solution (and punting on this work in the immediate term)

The steps of this work will now be:

  1. Update the ccao variable dictionary to remove unused rows and columns and focus exclusively on supporting renaming/recoding variables
  2. Move rich column descriptions from the notes column of the ccao variable dictionary into the dbt DAG
  3. Open up an issue to work with Nicole to determine our Tableau needs for the data catalog, and scope a dedicated solution

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants