Move `vars_dict` out of this package to a shared location #2

dfsnow · 2023-07-18T18:05:30Z

The vars_dict dataset in this package is used heavily by ccao-data/model-res-avm and ccao-data/model-condo-avm. However, iterating with the model requires us to constantly update this dictionary (and thus the package). Additionally, keeping the dictionary up-to-date is tedious, manual process.

I propose we automate the creation and maintenance of this dictionary and move it out of this package on to S3. We can use Glue and other AWS APIs to construct it on a schedule, then store it in a public bucket.

The text was updated successfully, but these errors were encountered:

jeancochrane · 2023-09-06T20:36:31Z

Here's a brief proposal for how we might do this!

Defining the variables

Variable definitions will be stored as metadata on our dbt model definitions, with a meta entry containing var_* attributes for each column in the columns key. (Example to follow shortly). We'll need to make a few schema changes in order to support mapping the rows in vars_dict.csv to a dbt meta object, and vice versa:

The notes column of the vars dict will be moved into the description attribute for the column (outside of meta) and used to document the column.
Rows will be aggregated such that there is only one variable object per column in the dbt model; in particular, categorical variables that span multiple rows in vars_dict.csv (one for each variable code) will be refactored such that the var_code, var_value, and var_value_short attributes are combined into a single var_codes object, where each var_code value is its own key with a var_value and var_value_short attribute.

Below is an example of the proposed definition of the default.vw_card_res_char.char_air feature. Here's what the original rows look like in vars_dict.csv:

var_from_source	var_from_table	var_from_ctas	var_from_view	var_name_hie	var_name_iasworld	var_name_athena	var_name_model	var_name_publish	var_name_pretty	var_type	var_data_type	var_model_type	var_is_published	var_is_predictor	var_code	var_value	var_value_short	var_notes
iasWorld	iasworld.dweldat		default.vw_card_res_char	qu_air	user7	char_air	char_air	central_air	Central Air Conditioning	char	categorical	res	TRUE	TRUE	1	Central A/C	YES	Whether the unit has central air conditioning.
iasWorld	iasworld.dweldat		default.vw_card_res_char	qu_air	user7	char_air	char_air	central_air	Central Air Conditioning	char	categorical	res	TRUE	TRUE	2	No Central A/C	NO	Whether the unit has central air conditioning.

Here's what that would look like converted to a dbt meta object in dbt/models/default/schema.yml:

diff --git a/dbt/models/default/schema.yml b/dbt/models/default/schema.yml
index c031e78..22d6aa1 100644
--- a/dbt/models/default/schema.yml
+++ b/dbt/models/default/schema.yml
@@ -66,6 +66,32 @@ models:
             error_if: ">266758"
   - name: default.vw_card_res_char
     description: '{{ doc("vw_card_res_char") }}'
+    columns:
+      - name: char_air
+        description: Whether the unit has central air conditioning.
+        meta:
+          var_from_source: iasWorld
+          var_from_table: iasworld.dweldat
+          var_from_ctas: null
+          var_from_view: default.vw_card_res_char
+          var_name_hie: qu_air
+          var_name_iasworld: user7
+          var_name_athena: char_air
+          var_name_model: char_air
+          var_name_publish: central_air
+          var_name_pretty: Central Air Conditioning
+          var_type: char
+          var_data_type: categorical
+          var_model_type: res
+          var_is_published: true
+          var_is_predictor: true
+          var_codes:
+            1:
+              var_value: Central A/C
+              var_value_short: "YES"
+            2:
+              var_value: No Central A/C
+              var_value_short: "NO"
     tests:
       # Unique by card and year
       - dbt_utils.unique_combination_of_columns:

It will probably be most efficient to write a script to do this for the 500+ rows in vars_dict.csv.

Compiling the variables into the vars dict

Whenever dbt compiles the DAG, it produces a new manifest.json file with the full DAG definition; per the dbt docs, any resource metadata stored in a meta field will get compiled into the manifest as well. I confirmed this locally by running:

$ jq '.nodes."model.athena.default.vw_card_res_char".columns.char_air.meta' target/manifest.json

And the output:

{
  "var_from_source": "iasWorld",
  "var_from_table": "iasworld.dweldat",
  "var_from_ctas": null,
  "var_from_view": "default.vw_card_res_char",
  "var_name_hie": "qu_air",
  "var_name_iasworld": "user7",
  "var_name_athena": "char_air",
  "var_name_model": "char_air",
  "var_name_publish": "central_air",
  "var_name_pretty": "Central Air Conditioning",
  "var_type": "char",
  "var_data_type": "categorical",
  "var_model_type": "res",
  "var_is_published": true,
  "var_is_predictor": true,
  "var_codes": {
    "1": {
      "var_value": "Central A/C",
      "var_value_short": "YES"
    },
    "2": {
      "var_value": "No Central A/C",
      "var_value_short": "NO"
    }
  }
}

We can write a simple script that parses a manifest.json file to parse these metadata objects and compile them into output rows matching the format of vars_dict.csv. This will essentially be the reverse of the operation described in Defining the variables above, so it might make sense to write one conversion script that can perform both operations.

Deployment

Once we have metadata compiled to the dbt manifest and a script to convert it into the vars dict, we can write a GitHub workflow that runs on each commit to the main branch and pushes the dict to S3 remote storage in the ccao-data-public-us-east-1 bucket under the vars_dict prefix.

We would like to version the vars dict according to the git commit that generated it. As such, the workflow will push two copies of the dict to S3: one using the short commit hash (e.g. ccao-data-public-us-east-1/vars_dict/3798bfa.csv) and one using the reserved latest key (e.g. ccao-data-public-us-east-1/vars_dict/latest.csv). This way, data consumers can pin to a specific version of the dict if they so desire.

To avoid unnecessary versions, the workflow will only push a new version if the CSV produced by the conversion script differs from latest.csv in S3.

Incorporating the remote CSV into the `ccao` package

We will update the ccao package so that it pulls vars_dict from the remote S3 bucket. (This is the step that gives me the least confidence, since I haven't yet written an R package!)

We will define a new get_vars_dict function that accepts an optional version argument. Data consumers will be able to use this function to retrieve a specific version of the vars dict according to the short hash of the git commit that generated it. If no version is supplied, get_vars_dict should return the latest version.

When the package is first imported, it should call get_vars_dict() to pull the latest vars dict from S3 and save it to the vars_dict variable. This variable will then be available in memory for use by package internals or package consumers. We may also want to consider passing a version into this initial call so that each published version of the ccao package refers to a specific version of the vars dict by default, but I'm unsure on this point.

jeancochrane · 2023-09-06T20:36:47Z

Curious what @dfsnow and @wrridgeway think about my proposal above!

dfsnow · 2023-09-07T14:54:46Z

@jeancochrane See my notes on each section:

Defining the variables

Re: notes. Is there a way to store separate description (a high-level description of the field) and notes (caveats, long description, etc.) columns?
The aggregation of rows also makes sense. However, since a lot of the stuff in vars_dict will already be represented by dbt, I think we can drop a few of the columns. See diff and comments within it:

diff --git a/dbt/models/default/schema.yml b/dbt/models/default/schema.yml
index c031e78..22d6aa1 100644
--- a/dbt/models/default/schema.yml
+++ b/dbt/models/default/schema.yml
@@ -66,6 +66,32 @@ models:
             error_if: ">266758"
   - name: default.vw_card_res_char
     description: '{{ doc("vw_card_res_char") }}'
+    columns:
+      - name: char_air
+        description: Whether the unit has central air conditioning.
+        meta:
+          var_from_source: iasWorld                   # Perhaps we can move this into dbt's loader field?
-          var_from_table: iasworld.dweldat            # All of these should be able to be imputed from dbt
-          var_from_ctas: null
-          var_from_view: default.vw_card_res_char
+          var_name_hie: qu_air
+          var_name_iasworld: user7
+          var_name_athena: char_air
+          var_name_model: char_air
+          var_name_publish: central_air
+          var_name_pretty: Central Air Conditioning
+          var_type: char                              # This maps to the prefix used when renaming  
+          var_data_type: categorical                  # This maps to the R data type, so must stay
-          var_model_type: res
-          var_is_published: true                      # No longer really used
-          var_is_predictor: true                      # Built into the model pipeline now
+          var_codes:
+            1:
+              var_value: Central A/C
+              var_value_short: "YES"
+            2:
+              var_value: No Central A/C
+              var_value_short: "NO"
     tests:
       # Unique by card and year
       - dbt_utils.unique_combination_of_columns:

Compiling the variables into the vars dict

Flattening these vars might be tricky (see my question about duplicates below), but I think it's totally doable with jq or python. Agree that we can probably fit this into one script.

Deployment

Deploying to S3 with the latest + hash makes sense. However, it would also be extremely useful to have this output accessible as an Athena table. IMO, I think the way to achieve this is to write the output to ccao-data-public-us-east-1/vars_dict/, but include the hash/version as a column in the CSV, in addition to using it in the naming schema. This makes the version accessible to Athena.

If we then define ccao-data-public-us-east-1/vars_dict/ as an Athena table with a crawler, we can access all versions of the CSV in a single table and simply use WHERE = 'latest'. This will be extremely handy for grabbing high-level variable descriptions. I would write the table to ccao.vars_dict.

Questions

How should we represent duplicate data? Something like char_air is stored in a raw form in dweldat and a cleaned up form in vw_res_card_char. In the former, the numeric codes are stored, in the latter, the string representations are stored.
Should we just dupe the definitions across both the view and table? If so, how do we represent that when parsing back to CSV?
Can we collapse the duplicate definitions into a single row in vars_dict? i.e. how they are currently represented?
One thing to consider: Athena can read and parse arbitrary JSON. We might be able to define the construction of the vars_dict table using SQL alone, meaning it could simply be a dbt build step. Not sure how well this plays with state though (would updating a documentation column alter the table state and force a rebuild?)

jeancochrane · 2023-09-07T21:15:35Z

We had a productive discussion of this issue offline, resulting in some changes to the scope. Codifying that discussion here for posterity:

It feels like this issue is experiencing scope creep, so we're going to cut the scope and put the extraneous features on the backburner. Specifically, there are a few distinct goals that we were trying to accomplish with the proposed design above:

Speeding up the model iteration cycle by making the variables easier to update (the original goal as stated in the issue above)
Consolidating our data documentation by moving variable definitions into dbt
Exposing a public interface for translating between variable names and accessing their rich descriptions (e.g. in Tableau reports)

The proposed solution (in brief: variables stored in dbt metadata, which gets compiled to a CSV, then updated to add a version column, then uploaded to S3, then crawled by Glue, then exposed in Athena, then downloaded by the ccao package as necessary to support ccao::vars_dict) feels overengineered due to our desire to solve all of these problems at once. Instead, an 80/20 solution might improve on each goal incrementally:

Speed up the model iteration cycle by stripping the model variables out of the ccao variable dictionary and removing the requirement that we keep it up to date with our changes
Consolidate our data documentation by moving column notes from the ccao variable dictionary to the dbt DAG
Expose a public interface for column names/descriptions by designing a dedicated solution (and punting on this work in the immediate term)

The steps of this work will now be:

Update the ccao variable dictionary to remove unused rows and columns and focus exclusively on supporting renaming/recoding variables
Move rich column descriptions from the notes column of the ccao variable dictionary into the dbt DAG
Open up an issue to work with Nicole to determine our Tableau needs for the data catalog, and scope a dedicated solution

dfsnow assigned wrridgeway Jul 18, 2023

dfsnow assigned jeancochrane Aug 4, 2023

This was referenced Sep 8, 2023

Add rich descriptions to column definitions in dbt ccao-data/data-architecture#118

Merged

Consolidate vars_dict.csv to focus exclusively on renaming/recoding #6

Merged

jeancochrane closed this as completed in #6 Sep 14, 2023

jeancochrane mentioned this issue Sep 19, 2023

[Data catalog] Revamp process for updating feature documentation in res model README ccao-data/data-architecture#140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move `vars_dict` out of this package to a shared location #2

Move `vars_dict` out of this package to a shared location #2

dfsnow commented Jul 18, 2023

jeancochrane commented Sep 6, 2023

jeancochrane commented Sep 6, 2023

dfsnow commented Sep 7, 2023 •

edited

Loading

jeancochrane commented Sep 7, 2023

Move vars_dict out of this package to a shared location #2

Move vars_dict out of this package to a shared location #2

Comments

dfsnow commented Jul 18, 2023

jeancochrane commented Sep 6, 2023

Defining the variables

Compiling the variables into the vars dict

Deployment

Incorporating the remote CSV into the ccao package

jeancochrane commented Sep 6, 2023

dfsnow commented Sep 7, 2023 • edited Loading

Defining the variables

Compiling the variables into the vars dict

Deployment

Questions

jeancochrane commented Sep 7, 2023

Move `vars_dict` out of this package to a shared location #2

Move `vars_dict` out of this package to a shared location #2

Incorporating the remote CSV into the `ccao` package

dfsnow commented Sep 7, 2023 •

edited

Loading