-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move vars_dict
out of this package to a shared location
#2
Comments
Here's a brief proposal for how we might do this! Defining the variablesVariable definitions will be stored as metadata on our dbt model definitions, with a
Below is an example of the proposed definition of the
Here's what that would look like converted to a dbt diff --git a/dbt/models/default/schema.yml b/dbt/models/default/schema.yml
index c031e78..22d6aa1 100644
--- a/dbt/models/default/schema.yml
+++ b/dbt/models/default/schema.yml
@@ -66,6 +66,32 @@ models:
error_if: ">266758"
- name: default.vw_card_res_char
description: '{{ doc("vw_card_res_char") }}'
+ columns:
+ - name: char_air
+ description: Whether the unit has central air conditioning.
+ meta:
+ var_from_source: iasWorld
+ var_from_table: iasworld.dweldat
+ var_from_ctas: null
+ var_from_view: default.vw_card_res_char
+ var_name_hie: qu_air
+ var_name_iasworld: user7
+ var_name_athena: char_air
+ var_name_model: char_air
+ var_name_publish: central_air
+ var_name_pretty: Central Air Conditioning
+ var_type: char
+ var_data_type: categorical
+ var_model_type: res
+ var_is_published: true
+ var_is_predictor: true
+ var_codes:
+ 1:
+ var_value: Central A/C
+ var_value_short: "YES"
+ 2:
+ var_value: No Central A/C
+ var_value_short: "NO"
tests:
# Unique by card and year
- dbt_utils.unique_combination_of_columns: It will probably be most efficient to write a script to do this for the 500+ rows in Compiling the variables into the vars dictWhenever dbt compiles the DAG, it produces a new $ jq '.nodes."model.athena.default.vw_card_res_char".columns.char_air.meta' target/manifest.json And the output: {
"var_from_source": "iasWorld",
"var_from_table": "iasworld.dweldat",
"var_from_ctas": null,
"var_from_view": "default.vw_card_res_char",
"var_name_hie": "qu_air",
"var_name_iasworld": "user7",
"var_name_athena": "char_air",
"var_name_model": "char_air",
"var_name_publish": "central_air",
"var_name_pretty": "Central Air Conditioning",
"var_type": "char",
"var_data_type": "categorical",
"var_model_type": "res",
"var_is_published": true,
"var_is_predictor": true,
"var_codes": {
"1": {
"var_value": "Central A/C",
"var_value_short": "YES"
},
"2": {
"var_value": "No Central A/C",
"var_value_short": "NO"
}
}
} We can write a simple script that parses a DeploymentOnce we have metadata compiled to the dbt manifest and a script to convert it into the vars dict, we can write a GitHub workflow that runs on each commit to the main branch and pushes the dict to S3 remote storage in the We would like to version the vars dict according to the git commit that generated it. As such, the workflow will push two copies of the dict to S3: one using the short commit hash (e.g. To avoid unnecessary versions, the workflow will only push a new version if the CSV produced by the conversion script differs from Incorporating the remote CSV into the
|
Curious what @dfsnow and @wrridgeway think about my proposal above! |
@jeancochrane See my notes on each section: Defining the variables
diff --git a/dbt/models/default/schema.yml b/dbt/models/default/schema.yml
index c031e78..22d6aa1 100644
--- a/dbt/models/default/schema.yml
+++ b/dbt/models/default/schema.yml
@@ -66,6 +66,32 @@ models:
error_if: ">266758"
- name: default.vw_card_res_char
description: '{{ doc("vw_card_res_char") }}'
+ columns:
+ - name: char_air
+ description: Whether the unit has central air conditioning.
+ meta:
+ var_from_source: iasWorld # Perhaps we can move this into dbt's loader field?
- var_from_table: iasworld.dweldat # All of these should be able to be imputed from dbt
- var_from_ctas: null
- var_from_view: default.vw_card_res_char
+ var_name_hie: qu_air
+ var_name_iasworld: user7
+ var_name_athena: char_air
+ var_name_model: char_air
+ var_name_publish: central_air
+ var_name_pretty: Central Air Conditioning
+ var_type: char # This maps to the prefix used when renaming
+ var_data_type: categorical # This maps to the R data type, so must stay
- var_model_type: res
- var_is_published: true # No longer really used
- var_is_predictor: true # Built into the model pipeline now
+ var_codes:
+ 1:
+ var_value: Central A/C
+ var_value_short: "YES"
+ 2:
+ var_value: No Central A/C
+ var_value_short: "NO"
tests:
# Unique by card and year
- dbt_utils.unique_combination_of_columns: Compiling the variables into the vars dictFlattening these vars might be tricky (see my question about duplicates below), but I think it's totally doable with jq or python. Agree that we can probably fit this into one script. DeploymentDeploying to S3 with the If we then define Questions
|
We had a productive discussion of this issue offline, resulting in some changes to the scope. Codifying that discussion here for posterity: It feels like this issue is experiencing scope creep, so we're going to cut the scope and put the extraneous features on the backburner. Specifically, there are a few distinct goals that we were trying to accomplish with the proposed design above:
The proposed solution (in brief: variables stored in dbt metadata, which gets compiled to a CSV, then updated to add a version column, then uploaded to S3, then crawled by Glue, then exposed in Athena, then downloaded by the
The steps of this work will now be:
|
The
vars_dict
dataset in this package is used heavily by ccao-data/model-res-avm and ccao-data/model-condo-avm. However, iterating with the model requires us to constantly update this dictionary (and thus the package). Additionally, keeping the dictionary up-to-date is tedious, manual process.I propose we automate the creation and maintenance of this dictionary and move it out of this package on to S3. We can use Glue and other AWS APIs to construct it on a schedule, then store it in a public bucket.
The text was updated successfully, but these errors were encountered: