Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a data dictionary #315

Merged

Conversation

jeancochrane
Copy link
Contributor

@jeancochrane jeancochrane commented Jan 8, 2025

This PR turns the "Features Used" table in the README into a proper data dictionary by making two changes:

  • Add a new column Variable Name that lists the name of the variable as it appears in the model code
  • Save a version of the table to CSV in docs/data-dict.csv

We also rename the Notes column to Description for clarity, and move it to the left in the table so that it's more prominent.

Note that this PR does not create historical dictionaries for past models. My expectation is that we will keep docs/data-dict.csv up to date with the most recent version of the parameter file, and then in the future we can back out the data dict that we used for past models by referencing the version of docs/data-dict.csv that existed at the time of the yearly model tag.

If this change looks good, I'll go ahead and copy it to the condo model to address ccao-data/model-condo-avm#72.

Closes #300.

@jeancochrane jeancochrane linked an issue Jan 8, 2025 that may be closed by this pull request
README.Rmd Show resolved Hide resolved
README.Rmd Show resolved Hide resolved
README.Rmd Show resolved Hide resolved
Comment on lines +353 to +355
# Write machine-readable version of the table to file
param_tbl_fmt %>%
write_csv("docs/data-dict.csv")
Copy link
Contributor Author

@jeancochrane jeancochrane Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seemed to me like the simplest way to keep the data dict up to date: Any time we render the README, we'll save the data dict to the file. If model parameters haven't changed, the data dict file won't change, and there won't be a diff; otherwise, there will be a diff and the code author will be prompted to commit it. Not the most airtight system, but I figure it's probably a good enough starting place. Let me know if you have other ideas!

Perhaps out of scope for now, but we could also consider adding a pre-commit check similar to readme-rmd-rendered that compares the params in params.yml to the params in this file to make sure they match. I'm happy to take a crack at that now if you think it's a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pre-commit hook was pretty straightforward so I went ahead and implemented it in 46c163b.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that we've ended up with a system where we need to sync four separate things: ccao::vars_dict, params.yaml, docs/data-dict.csv, and the README. I agree this is a good simple solution for now though. Let's roll with it and worry about something better/more long-term once 2025 modeling is finished.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, it's confusing and brittle to maintain. I opened #324 to keep track of this work so that we can pick it up once we're done with modeling.

README.md Outdated Show resolved Hide resolved
README.md Show resolved Hide resolved
@jeancochrane jeancochrane marked this pull request as ready for review January 8, 2025 21:48
@wrridgeway
Copy link
Member

This is awesome. Question for anyone - if someone wanted to grab training data and use it to feed a python model, would they have the info they need to recode features/apply variable labels currently?

@jeancochrane
Copy link
Contributor Author

if someone wanted to grab training data and use it to feed a python model, would they have the info they need to recode features/apply variable labels currently?

Yes, the development version of the Python ccao package has both vars_recode and vars_rename. However, it's under active development and we haven't used it in production yet, so I wouldn't recommend this path to anyone quite yet.

@jeancochrane jeancochrane marked this pull request as draft January 10, 2025 17:34
@jeancochrane
Copy link
Contributor Author

Currently blocked by ccao-data/data-architecture#704. Once that comes in I'll do a final pass at rendering the README and data dict and re-request review.

@jeancochrane
Copy link
Contributor Author

Now blocked by ccao-data/actions#36.

@jeancochrane jeancochrane marked this pull request as ready for review January 13, 2025 15:51
@jeancochrane
Copy link
Contributor Author

Alright @dfsnow, this should finally be ready for review!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Love the simplicity of this hook! My only suggestion would be to add a call to action on failure i.e. run the README.rmd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, done in dc1ce6e!

Data dictionary must be up to date with params file.......................Failed
- hook id: check-data-dict
- exit code: 1

Error: Params are not present in both params.yaml and docs/data-dict.csv: foo, bar. Did you forget to reknit README.Rmd after updating params.yaml?

README.Rmd Outdated
The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved. Most of them are in use in the model as of `r Sys.Date()`.
The residential model uses a variety of individual and aggregate features to determine a property's assessed value. We've tested a long list of possible features over time, including [walk score](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_walkscore.html), [crime rate](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/chicago_crimerate.html), [school districts](https://gitlab.com/ccao-data-science---modeling/models/ccao_res_avm/-/blob/9407d1fae1986c5ce1f5434aa91d3f8cf06c8ea1/output/test_new_variables/county_school_boundaries_mean_encoded.html), and many others. The features in the table below are the ones that made the cut. They're the right combination of easy to understand and impute, powerfully predictive, and well-behaved.

For a machine-readable version of this data dictionary, see [`docs/data-dict.csv`](./docs/data-dict.csv).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (non-blocking): Let's add some callouts here to other data documentation e.g. ccao::vars_dict and the dbt catalog.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable, I expanded this section in e3eca6c and moved it below the table.

README.Rmd Outdated
)) %>%
pivot_wider(
id_cols = `Feature Name`:`Notes`,
id_cols = `feature_name`:`category`,
names_from = row,
values_from = var_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: We may want to use var_code here, since those are the values actually recorded in the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I went ahead and added a new column for this in b8684fb to make it clear that the encoded values are different from the semantic values. Let me know if you think there's a clearer way to express this!

Comment on lines +353 to +355
# Write machine-readable version of the table to file
param_tbl_fmt %>%
write_csv("docs/data-dict.csv")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love that we've ended up with a system where we need to sync four separate things: ccao::vars_dict, params.yaml, docs/data-dict.csv, and the README. I agree this is a good simple solution for now though. Let's roll with it and worry about something better/more long-term once 2025 modeling is finished.

@jeancochrane jeancochrane force-pushed the jeancochrane/300-missing-data-dictionary-in-readme branch from 551e49b to e3eca6c Compare January 13, 2025 21:24
@jeancochrane
Copy link
Contributor Author

I made some major changes based on your comments @dfsnow, mind taking one more look?

@jeancochrane jeancochrane requested a review from dfsnow January 13, 2025 22:44
Copy link
Member

@dfsnow dfsnow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me @jeancochrane! Can you put up a PR for condos as well?

@dfsnow dfsnow merged commit d0cf3e0 into 2025-assessment-year Jan 14, 2025
4 checks passed
@dfsnow dfsnow deleted the jeancochrane/300-missing-data-dictionary-in-readme branch January 14, 2025 16:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Missing Data Dictionary in readme
3 participants