Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Population Synthesizer Benchmark #1

Merged
merged 26 commits into from
Nov 9, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
69ea715
initial commit
DavidOry Jan 8, 2021
06fcd3b
minor progress
DavidOry Jan 13, 2021
5ded40d
progress on taz marginals
DavidOry Jan 14, 2021
67d316c
beta version
DavidOry Jan 15, 2021
2980828
tidy up, add readme
DavidOry Jan 16, 2021
75cf9f0
readme fix
DavidOry Jan 16, 2021
3d9a960
readme update
DavidOry Jan 16, 2021
08ed7db
fix readme
DavidOry Jan 16, 2021
f8d7c46
remove data from scripts, update script
DavidOry May 1, 2021
aeaae7a
Use numeric (not string) variables, standardize variable names, repla…
lmz May 11, 2021
944827b
Initial commit of PopulationSim setup scripts and notes
lmz May 12, 2021
7dd0144
NOP - comments and spacing
lmz May 13, 2021
3f0b929
Add source information and update seed_households.csv to have gqtype
lmz May 18, 2021
8b3ecb1
Keep gqtype in household_seed
lmz May 18, 2021
75ff0ab
Update seed_households.csv for buggy last commit
lmz May 18, 2021
f250834
Reran (and renamed household seed's gqtype to hhgqtype)
lmz May 18, 2021
2bc9642
Fixed links
lmz May 18, 2021
5e4f311
Merge branch 'develop-pop-syn' of https://github.com/ZephyrTransport/…
lmz May 18, 2021
6665d7f
Fix data dictionary error
lmz Sep 14, 2021
e78d71e
Update for last run
lmz Sep 14, 2021
e375d9b
Update outputs from commit 6665d7f
lmz Sep 15, 2021
4fa03b3
Update testing notes
lmz Sep 15, 2021
835b417
Update importance weights and non-gq controls expression
lmz Sep 15, 2021
3d46f85
Rename variable to control field
lmz Sep 15, 2021
dff7886
Add county level summary (calculate)
lmz Nov 9, 2021
0092b29
Remove extraneous notes
lmz Nov 9, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
.Rproj.user
.Rhistory
.RData
.Ruserdata
.DS_Store
**/*.twbr
250 changes: 250 additions & 0 deletions population-synthesis/san-francisco-bay-area/PopulationSim.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
```shell
setopt interactivecomments
# The above makes it so zsh is ok with my comments
# I am on macOS 112.0.1 Monterey

# add R to my path
export PATH=/Library/Frameworks/R.framework/Resources:$PATH

# this is the location of the SF Bay Area population synthesis example in software-badging-benchmarks
export zephyr_sf_dir=/Users/lmz/Documents/GitHub/software-badging-benchmarks/population-synthesis/san-francisco-bay-area

# Following the Installation instructions on https://activitysim.github.io/populationsim/getting_started.html
conda create -n popsim python=3.8
conda activate popsim
# Now in popsim environment...

conda install pytables

# Here, we'll diverge a bit from the instructions in order install (in editable mode) populationsim from a clone of my fork
# e.g. https://github.com/lmz/populationsim
# So my working directory is /Users/lzorn/Documents/GitHub/populationsim
pip install -e .

# Setup the example with the benchmark data
mkdir example_zephyr_sfbayarea
cd example_zephyr_sfbayarea
# Setup the standard subdirectories, pivoting from example_calm
mkdir configs
mkdir data
mkdir output
# Collect validation results across runs here
mkdir validation
cp $zephyr_sf_dir/scripts/validation.twb validation

# start with config from example_calm
cp ../example_calm/run_populationsim.py .
cp ../example_calm/configs/logging.yaml configs
# this one gets manual edits
cp ../example_calm/configs/settings.yaml configs
vim configs/settings.yaml
# Saved this into $zephyr_sf_dir/PopulationSim.md

# pull data from zephyr's software-badging-benchmarks repo
cp $zephyr_sf_dir/geographies.csv data
cp $zephyr_sf_dir/household_seed.csv data
cp $zephyr_sf_dir/person_seed.csv data

# create configs/controls.csv and and control totals (data/control_totals_[county,taz].csv)
# from the marginals
Rscript --vanilla $zephyr_sf_dir/scripts/setup_PopulationSim.R
# create a suffix for saving results
export suffix=$(date +%y%m%d_%H%M)
cp configs/controls.csv validation/controls_$suffix.csv

# run it!
python run_populationsim.py
# output looks like this:
# Configured logging using basicConfig
# INFO:activitysim:Configured logging using basicConfig
# INFO - Read logging configuration from: configs/logging.yaml
# INFO - SETTING configs_dir: configs
# INFO - SETTING settings_file_name: settings.yaml
#
# ...
#
# INFO - trace_memory_info pipeline.run after write_synthetic_population rss: 0.85GB used: 4.28 GB percent: 65.3%
# INFO - trace_memory_info #MEM pipeline.run after run_models rss: 0.85GB used: 4.28 GB percent: 65.3%
# INFO - Time to execute run_model (13 models) : 1645.948 seconds (27.4 minutes)
# INFO - Time to execute all models : 1646.103 seconds (27.4 minutes)

Rscript --vanilla $zephyr_sf_dir/scripts/combine_PopulationSim_summaries_to_long.R
cp output/final_summary_long.csv validation/final_summary_long_$suffix.csv
```

Some things I found:
* PopulationSim doesn't seem to like string variables; the variables used in controls should be numeric
* PopulationSim doesn't handle a variable called "size"; I'm guessing it collides with pandas.DataFrame.size()
* Importance isn't relative. If you run PopulationSim with just total household controls and importance=1, it will validate poorly. If you make importance=1000 it will validate perfectly.

The following is the settings.yaml file used above.
```yaml
####################################################################
# PopulationSim Properties
####################################################################


# Algorithm/Software Configuration
# ------------------------------------------------------------------
INTEGERIZE_WITH_BACKSTOPPED_CONTROLS: True
SUB_BALANCE_WITH_FLOAT_SEED_WEIGHTS: False
GROUP_BY_INCIDENCE_SIGNATURE: True
USE_SIMUL_INTEGERIZER: True
USE_CVXPY: False
max_expansion_factor: 50
MAX_BALANCE_ITERATIONS_SIMULTANEOUS: 1000

# Geographic Settings
# ------------------------------------------------------------------
geographies: [county, PUMA, taz]
seed_geography: PUMA


# Tracing
# ------------------------------------------------------------------
trace_geography:
taz: 62

# Data Directory
# ------------------------------------------------------------------
data_dir: data

# Input Data Tables
# ------------------------------------------------------------------
# input_pre_processor input_table_list
input_table_list:
- tablename: households
filename : household_seed.csv
index_col: unique_hh_id
column_map:
hhnum: unique_hh_id
- tablename: persons
filename : person_seed.csv
column_map:
hhnum: unique_hh_id
SPORDER: SPORDER
# drop mixed type fields that appear to have been incorrectly generated
drop_columns:
- tablename: geo_cross_walk
filename : geographies.csv
- tablename: taz_control_data
filename : control_totals_taz.csv
- tablename: county_control_data
filename : control_totals_county.csv

# Reserved Column Names
# ------------------------------------------------------------------
household_weight_col: WGTP
household_id_col: unique_hh_id
total_hh_control: number


# Control Specification File Name
# ------------------------------------------------------------------
control_file_name: controls.csv

# Output Tables
# ------------------------------------------------------------------
# output_tables can specify either a list of output tables to include or to skip
# if neither is specified, then no tables will be written

output_tables:
action: include
tables:
- summary_taz
- summary_taz_PUMA
- summary_county_1
- summary_county_2
- summary_county_3
- summary_county_4
- summary_county_5
- summary_county_6
- summary_county_7
- summary_county_8
- summary_county_9
- expanded_household_ids
- trace_taz_weights

# Synthetic Population Output Specification
# ------------------------------------------------------------------
#

output_synthetic_population:
household_id: household_id
households:
filename: synthetic_households.csv
columns:
# keep all the columns in household_seed.csv
- RT
- SERIALNO
- DIVISION
- PUMA
- REGION
- ST
- ADJINC
- WGTP
- NP
- TYPE
- BLD
- HHT
- HINCP
- HUPAC
- NPF
- TEN
- VEH
- county_index
- unique_hh_id
- weight
- hhsize
- incomeQ
- workers

persons:
filename: synthetic_persons.csv
# keep all the columns in person_seed.csv
columns:
- RT
- SERIALNO
- SPORDER
- PUMA
- ST
- PWGTP
- AGEP
- COW
- MAR
- MIL
- RELP
- SCHG
- SCHL
- SEX
- WKHP
- WKW
- ESR
- HISP
- PINCP
- POWPUMA
- unique_hh_id
- county_index
- age
- occupation
- gqtype

# Model steps for base mode
# ------------------------------------------------------------------
models:
- input_pre_processor
- setup_data_structures
- initial_seed_balancing
- meta_control_factoring
- final_seed_balancing
- integerize_final_seed_weights
- sub_balancing.geography=taz
- sub_balancing.geography=county
- expand_households
- write_data_dictionary
- summarize
- write_tables
- write_synthetic_population

resume_after:
```
9 changes: 9 additions & 0 deletions population-synthesis/san-francisco-bay-area/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Population Synthesis
## San Francisco Bay Area

### Files (written as CSV and [Feather](https://arrow.apache.org/docs/python/feather.html))
1. `household_seed` -- List of households from [PUMS](https://www.census.gov/programs-surveys/acs/technical-documentation/pums/documentation.html)
2. `person_seed` -- List of persons in each household from PUMS
3. `marginals` -- Expansion targets by geography and person/household category
3. `geographies` -- Mapping from PUMAs, counties, and travel analysis zones (TAZs)
4. `data_dictionary` -- Identification of variables used in above files
18 changes: 18 additions & 0 deletions population-synthesis/san-francisco-bay-area/data_dictionary.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
file,variable,definiton
household_seed,hhsize,Number of persons in the household
household_seed,incomeQ,Income quantile category
household_seed,weight,PUMS expansion factor for household record
household_seed,workers,Number of workers in the household
"household_seed, person_seed",[hh]gqtype,"Non-institutional group quarters living arrangement (1=university, 2=military, 3=other non-institutional, 0=NA - in household)"
"household_seed, person_seed",county_index,Integer index for County consistent with geography_index
"household_seed, person_seed",unique_hh_id,Unique household identification number
"household_seed, person_seed",VARIABLES IN CAPS,"Standard PUMS variables, see: https://www2.census.gov/programs-surveys/acs/tech_docs/pums/data_dict/PUMS_Data_Dictionary_2014-2018.csv"
marginals,age_is_[category],Person age category
marginals,gqtype_is_[category],Category of group quarters living arrangement
marginals,hhsize_is_[number of persons],"Number of persons in the household, from 1 to 4+"
marginals,incomeQ_[category],"Household income quantile category based on $2000 (1=less than $30k, 2=$30k-$60k, 3=$60k-100k, 4=over $100k, 0=NA - not a household"
marginals,number_households,"Number of households, including group quarters"
marginals,occupation_is_[occupation],Persons by occupation category
marginals,worker_[category],"Number of workers in the household, from 1 to 3+"
person_seed,age,Age in years
person_seed,occupation,"Occupation category (1=management, 2=professional, 3=services, 4=retail, 5=manual, 6=military, 0=NA - not employed)"
Binary file not shown.
Loading