-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow non-binary incidence #123
Allow non-binary incidence #123
Conversation
This looks pretty straightforward to me – though I have to confess that most of what I knew when I wrote the code has leaked out of my brain in the intervening years. Is this a straight-up improvement, or should it be an option? (Or did I get the math wrong when I wrote the code originally and you are just being nice by not saying so?) |
Hi @jamiecook , @toliwaga , Now, household size is a special case which can be specified both as a household level control or a person level control. As a household level control, the controls would be segmented by household size groups. As a person level control, the control total is total population and incidence values are number of persons in each household. Have you tried any other household level variable with non-binary incidence? |
Thanks @jamiecook. Can you also fill out the Contribution Review Criteria so we can make sure the PR is complete? Here is an example from a previous PR. |
also @jamiecook please PR to develop. Master is our release branch, which we only push to from develop, after develop has passed the tests. |
994567c
to
86f17d5
Compare
@bstabler - updated to develop branch, added review criteria @toliwaga - it's pretty much the same as your impl, just slightly more generalised. Previously you were allowing incidence = {0,1+}, which would was equivalent to updating by {gamma ^0=1, gamma ^1=gamma} respectively. You did this by selecting rows where incidence > 0 and *= gamma. So basically this PR handles the broader case of incidence 1+. @binnympaul - as far as I could see when I tried to set up population as a person level control, it was implemented by counting persons and aggregating them up to the household level where they were treated as you describe. i.e. the incidence table had hh_size = {0,1,2,3,...}. However because of the treatment of this in the code (see comment to @toliwaga) the weights for hh_size = 1 were updated equally to the weights for hh_size = 2. Meaning bad convergence and eventual relaxation of the control to a point where it didn't fit well. If, as in the example_calm project, the hh_size was instead controlled by 5 segments (0,1,2,3,4+), this is fine. I believe that this fixes the special case where population is specified at the person level, and also any other control that is calculated by counting and aggregating persons them to the household level. |
@bstabler - is there anything else you need from me on this PR? |
@jamiecook - thanks again. It looks there are two more things to resolve:
https://travis-ci.com/github/ActivitySim/populationsim/jobs/338254226#L1067 |
@jamiecook - @binnympaul and I discussed the PR and everything looks good except for one issue - the results appear to be different than before. You can see this here, which says the expected TAZ 100 HH count is different. You can see the original test code here. We're not sure why this would be since we understand your change to not change the mathematics when run with example_calm and the same configuration files. Can you run example_calm with the existing and suggested revisions and confirm that the resulting tables are exactly the same? If not, can you investigate the differences and report back? Thanks for working with us to get this finalized for contribution! |
@bstabler - I have been looking into this and trying to get the tests running locally to debug. Is there a reason why there are so many
In the test results? |
@jamiecook not really. I think @toliwaga fixed this recently for ActivitySim? We should really do the same for PopulationSim. |
Okay, looking into this - the change here won't modify results if you have binary incidence values. Which i believe most of calm's controls are. But crucially, it's most, not all of the controls. I believe that the person controls relating to students or occupation groups
When these are tallied up to the household level, if there are more than one such person in the household, the incidence will be > 1. And if you have a control total for that which you need to hit, it makes sense (to me) that you ascribe a higher rate of change to the weights attached to households with more of that type of person in them. This is the debug logging I used to find this:
So in the old code it would have factored up the weights for the 1 person household at the same rate (+4%) as the households with 2 or 3 of that type of person. Under this change it would increase the 1 person household by 4%, the 2 person hh by 8.8% and the 3 person hh by 13.5%. This seems to have some positive impact on the number of iterations to converge in the test data (not that this is going to be a realistic data set to measure this).
I'm not sure of a good way to prove that this improves convergence in all cases, but as I said in the description, it allows us to converge in cases where we encode the hh_size directly rather than as a series of dummy variables. I've also done a quick validation that python's pow method isn't doing anything funny and if you have ONLY binary incidences, that it is equivalent.
|
@jamiecook - thanks for tracking this down. The difference is now explainable so please update the expected test results so the tests will pass. Also, please run the validation notebook before and after the change to see how much the results changed. We're likely good with the change, but just to make sure, we'd like to see the differences before accepting the PR. |
@jamiecook - thanks for checking this! I reviewed the original paper, PopSyn3 Java code and original PopulationSim technical specification memo. Your are right about the specification that balancing factor (gamma) must be exponentiated by the incidence value. As you reported, this would perhaps lead to faster convergence. I hope the validation would improve too, especially for person level controls. |
@binnympaul - I don't believe any other changes (to incidence creation) are required for this to be incorporated, as you point out count incidence are already created for person level controls, this change leverages that fact. |
On the validation notebook, my intuition is telling me that this is a positive change across the board. Would it be useful to have a separate "validation_comparison.ipynb" notebook? |
@@ -93,7 +93,7 @@ | |||
" packages = 'pyyaml pandas numpy matplotlib'\n", | |||
" ret = os.system(f'conda install {packages}')\n", | |||
" if ret != 0:\n", | |||
" os.system(f'pip install {packages}')" | |||
" os.system(f'{sys.executable} -m pip install {packages}')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you shouldn't call pip or python directly in case the user path picks a different python than the one running the notebook.
@@ -840,7 +840,7 @@ | |||
} | |||
], | |||
"source": [ | |||
"def meta_geog_df(meta_geog):\n", | |||
"def meta_geog_df(summary_df, meta_geog):\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modify the methods to not use global variables and instead pass requisite data as arguments.
" stats.append(s)\n", | ||
" \n", | ||
" ax.set_title(f\"{params['geography']} - {params['name']}\")\n", | ||
" ax.set_ylabel('Frequency'); ax.set_xlabel('Difference: control vs. result')\n", | ||
" ax.scatter(f.index, f)\n", | ||
" ax.hist(diff, bins=10, range=(-5,5), alpha=0.5)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a histogram plot, rather than a scatter plot, to show the histogram.
@bstabler - I have updated tests, added some other small quality of life changes which I've documented in the "files changed" section. |
hey @jamiecook thanks for the validation notebook improvements. would you be able to include the updated output cells in this PR (for the CALM data)? I'm mostly interested in the github render reflecting the histogram change. |
thanks @jamiecook, @blakerosenthal. Let's get the notebook issue fixed and then we'll pull this. Thanks for the great work @jamiecook! |
good point @blakerosenthal - I found a couple of typos I introduced as well so it was definitely a good idea to do a full run through with it. |
much better -- this looks great, thanks @jamiecook! cc @bstabler this looks good to go :) |
* update package version number as well * Allow non-binary incidence (#123) * Allow non-binary incidence * style * update tests to pass * add some progress indication * tidy up validation script, use histogram for a histogram * fix render and some typos Co-authored-by: bstabler <benstabler@yahoo.com>
* update package version number as well * Allow non-binary incidence (#123) * Allow non-binary incidence * style * update tests to pass * add some progress indication * tidy up validation script, use histogram for a histogram * fix render and some typos * increment version Co-authored-by: Jamie Cook <jamie.cook@veitchlister.com.au>
@bstabler I just ran the latest version with our existing config & control files and the results are much better for total persons. Instead of being about 60,000 short we are now only short by 80. The target is about 4 million. We did not change our categorical HH size variables. Could this issue/code change be the reason? Thanks! |
Excellent news @stefancoe, this change shouldn't directly change your concordance to HH size if you are using categorical variables, but if you have population expressed as household level constraint it will definitely help. Which sounds like what you have there. Really awesome to hear that this is helping. |
@jamiecook , @stefancoe : This is good news! As we were hoping, this fix seem to be improving the validation of person level controls. I will test this on one of our older implementations to confirm. |
@jamiecook @binnympaul Yes, we are very excited about this! I was going to test the old code/new code using the calm example to see if it had a similar impact, but I did not see total persons as a person level control. In case you are interested, here is a link to our controls. We have since added tenure, which I will put up soon. Thanks! |
* update package version number as well * Allow non-binary incidence (#123) * Allow non-binary incidence * style * update tests to pass * add some progress indication * tidy up validation script, use histogram for a histogram * fix render and some typos * increment version * deprecate py2.7 * Multiprocess (#130) * [Bugfix] Allow seed and meta geography to be the same (#139) * Fixes bug where if the seed geography is the same as the meta_geography, pandas has a small panic attack and the run will fail. * add cytoolz to the "requirements" * fix another activitysim change * Absolute bounds (#136) * adding upper/lower bounds to weighting use case * #137, #134, #133, #131 Co-authored-by: Jamie Cook <jamie.cook@veitchlister.com.au> Co-authored-by: Blake Rosenthal <blake.rosenthal@rsginc.com> Co-authored-by: Ben Stabler <bstabler@users.noreply.github.com> Co-authored-by: Leah Flake <leah.flake@rsginc.com>
This PR updates the list_balancer and simultaneous_list_balancer to allow for non-binary incidences.
In particular it changes the household weight updating to use a power in line with Equation 13 in Peter Vovsha's 2014 paper which uses a similar methodology.
https://pdfs.semanticscholar.org/f765/07c2b1f534dcb4ca281681080ba2f15140d8.pdf
With this change, household size can be directly included as a control total rather than encoding via a series of dummy variables
Testing with the above controls (and some others) gave the following highly accurate correspondence
Review Criteria