Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow non-binary incidence #123

Merged

Conversation

jamiecook
Copy link
Contributor

@jamiecook jamiecook commented May 19, 2020

This PR updates the list_balancer and simultaneous_list_balancer to allow for non-binary incidences.

In particular it changes the household weight updating to use a power in line with Equation 13 in Peter Vovsha's 2014 paper which uses a similar methodology.

image

https://pdfs.semanticscholar.org/f765/07c2b1f534dcb4ca281681080ba2f15140d8.pdf

With this change, household size can be directly included as a control total rather than encoding via a series of dummy variables

target,geography,seed_table,importance,control_field,expression
num_opd,TZ16,households,100000000,num_opd,(households.weight > 0) & (households.weight < np.inf)
pop_opd,TZ16,persons,100000000000,pop_opd,persons.REGION == 1

Testing with the above controls (and some others) gave the following highly accurate correspondence

image

Review Criteria

  1. Does it contain all the required elements, including a runnable example, documentation, and tests? *
  2. Does it implement good methods (i.e. is it consistent with good practices in travel modeling)? No methods were changed
  3. Are the runtimes reasonable and does it provide documentation justifying this claim? Yes, Yes
  4. Does it include non-Python code, such as C/C++? If so, does it compile on any OS and are compilation instructions included? No it does not
  5. Is it licensed with the ActivitySim license that allows the code to be freely distributed and modified and includes attribution so that the ‘provenance’ of the code can be tracked? Does it include an official release of ownership from the funding agency if applicable? Yes; N/A
  6. Does it appropriately interact with the data pipeline (i.e. it doesn't create new ways of managing data)? N/A
  7. Does it include regression tests to enable checking that consistent results will be returned when updates are made to the framework? Minor change which is consistent with existing tests
  8. Does it include sufficient test coverage and test data for existing and proposed features? All existing tests pass
  9. Any other comments or suggestions for improving the developer experience? No

@toliwaga
Copy link
Contributor

This looks pretty straightforward to me – though I have to confess that most of what I knew when I wrote the code has leaked out of my brain in the intervening years.

Is this a straight-up improvement, or should it be an option?

(Or did I get the math wrong when I wrote the code originally and you are just being nice by not saying so?)

@binnympaul
Copy link
Collaborator

Hi @jamiecook , @toliwaga ,
We do allow non-binary incidence for person level controls, where the incidence value is the actual count of persons in the households belonging to a control group (e,g., age_18_35). However, for household level controls the incidence can only be binary.

Now, household size is a special case which can be specified both as a household level control or a person level control. As a household level control, the controls would be segmented by household size groups. As a person level control, the control total is total population and incidence values are number of persons in each household.

Have you tried any other household level variable with non-binary incidence?

@bstabler
Copy link
Contributor

Thanks @jamiecook. Can you also fill out the Contribution Review Criteria so we can make sure the PR is complete? Here is an example from a previous PR.

@bstabler
Copy link
Contributor

also @jamiecook please PR to develop. Master is our release branch, which we only push to from develop, after develop has passed the tests.

@jamiecook jamiecook changed the base branch from master to develop May 20, 2020 22:38
@jamiecook jamiecook force-pushed the jamiecook/allow_non_binary_incidence branch from 994567c to 86f17d5 Compare May 20, 2020 22:40
@jamiecook
Copy link
Contributor Author

@bstabler - updated to develop branch, added review criteria

@toliwaga - it's pretty much the same as your impl, just slightly more generalised. Previously you were allowing incidence = {0,1+}, which would was equivalent to updating by {gamma ^0=1, gamma ^1=gamma} respectively. You did this by selecting rows where incidence > 0 and *= gamma. So basically this PR handles the broader case of incidence 1+.

@binnympaul - as far as I could see when I tried to set up population as a person level control, it was implemented by counting persons and aggregating them up to the household level where they were treated as you describe. i.e. the incidence table had hh_size = {0,1,2,3,...}. However because of the treatment of this in the code (see comment to @toliwaga) the weights for hh_size = 1 were updated equally to the weights for hh_size = 2. Meaning bad convergence and eventual relaxation of the control to a point where it didn't fit well. If, as in the example_calm project, the hh_size was instead controlled by 5 segments (0,1,2,3,4+), this is fine.

I believe that this fixes the special case where population is specified at the person level, and also any other control that is calculated by counting and aggregating persons them to the household level.

@jamiecook
Copy link
Contributor Author

also - CI tests seem to be failing with an open pipeline error?

image

I don't believe this is related to the PR

@jamiecook
Copy link
Contributor Author

@bstabler - is there anything else you need from me on this PR?
Really interested to hear if you consider this a good improvement

@bstabler
Copy link
Contributor

@jamiecook - thanks again. It looks there are two more things to resolve:

  • pycodestyle requirements
populationsim/simul_balancer.py:241:46: E231 missing whitespace after ','
The command "pycodestyle populationsim" exited with 1.
  • the results for the test example changed?

https://travis-ci.com/github/ActivitySim/populationsim/jobs/338254226#L1067

@bstabler
Copy link
Contributor

bstabler commented Jun 4, 2020

@jamiecook - @binnympaul and I discussed the PR and everything looks good except for one issue - the results appear to be different than before. You can see this here, which says the expected TAZ 100 HH count is different. You can see the original test code here. We're not sure why this would be since we understand your change to not change the mathematics when run with example_calm and the same configuration files. Can you run example_calm with the existing and suggested revisions and confirm that the resulting tables are exactly the same? If not, can you investigate the differences and report back? Thanks for working with us to get this finalized for contribution!

@jamiecook
Copy link
Contributor Author

@bstabler - I have been looking into this and trying to get the tests running locally to debug. Is there a reason why there are so many

RuntimeError: Pipeline is already open!

In the test results?

@bstabler
Copy link
Contributor

bstabler commented Jun 4, 2020

@jamiecook not really. I think @toliwaga fixed this recently for ActivitySim? We should really do the same for PopulationSim.

@jamiecook
Copy link
Contributor Author

Okay, looking into this - the change here won't modify results if you have binary incidence values. Which i believe most of calm's controls are. But crucially, it's most, not all of the controls. I believe that the person controls relating to students or occupation groups

students_by_housing_type,TAZ,persons,1000,OSUFAM,persons.OSUTAG == 1

When these are tallied up to the household level, if there are more than one such person in the household, the incidence will be > 1. And if you have a control total for that which you need to hit, it makes sense (to me) that you ascribe a higher rate of change to the weights attached to households with more of that type of person in them.

This is the debug logging I used to find this:

c=5 
incidence[c] = [0 2 3 1 0 0]
incidence[c] > 0 = [False  True  True  True False False]  
gamma[c] = 1.043136949226374
pow(gamma[c], incidence[c]) = [1.0, 1.0881346948413069, 1.1350735059241321, 1.043136949226374, 1.0, 1.0]

So in the old code it would have factored up the weights for the 1 person household at the same rate (+4%) as the households with 2 or 3 of that type of person. Under this change it would increase the 1 person household by 4%, the 2 person hh by 8.8% and the 3 person hh by 13.5%.

This seems to have some positive impact on the number of iterations to converge in the test data (not that this is going to be a realistic data set to measure this).

# with the change
{'converged': True, 'iter': 669, 'delta': 7.309728748220815e-10, 'max_gamma_dif': 9.634573139294389e-10}
#  without the change
{'converged': True, 'iter': 726, 'delta': 2.9239007511468646e-10, 'max_gamma_dif': 9.491205599232444e-10}

I'm not sure of a good way to prove that this improves convergence in all cases, but as I said in the description, it allows us to converge in cases where we encode the hh_size directly rather than as a series of dummy variables.

I've also done a quick validation that python's pow method isn't doing anything funny and if you have ONLY binary incidences, that it is equivalent.

pow(gamma[c], 0) = 1.043136949226374 
gamma[c] = 1.043136949226374 
pow(gamma[c], 0) - gamma[c] = 0.0

@bstabler
Copy link
Contributor

bstabler commented Jun 6, 2020

@jamiecook - thanks for tracking this down. The difference is now explainable so please update the expected test results so the tests will pass. Also, please run the validation notebook before and after the change to see how much the results changed. We're likely good with the change, but just to make sure, we'd like to see the differences before accepting the PR.

@binnympaul
Copy link
Collaborator

binnympaul commented Jun 6, 2020

@jamiecook - thanks for checking this! I reviewed the original paper, PopSyn3 Java code and original PopulationSim technical specification memo. Your are right about the specification that balancing factor (gamma) must be exponentiated by the incidence value. As you reported, this would perhaps lead to faster convergence. I hope the validation would improve too, especially for person level controls.
However, it is my understanding that this fix would not have any bearing on how the incidence is created for a control. By design, binary incidence (0 or 1) is created for a household-level control and count incidence (number of persons) is created for person-level controls.

@jamiecook
Copy link
Contributor Author

@binnympaul - I don't believe any other changes (to incidence creation) are required for this to be incorporated, as you point out count incidence are already created for person level controls, this change leverages that fact.

@jamiecook
Copy link
Contributor Author

On the validation notebook, my intuition is telling me that this is a positive change across the board.
I've modified the validation notebook to do a side by side histogram of the differences (control-actual) and across all the categories I am getting smaller differences.

example_calm_validation_comparison

Would it be useful to have a separate "validation_comparison.ipynb" notebook?

@jamiecook
Copy link
Contributor Author

image

@jamiecook
Copy link
Contributor Author

Same as above plot, but zoomed
image

@@ -93,7 +93,7 @@
" packages = 'pyyaml pandas numpy matplotlib'\n",
" ret = os.system(f'conda install {packages}')\n",
" if ret != 0:\n",
" os.system(f'pip install {packages}')"
" os.system(f'{sys.executable} -m pip install {packages}')"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you shouldn't call pip or python directly in case the user path picks a different python than the one running the notebook.

@@ -840,7 +840,7 @@
}
],
"source": [
"def meta_geog_df(meta_geog):\n",
"def meta_geog_df(summary_df, meta_geog):\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

modify the methods to not use global variables and instead pass requisite data as arguments.

" stats.append(s)\n",
" \n",
" ax.set_title(f\"{params['geography']} - {params['name']}\")\n",
" ax.set_ylabel('Frequency'); ax.set_xlabel('Difference: control vs. result')\n",
" ax.scatter(f.index, f)\n",
" ax.hist(diff, bins=10, range=(-5,5), alpha=0.5)\n",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a histogram plot, rather than a scatter plot, to show the histogram.

@jamiecook
Copy link
Contributor Author

@bstabler - I have updated tests, added some other small quality of life changes which I've documented in the "files changed" section.
Is my interpretation of the validation the same as yours?
Is there anything else you need to close this out?

@blakerosenthal
Copy link
Contributor

hey @jamiecook thanks for the validation notebook improvements. would you be able to include the updated output cells in this PR (for the CALM data)? I'm mostly interested in the github render reflecting the histogram change.

@bstabler
Copy link
Contributor

thanks @jamiecook, @blakerosenthal. Let's get the notebook issue fixed and then we'll pull this. Thanks for the great work @jamiecook!

@jamiecook
Copy link
Contributor Author

good point @blakerosenthal - I found a couple of typos I introduced as well so it was definitely a good idea to do a full run through with it.

@blakerosenthal
Copy link
Contributor

much better -- this looks great, thanks @jamiecook!

cc @bstabler this looks good to go :)

@bstabler bstabler merged commit 319c669 into ActivitySim:develop Jun 19, 2020
@lmz lmz mentioned this pull request Sep 24, 2020
bstabler added a commit that referenced this pull request Oct 20, 2020
* update package version number as well

* Allow non-binary incidence (#123)

* Allow non-binary incidence

* style

* update tests to pass

* add some progress indication

* tidy up validation script, use histogram for a histogram

* fix render and some typos

Co-authored-by: bstabler <benstabler@yahoo.com>
bstabler added a commit that referenced this pull request Oct 20, 2020
* update package version number as well

* Allow non-binary incidence (#123)

* Allow non-binary incidence

* style

* update tests to pass

* add some progress indication

* tidy up validation script, use histogram for a histogram

* fix render and some typos

* increment version

Co-authored-by: Jamie Cook <jamie.cook@veitchlister.com.au>
@stefancoe
Copy link

@bstabler I just ran the latest version with our existing config & control files and the results are much better for total persons. Instead of being about 60,000 short we are now only short by 80. The target is about 4 million. We did not change our categorical HH size variables. Could this issue/code change be the reason? Thanks!

@jamiecook
Copy link
Contributor Author

Excellent news @stefancoe, this change shouldn't directly change your concordance to HH size if you are using categorical variables, but if you have population expressed as household level constraint it will definitely help. Which sounds like what you have there. Really awesome to hear that this is helping.

@binnympaul
Copy link
Collaborator

@jamiecook , @stefancoe : This is good news! As we were hoping, this fix seem to be improving the validation of person level controls. I will test this on one of our older implementations to confirm.

@stefancoe
Copy link

@jamiecook @binnympaul Yes, we are very excited about this! I was going to test the old code/new code using the calm example to see if it had a similar impact, but I did not see total persons as a person level control.

In case you are interested, here is a link to our controls. We have since added tenure, which I will put up soon.
https://github.com/psrc/populationsim/blob/master/psrc/configs/controls.csv

Thanks!

bstabler added a commit that referenced this pull request Aug 26, 2021
* update package version number as well

* Allow non-binary incidence (#123)

* Allow non-binary incidence

* style

* update tests to pass

* add some progress indication

* tidy up validation script, use histogram for a histogram

* fix render and some typos

* increment version

* deprecate py2.7

* Multiprocess (#130)

* [Bugfix] Allow seed and meta geography to be the same (#139)

* Fixes bug where if the seed geography is the same as the meta_geography, pandas has a small panic attack and the run will fail.

* add cytoolz to the "requirements"

* fix another activitysim change

* Absolute bounds (#136)


* adding upper/lower bounds to weighting use case

* #137, #134, #133, #131

Co-authored-by: Jamie Cook <jamie.cook@veitchlister.com.au>
Co-authored-by: Blake Rosenthal <blake.rosenthal@rsginc.com>
Co-authored-by: Ben Stabler <bstabler@users.noreply.github.com>
Co-authored-by: Leah Flake <leah.flake@rsginc.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants