Allow non-binary incidence #123

jamiecook · 2020-05-19T05:08:46Z

This PR updates the list_balancer and simultaneous_list_balancer to allow for non-binary incidences.

In particular it changes the household weight updating to use a power in line with Equation 13 in Peter Vovsha's 2014 paper which uses a similar methodology.

https://pdfs.semanticscholar.org/f765/07c2b1f534dcb4ca281681080ba2f15140d8.pdf

With this change, household size can be directly included as a control total rather than encoding via a series of dummy variables

target,geography,seed_table,importance,control_field,expression
num_opd,TZ16,households,100000000,num_opd,(households.weight > 0) & (households.weight < np.inf)
pop_opd,TZ16,persons,100000000000,pop_opd,persons.REGION == 1

Testing with the above controls (and some others) gave the following highly accurate correspondence

Review Criteria

Does it contain all the required elements, including a runnable example, documentation, and tests? *
Does it implement good methods (i.e. is it consistent with good practices in travel modeling)? No methods were changed
Are the runtimes reasonable and does it provide documentation justifying this claim? Yes, Yes
Does it include non-Python code, such as C/C++? If so, does it compile on any OS and are compilation instructions included? No it does not
Is it licensed with the ActivitySim license that allows the code to be freely distributed and modified and includes attribution so that the ‘provenance’ of the code can be tracked? Does it include an official release of ownership from the funding agency if applicable? Yes; N/A
Does it appropriately interact with the data pipeline (i.e. it doesn't create new ways of managing data)? N/A
Does it include regression tests to enable checking that consistent results will be returned when updates are made to the framework? Minor change which is consistent with existing tests
Does it include sufficient test coverage and test data for existing and proposed features? All existing tests pass
Any other comments or suggestions for improving the developer experience? No

toliwaga · 2020-05-19T16:40:19Z

This looks pretty straightforward to me – though I have to confess that most of what I knew when I wrote the code has leaked out of my brain in the intervening years.

Is this a straight-up improvement, or should it be an option?

(Or did I get the math wrong when I wrote the code originally and you are just being nice by not saying so?)

binnympaul · 2020-05-19T17:01:06Z

Hi @jamiecook , @toliwaga ,
We do allow non-binary incidence for person level controls, where the incidence value is the actual count of persons in the households belonging to a control group (e,g., age_18_35). However, for household level controls the incidence can only be binary.

Now, household size is a special case which can be specified both as a household level control or a person level control. As a household level control, the controls would be segmented by household size groups. As a person level control, the control total is total population and incidence values are number of persons in each household.

Have you tried any other household level variable with non-binary incidence?

bstabler · 2020-05-19T18:11:01Z

Thanks @jamiecook. Can you also fill out the Contribution Review Criteria so we can make sure the PR is complete? Here is an example from a previous PR.

bstabler · 2020-05-19T20:37:01Z

also @jamiecook please PR to develop. Master is our release branch, which we only push to from develop, after develop has passed the tests.

jamiecook · 2020-05-20T23:02:04Z

@bstabler - updated to develop branch, added review criteria

@toliwaga - it's pretty much the same as your impl, just slightly more generalised. Previously you were allowing incidence = {0,1+}, which would was equivalent to updating by {gamma ^0=1, gamma ^1=gamma} respectively. You did this by selecting rows where incidence > 0 and *= gamma. So basically this PR handles the broader case of incidence 1+.

@binnympaul - as far as I could see when I tried to set up population as a person level control, it was implemented by counting persons and aggregating them up to the household level where they were treated as you describe. i.e. the incidence table had hh_size = {0,1,2,3,...}. However because of the treatment of this in the code (see comment to @toliwaga) the weights for hh_size = 1 were updated equally to the weights for hh_size = 2. Meaning bad convergence and eventual relaxation of the control to a point where it didn't fit well. If, as in the example_calm project, the hh_size was instead controlled by 5 segments (0,1,2,3,4+), this is fine.

I believe that this fixes the special case where population is specified at the person level, and also any other control that is calculated by counting and aggregating persons them to the household level.

jamiecook · 2020-05-20T23:05:01Z

also - CI tests seem to be failing with an open pipeline error?

I don't believe this is related to the PR

jamiecook · 2020-05-27T11:32:12Z

@bstabler - is there anything else you need from me on this PR?
Really interested to hear if you consider this a good improvement

bstabler · 2020-05-29T00:39:53Z

@jamiecook - thanks again. It looks there are two more things to resolve:

pycodestyle requirements

populationsim/simul_balancer.py:241:46: E231 missing whitespace after ','
The command "pycodestyle populationsim" exited with 1.

the results for the test example changed?

https://travis-ci.com/github/ActivitySim/populationsim/jobs/338254226#L1067

bstabler · 2020-06-04T01:01:41Z

@jamiecook - @binnympaul and I discussed the PR and everything looks good except for one issue - the results appear to be different than before. You can see this here, which says the expected TAZ 100 HH count is different. You can see the original test code here. We're not sure why this would be since we understand your change to not change the mathematics when run with example_calm and the same configuration files. Can you run example_calm with the existing and suggested revisions and confirm that the resulting tables are exactly the same? If not, can you investigate the differences and report back? Thanks for working with us to get this finalized for contribution!

jamiecook · 2020-06-04T22:50:24Z

@bstabler - I have been looking into this and trying to get the tests running locally to debug. Is there a reason why there are so many

RuntimeError: Pipeline is already open!

In the test results?

bstabler · 2020-06-04T23:39:21Z

@jamiecook not really. I think @toliwaga fixed this recently for ActivitySim? We should really do the same for PopulationSim.

jamiecook · 2020-06-05T00:29:13Z

Okay, looking into this - the change here won't modify results if you have binary incidence values. Which i believe most of calm's controls are. But crucially, it's most, not all of the controls. I believe that the person controls relating to students or occupation groups

students_by_housing_type,TAZ,persons,1000,OSUFAM,persons.OSUTAG == 1

When these are tallied up to the household level, if there are more than one such person in the household, the incidence will be > 1. And if you have a control total for that which you need to hit, it makes sense (to me) that you ascribe a higher rate of change to the weights attached to households with more of that type of person in them.

This is the debug logging I used to find this:

c=5 
incidence[c] = [0 2 3 1 0 0]
incidence[c] > 0 = [False  True  True  True False False]  
gamma[c] = 1.043136949226374
pow(gamma[c], incidence[c]) = [1.0, 1.0881346948413069, 1.1350735059241321, 1.043136949226374, 1.0, 1.0]

So in the old code it would have factored up the weights for the 1 person household at the same rate (+4%) as the households with 2 or 3 of that type of person. Under this change it would increase the 1 person household by 4%, the 2 person hh by 8.8% and the 3 person hh by 13.5%.

This seems to have some positive impact on the number of iterations to converge in the test data (not that this is going to be a realistic data set to measure this).

# with the change
{'converged': True, 'iter': 669, 'delta': 7.309728748220815e-10, 'max_gamma_dif': 9.634573139294389e-10}
#  without the change
{'converged': True, 'iter': 726, 'delta': 2.9239007511468646e-10, 'max_gamma_dif': 9.491205599232444e-10}

I'm not sure of a good way to prove that this improves convergence in all cases, but as I said in the description, it allows us to converge in cases where we encode the hh_size directly rather than as a series of dummy variables.

I've also done a quick validation that python's pow method isn't doing anything funny and if you have ONLY binary incidences, that it is equivalent.

pow(gamma[c], 0) = 1.043136949226374 
gamma[c] = 1.043136949226374 
pow(gamma[c], 0) - gamma[c] = 0.0

bstabler · 2020-06-06T00:05:11Z

@jamiecook - thanks for tracking this down. The difference is now explainable so please update the expected test results so the tests will pass. Also, please run the validation notebook before and after the change to see how much the results changed. We're likely good with the change, but just to make sure, we'd like to see the differences before accepting the PR.

binnympaul · 2020-06-06T18:07:35Z

@jamiecook - thanks for checking this! I reviewed the original paper, PopSyn3 Java code and original PopulationSim technical specification memo. Your are right about the specification that balancing factor (gamma) must be exponentiated by the incidence value. As you reported, this would perhaps lead to faster convergence. I hope the validation would improve too, especially for person level controls.
However, it is my understanding that this fix would not have any bearing on how the incidence is created for a control. By design, binary incidence (0 or 1) is created for a household-level control and count incidence (number of persons) is created for person-level controls.

jamiecook · 2020-06-08T08:01:05Z

@binnympaul - I don't believe any other changes (to incidence creation) are required for this to be incorporated, as you point out count incidence are already created for person level controls, this change leverages that fact.

jamiecook · 2020-06-09T08:25:42Z

On the validation notebook, my intuition is telling me that this is a positive change across the board.
I've modified the validation notebook to do a side by side histogram of the differences (control-actual) and across all the categories I am getting smaller differences.

Would it be useful to have a separate "validation_comparison.ipynb" notebook?

jamiecook · 2020-06-09T08:36:31Z

jamiecook · 2020-06-09T08:38:31Z

Same as above plot, but zoomed

jamiecook · 2020-06-11T06:15:49Z

scripts/validation.ipynb

@@ -93,7 +93,7 @@
    "    packages = 'pyyaml pandas numpy matplotlib'\n",
    "    ret = os.system(f'conda install {packages}')\n",
    "    if ret != 0:\n",
-    "        os.system(f'pip install {packages}')"
+    "        os.system(f'{sys.executable} -m pip install {packages}')"


you shouldn't call pip or python directly in case the user path picks a different python than the one running the notebook.

jamiecook · 2020-06-11T06:16:13Z

scripts/validation.ipynb

@@ -840,7 +840,7 @@
    }
   ],
   "source": [
-    "def meta_geog_df(meta_geog):\n",
+    "def meta_geog_df(summary_df, meta_geog):\n",


modify the methods to not use global variables and instead pass requisite data as arguments.

jamiecook · 2020-06-11T06:16:35Z

scripts/validation.ipynb

    "    stats.append(s)\n",
    "    \n",
    "    ax.set_title(f\"{params['geography']} - {params['name']}\")\n",
    "    ax.set_ylabel('Frequency'); ax.set_xlabel('Difference: control vs. result')\n",
-    "    ax.scatter(f.index, f)\n",
+    "    ax.hist(diff, bins=10, range=(-5,5), alpha=0.5)\n",


Use a histogram plot, rather than a scatter plot, to show the histogram.

jamiecook · 2020-06-11T06:17:56Z

@bstabler - I have updated tests, added some other small quality of life changes which I've documented in the "files changed" section.
Is my interpretation of the validation the same as yours?
Is there anything else you need to close this out?

blakerosenthal · 2020-06-12T00:29:13Z

hey @jamiecook thanks for the validation notebook improvements. would you be able to include the updated output cells in this PR (for the CALM data)? I'm mostly interested in the github render reflecting the histogram change.

bstabler · 2020-06-12T00:32:15Z

thanks @jamiecook, @blakerosenthal. Let's get the notebook issue fixed and then we'll pull this. Thanks for the great work @jamiecook!

jamiecook · 2020-06-12T00:42:59Z

good point @blakerosenthal - I found a couple of typos I introduced as well so it was definitely a good idea to do a full run through with it.

blakerosenthal · 2020-06-12T02:03:54Z

much better -- this looks great, thanks @jamiecook!

cc @bstabler this looks good to go :)

* update package version number as well * Allow non-binary incidence (#123) * Allow non-binary incidence * style * update tests to pass * add some progress indication * tidy up validation script, use histogram for a histogram * fix render and some typos Co-authored-by: bstabler <benstabler@yahoo.com>

* update package version number as well * Allow non-binary incidence (#123) * Allow non-binary incidence * style * update tests to pass * add some progress indication * tidy up validation script, use histogram for a histogram * fix render and some typos * increment version Co-authored-by: Jamie Cook <jamie.cook@veitchlister.com.au>

stefancoe · 2021-02-23T16:22:02Z

@bstabler I just ran the latest version with our existing config & control files and the results are much better for total persons. Instead of being about 60,000 short we are now only short by 80. The target is about 4 million. We did not change our categorical HH size variables. Could this issue/code change be the reason? Thanks!

jamiecook · 2021-02-24T00:26:40Z

Excellent news @stefancoe, this change shouldn't directly change your concordance to HH size if you are using categorical variables, but if you have population expressed as household level constraint it will definitely help. Which sounds like what you have there. Really awesome to hear that this is helping.

binnympaul · 2021-02-24T03:06:02Z

@jamiecook , @stefancoe : This is good news! As we were hoping, this fix seem to be improving the validation of person level controls. I will test this on one of our older implementations to confirm.

stefancoe · 2021-02-24T03:35:43Z

@jamiecook @binnympaul Yes, we are very excited about this! I was going to test the old code/new code using the calm example to see if it had a similar impact, but I did not see total persons as a person level control.

In case you are interested, here is a link to our controls. We have since added tenure, which I will put up soon.
https://github.com/psrc/populationsim/blob/master/psrc/configs/controls.csv

Thanks!

* update package version number as well * Allow non-binary incidence (#123) * Allow non-binary incidence * style * update tests to pass * add some progress indication * tidy up validation script, use histogram for a histogram * fix render and some typos * increment version * deprecate py2.7 * Multiprocess (#130) * [Bugfix] Allow seed and meta geography to be the same (#139) * Fixes bug where if the seed geography is the same as the meta_geography, pandas has a small panic attack and the run will fail. * add cytoolz to the "requirements" * fix another activitysim change * Absolute bounds (#136) * adding upper/lower bounds to weighting use case * #137, #134, #133, #131 Co-authored-by: Jamie Cook <jamie.cook@veitchlister.com.au> Co-authored-by: Blake Rosenthal <blake.rosenthal@rsginc.com> Co-authored-by: Ben Stabler <bstabler@users.noreply.github.com> Co-authored-by: Leah Flake <leah.flake@rsginc.com>

jamiecook changed the base branch from master to develop May 20, 2020 22:38

Allow non-binary incidence

86f17d5

jamiecook force-pushed the jamiecook/allow_non_binary_incidence branch from 994567c to 86f17d5 Compare May 20, 2020 22:40

style

8d76b0b

jamiecook added 3 commits June 11, 2020 14:49

update tests to pass

7b57863

add some progress indication

0990085

tidy up validation script, use histogram for a histogram

6fbbbed

jamiecook commented Jun 11, 2020

View reviewed changes

fix render and some typos

a24e528

bstabler merged commit 319c669 into ActivitySim:develop Jun 19, 2020

lmz mentioned this pull request Sep 24, 2020

Non-binary incidence #127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow non-binary incidence #123

Allow non-binary incidence #123

jamiecook commented May 19, 2020 •

edited

Loading

toliwaga commented May 19, 2020

binnympaul commented May 19, 2020

bstabler commented May 19, 2020

bstabler commented May 19, 2020

jamiecook commented May 20, 2020

jamiecook commented May 20, 2020

jamiecook commented May 27, 2020

bstabler commented May 29, 2020

bstabler commented Jun 4, 2020

jamiecook commented Jun 4, 2020

bstabler commented Jun 4, 2020 •

edited

Loading

jamiecook commented Jun 5, 2020

bstabler commented Jun 6, 2020

binnympaul commented Jun 6, 2020 •

edited

Loading

jamiecook commented Jun 8, 2020

jamiecook commented Jun 9, 2020

jamiecook commented Jun 9, 2020

jamiecook commented Jun 9, 2020

jamiecook Jun 11, 2020

jamiecook Jun 11, 2020

jamiecook Jun 11, 2020

jamiecook commented Jun 11, 2020

blakerosenthal commented Jun 12, 2020

bstabler commented Jun 12, 2020

jamiecook commented Jun 12, 2020

blakerosenthal commented Jun 12, 2020

stefancoe commented Feb 23, 2021

jamiecook commented Feb 24, 2021

binnympaul commented Feb 24, 2021

stefancoe commented Feb 24, 2021

Allow non-binary incidence #123

Allow non-binary incidence #123

Conversation

jamiecook commented May 19, 2020 • edited Loading

Review Criteria

toliwaga commented May 19, 2020

binnympaul commented May 19, 2020

bstabler commented May 19, 2020

bstabler commented May 19, 2020

jamiecook commented May 20, 2020

jamiecook commented May 20, 2020

jamiecook commented May 27, 2020

bstabler commented May 29, 2020

bstabler commented Jun 4, 2020

jamiecook commented Jun 4, 2020

bstabler commented Jun 4, 2020 • edited Loading

jamiecook commented Jun 5, 2020

bstabler commented Jun 6, 2020

binnympaul commented Jun 6, 2020 • edited Loading

jamiecook commented Jun 8, 2020

jamiecook commented Jun 9, 2020

jamiecook commented Jun 9, 2020

jamiecook commented Jun 9, 2020

jamiecook Jun 11, 2020

Choose a reason for hiding this comment

jamiecook Jun 11, 2020

Choose a reason for hiding this comment

jamiecook Jun 11, 2020

Choose a reason for hiding this comment

jamiecook commented Jun 11, 2020

blakerosenthal commented Jun 12, 2020

bstabler commented Jun 12, 2020

jamiecook commented Jun 12, 2020

blakerosenthal commented Jun 12, 2020

stefancoe commented Feb 23, 2021

jamiecook commented Feb 24, 2021

binnympaul commented Feb 24, 2021

stefancoe commented Feb 24, 2021

jamiecook commented May 19, 2020 •

edited

Loading

bstabler commented Jun 4, 2020 •

edited

Loading

binnympaul commented Jun 6, 2020 •

edited

Loading