Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update resolve_bill method #149

Merged
merged 7 commits into from
Oct 18, 2024
Merged

Update resolve_bill method #149

merged 7 commits into from
Oct 18, 2024

Conversation

alexobaseki
Copy link
Contributor

Update resolve_bill: find a session that matches the incoming entity using the entity date. If a unique session is not found, then use the session with the latest "start_date". This does increase the number of calls to the database significantly. However, the result in this case was a 100% match for all event : bill relationship.

Copy link
Contributor

@jessemortenson jessemortenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, one tiny recommendation and some thoughts

date = datetime.fromisoformat(date)
new_date = date - timedelta(days=1)
legislative_session = LegislativeSession.objects.filter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be legislative_sessions (plural) since it can return more than one

@@ -169,15 +169,28 @@ def resolve_bill(self, bill_id: str, *, date: str) -> typing.Optional[_ID]:
if bill_transform_func:
bill_id = bill_transform_func(bill_id)

# move the start_date up a bit in case the event is on the last day of a session to compare with end_date
# Some steps here to first find the session that match the incoming event using the event date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make a good point about increasing queries: in the past lots of queries in resolve_bill() has caused some real DB issues. So I think it could indeed be worth a bit more refactor to move this logic somewhere else.

The Importer base class has a property already called session_cache (see here). Here's what I'm thinking:

  • We could add a new method to base importer get_all_sessions() but there is a possible bad scenario where get_session() (which already exists and is used with Bills and Votes) loads a single session or two into session_cache. That creates a situation where it is impossible to know if we have all of the sessions or not. So we don't want to set up a scenario where get_all_sessions() logic is set up to fail by get_session()
  • So then maybe make a change to get_session() (here) so that it loads all sessions into session_cache instead of just the one requested session (but it would still return the requested session). Basically it is not any less performant to query for 16 sessions than it is to query for one session, so why not load them all anytime some client code wants a session loaded from the DB?
  • resolve_bill() calls get_session() or some new method get_all_sessions(), meaning that it is going to get back all the sessions as a list of dicts, and there will be only one DB query to get them.
  • then instead the logic here loops through the list of sessions to make comparisons and find the right one (instead of db query with WHERE clauses).

OR - leave the logic here but implement a new self.current_session_id_cache property to save it for use by subsequent events.

All that said - there are generally many fewer events returned by an events scraper (vs. bills/votes). And I think only Event importer uses resolve_bill(). So we definitely could just merge this in and wait to see if performance issues appear before making these kind of changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, thank you for the additional insight. I will take a good.

Copy link
Contributor

@jessemortenson jessemortenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, two small changes recommended

@@ -128,6 +128,7 @@ def __init__(self, jurisdiction_id: str, do_postimport=True) -> None:
self.pseudo_id_cache: typing.Dict[str, typing.Optional[_ID]] = {}
self.person_cache: typing.Dict[_PersonCacheKey, typing.Optional[str]] = {}
self.session_cache: typing.Dict[str, LegislativeSession] = {}
self.all_sessions_cache: typing.List[LegislativeSession] = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth a comment here explaining the difference between these two

@@ -139,6 +140,12 @@ def __init__(self, jurisdiction_id: str, do_postimport=True) -> None:
if settings.IMPORT_TRANSFORMERS.get(self._type):
self.cached_transformers = settings.IMPORT_TRANSFORMERS[self._type]

def get_all_sessions(self) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this method should return self.all_sessions_cache (rather than ONLY load a value to that property) - that would make it consistent with get_session() which returns the requested session

@alexobaseki alexobaseki merged commit 2819bc2 into main Oct 18, 2024
1 check passed
@alexobaseki alexobaseki deleted the update-resolve-bill-method branch October 18, 2024 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants