Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update resolve_bill method #149

Merged
merged 7 commits into from
Oct 18, 2024
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 19 additions & 6 deletions openstates/importers/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -169,15 +169,28 @@ def resolve_bill(self, bill_id: str, *, date: str) -> typing.Optional[_ID]:
if bill_transform_func:
bill_id = bill_transform_func(bill_id)

# move the start_date up a bit in case the event is on the last day of a session to compare with end_date
# Some steps here to first find the session that match the incoming event using the event date
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You make a good point about increasing queries: in the past lots of queries in resolve_bill() has caused some real DB issues. So I think it could indeed be worth a bit more refactor to move this logic somewhere else.

The Importer base class has a property already called session_cache (see here). Here's what I'm thinking:

  • We could add a new method to base importer get_all_sessions() but there is a possible bad scenario where get_session() (which already exists and is used with Bills and Votes) loads a single session or two into session_cache. That creates a situation where it is impossible to know if we have all of the sessions or not. So we don't want to set up a scenario where get_all_sessions() logic is set up to fail by get_session()
  • So then maybe make a change to get_session() (here) so that it loads all sessions into session_cache instead of just the one requested session (but it would still return the requested session). Basically it is not any less performant to query for 16 sessions than it is to query for one session, so why not load them all anytime some client code wants a session loaded from the DB?
  • resolve_bill() calls get_session() or some new method get_all_sessions(), meaning that it is going to get back all the sessions as a list of dicts, and there will be only one DB query to get them.
  • then instead the logic here loops through the list of sessions to make comparisons and find the right one (instead of db query with WHERE clauses).

OR - leave the logic here but implement a new self.current_session_id_cache property to save it for use by subsequent events.

All that said - there are generally many fewer events returned by an events scraper (vs. bills/votes). And I think only Event importer uses resolve_bill(). So we definitely could just merge this in and wait to see if performance issues appear before making these kind of changes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point, thank you for the additional insight. I will take a good.

# If a unique session is not found, then use the session with the latest "start_date"
date = datetime.fromisoformat(date)
new_date = date - timedelta(days=1)
legislative_session = LegislativeSession.objects.filter(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be legislative_sessions (plural) since it can return more than one

Q(end_date__gte=date) | Q(end_date=""),
start_date__lte=date,
jurisdiction_id=self.jurisdiction_id,
)
session_ids = {each.id for each in legislative_session}

if len(session_ids) == 1:
session_id = session_ids.pop()
else:
legislative_session = (
LegislativeSession.objects.filter(jurisdiction_id=self.jurisdiction_id)
.order_by("-start_date")
.first()
)
session_id = legislative_session.id

objects = Bill.objects.filter(
Q(legislative_session__end_date__gte=new_date)
| Q(legislative_session__end_date=""),
legislative_session__start_date__lte=date,
legislative_session__jurisdiction_id=self.jurisdiction_id,
legislative_session__id=session_id,
identifier=bill_id,
)
ids = {each.id for each in objects}
Expand Down
Loading