Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store NO_MATCH_RESOURCE citations #4920

Open
grossir opened this issue Jan 14, 2025 · 3 comments
Open

Store NO_MATCH_RESOURCE citations #4920

grossir opened this issue Jan 14, 2025 · 3 comments
Assignees

Comments

@grossir
Copy link
Contributor

grossir commented Jan 14, 2025

We can store the unmatched citations in order to

  • know how many we are missing
  • update the resolutions as we fill the gaps

We need to:

  1. Create a model for storing them, under the citations app: maybe UnmatchedCitation ?
class UnmatchedCitation(models.Model):
    citing_opinion_id = models.ForeignKey(Opinion, help_text="The opinion citing this citation")
    #  we currently don't match FullJournalCitation or FullLawCitation, may be interesting to track which are referenced more so as to decide which to implement first
    resource_type = models.TextField(help_text="The eyecite resource type")
    citation_string = models.TextField(help_text="The unparsed citation string in case it doesn't match the regular citation model below")
    
    # below is the same structure as the search.Citation model
    volume = models.SmallIntegerField(help_text="The volume of the reporter", null=True)
    reporter = models.TextField(
        help_text="The abbreviation for the reporter",
    )
    page = models.TextField(
        help_text=(
            "The 'page' of the citation in the reporter. Unfortunately, "
            "this is not an integer, but is a string-type because "
            "several jurisdictions do funny things with the so-called "
            "'page'. For example, we have seen Roman numerals in "
            "Nebraska, 13301-M in Connecticut, and 144M in Montana."
        ),
    )
    type = models.SmallIntegerField(
        help_text="The type of citation that this is.", choices=Citation.CITATION_TYPES
    )
  1. Update ingestions processes so that when a new citation is ingested, it is looked up in the UnmatchedCitation citation table; and if it exists, all the Opinions citing it have their annotated HTML updated
@flooie flooie moved this to General Backlog in Case Law Sprint Jan 14, 2025
@mlissner
Copy link
Member

mlissner commented Jan 14, 2025

That table looks about right to me. I think we'll want an index on some of the fields so that lookups in the table can be fast, and I guess we'll want a unique_together on the opinion and citation fields? citing_opinion_id can just be called citing_opinion.

I'm not sure that your second step is the right approach. Like, it might be better when adding citations just to flag the ones that are no longer Unmatched, and to have a second script that comes around to re-run the citation finder on those opinions.

One thing I don't like about your approach is that if you add n citations that a single opinion lacked, then the citation finder will be run n times on that opinion. Like, if an opinion is missing four citations and you add all four in a batch, it'll get extracted four times, when once would have been enough.

A possibly-better way to do this is to add a column to the UnmatchedCitation table for status. It could have values like:

  • unmatched -> We don't have it in the Citation table
  • found -> It's in the citation table, but we haven't tried to fix the text of the decision yet
  • fixed -> We ran the citation finder and it worked
  • failed_ambiguous -> We ran the citation finder after thinking we had the citation and it still failed b/c the citation was ambiguous
  • failed -> We ran the citation finder, but it failed for some other reason

Then, we could have a second script we ran sometimes that looks for that column having a status of found. When it finds that, it re-runs the citation finder, which flips the value from found to either fixed or one of the failed statuses. This might be over-engineered, but it'd help us with certain citations that we can't seem to match up — maybe?

@flooie flooie moved this from General Backlog to Backlog Jan 13 to Jan 24 in Case Law Sprint Jan 15, 2025
@grossir grossir moved this from Backlog Jan 13 to Jan 24 to In progress in Case Law Sprint Jan 16, 2025
grossir added a commit that referenced this issue Jan 21, 2025
Solves #4920

- Add new model UnmatchedCitation on citations app
- refactor cl.search.models.Citation to create a BaseCitation abstract model to reuse on the UnmatchedCitation model
- updates cl.citations.tasks.store_opinion_citations_and_update_parentheticals to handle storing and updating unmatched citations
- updates cl.search.signals to update UnmatchedCitation status when a new Citation is saved
- add tests
@grossir
Copy link
Contributor Author

grossir commented Jan 21, 2025

Some notes on changes from the proposed model while developing

  1. Discarding the resource_type field
    It was meant to store the eyecite types [FullJournalCitation, FullLawCitation, FullCaseCitation, ShortCaseCitation, etc]; but I realized it would be of little use, because:

    • eyecite doesn't parse anything for the unsupported types [FullJournalCitation, FullLawCitation] only the section tokens §,
      so there is no actual citation or metadata to store
    • Ummatched IdCitation and SupraCitation types hold no metadata except for (maybe) a pincite and their position in the text.
      Thus, they are not useful for updating citation resolutions, and provide no further information
    • A short case citation comes after a full case citation. If the full case citation was matched, a short case citation not matching would be due to a typo in the text or some other unexpected error.
  2. Storing metadata
    We will only be storing unmatched FullCaseCitation. eyecite sometimes return metadata got from the opinion's context. The fields are: ['parenthetical', 'pin_cite', 'year', 'court', 'plaintiff', 'defendant', 'extra'].
    I think year and court are useful for analysis, to know in what court and in what year we may find the missing citation.

  3. Handling updates when a new citation is ingested
    I think the easiest way is using signals. Citations are saved:

  • using instance.save()

    • cl_scrape_opinions.save_everything
    • cl_back_scrape_citations.scrape_court
    • cl.scrapers.tasks.update_document_from_text
  • using Citation.objects.create

    • cl.corpus_importer.utils.add_citations_to_cluster, which is used in many files

both of these methods trigger the post_save signals

grossir added a commit that referenced this issue Jan 21, 2025
Solves #4920

- Add new model UnmatchedCitation on citations app
- refactor cl.search.models.Citation to create a BaseCitation abstract model to reuse on the UnmatchedCitation model
- updates cl.citations.tasks.store_opinion_citations_and_update_parentheticals to handle storing and updating unmatched citations
- updates cl.search.signals to update UnmatchedCitation status when a new Citation is saved
- add tests
grossir added a commit that referenced this issue Jan 21, 2025
Solves #4920

- Add new model UnmatchedCitation on citations app
- refactor cl.search.models.Citation to create a BaseCitation abstract model to reuse on the UnmatchedCitation model
- updates cl.citations.tasks.store_opinion_citations_and_update_parentheticals to handle storing and updating unmatched citations
- updates cl.search.signals to update UnmatchedCitation status when a new Citation is saved
- add tests
grossir added a commit that referenced this issue Jan 21, 2025
Solves #4920

- Add new model UnmatchedCitation on citations app
- refactor cl.search.models.Citation to create a BaseCitation abstract model to reuse on the UnmatchedCitation model
- updates cl.citations.tasks.store_opinion_citations_and_update_parentheticals to handle storing and updating unmatched citations
- updates cl.search.signals to update UnmatchedCitation status when a new Citation is saved
- add tests
- add update_unmatched_citations command to trigger update for found citations
grossir added a commit that referenced this issue Jan 21, 2025
Solves #4920

- Add new model UnmatchedCitation on citations app
- refactor cl.search.models.Citation to create a BaseCitation abstract model to reuse on the UnmatchedCitation model
- updates cl.citations.tasks.store_opinion_citations_and_update_parentheticals to handle storing and updating unmatched citations
- updates cl.search.signals to update UnmatchedCitation status when a new Citation is saved
- add tests
- add update_unmatched_citations command to trigger update for found citations
@mlissner
Copy link
Member

Thanks for the notes. Super helpful now and in the future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In progress
Development

No branches or pull requests

2 participants