Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(nymisc): Add scraper for nyfam, nycity, nycounty, nysupreme, nycciv, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims #848

Merged
merged 19 commits into from
Jan 9, 2024

Conversation

grossir
Copy link
Contributor

@grossir grossir commented Jan 5, 2024

Implements #827: Almost all the logic is inside the nymisc.py file. The inheriting classes just define a regex that we probably must refine when testing against the backscrapers.

This PR is also modifying OpinionSiteLinear, OpinionSite and AbstractSite in a few places to add support for returning the child_court field, as discussed in #847

This PR will need to be reviewed and tested with other changes in courtlistener and reporters-db, but I think it is ready for a first review as it is

@grossir grossir requested a review from flooie January 5, 2024 06:22
@grossir
Copy link
Contributor Author

grossir commented Jan 5, 2024

Testing failures due to a bug in coloctapp

@flooie
Copy link
Contributor

flooie commented Jan 5, 2024

Off the bat - I think we have the wrong IDs for the parent courts. This is my fault, but we should check against what is in the CL db and/or what is it courts_db.

grossir and others added 5 commits January 5, 2024 12:22
…yct, nycivct, nycrimct, nysurct, nydistct, nyjustct, nyclaimsct

Renamed base class to nytrial; Added clean_judge_str; added nysupct_commercial to use this template
Simplify judge and docket parsing according to code review
@@ -108,6 +109,9 @@ def _get_precedential_statuses(self):
def _get_summaries(self):
return None

def _get_child_courts(self):
return None

def extract_from_text(self, scraped_text):
Copy link
Contributor

@flooie flooie Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about this?

@staticmethod
def match(scraped_text, pattern):
    m = re.findall(pattern, scraped_text)
    r = list(filter(None, chain.from_iterable(m)))
    return r[0].strip() if r else ""

def extract_from_text(self, scraped_text: str) -> Dict[str, Any]:
    """"""
    # Find Author Str
    pattern = r"<td[^>]*>(.*?),\s?[JS]\.</td>|Judge:\s?(.+)|([\w .,]+), J\.\s+"
    author_str = self.match(scraped_text, pattern)

    # Find Docket Number
    pattern = r"</table><br><br\s?/?>\s?(.*)\r?\n|Docket Number:\s?(.+)"
    docket_number = harmonize(self.match(scraped_text, pattern))

    # Find Misc Citation
    pattern = r"\[(?P<volume>\d+) (?P<reporter>Misc [23]d) (?P<page>.+)\]"
    cite_match = re.search(pattern, scraped_text)

    metadata = {
        "Docket": {"docket_number": docket_number},
        "Opinion": {"author_str": normalize_judge_string(author_str)[0]}
    }
    if cite_match:
        metadata["Citation"] = cite_match.groupdict("")

    return metadata

Copy link
Contributor

@flooie flooie Jan 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at it - and I think we can accomplish all of this with just three simple regex patterns. I think ... it might keep the complexity of maintaining it easier? But I could be wrong and would like your thoughts

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also - I obviously put it in the wrong file - so apologies of the confusion, but I don't want to move it.

Comment on lines 153 to 223
def get_docket_number(self, scraped_text: str) -> str:
"""Get docket number
Sometimes it is explicit in a table at the beginning of the document
with the heading 'Docket Number'

Sometimes it is explicit in the first lines of the text with headings
such as 'Index No', 'Docket No', etc

Sometimes it is just a special string, that may be numeric only
Sometimes it may not exist

:param scraped_text: scraped text
:return: docket number if it exists
"""
if "<table" in scraped_text:
html = html_fromstring(scraped_text)
docket_xpath = "//table[@width='75%']/following-sibling::text()"
element = html.xpath(docket_xpath)
docket = element[0] if element else ""
else:
docket_regex = r"Docket Number:(?P<docket>.+)"
match = re.search(docket_regex, scraped_text[:500])
docket = match.group("docket") if match else ""

return docket.strip()

def get_judge(self, scraped_text: str) -> str:
"""Get judge from PDF or HTML text

We delete a trailing ", J." or ", S." after judge's last names,
which appear in HTML tables. This are honorifics
For "S.": https://www.nycourts.gov/reporter/3dseries/2023/2023_50144.htm

:param scraped_text: string from HTML or PDF
:return: judge name
"""
if "<table" in scraped_text:
html = html_fromstring(scraped_text)
td = html.xpath(
'//td[contains(text(), ", J.") or contains(text(), ", S.")]'
)
judge = td[0].text_content() if td else ""
else:
match = re.search(r"Judge:\s?(?P<judge>.+)", scraped_text)
judge = match.group("judge") if match else ""

judge = normalize_judge_string(clean_string(judge))[0]
judge = re.sub(r" [JS]\.$", "", judge)

return judge.strip()

def get_citation(self, scraped_text: str) -> Dict[str, str]:
"""Extracts volume, reporter and page of citation that
has the following shape [81 Misc 3d 1211(A)]

Citation should be searched on the top of the document since an opinion
may cite other opinions on the argumentation

Tagged as "Official Citation" on the source. For example:
https://lrb.nycourts.gov/citator/reporter/citations/detailsview.aspx?id=2023_51315

:param scraped_text: string from HTML or PDF
:return: dictionary with expected citation fields
"""
regex = r"\[(?P<volume>\d+) (?P<reporter>Misc 3d) (?P<page>.+)\]"
match = re.search(regex, scraped_text[:1200])

if not match:
return {}

return match.groupdict("")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop this code if we use the regex above - but this is all very nice code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I agree with you that simple is better and regex should be enough in this case. I will test the changes and make a new push

metadata = {}

if docket_number:
clean_docket_number = clean_string(docket_number)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clean_string is called in harmonize. I think it's redundant here.

@flooie
Copy link
Contributor

flooie commented Jan 7, 2024

@grossir thanks for a great effort here.

My only real pushback is that it feels like you dotted your i's and crossed your t's more than you needed to. I understand the impulse to use LXML for the HTMLs and regex for the PDFs, but I think we can simplify the code a lot by just using regex.

I think this reduces the complexity and will make it a little easier for us in the long run. I didnt test it fully with Doctor extracted text, so I look forward to your thoughts.

…port to get case_name_full in extract_from_text, update test cases

return judge.strip()
pattern = r"</table><br><br\s?/?>\s?(.*)\r?\n|Docket Number:\s?(.+)"
docket_number = harmonize(self.match(scraped_text, pattern))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does harmonize do anything here with docket numbers?

@grossir
Copy link
Contributor Author

grossir commented Jan 8, 2024

in the first group of back scraped opinions I get these duplicates. It seems like opinions are posted multiple times.

Duplication is controlled on courtlistener. I see that cl_back_scrape_opinions.py controls for duplicates. It inherits from cl_scrape_opinions.Command and does the individual record hash check (L:268-297)

The Dup Checker tolerates 5 consecutive duplicates (threshold init argument). Currently there is no way to modify that when calling the command

If that's not enough, we can just control duplication in juriscraper. We can keep a set() with the seen URLs, and skip URLs already seen

On further inspection, the --fullcrawl argument passed into the backscraper makes it so the Dup Checker does not abort the scrape even if it surpasses the consecutive duplicate limit.Still, it will skip an individual record if the URL's raw content is the same

@flooie flooie merged commit c15d259 into freelawproject:main Jan 9, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants