-
-
Notifications
You must be signed in to change notification settings - Fork 114
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(nymisc): Add scraper for nyfam, nycity, nycounty, nysupreme, nycciv, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims #848
Conversation
…civ, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims
nycounty, nysupreme, nycciv, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims
Testing failures due to a bug in |
Off the bat - I think we have the wrong IDs for the parent courts. This is my fault, but we should check against what is in the CL db and/or what is it courts_db. |
…yct, nycivct, nycrimct, nysurct, nydistct, nyjustct, nyclaimsct Renamed base class to nytrial; Added clean_judge_str; added nysupct_commercial to use this template
for more information, see https://pre-commit.ci
Simplify judge and docket parsing according to code review
@@ -108,6 +109,9 @@ def _get_precedential_statuses(self): | |||
def _get_summaries(self): | |||
return None | |||
|
|||
def _get_child_courts(self): | |||
return None | |||
|
|||
def extract_from_text(self, scraped_text): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about this?
@staticmethod
def match(scraped_text, pattern):
m = re.findall(pattern, scraped_text)
r = list(filter(None, chain.from_iterable(m)))
return r[0].strip() if r else ""
def extract_from_text(self, scraped_text: str) -> Dict[str, Any]:
""""""
# Find Author Str
pattern = r"<td[^>]*>(.*?),\s?[JS]\.</td>|Judge:\s?(.+)|([\w .,]+), J\.\s+"
author_str = self.match(scraped_text, pattern)
# Find Docket Number
pattern = r"</table><br><br\s?/?>\s?(.*)\r?\n|Docket Number:\s?(.+)"
docket_number = harmonize(self.match(scraped_text, pattern))
# Find Misc Citation
pattern = r"\[(?P<volume>\d+) (?P<reporter>Misc [23]d) (?P<page>.+)\]"
cite_match = re.search(pattern, scraped_text)
metadata = {
"Docket": {"docket_number": docket_number},
"Opinion": {"author_str": normalize_judge_string(author_str)[0]}
}
if cite_match:
metadata["Citation"] = cite_match.groupdict("")
return metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took a look at it - and I think we can accomplish all of this with just three simple regex patterns. I think ... it might keep the complexity of maintaining it easier? But I could be wrong and would like your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also - I obviously put it in the wrong file - so apologies of the confusion, but I don't want to move it.
def get_docket_number(self, scraped_text: str) -> str: | ||
"""Get docket number | ||
Sometimes it is explicit in a table at the beginning of the document | ||
with the heading 'Docket Number' | ||
|
||
Sometimes it is explicit in the first lines of the text with headings | ||
such as 'Index No', 'Docket No', etc | ||
|
||
Sometimes it is just a special string, that may be numeric only | ||
Sometimes it may not exist | ||
|
||
:param scraped_text: scraped text | ||
:return: docket number if it exists | ||
""" | ||
if "<table" in scraped_text: | ||
html = html_fromstring(scraped_text) | ||
docket_xpath = "//table[@width='75%']/following-sibling::text()" | ||
element = html.xpath(docket_xpath) | ||
docket = element[0] if element else "" | ||
else: | ||
docket_regex = r"Docket Number:(?P<docket>.+)" | ||
match = re.search(docket_regex, scraped_text[:500]) | ||
docket = match.group("docket") if match else "" | ||
|
||
return docket.strip() | ||
|
||
def get_judge(self, scraped_text: str) -> str: | ||
"""Get judge from PDF or HTML text | ||
|
||
We delete a trailing ", J." or ", S." after judge's last names, | ||
which appear in HTML tables. This are honorifics | ||
For "S.": https://www.nycourts.gov/reporter/3dseries/2023/2023_50144.htm | ||
|
||
:param scraped_text: string from HTML or PDF | ||
:return: judge name | ||
""" | ||
if "<table" in scraped_text: | ||
html = html_fromstring(scraped_text) | ||
td = html.xpath( | ||
'//td[contains(text(), ", J.") or contains(text(), ", S.")]' | ||
) | ||
judge = td[0].text_content() if td else "" | ||
else: | ||
match = re.search(r"Judge:\s?(?P<judge>.+)", scraped_text) | ||
judge = match.group("judge") if match else "" | ||
|
||
judge = normalize_judge_string(clean_string(judge))[0] | ||
judge = re.sub(r" [JS]\.$", "", judge) | ||
|
||
return judge.strip() | ||
|
||
def get_citation(self, scraped_text: str) -> Dict[str, str]: | ||
"""Extracts volume, reporter and page of citation that | ||
has the following shape [81 Misc 3d 1211(A)] | ||
|
||
Citation should be searched on the top of the document since an opinion | ||
may cite other opinions on the argumentation | ||
|
||
Tagged as "Official Citation" on the source. For example: | ||
https://lrb.nycourts.gov/citator/reporter/citations/detailsview.aspx?id=2023_51315 | ||
|
||
:param scraped_text: string from HTML or PDF | ||
:return: dictionary with expected citation fields | ||
""" | ||
regex = r"\[(?P<volume>\d+) (?P<reporter>Misc 3d) (?P<page>.+)\]" | ||
match = re.search(regex, scraped_text[:1200]) | ||
|
||
if not match: | ||
return {} | ||
|
||
return match.groupdict("") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can drop this code if we use the regex above - but this is all very nice code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great, I agree with you that simple is better and regex should be enough in this case. I will test the changes and make a new push
metadata = {} | ||
|
||
if docket_number: | ||
clean_docket_number = clean_string(docket_number) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clean_string is called in harmonize. I think it's redundant here.
@grossir thanks for a great effort here. My only real pushback is that it feels like you dotted your i's and crossed your t's more than you needed to. I understand the impulse to use LXML for the HTMLs and regex for the PDFs, but I think we can simplify the code a lot by just using regex. I think this reduces the complexity and will make it a little easier for us in the long run. I didnt test it fully with Doctor extracted text, so I look forward to your thoughts. |
…port to get case_name_full in extract_from_text, update test cases
|
||
return judge.strip() | ||
pattern = r"</table><br><br\s?/?>\s?(.*)\r?\n|Docket Number:\s?(.+)" | ||
docket_number = harmonize(self.match(scraped_text, pattern)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does harmonize do anything here with docket numbers?
…port to get case_name_full in extract_from_text, update test cases
Duplication is controlled on courtlistener. I see that cl_back_scrape_opinions.py controls for duplicates. It inherits from The Dup Checker tolerates 5 consecutive duplicates (threshold init argument). Currently there is no way to modify that when calling the command If that's not enough, we can just control duplication in juriscraper. We can keep a set() with the seen URLs, and skip URLs already seen On further inspection, the --fullcrawl argument passed into the backscraper makes it so the Dup Checker does not abort the scrape even if it surpasses the consecutive duplicate limit.Still, it will skip an individual record if the URL's raw content is the same |
Implements #827: Almost all the logic is inside the nymisc.py file. The inheriting classes just define a regex that we probably must refine when testing against the backscrapers.
This PR is also modifying OpinionSiteLinear, OpinionSite and AbstractSite in a few places to add support for returning the child_court field, as discussed in #847
This PR will need to be reviewed and tested with other changes in courtlistener and reporters-db, but I think it is ready for a first review as it is