feat(nymisc): Add scraper for nyfam, nycity, nycounty, nysupreme, nycciv, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims #848

grossir · 2024-01-05T06:22:34Z

Implements #827: Almost all the logic is inside the nymisc.py file. The inheriting classes just define a regex that we probably must refine when testing against the backscrapers.

This PR is also modifying OpinionSiteLinear, OpinionSite and AbstractSite in a few places to add support for returning the child_court field, as discussed in #847

This PR will need to be reviewed and tested with other changes in courtlistener and reporters-db, but I think it is ready for a first review as it is

…civ, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims

nycounty, nysupreme, nycciv, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims

for more information, see https://pre-commit.ci

grossir · 2024-01-05T06:38:01Z

Testing failures due to a bug in coloctapp

flooie · 2024-01-05T13:50:28Z

Off the bat - I think we have the wrong IDs for the parent courts. This is my fault, but we should check against what is in the CL db and/or what is it courts_db.

juriscraper/opinions/united_states/state/__init__.py

juriscraper/opinions/united_states/state/nydistrict.py

…yct, nycivct, nycrimct, nysurct, nydistct, nyjustct, nyclaimsct Renamed base class to nytrial; Added clean_judge_str; added nysupct_commercial to use this template

for more information, see https://pre-commit.ci

Simplify judge and docket parsing according to code review

flooie · 2024-01-07T01:03:42Z

juriscraper/OpinionSite.py

@@ -108,6 +109,9 @@ def _get_precedential_statuses(self):
    def _get_summaries(self):
        return None

+    def _get_child_courts(self):
+        return None
+
    def extract_from_text(self, scraped_text):


What do you think about this?

@staticmethod def match(scraped_text, pattern): m = re.findall(pattern, scraped_text) r = list(filter(None, chain.from_iterable(m))) return r[0].strip() if r else "" def extract_from_text(self, scraped_text: str) -> Dict[str, Any]: """""" # Find Author Str pattern = r"<td[^>]*>(.*?),\s?[JS]\.</td>|Judge:\s?(.+)|([\w .,]+), J\.\s+" author_str = self.match(scraped_text, pattern) # Find Docket Number pattern = r"</table><br><br\s?/?>\s?(.*)\r?\n|Docket Number:\s?(.+)" docket_number = harmonize(self.match(scraped_text, pattern)) # Find Misc Citation pattern = r"\[(?P<volume>\d+) (?P<reporter>Misc [23]d) (?P<page>.+)\]" cite_match = re.search(pattern, scraped_text) metadata = { "Docket": {"docket_number": docket_number}, "Opinion": {"author_str": normalize_judge_string(author_str)[0]} } if cite_match: metadata["Citation"] = cite_match.groupdict("") return metadata

I took a look at it - and I think we can accomplish all of this with just three simple regex patterns. I think ... it might keep the complexity of maintaining it easier? But I could be wrong and would like your thoughts

Also - I obviously put it in the wrong file - so apologies of the confusion, but I don't want to move it.

juriscraper/opinions/united_states/state/nytrial.py

flooie · 2024-01-07T01:24:34Z

juriscraper/opinions/united_states/state/nytrial.py

+    def get_docket_number(self, scraped_text: str) -> str:
+        """Get docket number
+        Sometimes it is explicit in a table at the beginning of the document
+        with the heading 'Docket Number'
+
+        Sometimes it is explicit in the first lines of the text with headings
+        such as 'Index No', 'Docket No', etc
+
+        Sometimes it is just a special string, that may be numeric only
+        Sometimes it may not exist
+
+        :param scraped_text: scraped text
+        :return: docket number if it exists
+        """
+        if "<table" in scraped_text:
+            html = html_fromstring(scraped_text)
+            docket_xpath = "//table[@width='75%']/following-sibling::text()"
+            element = html.xpath(docket_xpath)
+            docket = element[0] if element else ""
+        else:
+            docket_regex = r"Docket Number:(?P<docket>.+)"
+            match = re.search(docket_regex, scraped_text[:500])
+            docket = match.group("docket") if match else ""
+
+        return docket.strip()
+
+    def get_judge(self, scraped_text: str) -> str:
+        """Get judge from PDF or HTML text
+
+        We delete a trailing ", J." or ", S." after judge's last names,
+        which appear in HTML tables. This are honorifics
+        For "S.": https://www.nycourts.gov/reporter/3dseries/2023/2023_50144.htm
+
+        :param scraped_text: string from HTML or PDF
+        :return: judge name
+        """
+        if "<table" in scraped_text:
+            html = html_fromstring(scraped_text)
+            td = html.xpath(
+                '//td[contains(text(), ", J.") or contains(text(), ", S.")]'
+            )
+            judge = td[0].text_content() if td else ""
+        else:
+            match = re.search(r"Judge:\s?(?P<judge>.+)", scraped_text)
+            judge = match.group("judge") if match else ""
+
+        judge = normalize_judge_string(clean_string(judge))[0]
+        judge = re.sub(r" [JS]\.$", "", judge)
+
+        return judge.strip()
+
+    def get_citation(self, scraped_text: str) -> Dict[str, str]:
+        """Extracts volume, reporter and page of citation that
+        has the following shape [81 Misc 3d 1211(A)]
+
+        Citation should be searched on the top of the document since an opinion
+        may cite other opinions on the argumentation
+
+        Tagged as "Official Citation" on the source. For example:
+        https://lrb.nycourts.gov/citator/reporter/citations/detailsview.aspx?id=2023_51315
+
+        :param scraped_text: string from HTML or PDF
+        :return: dictionary with expected citation fields
+        """
+        regex = r"\[(?P<volume>\d+) (?P<reporter>Misc 3d) (?P<page>.+)\]"
+        match = re.search(regex, scraped_text[:1200])
+
+        if not match:
+            return {}
+
+        return match.groupdict("")


I think we can drop this code if we use the regex above - but this is all very nice code.

Great, I agree with you that simple is better and regex should be enough in this case. I will test the changes and make a new push

juriscraper/opinions/united_states/state/nysupct.py

flooie · 2024-01-07T01:27:33Z

juriscraper/opinions/united_states/state/nytrial.py

+        metadata = {}
+
+        if docket_number:
+            clean_docket_number = clean_string(docket_number)


clean_string is called in harmonize. I think it's redundant here.

flooie · 2024-01-07T01:32:05Z

@grossir thanks for a great effort here.

My only real pushback is that it feels like you dotted your i's and crossed your t's more than you needed to. I understand the impulse to use LXML for the HTMLs and regex for the PDFs, but I think we can simplify the code a lot by just using regex.

I think this reduces the complexity and will make it a little easier for us in the long run. I didnt test it fully with Doctor extracted text, so I look forward to your thoughts.

juriscraper/opinions/united_states/state/nytrial.py

…port to get case_name_full in extract_from_text, update test cases

juriscraper/opinions/united_states/state/nytrial.py

flooie · 2024-01-08T16:55:30Z

juriscraper/opinions/united_states/state/nytrial.py


-        return judge.strip()
+        pattern = r"</table><br><br\s?/?>\s?(.*)\r?\n|Docket Number:\s?(.+)"
+        docket_number = harmonize(self.match(scraped_text, pattern))


does harmonize do anything here with docket numbers?

juriscraper/opinions/united_states/state/nytrial.py

tests/local/test_ScraperExtractFromTextTest.py

…port to get case_name_full in extract_from_text, update test cases

flooie · 2024-01-08T19:52:02Z

two last things

this PDF - is corrupted and returns this. https://nycourts.gov/reporter/pdfs/2013/2013_33259.pdf {'Docket': {'docket_number': '(cid:25)(cid:24)(cid:19)(cid:20)(cid:22)(cid:24)(cid:18)(cid:19)(cid:28)'}, 'Opinion': {'author_str': 'Barbara Jaffe'}}
It seems like there are lots of duplicates in the tables.

in the first group of back scraped opinions I get these duplicates. It seems like opinions are posted multiple times.

https://nycourts.gov/reporter/3dseries/2003/2003_23913.htm
https://nycourts.gov/reporter/3dseries/2003/2003_23913.htm

https://nycourts.gov/reporter/3dseries/2003/2003_23908.htm
https://nycourts.gov/reporter/3dseries/2003/2003_23908.htm

https://nycourts.gov/reporter/3dseries/2003/2003_23889.htm
https://nycourts.gov/reporter/3dseries/2003/2003_23889.htm

https://nycourts.gov/reporter/3dseries/2003/2003_23890.htm
https://nycourts.gov/reporter/3dseries/2003/2003_23890.htm

https://nycourts.gov/reporter/3dseries/2003/2003_23888.htm
https://nycourts.gov/reporter/3dseries/2003/2003_23888.htm

https://nycourts.gov/reporter/3dseries/2003/2003_23883.htm
https://nycourts.gov/reporter/3dseries/2003/2003_23883.htm

https://nycourts.gov/reporter/3dseries/2003/2003_23911.htm
https://nycourts.gov/reporter/3dseries/2003/2003_23911.htm

grossir · 2024-01-08T21:51:05Z

in the first group of back scraped opinions I get these duplicates. It seems like opinions are posted multiple times.

Duplication is controlled on courtlistener. I see that cl_back_scrape_opinions.py controls for duplicates. It inherits from cl_scrape_opinions.Command and does the individual record hash check (L:268-297)

The Dup Checker tolerates 5 consecutive duplicates (threshold init argument). Currently there is no way to modify that when calling the command

If that's not enough, we can just control duplication in juriscraper. We can keep a set() with the seen URLs, and skip URLs already seen

On further inspection, the --fullcrawl argument passed into the backscraper makes it so the Dup Checker does not abort the scrape even if it surpasses the consecutive duplicate limit.Still, it will skip an individual record if the URL's raw content is the same

grossir added 4 commits January 3, 2024 16:23

feat(nymisc): Add scraper for nyfam, nycity, nycounty, nysupreme, nyc…

70ee588

…civ, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims

feat(nymisc): Add scraper for nyfam, nycity,

7272ba5

nycounty, nysupreme, nycciv, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims

Merge branch 'nymisc' of github.com:grossir/juriscraper into nymisc

e5e713a

feat(nymisc): implement NY Misc reporter 10 scrapers

ef61a0b

grossir requested a review from flooie January 5, 2024 06:22

pre-commit-ci bot and others added 3 commits January 5, 2024 06:23

[pre-commit.ci] auto fixes from pre-commit.com hooks

98a8115

for more information, see https://pre-commit.ci

feat(nymisc): implement NY Misc reporter 10 scrapers

927a506

fixed conflicts

2f207b3

flooie reviewed Jan 5, 2024

View reviewed changes

juriscraper/opinions/united_states/state/__init__.py Show resolved Hide resolved

flooie reviewed Jan 5, 2024

View reviewed changes

juriscraper/opinions/united_states/state/nydistrict.py Outdated Show resolved Hide resolved

grossir and others added 5 commits January 5, 2024 12:22

Merge branch 'main' into nymisc

2946b53

feat(nymisc): renamed scrapers to nysupct, nyfamct, nycityct, nycount…

4812253

…yct, nycivct, nycrimct, nysurct, nydistct, nyjustct, nyclaimsct Renamed base class to nytrial; Added clean_judge_str; added nysupct_commercial to use this template

[pre-commit.ci] auto fixes from pre-commit.com hooks

612d09b

for more information, see https://pre-commit.ci

feat(nytrial): add scrapers for 10 families of courts

5eb6230

Simplify judge and docket parsing according to code review

solve merge conflicts

c3fe4e4

flooie reviewed Jan 7, 2024

View reviewed changes

juriscraper/opinions/united_states/state/nytrial.py Show resolved Hide resolved

flooie reviewed Jan 7, 2024

View reviewed changes

juriscraper/opinions/united_states/state/nytrial.py Outdated Show resolved Hide resolved

flooie reviewed Jan 7, 2024

View reviewed changes

juriscraper/opinions/united_states/state/nysupct.py Show resolved Hide resolved

flooie reviewed Jan 7, 2024

View reviewed changes

juriscraper/opinions/united_states/state/nytrial.py Outdated Show resolved Hide resolved

feat(nytrail): Add code review changes for extract_from_text; add sup…

94c00a8

…port to get case_name_full in extract_from_text, update test cases

flooie reviewed Jan 8, 2024

View reviewed changes

juriscraper/opinions/united_states/state/nytrial.py Outdated Show resolved Hide resolved

flooie reviewed Jan 8, 2024

View reviewed changes

juriscraper/opinions/united_states/state/nytrial.py Outdated Show resolved Hide resolved

flooie reviewed Jan 8, 2024

View reviewed changes

tests/local/test_ScraperExtractFromTextTest.py Show resolved Hide resolved

flooie reviewed Jan 8, 2024

View reviewed changes

tests/local/test_ScraperExtractFromTextTest.py Show resolved Hide resolved

grossir added 3 commits January 8, 2024 13:32

feat(nytrail): Add code review changes for extract_from_text; add sup…

a62bbde

…port to get case_name_full in extract_from_text, update test cases

solve merge conflict

71d9607

feat(nytrial): collect case_name_full for Docket in extract_from_text

8f17508

grossir and others added 3 commits January 8, 2024 19:19

feat(nytrial): collect case_name_full for Docket in extract_from_text

6c06f21

Merge branch 'nymisc' of github.com:grossir/juriscraper into nymisc

6424f4c

Merge branch 'main' into nymisc

87d9fcd

flooie merged commit c15d259 into freelawproject:main Jan 9, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(nymisc): Add scraper for nyfam, nycity, nycounty, nysupreme, nycciv, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims #848

feat(nymisc): Add scraper for nyfam, nycity, nycounty, nysupreme, nycciv, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims #848

grossir commented Jan 5, 2024

grossir commented Jan 5, 2024

flooie commented Jan 5, 2024

flooie Jan 7, 2024 •

edited

Loading

flooie Jan 7, 2024 •

edited

Loading

flooie Jan 7, 2024

flooie Jan 7, 2024

grossir Jan 8, 2024

flooie Jan 7, 2024

flooie commented Jan 7, 2024

flooie Jan 8, 2024

flooie commented Jan 8, 2024

grossir commented Jan 8, 2024 •

edited

Loading

feat(nymisc): Add scraper for nyfam, nycity, nycounty, nysupreme, nycciv, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims #848

feat(nymisc): Add scraper for nyfam, nycity, nycounty, nysupreme, nycciv, nyccrim, nysurrogate, nydistrict, nyjustice, nyctclaims #848

Conversation

grossir commented Jan 5, 2024

grossir commented Jan 5, 2024

flooie commented Jan 5, 2024

flooie Jan 7, 2024 • edited Loading

Choose a reason for hiding this comment

flooie Jan 7, 2024 • edited Loading

Choose a reason for hiding this comment

flooie Jan 7, 2024

Choose a reason for hiding this comment

flooie Jan 7, 2024

Choose a reason for hiding this comment

grossir Jan 8, 2024

Choose a reason for hiding this comment

flooie Jan 7, 2024

Choose a reason for hiding this comment

flooie commented Jan 7, 2024

flooie Jan 8, 2024

Choose a reason for hiding this comment

flooie commented Jan 8, 2024

grossir commented Jan 8, 2024 • edited Loading

flooie Jan 7, 2024 •

edited

Loading

flooie Jan 7, 2024 •

edited

Loading

grossir commented Jan 8, 2024 •

edited

Loading