Recommended approach to scrape multiple jurisdictions at once? #70

jpmckinney · 2014-05-04T18:16:53Z

For example, if a provincial website has information for all its municipalities.

jamesturk · 2014-05-20T19:25:26Z

any idea of how you'd like to see pupa handle this?

jpmckinney · 2014-05-20T19:55:00Z

I don't have strong opinions on how the API should work, but one way is to be able to change the "active jurisdiction" so that objects are yielded to the appropriate jurisdiction. Pseudo-code:

# __init__.py
from utils import CanadianJurisdiction
class QuebecMunicipalities(CanadianJurisdiction):
    # Here you would find either:
    # * nothing, since this is a fake jurisdiction
    # * dummy variables which will be ignored by the scraper
    # * a list of all the jurisdictions if one of the above two can't be implemented

# people.py
from pupa.scrape import Scraper
class QuebecMunicipalitiesPersonScraper(Scraper):
    # get the list of municipalities
    for municipality in municipalities:
        # create a jurisdiction object
        self.set_jurisdiction(jurisdiction)
        # yield a lot of people

However, I can imagine a lot of challenges in changing Pupa to work this way.

Maybe there are some Python metaprogramming tricks I can use, to make it seem like there are several thousand modules with common people.py scraper code, without requiring me to have thousands of folders of __init__.py files and small people.py files all inheriting from the same meta-scraper class.

jamesturk · 2014-05-20T19:57:48Z

the people.py files won't be needed if they're all the same, as multiple
jurisdictions can point to the same scraper(s)

your proposed solution might work, I'll play with some proof of concept code

On Tue, May 20, 2014 at 3:54 PM, James McKinney notifications@github.comwrote:

I don't have strong opinions on how the API should work, but one way is to
be able to change the "active jurisdiction" so that objects are yielded to
the appropriate jurisdiction. Pseudo-code:

init.py

from utils import CanadianJurisdiction
class QuebecMunicipalities(CanadianJurisdiction):
# Here you would find either:
# * nothing, since this is a fake jurisdiction
# * dummy variables which will be ignored by the scraper
# * a list of all the jurisdictions

people.py

from pupa.scrape import Scraper
class QuebecMunicipalitiesPersonScraper(Scraper):
# get the list of municipalities
for municipality in municipalities:
# create a jurisdiction object
self.set_jurisdiction(jurisdiction)
# yield a lot of people

However, I can imagine a lot of challenges in changing Pupa to work this
way.

Maybe there are some Python metaprogramming tricks I can use, to make it
seem like there are several thousand modules with common people.pyscraper code, without requiring me to have thousands of folders of
init.py files and small people.py files all inheriting from the same
meta-scraper class.

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/70#issuecomment-43674764
.

jpmckinney · 2014-05-20T20:05:15Z

Cool - how do you make multiple jurisdictions point to the same scrapers?

jamesturk · 2014-07-22T19:49:27Z

there's now an example of this in https://github.com/opencivicdata/scrapers-us-state

there's still one file per jurisdiction (maybe we can improve that, maybe this is good enough though) but they all point to the same scraper (and the jurisdictions in this case are actually auto-generated classes)

jpmckinney · 2014-07-25T16:55:23Z

Thanks! In Quebec I'll have 1000 auto-generated jurisdictions, mixed in with manual jurisdictions; we scrape the big cities individually (to get email addresses), but we're happy to use a provincial directory for the smaller cities (which has one email for the entire council). It may be confusing to have this mix, so avoiding one file per jurisdiction would still be ideal.

How is Pupa 0.0.4 coming along? How soon can I start upgrading to the PostgreSQL version?

jamesturk · 2014-07-28T17:29:15Z

pupa 0.4 is pretty much ready, there are still rough edges but no more than existed in the mongo version I believe. I was hoping to update some docs before calling it 0.4 officially, but we're using it in development now and will be releasing it as 0.4 and switching production over soon

the 1000 jurisdiction issue still requires more work/thinking on the best way to do it. i think a different command like pupa bulkupdate might get around some of the challenges we'd face, once things settle down here I'll try and think of a cleaner interface for this

jpmckinney · 2014-10-15T04:38:14Z

Pinging for any updates on how to implement common scraper code for 1000s of jurisdictions.

In the update command's handle method, I'm wondering if instead of getting a single jurisdiction from a module, it might get a list of jurisdictions instead, and then loop over them. Alternatively, there could be a bulkupdate command as mentioned earlier, which expects the module to define multiple jurisdictions.

jpmckinney · 2017-02-15T22:01:45Z

My workaround is to just put all the jurisdictions into one jurisdiction, in an organization hierarchy, which is fine for my needs, but maybe not in the general case. However, as there is no other demand for the general case, I'm closing.

jpmckinney mentioned this issue May 16, 2014

Fix aggregation scrapers opencivicdata/scrapers-ca#46

Closed

11 tasks

jamesturk added the proposed label May 20, 2014

jpmckinney mentioned this issue Oct 22, 2015

Re-introduce scrapers for multiple jurisdictions opennorth/represent-canada#95

Open

11 tasks

jpmckinney closed this as completed Feb 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended approach to scrape multiple jurisdictions at once? #70

Recommended approach to scrape multiple jurisdictions at once? #70

jpmckinney commented May 4, 2014

jamesturk commented May 20, 2014

jpmckinney commented May 20, 2014

jamesturk commented May 20, 2014

init.py

people.py

jpmckinney commented May 20, 2014

jamesturk commented Jul 22, 2014

jpmckinney commented Jul 25, 2014

jamesturk commented Jul 28, 2014

jpmckinney commented Oct 15, 2014

jpmckinney commented Feb 15, 2017

Recommended approach to scrape multiple jurisdictions at once? #70

Recommended approach to scrape multiple jurisdictions at once? #70

Comments

jpmckinney commented May 4, 2014

jamesturk commented May 20, 2014

jpmckinney commented May 20, 2014

jamesturk commented May 20, 2014

init.py

people.py

jpmckinney commented May 20, 2014

jamesturk commented Jul 22, 2014

jpmckinney commented Jul 25, 2014

jamesturk commented Jul 28, 2014

jpmckinney commented Oct 15, 2014

jpmckinney commented Feb 15, 2017