Sample Scrapy project demonstrating integration of Crawlera-Headless-Proxy with Scrapy Cloud through a custom Docker image. To demonstrate it, we use a Selenium + Firefox through its geckodriver.
Please do not assume this is the best way to integrate selenium within your spider. The goal here is showcase the deployment of crawlera-headless-proxy
Based on this KB
Install shub
pip install shub
Modify the scrappinghub.yml file and change <YOU PROJECT ID>
with ypu actual project ID
project: <YOU PROJECT ID>
requirements_file: ./requirements.txt
image: true
Deploy your project to Scrapy Cloud
$ shub login
Enter your API key from https://app.scrapinghub.com/account/apikey
API key: ********************************
Validating API key...
API key is OK, you are logged in now.
$ shub deploy
Building images.scrapinghub.com/project/<YOU PROJECT ID>:1.0.
Steps: 100%|█████████████| 12/12
The image images.scrapinghub.com/project/<YOU PROJECT ID>:1.0 build is completed.
Login to images.scrapinghub.com succeeded.
Pushing images.scrapinghub.com/project/<YOU PROJECT ID>:1.0 to the registry.
b58632e02b0f: 100%|█████████████| 53.8k/53.8k [2.55kB/s]
9cf43d5c0161: 100%|█████████████| 33.8k/33.8k [1.61kB/s]
The image images.scrapinghub.com/project/<YOU PROJECT ID>:1.0 pushed successfully. | 512/15.2k [?B/s]
Deploying images.scrapinghub.com/project/<YOU PROJECT ID>:1.0
You can check deploy results later with 'shub image check --id 1'.
Progress: 100%|█████████████| 100/100
Deploy results:
{'status': 'ok', 'project': <YOU PROJECT ID>, 'version': '1.0', 'spiders': 1}
Run the job on Scrapy Cloud passing in your Crawlera API Key using either an enviroment variable or an spider argument
$ shub schedule -e CRAWLERA_APIKEY=<API KEY> <YOUR PROJECT ID>/demo
Watch the log on the command line:
shub log -f <YOU PROJECT ID>/1/1
or print items as they are being scraped:
shub items -f <YOU PROJECT ID>/1/11
or watch it running in Scrapinghub's web interface:
https://app.scrapinghub.com/p/<YOU PROJECT ID>/1/1
Create a virtualenv
$ virtualenv .venv && source ./.venv/bin/activate
Install scrapy and the project requirements
(.venv) $ pip install -r requirements.txt
...
Follow installation instructions for crawlera-headless-proxy on your platform
Run crawlera-headless-proxy on a dedicated terminal/shell. It needs to be running for our demo spider to connect to it. (Hit ctrl+c to kill it and release the terminal)
$ crawlera-headless-proxy -d -a <CRAWLERA API KEY>
# OR
$ docker run -p 3128:3128 scrapinghub/crawlera-headless-proxy -d -a <CRAWLERA API KEY>
Run the project
$ ./venv/bin/scrapy crawl demo -o out.json