For this most excellent takehome exercise, I have elected to use the awesome library Scrapy, which lets you build fast, high-concurrency web spiders that extract and organize the data that you want and/or need extremely efficiently and easily while addressing login, authentication and session concerns.
I'm also using the Scrapy-Splash plugin which doesn't just run javascript to allow you to wait for dynamic content to load, it also has facilities for doing things like scrolling down for the emulated browser (running in a docker container) to load additional dynamic content.
First, if you don't already have Docker Desktop, download here.
For Mac and Windows, you should then be able download the Splash image in a terminal with the following:
docker pull scrapinghub/splash
and then you should be able to run the rendering engine with the following:
docker run -it -p 8050:8050 --rm scrapinghub/splash
- Download the code
git clone git@github.com:damzam/AdaptAPI.git
- Change directory into AdaptAPI
cd AdaptAPI
- Create a virtual environment to protect your system python
python3 -m venv .env
- Activate the virtual environment
source .env/bin/activate
- Install dependencies
pip install scrapy scrapy-splash
- cd into the
scraper
directory
cd scraper
- Load the seed urls from
input.json
(copied from the take home assignment) and run the respective spiders to scrape the MOCK_INDEMNITY and PLACEHOLDER_CARRIER content and log it out to the console with the following command:
make
-
Remove the local repo
-
Terminate the docker process and remove the image
docker rmi scrapinghub/splash
And you're done!