Skip to content

Collect/process data via various data sources : website / js website / API. Run scrapping pipeline via Celery, and Travis cron task. Dump the scraped data to slack

Notifications You must be signed in to change notification settings

yennanliu/web_scraping

Repository files navigation

web_scraping

Collection of scrapper pipelines build for different purposes

Build Status PRs

Architecture

  • Architecture idea
  • Asynchronous tasks
    • Celery client : flask <---> Celery client <---> Celery worker. Be connected to flask to the celery task, issue the commands for the tasks
    • Celery worker : A process that runs tasks in background, can be a scheduluedtask (periodic task), and a asynchronous (when API call) one.
    • Massage broker : Celery client <--Massage broker-> Celery worker. The Celery client will need to via Message worker to communicate with Celery worker. Here I use Redis as the Message broker.

Quick Start

Quick start via docker
# Run via docker 
$ cd ~ && git clone https://github.com/yennanliu/web_scraping
$ cd ~ && cd web_scraping &&  docker-compose -f  docker-compose.yml up 
Quick start manually
# Run manually 

# STEP 1) open one terminal and run celery server locally 
$ cd ~ && cd web_scraping/celery_queue
# run task from API call  
$ celery -A tasks worker --loglevel=info
# run cron (periodic) task 
$ celery -A tasks beat

# STEP 2) Run radis server locally (with the other terminal)
# make sure you have already installed radis
$ redis-server

# STEP 3) Run flower  (with the other terminal)
$ cd ~ && cd web_scraping/celery_queue
$ celery flower -A tasks --address=127.0.0.1 --port=5555

# STEP 4) Add a sample task 
# "add" task
$ curl -X POST -d '{"args":[1,2]}' http://localhost:5555/api/task/async-apply/tasks.add

# "multiply" task
$ curl -X POST -d '{"args":[3,5]}' http://localhost:5555/api/task/async-apply/tasks.multiply

# "scrape_task" task
$ curl -X POST   http://localhost:5555/api/task/async-apply/tasks.scrape_task

# "scrape_task_api" task
$ curl -X POST -d '{"args":["mlflow","mlflow"]}' http://localhost:5555/api/task/async-apply/tasks.scrape_task_api

# "indeed_scrap_task" task
$ curl -X POST  http://localhost:5555/api/task/async-apply/tasks.indeed_scrap_task

# "indeed_scrap_api_V1" task
$ curl -X POST -d '{"args":["New+York"]}' http://localhost:5555/api/task/async-apply/tasks.indeed_scrap_api_V1

File structure

├── Dockerfile
├── README.md
├── api.                  : Celery api (broker, job accepter(flask))
│   ├── Dockerfile        : Dockerfile build celery api 
│   ├── app.py            : Flask server accept job request(api)
│   ├── requirements.txt
│   └── worker.py         : Celery broker, celery backend(redis)
├── celery-queue          : Run main web scrapping jobs (via celery)
│   ├── Dockerfile        : Dockerfile build celery-queue
│   ├── IndeedScrapper    : Scrapper scrape Indeed.com 
│   ├── requirements.txt
│   └── tasks.py          : Celery run scrapping tasks 
├── cron_indeed_scrapping_test.py
├── cron_test.py
├── docker-compose.yml    : docker-compose build whole system : api, celery-queue, redis, and flower(celery job monitor)
├── legacy_project        
├── logs                  : Save running logs 
├── output                : Save scraped data 
├── requirements.txt
└── travis_push_github.sh : Script auto push output to github via Travis 

Development

Development
# Run Unit test # 1 
$ pytest -v tests/
# ================================== test session starts ==================================
# platform darwin -- Python 3.6.4, pytest-5.0.1, py-1.5.2, pluggy-0.12.0 -- /Users/jerryliu/anaconda3/envs/yen_dev/bin/python
# cachedir: .pytest_cache
# rootdir: /Users/jerryliu/web_scraping
# plugins: cov-2.7.1, celery-4.3.0
# collected 10 items                                                                      
# tests/unit_test.py::test_get_soup PASSED                                          [ 10%]
# tests/unit_test.py::test_extract_company PASSED                                   [ 20%]
# tests/unit_test.py::test_extract_salary PASSED                                    [ 30%]
# tests/unit_test.py::test_extract_location PASSED                                  [ 40%]
# tests/unit_test.py::test_extract_job_title PASSED                                 [ 50%]
# tests/unit_test.py::test_extract_summary PASSED                                   [ 60%]
# tests/unit_test.py::test_extract_link PASSED                                      [ 70%]
# tests/unit_test.py::test_extract_date PASSED                                      [ 80%]
# tests/unit_test.py::test_extract_fulltext PASSED                                  [ 90%]
# tests/unit_test.py::test_get_full_job_link_ PASSED                                [100%]

# Run Unit test # 2 
python tests/unit_test_celery.py  -v
# test_addition (__main__.TestAddTask) ... ok
# test_task_state (__main__.TestAddTask) ... ok
# test_multiplication (__main__.TestMultiplyTask) ... ok
# test_task_state (__main__.TestMultiplyTask) ... ok
# ----------------------------------------------------------------------
# Ran 4 tests in 0.131s
# OK

Tech

  • Celery : parallel/single thread python tasks management tool (celery broker/worker)
  • Redis : key-value DB save task data
  • Flower : UI monitor celery tasks
  • Flask : python light web framework, as project backend server here
  • Docker : build the app environment

Todo

TODO
### Project level

0. Deploy to Heroku cloud and make the scrapper as an API service 
1. Dockerize the project 
2. Run the scrapping (cron/paralel)jobs via Celery 
4. Add test (unit/integration test) 
5. Design DB model that save scrapping data systematically 

### Programming level 

1. Add utility scripts that can get XPATH of all objects in html
2. Workflow that automate whole processes
3. Job management 
	- Multiprocessing
	- Asynchronous
	- Queue 
4. Scrapping tutorial 
5. Scrapy, Phantomjs 

### Others 

1. Web scrapping 101 tutorial 

Ref

Ref

About

Collect/process data via various data sources : website / js website / API. Run scrapping pipeline via Celery, and Travis cron task. Dump the scraped data to slack

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published