Skip to content

Task Manager's task implementation

Tingxuan Gu edited this page Sep 5, 2019 · 2 revisions

This page is to show you how to construct a task for the task manager to run

Description

Typically, a runnable/task will have these three parts as support:

  • crawler - crawl data from selected website
  • extractor - extract the downloaded data and get the information you need
  • dumper - put the information into the database

These parts can be altered if the task is for a different purpose, e.g. classification.
In the task file itself, there usually is only a run function which can be called in the Task Manager.

Implementation

Normally, you do the pipeline file by file.
For each file you want from the website:

  • first you crawl that file(using wget or request)
  • then extract and dump it
  • finally you delete that file to save space
  • move on to the next file you are going to get

Remember to put logging information to catch the possible exceptions in the task you are working on