-
-
Notifications
You must be signed in to change notification settings - Fork 431
pipeline
The news-please pipeline offers several modules for processing, filtering and storing the results of the crawlers. This section explains the different pipeline modules and their configuration.
-
Module path:
newscrawler.pipeline.pipelines.ArticleMasterExtractor
-
Functionality:
The ArticleMasterExtractor bundles several tools into one pipeline module in order to extract meta data from raw articles. Based on the html response of the processed pipeline item it extracts:- author
- date the article was published
- article title
- article description
- article text
- top image
- used language
-
Configuration:
While the module works fine with the default settings, it is possible reconfigure the tools used in the extraction process. These changes can be performed in theArticleMasterExtractor
-section of the config file.More detailed information about the module and the incorporated extractors can be found here.
###Date filter
-
Module path:
newscrawler.pipeline.pipelines.DateFilter
-
Functionality:
This module filters the extracted articles based on their publishing date. It allows to filter all articles younger than a start date and/or older than an end date. It also implements a strict mode that dropps all articles without an extracted publishing date. -
Requirements:
Due to need of meta data (the publishing date), the module only functions if placed behind an suitable extractor in the pipeline. -
Configuration:
The configuration is done in the DateFilter Section ofnewscrawler.cfg
:#!python [DateFilter] start_date = '1999-01-01 00:00:00' end_date = '2999-12-31 00:00:00' strict_mode = False
Dates can be either None or date string with the following format:
'yyyy-mm-dd hh:mm:ss'
###HTML code handling
-
Module path:
newscrawler.pipeline.pipelines.HMTLCodeHandling
-
Functionality:
This Module checks the server responses and drops the processed site if the request was not accepted. As of 22.06.16 this module is not active, but serves as an example pipeline module.
##Storage ###Local storage
-
Module path:
newscrawler.pipeline.pipelines.LocalStorage
-
Functionality:
###Elasticsearch storage
-
Module path:
newscrawler.pipeline.pipelines.ElasticsearchStorage
-
Functionality:
This Modules stores the extracted data in a given Elasticsearch database. It manages two separate indices, one for current articles and one to archive previous versions of updated articles. Both indices use the following default mapping to store the articles and extracted meta data:mapping = { 'url': {'type': 'string', 'index': 'not_analyzed'}, 'sourceDomain': {'type': 'string', 'index': 'not_analyzed'}, 'pageTitle': {'type': 'string'}, 'rss_title': {'type': 'string'}, 'localpath': {'type': 'string', 'index' : 'not_analyzed'}, 'ancestor': {'type': 'string'}, 'descendant': {'type': 'string'}, 'version': {'type': 'long'}, 'downloadDate': {'type': 'date', "format":"yyyy-MM-dd HH:mm:ss"}, 'modifiedDate': {'type': 'date', "format":"yyyy-MM-dd HH:mm:ss"}, 'publish_date': {'type': 'date', "format":"yyyy-MM-dd HH:mm:ss"}, 'title': {'type': 'string'}, 'description': {'type': 'string'}, 'text': {'type': 'string'}, 'author': {'type': 'string'}, 'image': {'type': 'string', 'index' : 'not_analyzed'}, 'language': {'type': 'string', 'index' : 'not_analyzed'} }
-
Configuration:
To use this module you have to enter the address, the used port and if needed your user credentials into the Elasticsearch Section ofnewscrawler.cfg
. There you can also alter the name of the indices and the mapping used to store the article data.
###MySQL storage
-
Module path:
newscrawler.pipeline.pipelines.MySQLStorage
-
Functionality:
This Modules stores the extracted data in a given MySQL or MariaDB database. It manages two separate tables, one for current articles and one to archive previous versions of updated articles: -
Configuration:
To use this module you have to enter the address, the used port and if needed your user credentials into the MySQL section ofnewscrawler.cfg
. There is also a setup scriptinit-db.sql
for a convenient creation of the used tables.
###RSS crawl compare
-
Module path:
newscrawler.pipeline.pipelines.RSSCrawlCompare
-
Functionality:
Similar to the MySQL storage module, this module works with MySQL or MariaDB databases. But different to the MySQL module, it only works with articles returned from the Rss crawler.For every passed article the module looks for an older version in the database and updates the Fields if certain time has passed since the last update/download. This module won't save new articles and is only meant to keep the database up to date.
-
Configuration:
To use this module you have to enter the address, the used port and if needed your user credentials into the MySQL section ofnewscrawler.cfg
. To setup the used tables simply execute the provided setup scriptinit-db.sql
. You can also alter the interval articles are updated with thehours_to_pass_for_redownload_by_rss_crawler
-parameter in the Crawler section