python version: 3.6
Installation of pdftotext
dependencies
https://pypi.org/project/pdftotext/
Installation of python packages:
pip install -r requirements.txt
python crawl.py <CRAWL_TYPE>
Additional configurations of the crawl are located at: cfp_crawl/config.py
, also specifies crawl log and data save directories
<CRAWL_TYPE>
options:
wikicfp_latest
crawls details of the most recent conferences on the homepage of wikicfp at http://www.wikicfp.com
wikicfp_all
traverses through and scapes information from every conference series on wikicfp starting from http://www.wikicfp.com/cfp/series?t=c&i=A
conf_crawl
assumes a database populated with basic conference information obtained from either wikicfp_latest
/wikicfp_all
and proceeds to store the HTML information of the specified conferences. Crawls for directory specified in cfp_crawl/config.py
.
Selenium chromedriver is needed to better simulate organic access of conference sites (e.g. waiting for the loading of javascript elements). The chromedriver should match your chrome version can be downloaded https://chromedriver.chromium.org/. Move the executable into this repo or as specified in cfp_crawl/config.py
.
Pipeline is run from python run.py <DB_FILEPATH>
in ./post_processing
.
Includes:
- Extraction of page lines from database.
- Generation of vocab and model training (Ensure database has labelled data).
- Prediction of all page lines
- (To be updated) Line Named Entity Recognition training to improve extraction
- Extraction of <Person/Affiliation/Role-label> tuples
- (To be updated) Name disambiguation using dblp