Originally designed as part of Virginia Tech's Crisis & Tragedy Recovery Network (CTRnet), this project crawls the internet and collects webpages related to a given topic, often for archival purposes.
- Driver class for this project
- Responsible for creating configuration and classifier object and calling crawler
- Crawler class responsible for collecting and exploring new URLs to find relevant pages
- Given a priority queue and a scoring class with a calculate_score(text) method
-
Parent class of classifiers (non-VSM) including NaiveBayesClassifier and SVMClassifier
-
Contains code for tokenization and vectorization of document text using sklearn
-
Child classes only have to assign self.model
-
Subclass of Classifier, representing a Naïve Bayes classifier
-
Subclass of Classifier, representing an SVM classifier
- Parent class of scorers, which are non-classifier models, typically VSM
-
- Subclass of Scorer, representing a tf-idf vector space model
-
- Subclass of Scorer representing an LSI vector space model
- Configuration file for focused crawler in INI format
- Class responsible for reading configuration file, using ConfigParser
- Adds all configuration options to its internal dictionary (e.g. config[“seedFile”])
- Contains various utility functions relating to reading files and sanitizing/tokenizing text
- Contains URLs to relevant pages for focused crawler to start
- Default name, but can be modified in config.ini
- Simple implementation of a priority queue using a heap
- Uses BeautifulSoup and nltk to extract webpage text
For the full technical report, please visit: https://docs.google.com/file/d/0B436PtOU57sJZkc5anMyNDZPaHM/edit?usp=sharing