This is my cornerstone project for Univeristy of Michigan. This can be found in the courses honors track.
Professor: Dr. Charles 'Chuck' Severance
Class: Python for Everybody
Refrence:
Coursera: Python UM
Dr. Chuck's Website: www.dr-chuck.com
Free Python Materials: Python for Everybody
Websites used for research:
Dr. Chucks Projects Website
Google.com
Dr. Chuck's Website: www.dr-chuck.com
Free Python Materials: Python for Everybody
Websites used for research:
Dr. Chucks Projects Website
Google.com
Description: This project was made utilizing Dr. Chucks files provided in his course. Spider.py was handmade.
Utilization:
- Install all dependencies within the provided filelock.
- Run spider.py.
-
Requests via command line:
- URL to be spidered
- Enable exception list
- Exceptions list text file (Example of exception: https://www.google.com/search... skips all google urls with google.com/search)
- Enable saving of settings for easy setup
- When restarted it will ask if you want to use a new url of provided an updated exceptions list text file - Crawls the designated URL adding newly found urls to a spider.sqlite DB (auto creates the DB)
- Crawls the next url in the sqlite DB
- Records html (if found), error code(if provided), and the number of attempts on the site(if unable to access with a max of 3 attempts)
- Run sprank.py
-
Requests via command line:
- Amount of iterations to calculate the ranking of the URLs collected so far(must be visited by the crawler, not just collected) - Cylces through visited sites and ranks them based upon all other visited sites in the spider.sqlite DB
- Addds the ranking to the "rank' colomn
- Run spjson.py
- Pulls the ranking and url from the DB
- Creates spider/js for the force.html to utilize for the nodes
- Open force.html in browser/web engine
Spider.py
sprank.py
spjson.py