What is this?

This is my cornerstone project for Univeristy of Michigan. This can be found in the courses honors track.

Professor: Dr. Charles 'Chuck' Severance

Class: Python for Everybody

Refrence:

Coursera: Python UM
Dr. Chuck's Website: www.dr-chuck.com
Free Python Materials: Python for Everybody

Websites used for research:
Dr. Chucks Projects Website
Google.com

Description: This project was made utilizing Dr. Chucks files provided in his course. Spider.py was handmade.

Utilization:

Install all dependencies within the provided filelock.
Run spider.py.

Spider.py

Requests via command line:
- URL to be spidered
- Enable exception list
- Exceptions list text file (Example of exception: https://www.google.com/search... skips all google urls with google.com/search)
- Enable saving of settings for easy setup
- When restarted it will ask if you want to use a new url of provided an updated exceptions list text file
Crawls the designated URL adding newly found urls to a spider.sqlite DB (auto creates the DB)
Crawls the next url in the sqlite DB
Records html (if found), error code(if provided), and the number of attempts on the site(if unable to access with a max of 3 attempts)

Run sprank.py

sprank.py

Requests via command line:
- Amount of iterations to calculate the ranking of the URLs collected so far(must be visited by the crawler, not just collected)
Cylces through visited sites and ranks them based upon all other visited sites in the spider.sqlite DB
Addds the ranking to the "rank' colomn

Run spjson.py

spjson.py

Pulls the ranking and url from the DB
Creates spider/js for the force.html to utilize for the nodes

Open force.html in browser/web engine

Note: If you are doing the same class/project, please make your own graph and crawl the web. The pictures provided above are for showing what the code dose and not for use for grades, research, etc.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
img		img
pipfile		pipfile
web_resources		web_resources
.gitignore		.gitignore
README.md		README.md
force.html		force.html
spider.py		spider.py
spjson.py		spjson.py
sprank.py		sprank.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

About

Releases

Packages

Languages

Fra3zz/Web_Crawler_and_Visualization

Folders and files

Latest commit

History

Repository files navigation

What is this?

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages