Skip to content

farkaskid/WebCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

48 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

WebCrawler

A simple web crawler. This leverages Go's lightweight goroutines to achieve high level of concurrency and thus showing impressive performance. Just try

crawl -url http://www.google.com

and enjoy.

OSX binary is available in the release. You can download it and run with the usual ./crawlerOSX syntax.

More Options

  • -bound flag can be used to bound the crawler within the domain of the given URL. crawl -url http://www.netflix.com -bound
  • -maxWorkers flag can be used to control the maximum number of simultaneous goroutines the crawler will spawn. crawl -url http://www.netflix.com -maxWorkers 10000 It defaults to 1000.

Reporting

The crawler uses simple files to report the crawling results. A CrawlReport is generated for each crawled URL which contains:

  • URL: The URL whose report is this.
  • HttpStatus: If HTTP GET request was successful then the status of the response.
  • Err: Error if any with the technical details of failure.
  • ConnectedURLs: A list of URLs that were found in the response body of this URL.

All the reports can be found in the reports folder in the current working directory, it will be created if not already present. A single report file contains a bunch of CrawlReports encoded in the specified encoding. Currently, gob and json encoding are supported, gob being the default. You can pass the -json flag to encode the CrawlReports in JSON encoding.

The number of CrawlReports that a report file will contain can be changed by passing the -reportSize flag(ex -reportSize 300), it defaults to 500. It should be noted that setting a very high value can cause the crawler to consume large amounts of memory.

Example: crawl -url http://www.netflix.com -bound -json -reportSize 100