WebCrawler

A simple web crawler CLI application.

The aim of the application is to enumerate all the links for each page on a given domain. The result of the crawl will be output to a JSON file in the results under the crawl domain e.g. https://wiprodigital.com -> /results/wiprodigital.com.json (a sample has been included)

There are some caveats:

The application will automatically exclude JS / CSS URLs
The application will not crawl external URLs
The application will not crawl sub-domain URLs e.g. test.wiprodigital.com

Installing

Node v7.6+ required

npm i

Running

npm start <domain>

Note - <domain> may also be set as an environment variable START_URL, if both values are used the CLI takes precedence

Testing

npm test

Notes

Applied SRP for scraping (HttpPage) and traversing (WebCrawler)
Made use of Map to dedupe links
Used recursion to traverse pages
Took the decision to not create an "exporter" or "deserializer" class, given the native support in Node for JSON serialization & file exporting. However, if the application needed to support various export types then this would perhaps be a good approach to introduce a common interface.

TODO

Improve performance and speed of crawl e.g. run scrapes in parallel, use pm2 to scale out (although we need to be wary of race conditions, would need a mutex of some kind)
Include additional processing options e.g. max page depth, rate-limiting (protect against 429 errors)
Decouple HTML parsing from HttpPage class (maybe down the line we want to move away from cheerio)
Move results deserialization / file exporting to separate classes
Avoid crawling CMS-related URLs (/xmlrpc.php, /wp-json etc.)
Better handling of erroneous but valid URLs e.g. http://domain.com//a/b/c, crawler would currently treat //a as the domain in itself
Better hashtag URL processing (although the page is the same, they may pull dynamic content)
Better file name validation
Include stats e.g. total links found, pages crawled, crawl times etc.
Include more unit tests (happy-day, edge-case, error scenarios)
Include integration tests (validate against a real URL)
Implement Babel to leverage ES2017 syntax (i.e. yield, Object.fromEntries)
Improve parameter validation (or better yet, use TypeScript)
Improve instrumentation, utilise remote services like Loggly, Prometheus or equivalent
Perf tests against readily available libs like crawler, make sure you are reinventing the wheel for good reason

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
results		results
src		src
.gitignore		.gitignore
README.md		README.md
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WebCrawler

Installing

Running

Testing

Notes

TODO

About

Releases

Packages

Languages

jameshowe/webcrawler

Folders and files

Latest commit

History

Repository files navigation

WebCrawler

Installing

Running

Testing

Notes

TODO

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages