A simple web crawler CLI application.
The aim of the application is to enumerate all the links for each page on a given domain. The result of the crawl will be output to a JSON file in the results
under the crawl domain e.g. https://wiprodigital.com -> /results/wiprodigital.com.json
(a sample has been included)
There are some caveats:
- The application will automatically exclude JS / CSS URLs
- The application will not crawl external URLs
- The application will not crawl sub-domain URLs e.g.
test.wiprodigital.com
Node v7.6+ required
npm i
npm start <domain>
Note - <domain>
may also be set as an environment variable START_URL
, if both values are used the CLI takes precedence
npm test
- Applied SRP for scraping (
HttpPage
) and traversing (WebCrawler
) - Made use of Map to dedupe links
- Used recursion to traverse pages
- Took the decision to not create an "exporter" or "deserializer" class, given the native support in Node for JSON serialization & file exporting. However, if the application needed to support various export types then this would perhaps be a good approach to introduce a common interface.
- Improve performance and speed of crawl e.g. run scrapes in parallel, use pm2 to scale out (although we need to be wary of race conditions, would need a mutex of some kind)
- Include additional processing options e.g. max page depth, rate-limiting (protect against 429 errors)
- Decouple HTML parsing from
HttpPage
class (maybe down the line we want to move away fromcheerio
) - Move results deserialization / file exporting to separate classes
- Avoid crawling CMS-related URLs (
/xmlrpc.php
,/wp-json
etc.) - Better handling of erroneous but valid URLs e.g. http://domain.com//a/b/c, crawler would currently treat //a as the domain in itself
- Better hashtag URL processing (although the page is the same, they may pull dynamic content)
- Better file name validation
- Include stats e.g. total links found, pages crawled, crawl times etc.
- Include more unit tests (happy-day, edge-case, error scenarios)
- Include integration tests (validate against a real URL)
- Implement Babel to leverage ES2017 syntax (i.e.
yield
,Object.fromEntries
) - Improve parameter validation (or better yet, use TypeScript)
- Improve instrumentation, utilise remote services like Loggly, Prometheus or equivalent
- Perf tests against readily available libs like crawler, make sure you are reinventing the wheel for good reason