Skip to content

rorygibson/linkin

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

linkin

A minimal, async Clojure web crawling library.

("linkin" from Linkin Park's "Crawling", and because links)

Features

  • Uses http-kit and core.async for async fetching
  • Uses Jsoup to reliably extract links from scraped pages
  • Allows the user to pass in their own function to handle the body content of scraped pages
  • Respects the robots.txt of the target website (allow & disallow rules)

Usage

Right now it's not on clojars yet, so you'll need to build and install it locally using a method of your own choosing (lein-localrepo is a good choice)

Then include the following dependency in your project.clj:

[linkin "0.1.0-SNAPSHOT"]

Then:

(require 'linkin.core)
(linkin.core/crawl "http://example.com" linkin.core/simple-body-parser)

Todo

  • Throttling (ie don't DoS target sites...)
  • Control throttling using the spider delay directives in robots.txt
  • Control of max depth / number of pages crawled
  • Ability to spider across domains
  • Pass options through to http-kit (eg following redirects)
  • Filtering by content type
  • Stats while running
  • Better URL normalization (for detecting URLs we've seen before) - see http://en.wikipedia.org/wiki/URL_normalization

License

Copyright © 2014 Rory Gibson

Distributed under the Eclipse Public License version 1.0.

About

A minimal, async Clojure web crawler

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published