Light, fast and scalable web scraper for structured data.
- Structured data
- Reads scraping jobs from rabbitmq
- Respects robotx.txt
- Renders javascript
- Regex matching option
- Re-schedules jobs when url is unreachable
- Allowes grouping of elements Parent -> child
- Config based
- Concurrency limiter (variable - default 10)
- In memory cache (variable - default 15 minutes)
$ git clone git@github.com:kamilernerd/busyboi.git
$ cd busyboi
$ make
Usage of ./busyboi:
-concurreny_limit int
Limits how many crawling jobs can run simultaneously (default 10)
-queue_host string
Hostname for rabbitmq (default "localhost")
-queue_name string
Queue name for rabbitmq (default "busyboi")
-queue_password string
Password for rabbitmq (default "guest")
-queue_port string
Port for rabbitmq (default "5672")
-queue_user string
User for rabbitmq (default "guest")
-cache_ttl int
How long should cache be stored (in seconds). Default is 15 minutes.
{
"collection": "some_random",
"url": "somerandomurl.com/some/directory/index.html",
"fields": [
{
"name": "some html element",
"selector": "body > p[class=\"hello_world\"]"
},
{
"name": "some html a element",
"selector": "body > a[class=\"hello_world\"]",
"attr": "href"
},
{
"name": "a group of elements within a parent",
"selector": ".some-parent-element",
"children": [
{
"name": "a link within a parent element",
"selector": "a[class=\"some_class"\]",
"attr": "href"
}
]
},
{
"name": "a group of elements within a parent",
"selector": ".some-parent-element",
"children": [
{
"name": "a link within a parent element",
"selector": "a[class=\"some_class"\]",
"attr": "href",
"regex": "(http|s)+"
}
]
},
{
"name": "a group of elements within a parent",
"selector": ".some-parent-element",
"regex": "(http|s)+",
"children": [
{
"name": "a link within a parent element",
"selector": "a[class=\"some_class"\]",
"attr": "href"
}
]
}
]
}
- Some long term storage support (MAYBE!)