Download distribution zip (or tar.gz)
Full Changelog | Javadoc | Maven Central
New Features
-
Groovy crawl configs (experimental): Groovy Bean Definition DSL can now be used as an experimental alternative to Spring XML. This enables more terse and human-readable job configuration with inline scripting capabilities. There is no user interface for it in this release. For now, you must manually create a crawler-beans.groovy file in your job directory. #632
-
ExtractorHTML obeyRelNofollow: This option skips extraction of links marked
rel=nofollow
. This is useful for avoiding crawler traps on some sites. #638
Fixes
- Cookie rejected warning: The slf4j change in 3.6.0 inadvertently caused a previously hidden warning to be logged to
job.log
when a server sends aSet-Cookie
header with a disallowed domain value. This warning is now suppressed since it occurs frequently and does not require any action from the crawl operator. #640
Changes
- Removed fastutil: A small number of usages of fastutil were replaced with standard library equivalents in webarchive-commons and Heritrix. This reduced the Heritrix distribution size from 51 MB to 34 MB. iipc/webarchive-commons#101
Dependency Upgrades
- amqp-client 5.24.0
- commons-codec 1.17.2
- ftpserver-core 1.2.1
- freemarker 2.3.34
- jetty 9.4.57.v20241219
- jsch 0.2.22
- restlet 2.5.0
- spring 6.1.16
- webarchive-commons 1.3.0