Add Per-Host Rate Limiting and Caching #1605

mre · 2025-01-06T14:37:11Z

Currently, lychee faces challenges with rate limiting and cache effectiveness when checking links, particularly when dealing with multiple requests to the same hosts. This leads to several issues that need to be addressed:

Current Problems

Multiple concurrent requests to the same host trigger rate limits (429 errors) (See Add custom delay inbetween requests (prevent ban) #989)
Cache is ineffective with high concurrency due to race conditions (See The cache is ineffective with the default concurrency, for links in a website's theme #1593 (comment))
Global concurrency settings are too coarse-grained
Different hosts have different rate limit requirements
Headers are applied to all hosts, causing potential security issues (Security: restrict custom HTTP request headers to specific URL patterns #1298 and custom Header not sent #1441 (comment))

Proposed Solution

We should implement a smart per-host rate limiting and caching system that would:

Track rate limits per host using a concurrent HashMap:

use std::collections::HashMap;
use time::OffsetDateTime;

struct HostConfig {
    rate_limit_reset: Option<OffsetDateTime>,
    request_delay: Option<Duration>,
    max_concurrent_requests: Option<u32>,
}

Implement smarter caching:

Maintain separate cache states per host

Stretch goal: Add configuration options per host:

lychee --max-concurrency-per-host github.com=10 --delay-per-host github.com=100ms

Stretch goal II: Add support for per-host headers

The idea would be to maintain a HeaderMap.
See #1297 for details.

Implementation Notes

Use the existing rate-limits crate, which is mostly useful for APIs
Handle 429 responses with proper backoff using response headers when available

Benefits

Prevents IP bans from aggressive checking
More efficient resource usage
Better compliance with API rate limits
Improved cache effectiveness. Since the cache is per host, there would be no synchronization issues
Faster overall execution by avoiding unnecessary retries

Examples

[hosts."github.com"]
max_concurrent_requests = 10
request_delay = "100ms"
headers = { Authorization = "token ghp_xxxx", "User-Agent" = "my-bot" }

[hosts."api.example.com"] 
max_concurrent_requests = 1
request_delay = "1s"
headers = { "X-API-Key" = "secret", Accept = "application/json" }

CLI usage example:

lychee --max-concurrency-per-host github.com=10 --delay-per-host github.com=100ms

And when adding headers:

lychee \
  --max-concurrency-per-host github.com=10 \
  --delay-per-host github.com=100ms \
  --headers-per-host 'github.com=Authorization:token ghp_xxxx,User-Agent:my-bot'

This is just a proposal. I'm not 100% certain about the naming yet.

Related issues: #989, #1593

The text was updated successfully, but these errors were encountered:

mre · 2025-01-06T14:41:04Z

Also see:

mre added this to the v1.0 milestone Jan 6, 2025

This was referenced Jan 6, 2025

Add custom delay inbetween requests (prevent ban) #989

Open

The cache is ineffective with the default concurrency, for links in a website's theme #1593

Open

Test leapfrog for caching #556

Closed

mre added enhancement New feature or request request-for-comments labels Jan 6, 2025

mre mentioned this issue Jan 8, 2025

Implement recursion #1603

Draft

7 tasks

mre added the bug-bounty label Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Per-Host Rate Limiting and Caching #1605

Add Per-Host Rate Limiting and Caching #1605

mre commented Jan 6, 2025 •

edited

Loading

mre commented Jan 6, 2025

Add Per-Host Rate Limiting and Caching #1605

Add Per-Host Rate Limiting and Caching #1605

Comments

mre commented Jan 6, 2025 • edited Loading

Current Problems

Proposed Solution

Implementation Notes

Benefits

Examples

mre commented Jan 6, 2025

mre commented Jan 6, 2025 •

edited

Loading