Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Per-Host Rate Limiting and Caching #1605

Open
mre opened this issue Jan 6, 2025 · 1 comment
Open

Add Per-Host Rate Limiting and Caching #1605

mre opened this issue Jan 6, 2025 · 1 comment
Milestone

Comments

@mre
Copy link
Member

mre commented Jan 6, 2025

Currently, lychee faces challenges with rate limiting and cache effectiveness when checking links, particularly when dealing with multiple requests to the same hosts. This leads to several issues that need to be addressed:

Current Problems

Proposed Solution

We should implement a smart per-host rate limiting and caching system that would:

  1. Track rate limits per host using a concurrent HashMap:
use std::collections::HashMap;
use time::OffsetDateTime;

struct HostConfig {
    rate_limit_reset: Option<OffsetDateTime>,
    request_delay: Option<Duration>,
    max_concurrent_requests: Option<u32>,
}
  1. Implement smarter caching:
  • Maintain separate cache states per host
  1. Stretch goal: Add configuration options per host:
lychee --max-concurrency-per-host github.com=10 --delay-per-host github.com=100ms
  1. Stretch goal II: Add support for per-host headers

The idea would be to maintain a HeaderMap.
See #1297 for details.

Implementation Notes

  • Use the existing rate-limits crate, which is mostly useful for APIs
  • Handle 429 responses with proper backoff using response headers when available

Benefits

  • Prevents IP bans from aggressive checking
  • More efficient resource usage
  • Better compliance with API rate limits
  • Improved cache effectiveness. Since the cache is per host, there would be no synchronization issues
  • Faster overall execution by avoiding unnecessary retries

Examples

[hosts."github.com"]
max_concurrent_requests = 10
request_delay = "100ms"
headers = { Authorization = "token ghp_xxxx", "User-Agent" = "my-bot" }

[hosts."api.example.com"] 
max_concurrent_requests = 1
request_delay = "1s"
headers = { "X-API-Key" = "secret", Accept = "application/json" }

CLI usage example:

lychee --max-concurrency-per-host github.com=10 --delay-per-host github.com=100ms

And when adding headers:

lychee \
  --max-concurrency-per-host github.com=10 \
  --delay-per-host github.com=100ms \
  --headers-per-host 'github.com=Authorization:token ghp_xxxx,User-Agent:my-bot'

This is just a proposal. I'm not 100% certain about the naming yet.

Related issues: #989, #1593

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant