Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment/better crawler #104

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from
Draft

Conversation

JoshAshby
Copy link
Member

This is an experiement into making a better crawler setup that will let me iterate versions without breaking the interface. It's aimed at just the crawling step and the general idea is to produce a Document object which contains all the page data and dependencies as a tree, which can then be processed into an offline cache, WARC archive or better log/debugging packages.

while also thinking about the future of being able to produce WARC
files and have a better set of logs and information around requests
and responses during caching to aid in debugging

and also thinking about extendability and introducing newer crawlers
while keeping the same data structures
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant