Skip to content
Change the repository type filter

All

    Repositories list

    • Formasaurus tells you the type of an HTML form and its fields using machine learning
      HTML
      48116122Updated Jun 18, 2024Jun 18, 2024
    • scikit-learn inspired API for CRFsuite
      Python
      2154263412Updated Sep 25, 2023Sep 25, 2023
    • agnostic

      Public
      Agnostic Database Migrations
      Python
      MIT License
      185281Updated Aug 10, 2023Aug 10, 2023
    • soft404

      Public
      A classifier for detecting soft 404 pages
      Jupyter Notebook
      145635Updated Jul 6, 2023Jul 6, 2023
    • Log TensorBoard events without touching TensorFlow
      Python
      MIT License
      5063094Updated Dec 26, 2022Dec 26, 2022
    • arachnado

      Public
      Web Crawling UI and HTTP API, based on Scrapy and Tornado
      Python
      65161165Updated Nov 4, 2022Nov 4, 2022
    • Scrapy middleware which allows to crawl only new content
      Python
      MIT License
      237933Updated Oct 31, 2022Oct 31, 2022
    • autologin

      Public
      A project to attempt to automatically login to a website given a single seed
      Python
      Apache License 2.0
      4312395Updated Jul 29, 2022Jul 29, 2022
    • use multiple proxies with Scrapy
      Python
      MIT License
      158738458Updated May 20, 2022May 20, 2022
    • eli5

      Public
      A library for debugging/inspecting machine learning classifiers and explaining their predictions
      Jupyter Notebook
      MIT License
      3332.8k14519Updated May 1, 2022May 1, 2022
    • Show summary of a large number of URLs in a Jupyter Notebook
      Python
      MIT License
      91701Updated Jun 8, 2021Jun 8, 2021
    • Site Hound (previously THH) is a Domain Discovery Tool
      HTML
      Apache License 2.0
      132324Updated Jun 1, 2021Jun 1, 2021
    • autopager

      Public
      Detect and classify pagination links
      HTML
      259860Updated Sep 9, 2020Sep 9, 2020
    • html-text

      Public
      Extract text from HTML
      HTML
      MIT License
      24130132Updated Jul 22, 2020Jul 22, 2020
    • A rotating socks proxy using Tor, Delegate and Haproxy
      Dockerfile
      161410Updated Dec 19, 2019Dec 19, 2019
    • aquarium

      Public
      Splash + HAProxy + Docker Compose
      Python
      MIT License
      41198240Updated Nov 29, 2018Nov 29, 2018
    • Read JSON lines (jl) files, including gzipped and broken
      Python
      MIT License
      93420Updated Nov 21, 2018Nov 21, 2018
    • Scrapy extension which writes crawled items to Kafka
      Python
      MIT License
      93020Updated Nov 8, 2018Nov 8, 2018
    • Item definition and utils for storing items in CDR format for scrapy
      Python
      MIT License
      6700Updated Oct 29, 2018Oct 29, 2018
    • Sitehound's backend
      HTML
      Apache License 2.0
      5600Updated Oct 17, 2018Oct 17, 2018
    • Scrapy middleware for the autologin
      Python
      153741Updated May 29, 2018May 29, 2018
    • A generic crawler
      Python
      2578170Updated May 29, 2018May 29, 2018
    • Broad crawler for domain discovery
      Python
      MIT License
      101920Updated May 29, 2018May 29, 2018
    • Simple heuristic for measuring web page similarity (& data set)
      HTML
      188910Updated May 29, 2018May 29, 2018
    • Headless Horseman Page Classifier service
      Python
      MIT License
      5700Updated May 29, 2018May 29, 2018
    • deep-deep

      Public
      Adaptive crawler which uses Reinforcement Learning methods
      Jupyter Notebook
      3617000Updated May 29, 2018May 29, 2018
    • A collection of example LUA scripts and JS utilities
      JavaScript
      4700Updated May 29, 2018May 29, 2018
    • MaybeDont

      Public
      A component that tries to avoid downloading duplicate content
      Python
      MIT License
      142720Updated May 29, 2018May 29, 2018
    • sitehound

      Public
      This is the facade for installation and access to the individual components
      Shell
      Apache License 2.0
      81600Updated May 29, 2018May 29, 2018
    • A simple tool to add a new user with OpenSSH keys.
      Python
      MIT License
      1200Updated May 29, 2018May 29, 2018