Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
-
Updated
Dec 30, 2024 - Python
Clean, filter and sample URLs to optimize data collection – Python & command-line – Deduplication, spam, content and language filters
A Minimal Yet Powerful Crawler for Extracting all The Internal/External/Fuzz-able Links from a website
A simple GIT URL parser.
A type to represent, query, and manipulate a Uniform Resource Identifier.
Web scraping | Website cloner
This is a website url scraper built using python.
Check if the urls contained in a markdown file are down or not.
Extract information from URLs inside shell scripts
WebBriefs is an intelligent webpage summarizer API that extracts and condenses content into concise, readable markdown format. Perfect for quickly getting the gist of any website
A command line url parser, written in Python
Simple URL builder
Crawl websites and extract meaningful information from HTML and site content
Bot to generate useful links to increase the ranking of products sold on Amazon
ImageSpace is a Python application that downloads images from web pages, filters out certain types of images, and stores the valid images in a SQLite database. It utilizes the FastAPI framework for providing an API endpoint to process web pages and extract images.
A simple Python web crawler that processes URLs from web pages, handles redirects, and skips non-HTML content. It supports HTTP/HTTPS, calculates same-domain link ratios, avoids duplicate URLs, and saves results in a TSV file. Designed for easy scalability and future extensions.
A real spider at work scraping a website.
A python library which could parse URL to ip and country.
UrlShortner map's the larger url's into smaller one. This app is fully designed in python and used postgresql database for mapping url's.
Add a description, image, and links to the url-parser topic page so that developers can more easily learn about it.
To associate your repository with the url-parser topic, visit your repo's landing page and select "manage topics."