Clean, cached web content for agentsβMarkdown + citations, robots-aware, ETag/304 caching.
Current Version: 1.0.7
See MCP Web Scrape in action! These demos show real-time extraction and processing:
Transform messy HTML into clean, agent-ready Markdown with automatic citations
Extract and categorize all links from any webpage with filtering options
Get comprehensive page metadata including title, description, author, and keywords
AI-powered content summarization for quick insights and key points
# Extract content from any webpage
npx mcp-web-scrape@1.0.7
# Example: Extract from a news article
> extract_content https://news.ycombinator.com
β
Extracted 1,247 words with 5 citations
π Clean Markdown ready for your AI agent
# Extract all forms from a webpage
> extract_forms https://example.com/contact
β
Found 3 forms with 12 input fields
# Parse tables into structured data
> extract_tables https://example.com/data --format json
β
Extracted 5 tables with 247 rows
# Find social media profiles
> extract_social_media https://company.com
β
Found Twitter, LinkedIn, Facebook profiles
# Analyze sentiment of content
> sentiment_analysis https://blog.example.com/article
β
Sentiment: Positive (0.85), Emotional tone: Optimistic
# Extract named entities
> extract_entities https://news.example.com/article
β
Found 12 people, 8 organizations, 5 locations
# Check for security vulnerabilities
> scan_vulnerabilities https://mysite.com
β
No XSS vulnerabilities found, 2 header improvements suggested
# Analyze competitor SEO
> analyze_competitors ["https://competitor1.com", "https://competitor2.com"]
β
Competitor analysis complete: keyword gaps identified
# Monitor uptime and performance
> monitor_uptime https://mysite.com --interval 300
β
Uptime: 99.9%, Average response: 245ms
# Generate comprehensive report
> generate_reports https://website.com --metrics ["seo", "performance", "security"]
β
Generated 15-page analysis report
# Install globally
npm install -g mcp-web-scrape@1.0.7
# Try it instantly (latest version)
npx mcp-web-scrape@latest
# Try specific version
npx mcp-web-scrape@1.0.7
# Or start HTTP server
node dist/http.js
Add to your ~/Library/Application Support/ChatGPT/config.json
:
{
"mcpServers": {
"web-scrape": {
"command": "npx",
"args": ["mcp-web-scrape@1.0.7"]
}
}
}
Add to your ~/Library/Application Support/Claude/claude_desktop_config.json
:
{
"mcpServers": {
"web-scrape": {
"command": "npx",
"args": ["mcp-web-scrape@1.0.7"]
}
}
}
Tool | Description |
---|---|
extract_content |
Convert HTML to clean Markdown with citations |
summarize_content |
AI-powered content summarization |
get_page_metadata |
Extract title, description, author, keywords |
extract_links |
Get all links with filtering options |
extract_images |
Extract images with alt text and dimensions |
search_content |
Search within page content |
check_url_status |
Verify URL accessibility |
validate_robots |
Check robots.txt compliance |
extract_structured_data |
Parse JSON-LD, microdata, RDFa |
compare_content |
Compare two pages for changes |
batch_extract |
Process multiple URLs efficiently |
get_cache_stats |
View cache performance metrics |
clear_cache |
Manage cached content |
Tool | Description |
---|---|
extract_forms |
Extract form elements, fields, and validation rules |
extract_tables |
Parse HTML tables with headers and structured data |
extract_social_media |
Find social media links and profiles |
extract_contact_info |
Discover emails, phone numbers, and addresses |
extract_headings |
Analyze heading structure (H1-H6) for content hierarchy |
extract_feeds |
Discover and parse RSS/Atom feeds |
Tool | Description |
---|---|
convert_to_pdf |
Convert web pages to PDF format with customizable settings |
extract_text_only |
Extract plain text content without formatting or HTML |
generate_word_cloud |
Generate word frequency analysis and word cloud data |
translate_content |
Translate web page content to different languages |
extract_keywords |
Extract important keywords and phrases from content |
Tool | Description |
---|---|
analyze_readability |
Analyze text readability using various metrics (Flesch, Gunning-Fog, etc.) |
detect_language |
Detect the primary language of web page content |
extract_entities |
Extract named entities (people, places, organizations) |
sentiment_analysis |
Analyze sentiment and emotional tone of content |
classify_content |
Classify content into categories and topics |
Tool | Description |
---|---|
analyze_competitors |
Analyze competitor websites for SEO and content insights |
extract_schema_markup |
Extract and validate schema.org structured data |
check_broken_links |
Check for broken links and redirects on pages |
analyze_page_speed |
Analyze page loading speed and performance metrics |
generate_meta_tags |
Generate optimized meta tags for SEO |
Tool | Description |
---|---|
scan_vulnerabilities |
Scan pages for common security vulnerabilities |
check_ssl_certificate |
Check SSL certificate validity and security details |
analyze_cookies |
Analyze cookies and tracking mechanisms |
detect_tracking |
Detect tracking scripts and privacy concerns |
check_privacy_policy |
Analyze privacy policy compliance and coverage |
Tool | Description |
---|---|
monitor_uptime |
Monitor website uptime and availability |
track_changes_detailed |
Advanced change tracking with similarity analysis |
analyze_traffic_patterns |
Analyze website traffic patterns and trends |
benchmark_performance |
Benchmark performance against competitors |
generate_reports |
Generate comprehensive analysis reports |
Tool | Description |
---|---|
monitor_changes |
Track content changes over time with similarity analysis |
analyze_performance |
Measure page performance, SEO, and accessibility metrics |
generate_sitemap |
Crawl websites to generate comprehensive sitemaps |
validate_html |
Validate HTML structure, accessibility, and SEO compliance |
Deterministic Results β Same URL always returns identical content
Smart Citations β Every fact links back to its source
Robots Compliant β Respects robots.txt and rate limits
Lightning Fast β ETag/304 caching + persistent storage
Agent-Optimized β Clean Markdown instead of messy HTML
- β Respects robots.txt by default
- β Rate limiting prevents server overload
- β No paywall bypass - ethical scraping only
- β User-Agent identification for transparency
# Install specific version
npm install -g mcp-web-scrape@1.0.7
# Or use directly (latest)
npx mcp-web-scrape@latest
# Or use specific version
npx mcp-web-scrape@1.0.7
# Environment variables
export MCP_WEB_SCRAPE_CACHE_DIR="./cache"
export MCP_WEB_SCRAPE_USER_AGENT="MyBot/1.0"
export MCP_WEB_SCRAPE_RATE_LIMIT="1000"
STDIO (default)
mcp-web-scrape
HTTP/SSE
node dist/http.js --port 3000
Access cached content as MCP resources:
cache://news.ycombinator.com/path β Cached page content
cache://stats β Cache statistics
cache://robots/news.ycombinator.com β Robots.txt status
We love contributions! See CONTRIBUTING.md for guidelines.
Good First Issues:
- Add new content extractors
- Improve error handling
- Write more tests
- Enhance documentation
MIT Β© Mahipal
Built with β€οΈ for the Model Context Protocol ecosystem