Skip to content

Parallel web scraper in C++20 with Intel TBB for books.toscrape.com: 4-stage pipeline, concurrent data structures, automatic page discovery, robust error handling, and end-to-end analysis (extraction, statistics, report generation).

License

Notifications You must be signed in to change notification settings

ljubogdan/parallel-web-scraper-tbb

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸš€ Parallel Web Scraper with Intel TBB

C++ TBB CMake License

A high-performance parallel web scraper built with Intel Threading Building Blocks (TBB) for analyzing book data from toscrape.com

Features β€’ Installation β€’ Usage β€’ Architecture β€’ Performance β€’ API


πŸ“‹ Table of Contents


🎯 Overview

This project implements a sophisticated parallel web scraper specifically designed to analyze book data from the books.toscrape.com website. Built with modern C++20 and Intel Threading Building Blocks (TBB), it demonstrates the power of parallel processing for web scraping tasks.

Key Highlights

  • πŸ”₯ Parallel Processing: Utilizes Intel TBB for efficient parallel execution
  • πŸ“Š Performance Analysis: Compares serial vs parallel execution with detailed metrics
  • πŸ›‘οΈ Robust Error Handling: Comprehensive error handling and retry mechanisms
  • πŸ“ˆ Scalable Architecture: Modular design with clear separation of concerns
  • ⚑ High Performance: Achieves up to 10x speedup over serial implementation

✨ Features

πŸš€ Parallel Processing

  • Intel TBB Integration: Leverages TBB's parallel algorithms and data structures
  • Pipeline Processing: Implements a 4-stage parallel pipeline for optimal throughput
  • Concurrent Data Structures: Uses tbb::concurrent_vector and tbb::concurrent_hash_map
  • Thread-Safe Operations: All operations are designed for concurrent execution

πŸ“Š Comprehensive Analysis

  • Book Data Extraction: Extracts titles, prices, ratings, and descriptions
  • Statistical Analysis: Calculates price statistics, rating distributions, and word counts
  • Text Processing: Performs word frequency analysis and unique word counting
  • Performance Metrics: Tracks throughput, execution time, and speedup ratios

πŸ› οΈ Advanced Capabilities

  • Auto-Discovery: Automatically discovers all pages on the target website
  • URL Filtering: Validates and filters URLs for security
  • Flexible Input: Supports both file-based and auto-generated URL lists
  • Multiple Output Formats: Generates detailed results and summary reports

πŸ”§ Developer-Friendly

  • Modern C++20: Uses latest C++ features and best practices
  • CMake Build System: Cross-platform build configuration
  • Modular Design: Clean separation of concerns with dedicated classes
  • Comprehensive Documentation: Well-documented code with clear interfaces

πŸš€ Installation

Prerequisites

  • C++20 Compatible Compiler (GCC 10+, Clang 12+, MSVC 2019+)
  • CMake 3.20+
  • Intel TBB (Threading Building Blocks)
  • libcurl (for HTTP requests)

Windows Installation

# Install vcpkg (if not already installed)
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
.\bootstrap-vcpkg.bat

# Install dependencies
.\vcpkg install curl tbb

# Install Intel TBB (if using Intel oneAPI)
# Download from: https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html

Linux Installation

# Ubuntu/Debian
sudo apt update
sudo apt install build-essential cmake libcurl4-openssl-dev libtbb-dev

# CentOS/RHEL
sudo yum install gcc-c++ cmake libcurl-devel tbb-devel

# Or using vcpkg
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg install curl tbb

macOS Installation

# Using Homebrew
brew install cmake curl tbb

# Or using vcpkg
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg install curl tbb

πŸ’» Usage

Basic Usage

# Build the project
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release

# Run with default settings
./build/parallel-web-scraper-tbb

# Run with custom parameters
./build/parallel-web-scraper-tbb data/urls.txt output/results.txt 8 --auto

Command Line Arguments

./parallel-web-scraper-tbb [input_file] [output_file] [threads] [--auto]
Parameter Description Default
input_file Path to URLs file data/urls.txt
output_file Path to output file output/results.txt
threads Number of parallel threads std::thread::hardware_concurrency()
--auto Auto-discover all pages false

Example Commands

# Auto-discover all pages with 16 threads
./parallel-web-scraper-tbb data/urls.txt output/results.txt 16 --auto

# Use custom URL list with 8 threads
./parallel-web-scraper-tbb my_urls.txt my_output.txt 8

# Use all available CPU cores
./parallel-web-scraper-tbb --auto

Output Files

The scraper generates three output files:

  1. results-serial.txt - Serial execution results
  2. results-parallel.txt - Parallel execution results
  3. results-summary.txt - Performance comparison summary

πŸ—οΈ Architecture

System Design

The scraper follows a 4-stage parallel pipeline architecture:

graph LR
    A[URL Generator] --> B[Downloader]
    B --> C[Parser]
    C --> D[Analyzer]
    D --> E[Storage]
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#e8f5e8
    style D fill:#fff3e0
    style E fill:#fce4ec
Loading

Core Components

πŸ”— Downloader (Downloader.hpp/cpp)

  • Purpose: Handles HTTP requests with connection pooling
  • Features:
    • Multi-connection support
    • Timeout handling
    • Retry mechanisms
    • libcurl integration

πŸ“ Parser (Parser.hpp/cpp)

  • Purpose: Extracts structured data from HTML
  • Features:
    • HTML parsing and cleaning
    • Product data extraction
    • Text content extraction
    • Error handling

πŸ“Š Analyzer (Analyzer.hpp/cpp)

  • Purpose: Performs statistical analysis on parsed data
  • Features:
    • Price statistics calculation
    • Rating distribution analysis
    • Word frequency analysis
    • Text metrics computation

πŸ’Ύ Storage (Storage.hpp/cpp)

  • Purpose: Saves results and generates reports
  • Features:
    • Concurrent data storage
    • Summary report generation
    • Performance metrics tracking

πŸ› οΈ Utils (Utils.hpp/cpp)

  • Purpose: Utility functions for text processing
  • Features:
    • File I/O operations
    • Text manipulation
    • HTML stripping
    • String utilities

Data Flow

sequenceDiagram
    participant M as Main
    participant D as Downloader
    participant P as Parser
    participant A as Analyzer
    participant S as Storage
    
    M->>D: URL List
    D->>D: Parallel Download
    D->>P: HTML Content
    P->>P: Parse & Extract
    P->>A: Parsed Data
    A->>A: Analyze & Summarize
    A->>S: Analysis Results
    S->>S: Save & Generate Reports
Loading

πŸ“ˆ Performance

Benchmark Results

Based on real-world testing with the books.toscrape.com dataset:

Metric Serial Parallel Improvement
Execution Time 57.32s 5.62s 10.2x faster
Throughput 0.87 pages/s 8.90 pages/s 10.2x higher
CPU Utilization ~12.5% ~100% 8x better
Memory Usage Low Moderate Acceptable

Performance Characteristics

  • Scalability: Linear scaling with thread count (up to optimal point)
  • Memory Efficiency: Concurrent data structures minimize memory overhead
  • I/O Optimization: Connection pooling reduces network latency
  • CPU Utilization: Near-perfect parallelization of CPU-bound tasks

Optimization Features

  • Pipeline Processing: Overlaps I/O and CPU operations
  • Connection Pooling: Reuses HTTP connections
  • Concurrent Collections: Lock-free data structures
  • Batch Processing: Processes multiple URLs simultaneously

πŸ“š API Documentation

Core Classes

Downloader

class Downloader {
public:
    explicit Downloader(std::size_t max_connections);
    HttpResponse get(const std::string& url) const;
};

Parser

class Parser {
public:
    ParsedPage parse(const std::string& url, const std::string& html) const;
};

Analyzer

class Analyzer {
public:
    Analysis summarize(const ParsedPage& page) const;
};

Storage

class Storage {
public:
    explicit Storage(std::string output_file);
    void save(const tbb::concurrent_vector<Analysis>& results, 
              const GlobalSummary& summary) const;
};

Data Structures

Analysis

struct Analysis {
    std::string url;
    std::string title;
    std::size_t word_count;
    std::size_t unique_words;
    std::size_t item_count;
    double avg_price;
    double min_price;
    double max_price;
    std::array<int,6> stars_hist;
    std::vector<std::pair<std::string,std::size_t>> top_terms;
};

GlobalSummary

struct GlobalSummary {
    std::size_t urls_total;
    std::size_t urls_unique;
    std::size_t pages_downloaded;
    std::size_t total_products;
    std::array<long long,6> stars_hist;
    double avg_price_all;
    double min_price_all;
    double max_price_all;
    double seconds;
    double throughput;
};

πŸ“ Project Structure

parallel-web-scraper-tbb/
β”œβ”€β”€ πŸ“„ CMakeLists.txt          # Build configuration
β”œβ”€β”€ πŸ“„ main.cpp                # Main application entry point
β”œβ”€β”€ πŸ“„ README.md               # This file
β”œβ”€β”€ πŸ“„ LICENSE                 # MIT License
β”‚
β”œβ”€β”€ πŸ“ include/                # Header files
β”‚   β”œβ”€β”€ πŸ“„ Analyzer.hpp        # Analysis engine interface
β”‚   β”œβ”€β”€ πŸ“„ Downloader.hpp      # HTTP client interface
β”‚   β”œβ”€β”€ πŸ“„ Parser.hpp          # HTML parser interface
β”‚   β”œβ”€β”€ πŸ“„ Storage.hpp         # Data storage interface
β”‚   └── πŸ“„ Utils.hpp           # Utility functions interface
β”‚
β”œβ”€β”€ πŸ“ src/                    # Source files
β”‚   β”œβ”€β”€ πŸ“„ Analyzer.cpp        # Analysis engine implementation
β”‚   β”œβ”€β”€ πŸ“„ Downloader.cpp      # HTTP client implementation
β”‚   β”œβ”€β”€ πŸ“„ Parser.cpp          # HTML parser implementation
β”‚   β”œβ”€β”€ πŸ“„ Storage.cpp         # Data storage implementation
β”‚   └── πŸ“„ Utils.cpp           # Utility functions implementation
β”‚
β”œβ”€β”€ πŸ“ data/                   # Input data
β”‚   └── πŸ“„ urls.txt            # Sample URL list
β”‚
β”œβ”€β”€ πŸ“ output/                 # Generated output
β”‚   β”œβ”€β”€ πŸ“„ results-serial.txt  # Serial execution results
β”‚   β”œβ”€β”€ πŸ“„ results-parallel.txt # Parallel execution results
β”‚   └── πŸ“„ results-summary.txt # Performance summary
β”‚
└── πŸ“ build/                  # Build artifacts (generated)
    β”œβ”€β”€ πŸ“„ parallel-web-scraper-tbb.exe
    └── πŸ“ Release/

πŸ”§ Dependencies

Required Dependencies

Dependency Version Purpose
Intel TBB 2021.1+ Parallel processing framework
libcurl 7.60+ HTTP client library
CMake 3.20+ Build system
C++ Compiler C++20 Language support

Optional Dependencies

Dependency Purpose
vcpkg Package manager for C++
Intel oneAPI Complete TBB installation

πŸ”¨ Building

CMake Configuration

# Basic build
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release

# With vcpkg (Windows)
cmake -B build -S . -G "Visual Studio 17 2022" -A x64 \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_TOOLCHAIN_FILE=C:\vcpkg\scripts\buildsystems\vcpkg.cmake

# With Intel TBB path
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release \
  -DTBB_DIR="/path/to/tbb/lib/cmake/tbb"

Compilation

# Build the project
cmake --build build --config Release

# Build with specific number of jobs
cmake --build build --config Release -j 8

# Clean build
cmake --build build --config Release --target clean

Build Options

Option Description Default
CMAKE_BUILD_TYPE Build type (Debug/Release) Release
CMAKE_CXX_STANDARD C++ standard 20
TBB_DIR TBB installation path Auto-detect

🀝 Contributing

We welcome contributions! Please follow these guidelines:

Development Setup

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Make your changes following the existing code style
  4. Add tests for new functionality
  5. Update documentation as needed
  6. Commit your changes: git commit -m 'Add amazing feature'
  7. Push to the branch: git push origin feature/amazing-feature
  8. Open a Pull Request

Code Style

  • Follow C++20 best practices
  • Use meaningful variable names
  • Add comprehensive comments
  • Maintain consistent formatting
  • Write self-documenting code

Testing

  • Test with different thread counts
  • Verify performance improvements
  • Check memory usage
  • Validate output correctness

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2025 Bogdan Ljubinković

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

πŸ™ Acknowledgments

  • Intel TBB Team for the excellent parallel processing framework
  • libcurl Team for the robust HTTP client library
  • books.toscrape.com for providing a great testing dataset
  • CMake Community for the cross-platform build system

⭐ Star this repository if you found it helpful!

Made with ❀️ by Bogdan LjubinkoviΔ‡

About

Parallel web scraper in C++20 with Intel TBB for books.toscrape.com: 4-stage pipeline, concurrent data structures, automatic page discovery, robust error handling, and end-to-end analysis (extraction, statistics, report generation).

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published