🚀 Parallel Web Scraper with Intel TBB

A high-performance parallel web scraper built with Intel Threading Building Blocks (TBB) for analyzing book data from toscrape.com

Features • Installation • Usage • Architecture • Performance • API

📋 Table of Contents

Overview
Features
Installation
Usage
Architecture
Performance
API Documentation
Project Structure
Dependencies
Building
Contributing
License

🎯 Overview

This project implements a sophisticated parallel web scraper specifically designed to analyze book data from the books.toscrape.com website. Built with modern C++20 and Intel Threading Building Blocks (TBB), it demonstrates the power of parallel processing for web scraping tasks.

Key Highlights

🔥 Parallel Processing: Utilizes Intel TBB for efficient parallel execution
📊 Performance Analysis: Compares serial vs parallel execution with detailed metrics
🛡️ Robust Error Handling: Comprehensive error handling and retry mechanisms
📈 Scalable Architecture: Modular design with clear separation of concerns
⚡ High Performance: Achieves up to 10x speedup over serial implementation

✨ Features

🚀 Parallel Processing

Intel TBB Integration: Leverages TBB's parallel algorithms and data structures
Pipeline Processing: Implements a 4-stage parallel pipeline for optimal throughput
Concurrent Data Structures: Uses tbb::concurrent_vector and tbb::concurrent_hash_map
Thread-Safe Operations: All operations are designed for concurrent execution

📊 Comprehensive Analysis

Book Data Extraction: Extracts titles, prices, ratings, and descriptions
Statistical Analysis: Calculates price statistics, rating distributions, and word counts
Text Processing: Performs word frequency analysis and unique word counting
Performance Metrics: Tracks throughput, execution time, and speedup ratios

🛠️ Advanced Capabilities

Auto-Discovery: Automatically discovers all pages on the target website
URL Filtering: Validates and filters URLs for security
Flexible Input: Supports both file-based and auto-generated URL lists
Multiple Output Formats: Generates detailed results and summary reports

🔧 Developer-Friendly

Modern C++20: Uses latest C++ features and best practices
CMake Build System: Cross-platform build configuration
Modular Design: Clean separation of concerns with dedicated classes
Comprehensive Documentation: Well-documented code with clear interfaces

🚀 Installation

Prerequisites

C++20 Compatible Compiler (GCC 10+, Clang 12+, MSVC 2019+)
CMake 3.20+
Intel TBB (Threading Building Blocks)
libcurl (for HTTP requests)

Windows Installation

# Install vcpkg (if not already installed)
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
.\bootstrap-vcpkg.bat

# Install dependencies
.\vcpkg install curl tbb

# Install Intel TBB (if using Intel oneAPI)
# Download from: https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html

Linux Installation

# Ubuntu/Debian
sudo apt update
sudo apt install build-essential cmake libcurl4-openssl-dev libtbb-dev

# CentOS/RHEL
sudo yum install gcc-c++ cmake libcurl-devel tbb-devel

# Or using vcpkg
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg install curl tbb

macOS Installation

# Using Homebrew
brew install cmake curl tbb

# Or using vcpkg
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg install curl tbb

💻 Usage

Basic Usage

# Build the project
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release

# Run with default settings
./build/parallel-web-scraper-tbb

# Run with custom parameters
./build/parallel-web-scraper-tbb data/urls.txt output/results.txt 8 --auto

Command Line Arguments

./parallel-web-scraper-tbb [input_file] [output_file] [threads] [--auto]

Parameter	Description	Default
`input_file`	Path to URLs file	`data/urls.txt`
`output_file`	Path to output file	`output/results.txt`
`threads`	Number of parallel threads	`std::thread::hardware_concurrency()`
`--auto`	Auto-discover all pages	`false`

Example Commands

# Auto-discover all pages with 16 threads
./parallel-web-scraper-tbb data/urls.txt output/results.txt 16 --auto

# Use custom URL list with 8 threads
./parallel-web-scraper-tbb my_urls.txt my_output.txt 8

# Use all available CPU cores
./parallel-web-scraper-tbb --auto

Output Files

The scraper generates three output files:

results-serial.txt - Serial execution results
results-parallel.txt - Parallel execution results
results-summary.txt - Performance comparison summary

🏗️ Architecture

System Design

The scraper follows a 4-stage parallel pipeline architecture:

graph LR
    A[URL Generator] --> B[Downloader]
    B --> C[Parser]
    C --> D[Analyzer]
    D --> E[Storage]
    
    style A fill:#e1f5fe
    style B fill:#f3e5f5
    style C fill:#e8f5e8
    style D fill:#fff3e0
    style E fill:#fce4ec

Core Components

🔗 Downloader (`Downloader.hpp/cpp`)

Purpose: Handles HTTP requests with connection pooling
Features:
- Multi-connection support
- Timeout handling
- Retry mechanisms
- libcurl integration

📝 Parser (`Parser.hpp/cpp`)

Purpose: Extracts structured data from HTML
Features:
- HTML parsing and cleaning
- Product data extraction
- Text content extraction
- Error handling

📊 Analyzer (`Analyzer.hpp/cpp`)

Purpose: Performs statistical analysis on parsed data
Features:
- Price statistics calculation
- Rating distribution analysis
- Word frequency analysis
- Text metrics computation

💾 Storage (`Storage.hpp/cpp`)

Purpose: Saves results and generates reports
Features:
- Concurrent data storage
- Summary report generation
- Performance metrics tracking

🛠️ Utils (`Utils.hpp/cpp`)

Purpose: Utility functions for text processing
Features:
- File I/O operations
- Text manipulation
- HTML stripping
- String utilities

Data Flow

sequenceDiagram
    participant M as Main
    participant D as Downloader
    participant P as Parser
    participant A as Analyzer
    participant S as Storage
    
    M->>D: URL List
    D->>D: Parallel Download
    D->>P: HTML Content
    P->>P: Parse & Extract
    P->>A: Parsed Data
    A->>A: Analyze & Summarize
    A->>S: Analysis Results
    S->>S: Save & Generate Reports

📈 Performance

Benchmark Results

Based on real-world testing with the books.toscrape.com dataset:

Metric	Serial	Parallel	Improvement
Execution Time	57.32s	5.62s	10.2x faster
Throughput	0.87 pages/s	8.90 pages/s	10.2x higher
CPU Utilization	~12.5%	~100%	8x better
Memory Usage	Low	Moderate	Acceptable

Performance Characteristics

Scalability: Linear scaling with thread count (up to optimal point)
Memory Efficiency: Concurrent data structures minimize memory overhead
I/O Optimization: Connection pooling reduces network latency
CPU Utilization: Near-perfect parallelization of CPU-bound tasks

Optimization Features

Pipeline Processing: Overlaps I/O and CPU operations
Connection Pooling: Reuses HTTP connections
Concurrent Collections: Lock-free data structures
Batch Processing: Processes multiple URLs simultaneously

📚 API Documentation

Core Classes

`Downloader`

class Downloader {
public:
    explicit Downloader(std::size_t max_connections);
    HttpResponse get(const std::string& url) const;
};

`Parser`

class Parser {
public:
    ParsedPage parse(const std::string& url, const std::string& html) const;
};

`Analyzer`

class Analyzer {
public:
    Analysis summarize(const ParsedPage& page) const;
};

`Storage`

class Storage {
public:
    explicit Storage(std::string output_file);
    void save(const tbb::concurrent_vector<Analysis>& results, 
              const GlobalSummary& summary) const;
};

Data Structures

`Analysis`

struct Analysis {
    std::string url;
    std::string title;
    std::size_t word_count;
    std::size_t unique_words;
    std::size_t item_count;
    double avg_price;
    double min_price;
    double max_price;
    std::array<int,6> stars_hist;
    std::vector<std::pair<std::string,std::size_t>> top_terms;
};

`GlobalSummary`

struct GlobalSummary {
    std::size_t urls_total;
    std::size_t urls_unique;
    std::size_t pages_downloaded;
    std::size_t total_products;
    std::array<long long,6> stars_hist;
    double avg_price_all;
    double min_price_all;
    double max_price_all;
    double seconds;
    double throughput;
};

📁 Project Structure

parallel-web-scraper-tbb/
├── 📄 CMakeLists.txt          # Build configuration
├── 📄 main.cpp                # Main application entry point
├── 📄 README.md               # This file
├── 📄 LICENSE                 # MIT License
│
├── 📁 include/                # Header files
│   ├── 📄 Analyzer.hpp        # Analysis engine interface
│   ├── 📄 Downloader.hpp      # HTTP client interface
│   ├── 📄 Parser.hpp          # HTML parser interface
│   ├── 📄 Storage.hpp         # Data storage interface
│   └── 📄 Utils.hpp           # Utility functions interface
│
├── 📁 src/                    # Source files
│   ├── 📄 Analyzer.cpp        # Analysis engine implementation
│   ├── 📄 Downloader.cpp      # HTTP client implementation
│   ├── 📄 Parser.cpp          # HTML parser implementation
│   ├── 📄 Storage.cpp         # Data storage implementation
│   └── 📄 Utils.cpp           # Utility functions implementation
│
├── 📁 data/                   # Input data
│   └── 📄 urls.txt            # Sample URL list
│
├── 📁 output/                 # Generated output
│   ├── 📄 results-serial.txt  # Serial execution results
│   ├── 📄 results-parallel.txt # Parallel execution results
│   └── 📄 results-summary.txt # Performance summary
│
└── 📁 build/                  # Build artifacts (generated)
    ├── 📄 parallel-web-scraper-tbb.exe
    └── 📁 Release/

🔧 Dependencies

Required Dependencies

Dependency	Version	Purpose
Intel TBB	2021.1+	Parallel processing framework
libcurl	7.60+	HTTP client library
CMake	3.20+	Build system
C++ Compiler	C++20	Language support

Optional Dependencies

Dependency	Purpose
vcpkg	Package manager for C++
Intel oneAPI	Complete TBB installation

🔨 Building

CMake Configuration

# Basic build
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release

# With vcpkg (Windows)
cmake -B build -S . -G "Visual Studio 17 2022" -A x64 \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_TOOLCHAIN_FILE=C:\vcpkg\scripts\buildsystems\vcpkg.cmake

# With Intel TBB path
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release \
  -DTBB_DIR="/path/to/tbb/lib/cmake/tbb"

Compilation

# Build the project
cmake --build build --config Release

# Build with specific number of jobs
cmake --build build --config Release -j 8

# Clean build
cmake --build build --config Release --target clean

Build Options

Option	Description	Default
`CMAKE_BUILD_TYPE`	Build type (Debug/Release)	Release
`CMAKE_CXX_STANDARD`	C++ standard	20
`TBB_DIR`	TBB installation path	Auto-detect

🤝 Contributing

We welcome contributions! Please follow these guidelines:

Development Setup

Fork the repository
Create a feature branch: git checkout -b feature/amazing-feature
Make your changes following the existing code style
Add tests for new functionality
Update documentation as needed
Commit your changes: git commit -m 'Add amazing feature'
Push to the branch: git push origin feature/amazing-feature
Open a Pull Request

Code Style

Follow C++20 best practices
Use meaningful variable names
Add comprehensive comments
Maintain consistent formatting
Write self-documenting code

Testing

Test with different thread counts
Verify performance improvements
Check memory usage
Validate output correctness

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

MIT License

Copyright (c) 2025 Bogdan Ljubinković

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.

🙏 Acknowledgments

Intel TBB Team for the excellent parallel processing framework
libcurl Team for the robust HTTP client library
books.toscrape.com for providing a great testing dataset
CMake Community for the cross-platform build system

⭐ Star this repository if you found it helpful!

Made with ❤️ by Bogdan Ljubinković

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
include		include
src		src
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
PP-Specifikacija-projekta-2025.pdf		PP-Specifikacija-projekta-2025.pdf
README.md		README.md
main.cpp		main.cpp
script.txt		script.txt

License

ljubogdan/parallel-web-scraper-tbb

Folders and files

Latest commit

History

Repository files navigation