A high-performance parallel web scraper built with Intel Threading Building Blocks (TBB) for analyzing book data from toscrape.com
Features β’ Installation β’ Usage β’ Architecture β’ Performance β’ API
- Overview
- Features
- Installation
- Usage
- Architecture
- Performance
- API Documentation
- Project Structure
- Dependencies
- Building
- Contributing
- License
This project implements a sophisticated parallel web scraper specifically designed to analyze book data from the books.toscrape.com website. Built with modern C++20 and Intel Threading Building Blocks (TBB), it demonstrates the power of parallel processing for web scraping tasks.
- π₯ Parallel Processing: Utilizes Intel TBB for efficient parallel execution
- π Performance Analysis: Compares serial vs parallel execution with detailed metrics
- π‘οΈ Robust Error Handling: Comprehensive error handling and retry mechanisms
- π Scalable Architecture: Modular design with clear separation of concerns
- β‘ High Performance: Achieves up to 10x speedup over serial implementation
- Intel TBB Integration: Leverages TBB's parallel algorithms and data structures
- Pipeline Processing: Implements a 4-stage parallel pipeline for optimal throughput
- Concurrent Data Structures: Uses
tbb::concurrent_vector
andtbb::concurrent_hash_map
- Thread-Safe Operations: All operations are designed for concurrent execution
- Book Data Extraction: Extracts titles, prices, ratings, and descriptions
- Statistical Analysis: Calculates price statistics, rating distributions, and word counts
- Text Processing: Performs word frequency analysis and unique word counting
- Performance Metrics: Tracks throughput, execution time, and speedup ratios
- Auto-Discovery: Automatically discovers all pages on the target website
- URL Filtering: Validates and filters URLs for security
- Flexible Input: Supports both file-based and auto-generated URL lists
- Multiple Output Formats: Generates detailed results and summary reports
- Modern C++20: Uses latest C++ features and best practices
- CMake Build System: Cross-platform build configuration
- Modular Design: Clean separation of concerns with dedicated classes
- Comprehensive Documentation: Well-documented code with clear interfaces
- C++20 Compatible Compiler (GCC 10+, Clang 12+, MSVC 2019+)
- CMake 3.20+
- Intel TBB (Threading Building Blocks)
- libcurl (for HTTP requests)
# Install vcpkg (if not already installed)
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
.\bootstrap-vcpkg.bat
# Install dependencies
.\vcpkg install curl tbb
# Install Intel TBB (if using Intel oneAPI)
# Download from: https://www.intel.com/content/www/us/en/developer/tools/oneapi/toolkits.html
# Ubuntu/Debian
sudo apt update
sudo apt install build-essential cmake libcurl4-openssl-dev libtbb-dev
# CentOS/RHEL
sudo yum install gcc-c++ cmake libcurl-devel tbb-devel
# Or using vcpkg
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg install curl tbb
# Using Homebrew
brew install cmake curl tbb
# Or using vcpkg
git clone https://github.com/Microsoft/vcpkg.git
cd vcpkg
./bootstrap-vcpkg.sh
./vcpkg install curl tbb
# Build the project
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release
# Run with default settings
./build/parallel-web-scraper-tbb
# Run with custom parameters
./build/parallel-web-scraper-tbb data/urls.txt output/results.txt 8 --auto
./parallel-web-scraper-tbb [input_file] [output_file] [threads] [--auto]
Parameter | Description | Default |
---|---|---|
input_file |
Path to URLs file | data/urls.txt |
output_file |
Path to output file | output/results.txt |
threads |
Number of parallel threads | std::thread::hardware_concurrency() |
--auto |
Auto-discover all pages | false |
# Auto-discover all pages with 16 threads
./parallel-web-scraper-tbb data/urls.txt output/results.txt 16 --auto
# Use custom URL list with 8 threads
./parallel-web-scraper-tbb my_urls.txt my_output.txt 8
# Use all available CPU cores
./parallel-web-scraper-tbb --auto
The scraper generates three output files:
results-serial.txt
- Serial execution resultsresults-parallel.txt
- Parallel execution resultsresults-summary.txt
- Performance comparison summary
The scraper follows a 4-stage parallel pipeline architecture:
graph LR
A[URL Generator] --> B[Downloader]
B --> C[Parser]
C --> D[Analyzer]
D --> E[Storage]
style A fill:#e1f5fe
style B fill:#f3e5f5
style C fill:#e8f5e8
style D fill:#fff3e0
style E fill:#fce4ec
- Purpose: Handles HTTP requests with connection pooling
- Features:
- Multi-connection support
- Timeout handling
- Retry mechanisms
- libcurl integration
- Purpose: Extracts structured data from HTML
- Features:
- HTML parsing and cleaning
- Product data extraction
- Text content extraction
- Error handling
- Purpose: Performs statistical analysis on parsed data
- Features:
- Price statistics calculation
- Rating distribution analysis
- Word frequency analysis
- Text metrics computation
- Purpose: Saves results and generates reports
- Features:
- Concurrent data storage
- Summary report generation
- Performance metrics tracking
- Purpose: Utility functions for text processing
- Features:
- File I/O operations
- Text manipulation
- HTML stripping
- String utilities
sequenceDiagram
participant M as Main
participant D as Downloader
participant P as Parser
participant A as Analyzer
participant S as Storage
M->>D: URL List
D->>D: Parallel Download
D->>P: HTML Content
P->>P: Parse & Extract
P->>A: Parsed Data
A->>A: Analyze & Summarize
A->>S: Analysis Results
S->>S: Save & Generate Reports
Based on real-world testing with the books.toscrape.com dataset:
Metric | Serial | Parallel | Improvement |
---|---|---|---|
Execution Time | 57.32s | 5.62s | 10.2x faster |
Throughput | 0.87 pages/s | 8.90 pages/s | 10.2x higher |
CPU Utilization | ~12.5% | ~100% | 8x better |
Memory Usage | Low | Moderate | Acceptable |
- Scalability: Linear scaling with thread count (up to optimal point)
- Memory Efficiency: Concurrent data structures minimize memory overhead
- I/O Optimization: Connection pooling reduces network latency
- CPU Utilization: Near-perfect parallelization of CPU-bound tasks
- Pipeline Processing: Overlaps I/O and CPU operations
- Connection Pooling: Reuses HTTP connections
- Concurrent Collections: Lock-free data structures
- Batch Processing: Processes multiple URLs simultaneously
class Downloader {
public:
explicit Downloader(std::size_t max_connections);
HttpResponse get(const std::string& url) const;
};
class Parser {
public:
ParsedPage parse(const std::string& url, const std::string& html) const;
};
class Analyzer {
public:
Analysis summarize(const ParsedPage& page) const;
};
class Storage {
public:
explicit Storage(std::string output_file);
void save(const tbb::concurrent_vector<Analysis>& results,
const GlobalSummary& summary) const;
};
struct Analysis {
std::string url;
std::string title;
std::size_t word_count;
std::size_t unique_words;
std::size_t item_count;
double avg_price;
double min_price;
double max_price;
std::array<int,6> stars_hist;
std::vector<std::pair<std::string,std::size_t>> top_terms;
};
struct GlobalSummary {
std::size_t urls_total;
std::size_t urls_unique;
std::size_t pages_downloaded;
std::size_t total_products;
std::array<long long,6> stars_hist;
double avg_price_all;
double min_price_all;
double max_price_all;
double seconds;
double throughput;
};
parallel-web-scraper-tbb/
βββ π CMakeLists.txt # Build configuration
βββ π main.cpp # Main application entry point
βββ π README.md # This file
βββ π LICENSE # MIT License
β
βββ π include/ # Header files
β βββ π Analyzer.hpp # Analysis engine interface
β βββ π Downloader.hpp # HTTP client interface
β βββ π Parser.hpp # HTML parser interface
β βββ π Storage.hpp # Data storage interface
β βββ π Utils.hpp # Utility functions interface
β
βββ π src/ # Source files
β βββ π Analyzer.cpp # Analysis engine implementation
β βββ π Downloader.cpp # HTTP client implementation
β βββ π Parser.cpp # HTML parser implementation
β βββ π Storage.cpp # Data storage implementation
β βββ π Utils.cpp # Utility functions implementation
β
βββ π data/ # Input data
β βββ π urls.txt # Sample URL list
β
βββ π output/ # Generated output
β βββ π results-serial.txt # Serial execution results
β βββ π results-parallel.txt # Parallel execution results
β βββ π results-summary.txt # Performance summary
β
βββ π build/ # Build artifacts (generated)
βββ π parallel-web-scraper-tbb.exe
βββ π Release/
Dependency | Version | Purpose |
---|---|---|
Intel TBB | 2021.1+ | Parallel processing framework |
libcurl | 7.60+ | HTTP client library |
CMake | 3.20+ | Build system |
C++ Compiler | C++20 | Language support |
Dependency | Purpose |
---|---|
vcpkg | Package manager for C++ |
Intel oneAPI | Complete TBB installation |
# Basic build
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
# With vcpkg (Windows)
cmake -B build -S . -G "Visual Studio 17 2022" -A x64 \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_TOOLCHAIN_FILE=C:\vcpkg\scripts\buildsystems\vcpkg.cmake
# With Intel TBB path
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release \
-DTBB_DIR="/path/to/tbb/lib/cmake/tbb"
# Build the project
cmake --build build --config Release
# Build with specific number of jobs
cmake --build build --config Release -j 8
# Clean build
cmake --build build --config Release --target clean
Option | Description | Default |
---|---|---|
CMAKE_BUILD_TYPE |
Build type (Debug/Release) | Release |
CMAKE_CXX_STANDARD |
C++ standard | 20 |
TBB_DIR |
TBB installation path | Auto-detect |
We welcome contributions! Please follow these guidelines:
- Fork the repository
- Create a feature branch:
git checkout -b feature/amazing-feature
- Make your changes following the existing code style
- Add tests for new functionality
- Update documentation as needed
- Commit your changes:
git commit -m 'Add amazing feature'
- Push to the branch:
git push origin feature/amazing-feature
- Open a Pull Request
- Follow C++20 best practices
- Use meaningful variable names
- Add comprehensive comments
- Maintain consistent formatting
- Write self-documenting code
- Test with different thread counts
- Verify performance improvements
- Check memory usage
- Validate output correctness
This project is licensed under the MIT License - see the LICENSE file for details.
MIT License
Copyright (c) 2025 Bogdan LjubinkoviΔ
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
- Intel TBB Team for the excellent parallel processing framework
- libcurl Team for the robust HTTP client library
- books.toscrape.com for providing a great testing dataset
- CMake Community for the cross-platform build system
β Star this repository if you found it helpful!
Made with β€οΈ by Bogdan LjubinkoviΔ