Spark to MongoDB #12

MacMat01 · 2025-03-13T08:32:32Z

This pull request includes significant improvements and refactoring across multiple files to enhance the configuration management, data processing, and logging functionalities. The most important changes include refactoring the ConfigurationManager class, updating the geoJSON files, and improving the data aggregation and storage process.

Configuration Management:

config/configuration_manager.py: Refactored the ConfigurationManager class to simplify the configuration loading and data path setup. Removed the singleton pattern and added methods load_config and configure_data_paths for better modularity.

GeoJSON Updates:

data/geojson/FirstFloor.geojson: Updated room IDs to be integers and added new sensor features with coordinates. [1] [2] [3]
data/geojson/SecondFloor.geojson: Added iot_device_id placeholders and new sensor features with coordinates. [1] [2] [3] [4]

Data Aggregation and Storage:

src/analysis/analyzer.py: Enhanced data aggregation by adding aggregate_single_dataframe and aggregate_by_minute_window functions. Introduced store_geojson_to_db and store_single_geojson_to_db for saving aggregated data to MongoDB. Improved logging throughout the analysis process.

Logging Improvements:

main.py: Improved logging setup and refactored the main execution flow into modular functions for better readability and error handling.
src/database/db_config.py: Added logging for MongoDB connection attempts and errors.

Type Annotations and Code Cleanup:

src/analysis/analysis_preprocessor.py: Added type annotations to the run_preprocess function and refactored it for clarity.
src/analysis/spark_data_reader.py: Added type annotations and refactored methods for better readability and error handling.

Replaced the SingletonMeta pattern with a standard class implementation. Introduced thread-safe methods for loading, saving, and configuring data paths. This simplifies the design and improves readability while maintaining concurrency safety.

Introduce logging to provide detailed information on MongoDB connection success or failure. Replaced `print` statements with `logger` to standardize output and improve traceability. This enhances debugging and monitoring capabilities for the database connection process.

Reorganized and encapsulated functionality into the `DataIdMapper` class for better maintainability and clarity. Removed redundant functions and replaced print statements with logging for consistent logging practices. Updated the invoker to use the refactored `DataIdMapper` methods for data normalization.

Break down main functions into smaller, reusable modules for better readability and maintainability. Introduced error handling, logging improvements, type hints, and modernized list comprehensions. Simplified Spark data loading, preprocessing, aggregation, and configuration setup with enhanced structure.

Eliminated the explicit read mode ("r") when opening the config file, as it is the default behavior in Python. This simplifies the code without altering its functionality.

MacMat01 and others added 8 commits March 8, 2025 11:54

Remove redundant "r" mode in file open operation

4938735

Eliminated the explicit read mode ("r") when opening the config file, as it is the default behavior in Python. This simplifies the code without altering its functionality.

Refactoring aggregation method. Speed-up aggregation process

fc33bdc

add geojson creation and storage into mongoDB

74ab5cf

simplify configuration

656da55

MacMat01 added documentation Improvements or additions to documentation data processing Using Spark data warehouse Using MongoDB for processed data labels Mar 13, 2025

MacMat01 requested review from Zeeeeeeek and SamuelePirani March 13, 2025 08:32

MacMat01 assigned Zeeeeeeek, MacMat01 and SamuelePirani Mar 13, 2025

MacMat01 linked an issue Mar 13, 2025 that may be closed by this pull request

Update GeoJSON with aggregated data #11

Closed

MacMat01 closed this Mar 13, 2025

MacMat01 deleted the spark-to-mongodb branch March 13, 2025 11:23

MacMat01 restored the spark-to-mongodb branch March 13, 2025 11:27

MacMat01 deleted the spark-to-mongodb branch March 13, 2025 11:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spark to MongoDB #12

Spark to MongoDB #12

MacMat01 commented Mar 13, 2025

Spark to MongoDB #12

Spark to MongoDB #12

Conversation

MacMat01 commented Mar 13, 2025

Configuration Management:

GeoJSON Updates:

Data Aggregation and Storage:

Logging Improvements:

Type Annotations and Code Cleanup: