Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark to MongoDB #12

Closed
wants to merge 8 commits into from
Closed

Spark to MongoDB #12

wants to merge 8 commits into from

Conversation

MacMat01
Copy link
Collaborator

This pull request includes significant improvements and refactoring across multiple files to enhance the configuration management, data processing, and logging functionalities. The most important changes include refactoring the ConfigurationManager class, updating the geoJSON files, and improving the data aggregation and storage process.

Configuration Management:

  • config/configuration_manager.py: Refactored the ConfigurationManager class to simplify the configuration loading and data path setup. Removed the singleton pattern and added methods load_config and configure_data_paths for better modularity.

GeoJSON Updates:

Data Aggregation and Storage:

  • src/analysis/analyzer.py: Enhanced data aggregation by adding aggregate_single_dataframe and aggregate_by_minute_window functions. Introduced store_geojson_to_db and store_single_geojson_to_db for saving aggregated data to MongoDB. Improved logging throughout the analysis process.

Logging Improvements:

  • main.py: Improved logging setup and refactored the main execution flow into modular functions for better readability and error handling.
  • src/database/db_config.py: Added logging for MongoDB connection attempts and errors.

Type Annotations and Code Cleanup:

MacMat01 and others added 8 commits March 8, 2025 11:54
Replaced the SingletonMeta pattern with a standard class implementation. Introduced thread-safe methods for loading, saving, and configuring data paths. This simplifies the design and improves readability while maintaining concurrency safety.
Introduce logging to provide detailed information on MongoDB connection success or failure. Replaced `print` statements with `logger` to standardize output and improve traceability. This enhances debugging and monitoring capabilities for the database connection process.
Reorganized and encapsulated functionality into the `DataIdMapper` class for better maintainability and clarity. Removed redundant functions and replaced print statements with logging for consistent logging practices. Updated the invoker to use the refactored `DataIdMapper` methods for data normalization.
Break down main functions into smaller, reusable modules for better readability and maintainability. Introduced error handling, logging improvements, type hints, and modernized list comprehensions. Simplified Spark data loading, preprocessing, aggregation, and configuration setup with enhanced structure.
Eliminated the explicit read mode ("r") when opening the config file, as it is the default behavior in Python. This simplifies the code without altering its functionality.
@MacMat01 MacMat01 added documentation Improvements or additions to documentation data processing Using Spark data warehouse Using MongoDB for processed data labels Mar 13, 2025
@MacMat01 MacMat01 linked an issue Mar 13, 2025 that may be closed by this pull request
@MacMat01 MacMat01 closed this Mar 13, 2025
@MacMat01 MacMat01 deleted the spark-to-mongodb branch March 13, 2025 11:23
@MacMat01 MacMat01 restored the spark-to-mongodb branch March 13, 2025 11:27
@MacMat01 MacMat01 deleted the spark-to-mongodb branch March 13, 2025 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data processing Using Spark data warehouse Using MongoDB for processed data documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update GeoJSON with aggregated data
3 participants