This project is designed to create an advanced data architecture for analyzing traffic accident data from three major U.S. cities: Austin, Chicago, and New York City. The primary objective is to process and integrate large datasets from these cities using an ETL (Extract, Transform, Load) workflow and to model the data in a dimensional schema for in-depth analysis. The goal is to build a comprehensive data warehouse that can provide valuable insights into traffic accident trends, contributing factors, and accident hotspots, ultimately improving traffic safety, informing policy decisions, and optimizing resource allocation.
The project focuses on integrating data from multiple cities, addressing data quality issues such as missing values, and ensuring consistency through careful data cleansing. A star schema dimensional model was developed, including key dimension tables like Date, Location, Vehicle, and Contributing Factors, and fact tables such as Accident Mishaps and Contributing Factors. This structure enables powerful analysis of accident patterns, injury and fatality rates, and geographic accident hotspots.
The insights generated from this data are valuable for several stakeholders. Traffic authorities can use the findings to enhance road safety measures and improve infrastructure. Data analysts will benefit from the ability to explore accident trends and generate actionable reports. Government agencies can leverage these insights to shape effective traffic safety policies, while public safety teams can identify critical areas for intervention. By integrating and analyzing traffic accident data, this project aims to contribute to safer traffic management and provide data-driven insights to guide future decision-making.
The first step in this project was to perform data profiling on the traffic accident datasets from Austin, Chicago, and New York City. Each dataset contains a rich array of variables, including accident details, contributing factors, geographical information, and more. However, before conducting any analysis or generating insights, it was essential to assess the quality of the data, identify any issues such as missing values or inconsistencies, and determine the necessary cleaning and preprocessing steps.
The Austin Crash Dataset consists of 147,750 observations and 54 variables, with significant gaps in the data—approximately 21.6% of the cells were missing. Despite this, the dataset had no duplicate rows, ensuring that each record represented a unique traffic collision. The data types varied widely, with numeric, boolean, datetime, categorical, and text fields, requiring different types of preprocessing. The missing data, particularly in critical variables, was addressed through imputation methods, depending on the nature of the data and its relevance to the analysis. Despite the challenges posed by missing data, the dataset's comprehensiveness made it suitable for further analysis, offering a solid foundation for exploring trends and factors influencing accidents.
The Chicago Crash Dataset is another large dataset, containing 817,723 rows and a similar proportion of missing data (21.1%). Like the Austin dataset, it contained no duplicate entries, simplifying the preliminary data cleaning steps. The dataset's variables included a mix of text, boolean, datetime, and categorical fields, as well as numeric data. The significant missing data in critical areas required a careful approach to imputation, particularly for fields such as vehicle information and geographical identifiers. Nevertheless, the absence of duplicates and its manageable size made the dataset relatively easier to handle, and once missing values were addressed, it was ready for deeper analysis and transformation.
The New York City Crash Dataset is the largest of the three, with over 2 million observations and 29.5% missing data. This dataset posed the most significant challenge due to its size and the high percentage of missing information, especially in geographical identifiers like borough and zip code, as well as detailed vehicle data. Despite these challenges, the absence of duplicate rows was a major advantage, as it simplified the data cleaning process. Given the large size of the dataset, it required efficient preprocessing techniques, including optimized data extraction and transformation methods. Handling the missing data was crucial to ensure that machine learning models or statistical studies built on this dataset would yield meaningful results.
Across all three datasets, the missing data presented a substantial challenge. In particular, missing geographical information, such as street names and accident locations, could have skewed spatial analysis and made it difficult to identify accident hotspots. Additionally, the presence of various data types and inconsistencies in formatting required attention to ensure compatibility for merging and analysis. Despite these challenges, the lack of duplicate rows across all datasets ensured the uniqueness of each record, which simplified the data cleansing process.
The core objective of this project was to build an efficient dimensional data model that would facilitate powerful analysis and reporting on traffic accident data. Using a Star Schema approach, we created a set of dimension tables and fact tables to structure the accident data in a way that allows for detailed and efficient querying.
Dimension tables contain descriptive attributes related to different aspects of the traffic accident data. These tables provide the contextual information needed for analyzing accidents across multiple dimensions, such as time, location, and contributing factors. Here are the key dimension tables used in the model:
-
Dim_Date: This table is critical for temporal analysis, capturing various time-related attributes such as the day, week, month, quarter, and year. This enables the analysis of traffic accidents over different time frames, supporting trend analysis and seasonal reporting.
-
Dim_Location: This table contains information about the accident location, including street names, coordinates, and city data (e.g., borough or district). With location data, it is possible to perform spatial analysis, identifying accident hotspots and assessing the impact of specific areas on accident frequency.
-
Dim_Time: The Dim_Time table provides detailed information about the hours and minutes during which accidents occurred. It complements the Dim_Date table and enables more granular time-based analysis, such as peak accident hours or accidents occurring at different times of the day (morning, afternoon, evening).
-
Dim_Vehicle: This dimension contains data on the types of vehicles involved in accidents, including their make, model, year, and other related attributes. This allows for analysis of vehicle-specific factors such as the most common vehicle types involved in accidents or their relationship with accident severity.
-
Dim_Contribute: The Dim_Contribute table tracks the contributing factors to each accident, such as weather conditions, road conditions, driver behavior (e.g., speeding, alcohol consumption), and vehicle-related issues. Using Slowly Changing Dimensions (SCD) Type 2, this table preserves historical changes in contributing factors, ensuring that any updates or modifications to contributing factors are tracked over time.
-
Dim_Source: This table provides details about the source of the accident data, such as the reporting agency or database system. This ensures data traceability and allows for analysis of the data’s origins, ensuring transparency in reporting.
Fact tables contain quantitative data and metrics that are tied to the dimension tables. These tables track the measurable aspects of traffic accidents, such as the number of accidents, fatalities, injuries, and other key performance indicators. Here are the essential fact tables used in this project:
-
Fact_Mishaps: This is the central fact table in the model, recording detailed accident metrics such as the number of injuries, fatalities, vehicle types involved, accident severity, and associated time and location. This fact table serves as the foundation for accident analysis and provides a comprehensive view of the accidents, linking them to the corresponding dimension tables (e.g., time, location, contributing factors).
-
Fact_ContributingFactors: This table stores the multiple contributing factors for each accident. Since an accident may have several contributing factors (such as weather, road conditions, and driver behavior), this fact table allows for a many-to-many relationship between accidents and contributing factors. This provides the flexibility to perform detailed analyses on how different factors interact and influence accident outcomes.
-
Fact_UnitsInvolved: This table tracks the number and types of vehicles (or other units) involved in each accident. For example, a single accident could involve multiple vehicles, bicycles, pedestrians, or other units. This fact table enables analysis on accident frequency by vehicle type, pedestrian involvement, and other unit-related aspects.
For certain dimension tables, such as Dim_Contribute (for contributing factors), we used Slowly Changing Dimensions Type 2 (SCD2). This technique preserves the history of changes in the data by keeping both current and historical records. For example, if a contributing factor (such as "weather condition") changes over time (e.g., "snowy" to "clear"), SCD2 ensures that both the old and new values are stored, along with timestamps for when each value was valid. This approach is crucial for tracking the evolution of contributing factors and enables accurate historical analysis.
The dimensional model provides several key advantages:
- Ease of Analysis: The star schema design simplifies complex queries, allowing analysts to easily aggregate accident data across different dimensions (time, location, vehicle type, contributing factors).
- Performance Optimization: By structuring the data in a dimensional model, the system optimizes the performance of querying large datasets, enabling faster data retrieval and reporting.
- Flexibility: The model is flexible, supporting various types of analysis, including trend analysis, geographical analysis, and causal analysis (e.g., identifying contributing factors for accidents).
- Scalability: As new data becomes available (e.g., new accidents or updated contributing factors), the model can easily be expanded by adding new records to the dimension and fact tables without disrupting existing analyses.
The dimensional model developed in this project provides a robust framework for analyzing traffic accident data across multiple cities. By organizing data into dimension and fact tables, we ensure that the data is not only easy to query and analyze but also structured to support complex reporting and analytical needs. The model will be a powerful tool for uncovering insights into traffic safety, contributing factors, accident trends, and geographical hotspots, ultimately helping to inform data-driven decisions aimed at improving road safety and traffic management.
The ETL (Extract, Transform, Load) process was a critical component of this project, enabling the integration, cleaning, and transformation of raw traffic accident data into a structured, analysis-ready format. The ETL process was carried out using Talend, a powerful data integration tool, which facilitated the extraction of data from multiple sources, its transformation into the desired format, and the loading of clean data into the data warehouse. The process ensured that the traffic accident datasets from Austin, Chicago, and New York City were properly prepared and stored in a way that supports detailed analysis and reporting.
The extraction phase involved reading the raw traffic accident data from external sources and importing it into the system. The sources for this data included databases and flat files that contained accident records from the three cities. The key steps in the extraction process included:
- Ingesting Raw Data: Data was pulled from various external sources, including CSV files, databases, and APIs, depending on the format in which the data was available.
- Initial Data Validation: After extracting the data, initial validation checks were performed to ensure that the data was complete, without any immediate errors (e.g., missing fields or data type mismatches). This was essential before proceeding to the next stages of transformation.
The transformation phase was the most intensive part of the ETL process, involving data cleansing, standardization, and enrichment. During this stage, various transformations were applied to ensure the data conformed to the dimensional model's requirements and that the data was ready for loading into the database. Key steps included:
-
Handling Missing Data: Missing values in the dataset, which were prevalent across all three cities’ traffic accident records, were addressed using appropriate methods such as imputation or exclusion. The strategy for handling missing data varied depending on the variable and its importance. For example, geographical data with missing street names was imputed using nearby known locations, while other fields were excluded if they were deemed non-critical for the analysis.
-
Removing Duplicates: Duplicate rows in the dataset were identified and removed to maintain the uniqueness of each record. This was especially important since the datasets had no duplicate rows initially, ensuring data integrity across all three cities.
-
Date and Time Formatting: Dates and times were standardized across all datasets to ensure consistency. The date format was transformed into a consistent structure (e.g., MM/DD/YYYY), and times were standardized for analysis based on hour intervals to enable better time-based insights.
-
Geographical Data Normalization: Location data, such as street names and coordinates, were normalized and cleaned to ensure consistency in naming conventions and data formats. Geocoding was performed to convert latitude and longitude coordinates into standardized location names, helping with spatial analysis later on.
-
Categorical Data Encoding: Certain categorical data, such as accident types or contributing factors (e.g., weather conditions, road conditions), were encoded into consistent values or codes to facilitate easier analysis. This also involved creating mappings for categorical values, ensuring that each category was appropriately represented.
-
Type Conversion: Some data types were converted to match the target schema. For example, certain fields like IDs and numerical values that were stored as text were converted to integers or floating-point numbers, depending on the nature of the data.
-
Handling Null Values: A systematic approach was employed to handle null values, particularly in critical fields such as accident location or vehicle type. For instance, missing vehicle details were flagged as "unknown" in the dataset, ensuring that such records could still be analyzed while acknowledging their incomplete nature.
-
Contributing Factor Merge: In some cases, multiple fields related to contributing factors (such as weather, road conditions, and driver behavior) were merged into a single field. This was done using custom expressions in Talend’s tMap component. For example, a new field called "mergedContribute" was created by concatenating all non-null contributing factors. If any field was missing, the value "N/A" was used to indicate the absence of data.
-
ID Assignment and Sequencing: Unique identifiers were created for each record during the transformation process to maintain data integrity. For example, an ID field was created for each record, and sequential numbers were assigned to each accident, ensuring that every entry could be distinctly identified.
- Geocoding: One of the key transformations involved geocoding the latitude and longitude information to produce meaningful location names (e.g., street names, borough names). This was initially done using Google’s Geocoding API, but was later switched to Microsoft Power Query due to better integration capabilities and lower costs. This transformation enriched the datasets by providing more descriptive location data for accident analysis.
After the data was transformed into the desired format, the final step was to load it into the target data warehouse. This phase involved populating the dimensional model with clean, standardized data. The loading process was carried out as follows:
-
Loading into Staging Tables: Initially, transformed data was loaded into staging tables. These tables served as temporary holding areas, allowing for final validation and checks before loading the data into the permanent dimension and fact tables.
-
Data Validation: The staging tables were subjected to additional checks to verify the integrity and accuracy of the transformed data. This step ensured that the data met the business rules and constraints defined in the dimensional model.
-
Final Loading: After validation, the data was loaded into the final dimension and fact tables. These tables are structured to support fast querying and efficient reporting. The dimension tables store descriptive data, while the fact tables hold quantitative metrics related to the accidents, such as the number of fatalities, injuries, and contributing factors.
-
Logging: The ETL process also incorporated logging components to track any errors or issues during the transformation and loading stages. If any rows failed to meet the defined criteria (e.g., missing required fields), they were logged for further review. This helped maintain data integrity and ensured the accuracy of the final loaded dataset.
The overall ETL workflow facilitated a smooth transition of raw traffic accident data from external sources into a well-organized, clean, and standardized database structure. By applying various transformations such as data cleansing, geocoding, and encoding, the dataset was transformed into an analysis-ready format. The use of Talend as the ETL tool enabled efficient data processing and helped maintain high-quality data throughout the process. The resulting data warehouse is optimized for reporting and analytical purposes, supporting the generation of business insights related to traffic safety and accident trends.
Once the data was integrated, transformed, and loaded into the data warehouse, the next step was to utilize the data for business intelligence (BI) and reporting. The goal was to generate actionable insights from the processed traffic accident data, enabling stakeholders to make informed decisions regarding traffic safety, urban planning, and policy interventions. Through the use of data visualizations, dashboards, and analytical reports, this section of the project aimed to answer key business questions and identify critical trends in traffic accidents across Austin, Chicago, and New York City.
The analysis and reporting focused on answering several key business questions related to traffic accidents, which could guide decision-making and enhance safety measures. These questions include:
-
How many accidents occurred in each city?
-
Which areas in the three cities experienced the most accidents?
- This analysis aimed to identify accident hotspots by mapping accident occurrences to geographic locations (e.g., streets, intersections, or neighborhoods). This analysis helps in identifying areas that may require traffic safety interventions such as better signage, road redesigns, or increased law enforcement presence.
-
How many accidents resulted only in injuries (not fatalities)?
-
How often are pedestrians involved in accidents?
-
When do most accidents occur (Seasonality Report)?
-
How many motorists are injured or killed in accidents (by city)?
-
Which top 5 areas in each city have the highest number of fatal accidents?
-
Time-based analysis of accidents (Time of the day, day of the week, weekdays or weekends)?
-
Fatality Analysis
-
What are the most common contributing factors involved in accidents?
- This report analyzed the most frequent causes of traffic accidents, such as weather conditions, driver behavior (e.g., speeding, alcohol consumption), and road conditions. By identifying the major contributing factors, policymakers and urban planners can design targeted interventions to reduce accident rates.
The BI and reporting phase also involved the creation of various data visualizations and dashboards that would allow stakeholders to easily interpret the results of the analysis. By presenting the data visually, users can quickly identify trends, patterns, and correlations that may not be immediately obvious from raw data. Key visualizations and dashboards included:
-
Geospatial Mapping: Using the location data from the Dim_Location table, heat maps and incident maps were generated to visualize accident hotspots across the cities. This provided a clear view of high-risk areas, helping city planners focus their safety interventions.
-
Time Series Analysis: Line charts and bar graphs were created to show the trends in accident frequency over time, such as accidents by day, month, or year. This helped identify seasonal trends, whether accidents were more frequent during specific months or on certain days.
-
Severity Breakdown: A set of pie charts or stacked bar charts was used to break down the severity of accidents—whether they resulted in injuries or fatalities—by different factors such as location, time of day, and vehicle type.
-
Contributing Factor Analysis: Bar charts and word clouds were used to display the most common contributing factors to accidents. This allowed users to easily see the prevalence of specific factors (e.g., weather, driver errors) and understand their role in accident occurrences.
-
City Comparison Dashboards: A set of dashboards compared accident rates, fatalities, and contributing factors across the three cities, providing comparative insights that could inform city-specific policies or initiatives.
To support the visualization and reporting needs of stakeholders, the following tools were used:
- Power BI: A powerful tool for creating interactive dashboards and visualizations that enabled real-time analysis of traffic accident data.
- SQL Queries: Custom SQL queries were created to extract and aggregate the necessary data from the fact and dimension tables for reporting purposes. These queries supported both simple counts (e.g., number of accidents) and more complex aggregations (e.g., fatal accidents by contributing factors).
- Excel/Other Tools: For basic analysis and data exploration, Excel was used to conduct initial analysis before integrating findings into Power BI dashboards.
The analysis and reports generated from this project provided key insights into the traffic safety challenges faced by Austin, Chicago, and New York City. These insights are useful for:
- City Traffic Authorities: Helping them prioritize safety interventions and allocate resources effectively to high-risk areas.
- Urban Planners and Policymakers: Informing decisions on infrastructure investments, safety measures, and regulations aimed at reducing traffic accidents.
- Public Safety Teams: Enabling them to respond quickly to high-risk locations or times based on accident patterns.
The Business Intelligence and Reporting phase of the project transformed raw accident data into actionable insights. By answering critical business questions and visualizing the data, this phase provided a comprehensive understanding of accident trends, contributing factors, and safety concerns. The reports and dashboards created during this phase serve as a vital resource for decision-makers looking to improve traffic safety and reduce accident-related injuries and fatalities.
The Challenges and Solutions phase of the project addressed several hurdles encountered during the processing, transformation, and analysis of traffic accident data. These challenges were primarily related to data quality, system limitations, and ensuring that the ETL pipeline was scalable and efficient enough to handle large datasets across multiple cities. Each challenge was met with a tailored solution to maintain the accuracy and usability of the data while ensuring smooth processing throughout the ETL pipeline.
One of the most significant challenges encountered in this project was the high percentage of missing data across all three datasets—Austin, Chicago, and New York City. Missing data, particularly in critical fields such as accident location, contributing factors, and vehicle details, posed a risk to the completeness and reliability of the analysis.
Solution:
-
Imputation and Exclusion: Different strategies were applied to handle missing data. For some variables, such as accident location, imputation techniques were employed, filling in missing values with nearby location data or default values like “Unknown” for categorical fields. For other variables, particularly those with a large proportion of missing values, exclusion was the preferred method, removing rows or columns where the missing data was deemed non-critical or too pervasive to fill accurately.
-
Data Profiling and Validation: Extensive data profiling was conducted to identify patterns in the missing data. This helped to assess whether the missing values were random or indicative of underlying issues in the data collection process. Based on this profiling, more informed decisions were made regarding the handling of missing data, ensuring that the final dataset was as complete and accurate as possible.
Since the datasets were sourced from different cities with varying formats and structures, data inconsistencies were another major hurdle. For example, some datasets used different naming conventions for the same fields, such as varying date formats, different codes for vehicle types, and inconsistent location identifiers (e.g., street names, borough names).
Solution:
-
Data Standardization: A series of data transformation rules were applied during the ETL process to standardize data across all three cities. For example, all dates were converted to the MM/DD/YYYY format, and location names were standardized. Additionally, vehicle codes and contributing factors were standardized into consistent categories using mapping documents and data dictionaries to ensure uniformity across the datasets.
-
Normalization: Certain fields, such as contributing factors, were normalized during transformation. Multiple contributing factors (e.g., weather, road conditions, driver behavior) that appeared in separate columns were merged into a single field using concatenation, while "N/A" values were used for missing or unknown factors. This helped to simplify the data structure and ensure consistency across accident records.
The New York City Crash Dataset was particularly large, containing over 2 million observations. This posed challenges in terms of processing speed, storage, and ensuring that the ETL pipeline could efficiently handle such a large volume of data without performance degradation.
Solution:
-
Optimized ETL Pipelines: To handle the large dataset volume efficiently, the ETL pipeline was optimized for speed and scalability. This included techniques such as batch processing to break the dataset into smaller chunks, thereby improving data load times. Talend's parallel processing capabilities were also leveraged to perform multiple operations simultaneously, reducing the overall processing time.
-
Data Partitioning: Data partitioning techniques were used to divide the dataset into manageable subsets based on time, geography, or accident severity. This made it easier to process, analyze, and visualize large datasets without overwhelming the system.
-
Resource Scaling: Cloud-based infrastructure and more powerful hardware were utilized to scale up the processing resources. This ensured that the system could handle large volumes of data and support fast processing without running into memory or performance bottlenecks.
Another key challenge was geospatial data integration, particularly the need to convert latitude and longitude coordinates into meaningful location names (such as street names or boroughs). Initially, Google’s Geocoding API was used for this task, but it quickly became clear that this service was not ideal due to system limitations and budget constraints.
Solution:
-
Switch to Microsoft Power Query: Due to the integration issues with Google’s Geocoding API and its high cost for large datasets, the project team switched to using Microsoft Power Query for geocoding. Power Query proved to be more cost-effective, seamlessly integrating into the existing ETL pipeline, and provided scalable geospatial processing for translating coordinates into addresses.
-
Batch Geocoding: To manage the large volume of geocoding requests, batch processing techniques were used to send requests in groups, thus improving the efficiency and reducing the cost per query. This also ensured that the geocoding process could be handled within the project’s budget constraints.
The initial setup used Google’s Geocoding API and basic computing resources, which did not scale well for the large datasets involved in this project. The geocoding process required substantial computational power, and the system was prone to integration issues.
Solution:
-
Cloud Integration: To overcome the limitations of on-premise infrastructure, the system was moved to the cloud, providing access to more powerful computing resources. This allowed the ETL pipeline to process large datasets faster and more efficiently.
-
Improved Toolset: The use of Talend as an ETL tool and Power BI for data visualization provided a more robust and scalable solution for the project. Talend’s integration capabilities, especially with big data sources and cloud platforms, helped mitigate earlier system limitations, allowing seamless data processing and reporting.
Ensuring the accuracy and integrity of the data was essential, particularly when combining data from different sources. Inaccurate, incomplete, or outdated data could lead to misleading insights and poor decision-making.
Solution:
-
Automated Data Validation: During the ETL process, several automated validation rules were set up to check for anomalies, errors, or inconsistencies in the data (e.g., invalid dates, incorrect values). This ensured that only high-quality data was loaded into the dimensional model.
-
Logging and Monitoring: Comprehensive logging and error handling mechanisms were implemented to track issues during extraction, transformation, and loading. This allowed the project team to quickly detect and address any data integrity issues before they impacted the final analysis.
Throughout the project, the team overcame a series of challenges related to data quality, system limitations, and large dataset volumes. By employing targeted solutions such as data imputation, standardized transformations, cloud resources, and more efficient geospatial processing, the project was able to successfully integrate and prepare traffic accident data for analysis. These efforts ensured that the resulting data warehouse and reports were accurate, reliable, and ready for actionable insights to inform traffic safety policies and urban planning.
-
Data Extraction:
- Data from three cities (Austin, Chicago, NYC) was ingested from external sources using Talend's input components.
-
Data Transformation:
- Applied various ETL transformations to clean, normalize, and structure the data (e.g., handling null values, concatenating fields, converting data types).
-
Data Loading:
- Loaded processed data into staging tables for further validation, then into dimension and fact tables for comprehensive analysis.
-
Data Analysis and Reporting:
- Generated insights using SQL queries and business intelligence tools to visualize accident trends, factors, and hotspots in different cities.
- ETL Tool: Talend
- Database: MySQL for staging and dimension/fact tables
- Data Processing: Talend’s tMap for data transformation
- Geospatial Processing: Microsoft Power Query (used for geocoding)
- Visualization Tools: Power BI or other BI tools (for data analysis and reporting)
- Missing Data: High percentages of missing data in accident records were handled by imputation or exclusion methods.
- Geospatial Integration: Initially using Google's Geocoder API, the project switched to Microsoft Power Query to address system integration issues and reduce costs.
- Scalability: Despite the large volume of data (millions of rows), the ETL processes were optimized for performance, making use of modern computing resources.
This project demonstrates the application of advanced data architecture and ETL workflows for the analysis of traffic accidents. By integrating and processing large datasets from multiple cities, it provides actionable insights that can contribute to safer city traffic management. The final dimensional model supports detailed analysis, with capabilities for trend forecasting and contributing factor identification.
- Predictive Modeling: Using the cleaned datasets for machine learning models to predict accident hotspots and potential contributing factors.
- Real-time Data Integration: Integrating live accident data feeds into the current model for more timely insights.
- Advanced Geospatial Analysis: Enhance the geospatial capabilities to provide real-time location-based insights.