Data Engineering (LTAT.02.007)
This project investigates the relationship between suicide rates and the availability of mental health services (number of professionals and beds for care) in conjunction with GDP.
- Topic: Analysis of the potential correlation between mental health service availability (number of professionals and beds) and GDP on suicide rates.
- Goal: To understand if and how mental health infrastructure and economic factors influence suicide rates.
- Docker: Ensure Docker is installed on your system.
- Docker Compose: Ensure Docker Compose is installed and available in your PATH.
- Clone the Repository
docker-compose up airflow-init
docker-compose up -d
- Access the Airflow Web Interface http://localhost:8080
Use the following default credentials to log in:
- Username: airflow
- Password: airflow
- Access the Minio Interface http://localhost:9001/login
Use the following default credentials to log in:
- Username: minioadmin
- Password: minioadmin
//Create bucket warehouse
-
Access the Streamlit + Geopandas Interface http://localhost:8501
-
LLM Query Interface for Data Engineering Course http://localhost:8009/
- Use your OpenAI API KEY (llm\app.py - openai.api_key)
cd llm
uvicorn app:app --port 8089
-
Suicide Rate
- Source: WHO Suicide Rates
-
Beds for Mental Health
- General hospitals (per 100,000): WHO Beds in General Hospitals
- Community residential facilities (per 100,000): WHO Beds in Community Residential Facilities
- Mental hospitals (per 100,000): WHO Beds in Mental Hospitals
-
Human Resources for Mental Health
- Psychiatrists (per 100,000): WHO Psychiatrists
- Nurses (per 100,000): WHO Nurses
- Psychologists (per 100,000): WHO Psychologists
-
GDP Data
- Gross Domestic Product (in USD): World Bank GDP Data
-
Mental Health
- Mental illnesses prevalence: Dataset
- Does the number of beds for mental health patients have a positive effect on suicide rates?
- Is there a correlation between GDP and mental health service availability?
- Does the number of mental health professionals influence suicide rates? Are some roles more impactful than others?
- Further exploratory questions may arise during data analysis.
- Data Orchestration: Apache Airflow
- Database: DuckDB
- Data Transformation: dbt
- Versioning: Apache Iceberg
- Visualization: Streamlit (for dashboards) and GeoPandas (for geospatial visualizations)
- Additional: LLM - ask questions that will be trasformed into an SQL query and executed in DuckDB with data from datasets
Below is our star schema diagram:
UML Data Schema link