Python-based analysis project on topic "Socio-economic factors associated with the number of suicides in the world"
-
In the datasets folder you will find the datasets used in the study
-
Data preprocessing information is represented in the jupyter notebook by markdown entries.
-
The "models" folder contains saved trained ML models, data for them and a dictionary for decoding categorical variables.
This project was inspired by preventing or, at least, minimizing suicide rates in the world and was done as the final project on the HSE university discipline "Data Analysis on Python"
Presentation, full version
- Data processing (pandas)
- Interactive plots (plotly)
- Static plots (matplotlib)
- Statistic criteria (scipy.stats)
- Linear regression and multiple comparison (statsmodels)
- Machine learning (Decision Tree model and validation) (sklearn)
- Built model serialization (pickle)
Three datasets are used, joined by country and year of observation:
- Suicide Rates Overview (1985 to 2021) - main dataset
From this dataset suicides_per_100k as target variable is used. - Global Trends in Mental Health Disorder
Variables of different mental disorders are used, mainly depression, alcoholism and rates. - Inflation, Interest and Unemployment Rate Mainly unemployment and inflation prices rates are used.
- The suicide rate differs statistically significantly across age groups. ✅
- The suicide rate differs statistically significantly across generational groups. ✅
- The suicide rate differs statistically significantly by gender. ✅
- The suicide rate differs statistically significantly across wealth groups in the country. ✅
- The suicide rate is negatively statistically significantly associated with the human development index (HDI)
⁉️ - The suicide rate is positively statistically significantly associated with the rates of psychiatric disorders in the country. ✅
- The level of GDP per capita is negatively statistically significantly associated with the suicide rate and with the rates of psychiatric disorders in the country. ❌
- The suicide rate in rich countries is greater than or equal to that in poor countries. ✅
- The suicide rate is positively statistically significantly associated with inflation and unemployment rates.
⁉️
min_samples_split - The minimum number of samples required to split an internal node
max_depth - The maximum depth of the tree
Halving search with grid of parameters in model fitting was used.
Metrics of final regression model:
Predictors | min_samples_split | max_depth | |||
---|---|---|---|---|---|
age, generation, gender, country income level, alcoholism rate, depression rate | 0.86158 | 1.10141 | 0.30066 | 103 | 20 |
Final model explains 86% of the variability in the data, predicts the magnitude of the suicide rate (per 100,000 population) with an absolute mean error of 1.1 points.
The inclusion of variables whose relationship with the target variable was statistically confirmed (partiularly in case of alcoholism and depression rates) favorably influenced the predictive power of the model.
generation predictor values:
Generation Z | 1997-2012 |
Millennials | 1981-1996 |
Generation X | 1965-1980 |
Boomers | 1946-1964 |
Silent | 1928-1945 |
G.I. Generation | 1901-1927 |
- Project goals were achieved:
- Statistically significant socio-economic factors have been found that affect the rate of suicide (per 100,000 population)
- A model has been built that predicts the value of the target variable with high accuracy
- Interesting observations made:
- Men are the most risky group.
- Younger social groups have a lower risk of suicide.
- Rates of psychiatric abnormalities help improve the prognosis of the suicide rate.
- A basis for further research has been obtained: a detailed analysis of each of the divisions of observations is possible - by age, sex and generations.