DengueAI@drivendata

General Intro

This is a machine learning project aimed at solving the 'DengAI: Predicting Disease Spread' competition at DrivenData (see: https://www.drivendata.org/competitions/44/dengai-predicting-disease-spread/page/82/).

The aim is to predict the total_cases (weekly total disease cases) in San Juan (Brazil) and Iquitos (Peru). In this readme, we will describe the data, what has been done so far for data preprocessing and feature engineering, two of the models we tried, the results and DD scores we obtained, followed by some discussion about these results and their implications.

Data

Available feature data, and their descriptions are as follows:

city – City abbreviations: sj for San Juan and iq for Iquitos
week_start_date – Date given in yyyy-mm-dd format
station_max_temp_c – Maximum temperature
station_min_temp_c – Minimum temperature
station_avg_temp_c – Average temperature
station_precip_mm – Total precipitation
station_diur_temp_rng_c – Diurnal temperature range
precipitation_amt_mm – Total precipitation
reanalysis_sat_precip_amt_mm – Total precipitation
reanalysis_dew_point_temp_k – Mean dew point temperature
reanalysis_air_temp_k – Mean air temperature
reanalysis_relative_humidity_percent – Mean relative humidity
reanalysis_specific_humidity_g_per_kg – Mean specific humidity
reanalysis_precip_amt_kg_per_m2 – Total precipitation
reanalysis_max_air_temp_k – Maximum air temperature
reanalysis_min_air_temp_k – Minimum air temperature
reanalysis_avg_temp_k – Average air temperature
reanalysis_tdtr_k – Diurnal temperature range
ndvi_se – Pixel southeast of city centroid
ndvi_sw – Pixel southwest of city centroid
ndvi_ne – Pixel northeast of city centroid
ndvi_nw – Pixel northwest of city centroid

Preprocessing

The raw dataset is quite gappy, as shown below:

col name	type	value count	missing values
city	object	2	0
year	int64	21	0
weekofyear	int64	53	0
week_start_date	object	1049	0
ndvi_ne	float64	1214	194
ndvi_nw	float64	1365	52
ndvi_se	float64	1395	22
ndvi_sw	float64	1388	22
precipitation_amt_mm	float64	1157	13
reanalysis_air_temp_k	float64	1176	10
reanalysis_avg_temp_k	float64	600	10
reanalysis_dew_point_temp_k	float64	1180	10
reanalysis_max_air_temp_k	float64	141	10
reanalysis_min_air_temp_k	float64	117	10
reanalysis_precip_amt_kg_per_m2	float64	1039	10
reanalysis_relative_humidity_percent	float64	1370	10
reanalysis_sat_precip_amt_mm	float64	1157	13
reanalysis_specific_humidity_g_per_kg	float64	1171	10
reanalysis_tdtr_k	float64	519	10
station_avg_temp_c	float64	492	43
station_diur_temp_rng_c	float64	470	43
station_max_temp_c	float64	73	20
station_min_temp_c	float64	73	14
station_precip_mm	float64	663	22

After the preprocessing (see feature_eng.py), this is how the features looked like:

col name	type	value count
city	int64	2
weekofyear	int64	52
ndvi_ne	float64	1394
ndvi_nw	float64	1405
ndvi_se	float64	1407
ndvi_sw	float64	1402
reanalysis_dew_point_temp_k	float64	1180
reanalysis_precip_amt_kg_per_m2	float64	1039
reanalysis_relative_humidity_percent	float64	1370
reanalysis_specific_humidity_g_per_kg	float64	1171
station_avg_temp_c	float64	521
station_diur_temp_rng_c	float64	503
station_max_temp_c	float64	82
station_min_temp_c	float64	77
station_precip_mm	float64	673
total_cases	int64	134
population_x	float64	30
temp_dew	bool	2
temp_dew_l4	bool	2

Model

The models, fitted hyperparameters (if available) and the GridSearch parameters can be found in models.py. At this stage, following models are defined:

sklearn.linear_model.LinearRegression
sklearn.tree.DecisionTreeRegressor
sklearn.ensemble.RandomForestRegressor
sklearn.ensemble.GradientBoostingRegressor
xgboost.XGBRegressor,XGBRFRegressor

In the following, we report the results obtained by Gradient Boosting Regression and Random Forest Regression.

Gradient Boosting Regression (GBR)

After an iterative search for best parameters using the GridSearchCV function of sklearn (see FPE.py), the sub-test (test split of the train data) results did look really promising. The spikes, which is arguably the most critical characteristic, are not at all captured, as can be seen in the pseudo-time series plot (it shows the total cases in each row of the test-split of the train data, with the two cities appended to each other). As a result, R2 and RMSE metrics are quite poor:

However, when this model is applied to the test data by submitting the results file (submission.csv) to the competition, it got a mean absolute error (MAE, which is the chosen metric by the competition) of 24.6106, propelling us us into 897th position, in the top 8% of the total number of competitors (about 12K) at the current time:

.

Random Forest Regression (RFR)

After a similar iterative search for the best parameters using GridSearchCV using the Random Forest Regression model, we obtained relatively better looking results, when applied to the sub test, with a clearly higher R2, and better ability to predict the spike corresponding to a severe outbreak of Dengue:

Interestingly, the submission file generated with this model returned a much lower score:

Discussion

Based on these results, there are a few observations to be made. Perhaps, the metric for the competition has been poorly chosen: arguably what matters most is the ability of models to capture the large spikes, but the MAE is not good in penalizing models unable to capture these events. It is equally possible that the test data does not exhibit the same behavior observed in the train data, be it for reasons present in the dataset (e.g., interactions between variables that we did not consider) or absent from it. In the latter case, predicting the spikes might be counterproductive in terms of score and we could remove them from the training data. A final possibility is that our RFR model is somehow overfitted, although we do not capture this with our chosen train-test splits.

Note: LightGBM has an option for Poisson regression (which should be more appropriate given the data). Neither LightGBM nor XGBoost provide an implementation for binomial. Interesting to see how something like Prophet and/or ETSformer would perform, as a next step.

Name		Name	Last commit message	Last commit date
Latest commit History 98 Commits
baseline		baseline
data		data
data_exploration		data_exploration
results		results
.gitignore		.gitignore
FPE.py		FPE.py
LICENSE		LICENSE
dengue-benchmark-statsmodels.ipynb		dengue-benchmark-statsmodels.ipynb
evaluation.py		evaluation.py
feature_eng.py		feature_eng.py
main.py		main.py
models.py		models.py
processors.py		processors.py
rawdata.py		rawdata.py
readme.md		readme.md
transformers_FE.py		transformers_FE.py
transformers_num.py		transformers_num.py
utilities.py		utilities.py
visualizations.ipynb		visualizations.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DengueAI@drivendata

General Intro

Data

Preprocessing

Model

Gradient Boosting Regression (GBR)

Random Forest Regression (RFR)

Discussion

About

Releases

Packages

Contributors 2

Languages

License

ncerutti/DengueAI-challenge

Folders and files

Latest commit

History

Repository files navigation

DengueAI@drivendata

General Intro

Data

Preprocessing

Model

Gradient Boosting Regression (GBR)

Random Forest Regression (RFR)

Discussion

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages