test assignments for the data science school - Roonyx
- Choose data set from Kaggle repo. Data set shouldn't be analysed before in the available tutorials.
- Explore main data set features and target labels. (+ 0.5 score per unique method. Min score 2, Max score 4)
- Feature engineering for the new parameters (+ 1 score per unique feature. Min score 2, Max score 4)
- Choose and display statistics for the observations. In this step you need to create Statistical hypothesis and test them. Hypothesis should be meaningful and display some patterns in the data set (+ 3 score per hypothesis. Min score 6, Max score 12)
- Visualise explored features and hypothesis (+ 1 score per plot. Min score 6, Max score 10)
Useful resources:
With selected data set:
- Calculate entropy for full data set and for 2 selected groups. What is the information gain for such split? (+2 score)
- Calculate Gini index for the same groups and compare results (+2 score)
- Train a decision tree (DecisionTreeClassifier, random_state = 17) (+2 score)
- Find the optimal maximum depth using 5-fold cross-validation (GridSearchCV) (+2 score)
- Display final tree as an image (+2 score)
Useful resources:
With selected data set (if applicable, or change from school repo):
- Create and train BaggingClassifier (+2 score)
- Create and train RandomForestClassifier (+2 score)
- Create and train Linear classifier (+2 score)
- Create and train k Nearest Neighbors classifier (+2 score)
- Compare models accuracy
- Create an ensemble of models and estimate classification accuracy
- Display different accuracy metrics for model (+ 1 score per metric. Min score 2, Max score 4)
With selected data set (if applicable, or change from school repo):
- Create and train BaggingRegressor (+2 score)
- Create and train RandomForestRegressor (+2 score)
- Create and train Logistic Regression model (+2 score)
- Create and train k Nearest Neighbors Regression (+2 score)
- Compare models accuracy
- Create an ensemble of models and estimate classification accuracy
- Display different accuracy metrics for model (+ 1 score per metric. Min score 2, Max score 4)
Useful resources:
With selected data set (if applicable, or change from school repo):
- Create and train AdaBoostClassifier (+2 score)
- Create and train XGBoostClassifier (+2 score)
- Create and train LightGBM Classifier (+2 score)
- Create and train CatBoostClassifier (+2 score)
- Compare accuracy for models (+2 score)
- Marketing data (one of the data topic per group):
- Upwork analysis
- Facebook CTF analysis
- Face recognition task
- Emotion recognition
- Age recognition and gender recognition
- Pose estimation and motion extraction
- Sequence models
- voice timbre detection
- voice script recognition
- ??
- Kaggle competition. Join one of the open competitions and create a kernel.