Project part of OpenClassrooms Data Scientist path
Our aim is to develop a tag predictor algorithm for questions asked on StackOverflow.
Steps are as follow :
- Merge datafiles of data extraction from stackexchange
- Clean merge dataset
- Process text from title and body of questions
- Train supervised classifiers for main tags
- Predict tags of input based on classifiers predictions with highest confidence
Files of interest :
- modules.py : Gather the functions for the programs and more
- main.py : Train + Predict on dataframe
- training.py : Train classifiers with hard-coded options
- predicting.py : Predict on input dataframe from already saved training material
- cleaning_exploring.ipynb : Data cleaning and exploring notebook
- tags_predicting.ipynb : Text processing and data modelisation notebook