In this project, I will build a model for identifying potential fraudsters based on financial and e-mail data. For this, the following steps will be performed:
- data exploration (learning about the data, cleaning and preparing the data)
- feature selection and engineering (selecting the most significant features and creating new ones)
- reducing the dimensionality of the data using principal component analysis
- selection and tuning a supervised machine learning algorithms
- validating the algorithm to ensure acceptable performance of the model
The results are saved in the Jupyter notebook file in the repository.
The following additional files can be found in the repository:
- Enron_final.html: results in the html format.
- final_project_dataset.pkl: dataset in pkl format.
- final_project_dataset_modified.pkl, my_classifier.pkl, my_dataset.pkl, my_feature_list.pkl: files created as a result of project implementation.
- poi_id.py: script with the python code referred to in the results file, as well as the final classifier.
- tester.py: script used to test the classifier.
- tools folder: scripts used for data processing.