Skip to content

jiayuanz3/EODP-Real-world-Datasets

Repository files navigation

Explore Real-world Datasets

This is a piece work from COMP20008 Elements of Data Processing Project 2(University of Melborune).

We finished this work in 09/2019 with Python programming as a group (other team members: Yingjun Lu, Junjie Wu)

We implement the linkage between two datasets and measure its performance. Then comment on the choice of similarity functions, method of deriving a final score, and the threshold for determining if a pair is a match. We implement a blocking method for the linkage of the amazon.csv and google.csv data sets and report on the proposed method and the quality of the results of the blocking. Afterwards, we build and assess a classifier in predicting the cellular localization sites of proteins in yeast, based on particular attributes.

Preprocessing (impute missing value and scale the features).

Comparing three classification algorithms and select the best one (5-KNN, 10-KNN, decision tree).

Feature engineering and interpret performance (implement feature generation, select features by mutual information).

code.ipynb is the code file, others are datasets.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published