Macromolecule classification

Introduction

The study of amino acid sequence is vital in life sciences. We used different word embedding techniques from Natural Language processing to represent the amino acid sequence as vectors. Our main goal was to classify sequences to four group of classes, that are DNA, RNA, Protein and hybrid. After several tests we have achieved almost 99% of train and test accuracy. We have experimented on CNN, LSTM, Bidirectional LSTM, and GRU.

Dataset

Initially the dataset had 10 classes and wasnt normalized. Most of the data was from protein class as shown in figure below.

## preprocessing In our preprocessing step we have reduced the number of classes to four. We called all the hybrid protein macromolecules as only hybrid, regardless of what kind of hybrid structure it is, and the protein labelled as protein only. The figure 3. Gives us detail of our new data.

Model

Different models were developed and tested. One of the model architecture and its results are shown here.

Model Loss and Accuracy

Model Results:

Overall Results

Experiments on four differnt model were carried out. Table below shows accuracy comparison of all the models described below. CNN model has so far achieved the best and highest accuracy and precession. In terms of recall the CNN-GRU overperform CNN model.

Model	Accuracy	Precision	Recall	F1-score
CNN	0.98	0.91	0.875	0.895
Bi-LSTM	0.77	0.77	0.78	0.76
CNN-LSTM	0.79	0.78	0.79	0.78
CNN-GRU	0.90	0.90	0.92	0.92

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
images		images
Macromolecule_classification.ipynb		Macromolecule_classification.ipynb
Models_experiment.ipynb		Models_experiment.ipynb
README.md		README.md
normalized_data.csv		normalized_data.csv
preprocess.py		preprocess.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Macromolecule classification

Introduction

Dataset

Model

Model Loss and Accuracy

Model Results:

Overall Results

About

Releases

Packages

Languages

engrfaisal90/Macromolecule_classification

Folders and files

Latest commit

History

Repository files navigation

Macromolecule classification

Introduction

Dataset

Model

Model Loss and Accuracy

Model Results:

Overall Results

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages