Skip to content

Classification of macromolecule sequences into four group of classes with machine learning.

Notifications You must be signed in to change notification settings

engrfaisal90/Macromolecule_classification

Repository files navigation

Macromolecule classification

Introduction

The study of amino acid sequence is vital in life sciences. We used different word embedding techniques from Natural Language processing to represent the amino acid sequence as vectors. Our main goal was to classify sequences to four group of classes, that are DNA, RNA, Protein and hybrid. After several tests we have achieved almost 99% of train and test accuracy. We have experimented on CNN, LSTM, Bidirectional LSTM, and GRU.

Dataset

Initially the dataset had 10 classes and wasnt normalized. Most of the data was from protein class as shown in figure below.
Screenshot


## preprocessing In our preprocessing step we have reduced the number of classes to four. We called all the hybrid protein macromolecules as only hybrid, regardless of what kind of hybrid structure it is, and the protein labelled as protein only. The figure 3. Gives us detail of our new data.

Screenshot

Model

Different models were developed and tested. One of the model architecture and its results are shown here.

Screenshot

Model Loss and Accuracy

Screenshot Screenshot

Model Results:

Overall Results


Experiments on four differnt model were carried out. Table below shows accuracy comparison of all the models described below. CNN model has so far achieved the best and highest accuracy and precession. In terms of recall the CNN-GRU overperform CNN model.
Model Accuracy Precision Recall F1-score
CNN 0.98 0.91 0.875 0.895
Bi-LSTM 0.77 0.77 0.78 0.76
CNN-LSTM 0.79 0.78 0.79 0.78
CNN-GRU 0.90 0.90 0.92 0.92

About

Classification of macromolecule sequences into four group of classes with machine learning.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published