The study of amino acid sequence is vital in life sciences. We used different word embedding techniques from Natural Language processing to represent the amino acid sequence as vectors. Our main goal was to classify sequences to four group of classes, that are DNA, RNA, Protein and hybrid. After several tests we have achieved almost 99% of train and test accuracy. We have experimented on CNN, LSTM, Bidirectional LSTM, and GRU.
Initially the dataset had 10 classes and wasnt normalized. Most of the data was from protein class as shown in figure below.
## preprocessing In our preprocessing step we have reduced the number of classes to four. We called all the hybrid protein macromolecules as only hybrid, regardless of what kind of hybrid structure it is, and the protein labelled as protein only. The figure 3. Gives us detail of our new data.
Different models were developed and tested. One of the model architecture and its results are shown here.
Experiments on four differnt model were carried out. Table below shows accuracy comparison of all the models described below. CNN model has so far achieved the best and highest accuracy and precession. In terms of recall the CNN-GRU overperform CNN model.
Model | Accuracy | Precision | Recall | F1-score |
---|---|---|---|---|
CNN | 0.98 | 0.91 | 0.875 | 0.895 |
Bi-LSTM | 0.77 | 0.77 | 0.78 | 0.76 |
CNN-LSTM | 0.79 | 0.78 | 0.79 | 0.78 |
CNN-GRU | 0.90 | 0.90 | 0.92 | 0.92 |