This project tested 3 different models for protein secondary structure prediction. The project was guided thorugh chatGPT and most of the code was written by ChatGPT. The code was done as a jypiter notebook in google colab. A lot of the code was run using the CPU.
The dataset used is this one: https://www.kaggle.com/datasets/kirkdco/protein-secondary-structure-2022/data
4 different datasets were used using the code in the file data_to_proj. The values in the code is tested using the code with the following settings:
Dataset, Training sets (#sequences), Test set (#sequences), Start length, End length Dataset 1: 4000, 1000, 20, 50 Dataset 2, 8000, 2000, 20, 50 Dataset 3, 1600, 4000, 15, 60 Dataset 4, 1600, 4000, 10, 80
All the code is built up by functions with all parameters as well as the dataset, training, name of output report (consisting of F1 score and accuracy), name of the predictions document (consisting of the models predictions).
The random forest code can be found in: Random_Forest.ipynb It consist of a function that needs to be run to perform tests. Then the function calling are all in the code. First testing is done on the first two datasets. Then optimization is performed. Then those values are inserted into the calling function for the model to test the outcome. This is done for the rest of the datasets.
The CNN model code can be found in: CNN.ipynb First in the code one can see the model with one convolutional layer. This is followed by the actual running of the code with parameters. All the calls can be seen below each function. After is the two layer one. This one has the same tests and we then have an parameter optimization. This part is all commented on in the top. It consists of three code blocks first a pip install followed by the function and then the actuall testing. After this the code test on the 2 layer convolutional layer (same function as previously) with the new parameters in the function calling. After this some broad optimization was performed followed by fine tuning and then testing of these new parameters.
Hybrid CNN RNN model can be found in: Hybrid CNN_RNN.ipynb First in the code one can see the model with one convolutional layer. This is followed by the actual running of the code with parameters. All the calls can be seen below each function. After is the two layer one. This one has the same tests and we then have an parameter optimization. This part is all commented on in the top. It consists of three code blocks first a pip install followed by the function and then the actuall testing. After this the code test on the 2 layer convolutional layer (same function as previously) with the new parameters in the function calling.