Final_Project

Protein Signature and Classifier Discovery: The Forest Through the Trees

Identifying a classifier and proteins that significantly change in a simulated Olink dataset using random forest machine learning and ANOVA statistically testing. The simulated study includes three clinical variables: response status, timepoint, and sample collection site.

CWRU Final Project (Group 2) Chris Bock, Debra Fenty, Ben Snyder, Frankie Wong

Site: https://chrisbock76.github.io/Final_Project/
Presentation: https://docs.google.com/presentation/d/1oF25TyMGBnpUAAlSOjVGa522W7w3KbH9M3rzxzRgc7Q/edit?usp=sharing
Trello board (WIP): https://trello.com/b/xhnChis2/cwrufinalprojectgroup2

Background

“Olink Proteomics is a Swedish company (with a US-based laboratory located in Watertown, MA) dedicated to innovation, quality, rigor and transparency, providing outstanding products and services for human protein biomarker discovery. Our groundbreaking Olink panels for precision proteomics help scientists make research decisions more quickly and confidently through robust, multiplex biomarker analysis. Our high quality multiplex immunoassay panels help bring new insights into disease processes, improve disease detection, and contribute to a better understanding of biology.” (source: http://www.olink.com/)

Olink uses their proprietary Proximity Extension Assay (PEA) to detect over 1,100 unique proteins in human biological fluids, typically plasma or serum, where the output of the assay is reported in a relative protein concentration unit called NPX. NPX is a log2 transformation of DNA copy number generated from a quantitative polymerase chain reaction (qPCR) measurement. The NPX value is a direct translation of DNA copy number to amount in the biological sample. We will use a simulated dataset that was created by Olink to train other, experienced Data Scientists on how to analyze PEA generated data.

Proposed Project

The simulated study that we have received contains two individual datasets -- that can be analyzed separately or together via a set of ‘bridge’ samples -- and an annotation file for each respective dataset. Each dataset has the same number of unique subjects, but the second dataset contains the ‘bridge’ samples that were analyzed in the first dataset. Each annotation file contains the following data: SampleID, Subject, Treatment, Site, Time, Project (indicating the respective dataset). Dataset Details:

Clinical variables
- Three timepoints (Baseline; Week 6; Week 12)
- Site information (from sites A-E)
- Response status (Responder or Non-responder)
The first dataset has 52 subjects, and 156 overall samples
The second dataset has another 52 subjects, but also has 16 ‘bridge’ samples from the first run (and thus has 172 samples)

We have been tasked to identify some or all of the following via analysis of the data:

Identify what factors or groupings have significant effects on protein expression
- Site-to-site variations
- Time points
- Batch-to-batch variations
Use Random Forest to find a classifier for response ○ Test and train on one dataset ○ Does the model give favorable results using the second dataset?
Determine whether the two datasets can be compared together
(Maybe) Do the two datasets agree with one another in their results.
(If possible, can you combine the two datasets and look at the analysis again)

Machine Learning Model

We used a Random Forest Classifier model to take the provided NPX values, timeframe of treatment, and sites and predict whether or not a given sample is a Responder or Non-responder.

Technique

Data Cleanup of provided sample data to get rid of unnecessary lines/data
Create machine learning model using cleaned data
Run statistical tests (ANOVA) to look for statistical significance for certain proteins based on model results
Test second dataset using the model for accuracy

Resources

Jupyter Notebooks

Anova_by_proteins.ipynb - Data cleanup notebook
BS_ML_test.ipynb - Machine learning testing notebook. Contains KMeans and Random Forest models
NewANOVAxProtein.ipynb - ANOVA Test Notebook.
data_merge.ipynb/data2_merge.ipynb - Data cleanup notebook

Files

RandomForest_Model - binary file of random forest algorithm (Saved via Pickle)

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
Images		Images
Resources		Resources
anova_data		anova_data
assets		assets
.DS_Store		.DS_Store
ANOVA_CB_Test.ipynb		ANOVA_CB_Test.ipynb
ANOVA_merged.csv		ANOVA_merged.csv
Anova_by_proteins.ipynb		Anova_by_proteins.ipynb
BS_ML_test.ipynb		BS_ML_test.ipynb
CAM_ANOVA_Treat.ipynb		CAM_ANOVA_Treat.ipynb
CB_ML_test.ipynb		CB_ML_test.ipynb
FinalProject_Group2_ProjectProposal.docx		FinalProject_Group2_ProjectProposal.docx
NPX_data1.xlsx		NPX_data1.xlsx
NPX_data1_info.xlsx		NPX_data1_info.xlsx
NPX_data2.xlsx		NPX_data2.xlsx
NPX_data2_info.xlsx		NPX_data2_info.xlsx
NewANOVAxProtein.ipynb		NewANOVAxProtein.ipynb
README.md		README.md
RandomForest_Model		RandomForest_Model
analysis-vis.ipynb		analysis-vis.ipynb
analysis_boxplots.ipynb		analysis_boxplots.ipynb
data2_merge.ipynb		data2_merge.ipynb
data_merge.ipynb		data_merge.ipynb
index.html		index.html
panel_vis.ipynb		panel_vis.ipynb
protein_scrape.ipynb		protein_scrape.ipynb
sample.csv		sample.csv
test_cleaned.ipynb		test_cleaned.ipynb
treated_untreated.ipynb		treated_untreated.ipynb
tree.dot		tree.dot
trees.html		trees.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Final_Project

Protein Signature and Classifier Discovery: The Forest Through the Trees

Background

Proposed Project

Machine Learning Model

Technique

Resources

Jupyter Notebooks

Files

About

Uh oh!

Releases

Packages

Languages

BenSnyder90/Final_Project

Folders and files

Latest commit

History

Repository files navigation

Final_Project

Protein Signature and Classifier Discovery: The Forest Through the Trees

Background

Proposed Project

Machine Learning Model

Technique

Resources

Jupyter Notebooks

Files

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages