Skip to content

BenSnyder90/Final_Project

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

88 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Final_Project

Protein Signature and Classifier Discovery: The Forest Through the Trees

Identifying a classifier and proteins that significantly change in a simulated Olink dataset using random forest machine learning and ANOVA statistically testing. The simulated study includes three clinical variables: response status, timepoint, and sample collection site.

CWRU Final Project (Group 2) Chris Bock, Debra Fenty, Ben Snyder, Frankie Wong

Site: https://chrisbock76.github.io/Final_Project/
Presentation: https://docs.google.com/presentation/d/1oF25TyMGBnpUAAlSOjVGa522W7w3KbH9M3rzxzRgc7Q/edit?usp=sharing
Trello board (WIP): https://trello.com/b/xhnChis2/cwrufinalprojectgroup2

Background

“Olink Proteomics is a Swedish company (with a US-based laboratory located in Watertown, MA) dedicated to innovation, quality, rigor and transparency, providing outstanding products and services for human protein biomarker discovery. Our groundbreaking Olink panels for precision proteomics help scientists make research decisions more quickly and confidently through robust, multiplex biomarker analysis. Our high quality multiplex immunoassay panels help bring new insights into disease processes, improve disease detection, and contribute to a better understanding of biology.” (source: http://www.olink.com/)

Olink uses their proprietary Proximity Extension Assay (PEA) to detect over 1,100 unique proteins in human biological fluids, typically plasma or serum, where the output of the assay is reported in a relative protein concentration unit called NPX. NPX is a log2 transformation of DNA copy number generated from a quantitative polymerase chain reaction (qPCR) measurement. The NPX value is a direct translation of DNA copy number to amount in the biological sample. We will use a simulated dataset that was created by Olink to train other, experienced Data Scientists on how to analyze PEA generated data.

Proposed Project

The simulated study that we have received contains two individual datasets -- that can be analyzed separately or together via a set of ‘bridge’ samples -- and an annotation file for each respective dataset. Each dataset has the same number of unique subjects, but the second dataset contains the ‘bridge’ samples that were analyzed in the first dataset. Each annotation file contains the following data: SampleID, Subject, Treatment, Site, Time, Project (indicating the respective dataset). Dataset Details:

  • Clinical variables
    • Three timepoints (Baseline; Week 6; Week 12)
    • Site information (from sites A-E)
    • Response status (Responder or Non-responder)
  • The first dataset has 52 subjects, and 156 overall samples
  • The second dataset has another 52 subjects, but also has 16 ‘bridge’ samples from the first run (and thus has 172 samples)

We have been tasked to identify some or all of the following via analysis of the data:

  • Identify what factors or groupings have significant effects on protein expression
    • Site-to-site variations
    • Time points
    • Batch-to-batch variations
  • Use Random Forest to find a classifier for response ○ Test and train on one dataset ○ Does the model give favorable results using the second dataset?
  • Determine whether the two datasets can be compared together
  • (Maybe) Do the two datasets agree with one another in their results.
  • (If possible, can you combine the two datasets and look at the analysis again)

Machine Learning Model

We used a Random Forest Classifier model to take the provided NPX values, timeframe of treatment, and sites and predict whether or not a given sample is a Responder or Non-responder.

Technique

  • Data Cleanup of provided sample data to get rid of unnecessary lines/data
  • Create machine learning model using cleaned data
  • Run statistical tests (ANOVA) to look for statistical significance for certain proteins based on model results
  • Test second dataset using the model for accuracy

Resources

Jupyter Notebooks

  • Anova_by_proteins.ipynb - Data cleanup notebook
  • BS_ML_test.ipynb - Machine learning testing notebook. Contains KMeans and Random Forest models
  • NewANOVAxProtein.ipynb - ANOVA Test Notebook.
  • data_merge.ipynb/data2_merge.ipynb - Data cleanup notebook

Files

  • RandomForest_Model - binary file of random forest algorithm (Saved via Pickle)

About

CWRU Final Project (Group 2)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 97.2%
  • CSS 2.4%
  • Other 0.4%