Skip to content
/ PACA Public

Phenotype Aware Components Analysis is a contrastive learning approach leveraging CCA to robustly capture weak sources of subphenotypic variation to define de novo subtypes

License

Notifications You must be signed in to change notification settings

adigorla/PACA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

39 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PACA

License: GPL v3 release: v0.5.0 coverage: 100% docs: in-progress

Phenotype Aware Components Analysis (PACA) is a contrastive learning approach leveraging canonical correlation analysis to robustly capture weak sources of subphenotypic variation. PACA can be used to define de novo subtypes that are more likely to reflect molecular heterogeneity, especially in challenging cases where the phenotypic heterogeneity may be masked by a myriad of strong unrelated effects in the data.

Installation

PACA is implemented as a R (>= v4) packages which depends on the following :

  • Rcpp (>= v1.0)
  • RcppEigen (>= v3.4)
  • stats (>= v4.1)

You can install PACA using devtools:

devtools::install_github("adigorla/PACA")

Please see troubleshooting at the bottom for compilation issues.

Usage

PACA

The paca commmand runs the basic PACA algorithm after automatically estimating the number of shared dimensions k to be removed and returns components capturing variation unique to the cases. It chooses k which maximizes the variation unique to cases in a given case/control dataset.

# load package
library(PACA)

# load data
X <- read.table("case_data1.txt")
Y <- read.table("control_data1.txt")

Important

All PACA functions functions the input matrices to be of shape features-by-samples (MxN). So if input data is NxM, transpose both matrices to MxN

Xt <- t(X)
Yt <- t(Y)

The input data, X & Y needs to be of the form features-by-samples (MxN), where M > N. We assume the features are scaled as appropriate for the data type (e.g., quantile normalization for RNAseq data). Then the input data needs to be scaled along the sample axis, like below.

# standardize 
Xt.std <- scale(Xt, center = T, scale = T)
Yt.std <- scale(Yt, center = T, scale = T)

# run PACA (and infer k)
set.seed(4499)
PACA.res <- paca(Xt.std, Yt.std)

# return the top 5 (defult rank) unique components of the case data
print(dim(PACA.res$x)) # Nx5

# return the corrected case data
print(dim(PACA.res$xtil)) # MxN

Users also have the option to run the paca algorithm with a fixed k. This would return the unique components in the cases at the user-defined k. Again, the input data, X & Y needs to be of form samples-by-features (NxM), where M > N.

# standardize 
Xt.std <- scale(Xt, center = T, scale = T)
Yt.std <- scale(Yt, center = T, scale = T)

# run PACA with fixed k
PACA.res.k10 <- paca(Xt.std, Yt.std, k = 10)

# return the top 5 (defult rank) unique components of the cases, after correcting for the top 10 shared components
print(dim(PACA.res$x)) # Nx5

# the dimension of the corrected case data
print(dim(PACA.res$xtil)) # MxN

Please refer to the PACA man page for more detailed usage information.

Randomized (r)PACA

rpaca is a randomized extension of the basic paca algorithm. rpaca allows us to apply PACA in regimes where M << N, i.e., in cases where the number of samples is greater than the number of features.

# load package
library(PACA)

# load data
Xt <- t(read.table("case_data1.txt"))
Yt <- t(read.table("control_data1.txt"))

The input data, X & Y needs to be of form samples-by-features (NxM). While rpaca can automatically select k, we recommend users leverage domain knowledge when possible. For optimal results, consider performing a grid search with rpaca over a range of k values, selecting the one that maximizes an application-specific metric or aligns best with your field expertise.

# standardize 
Xt.std <- scale(Xt, center = T, scale = T)
Yt.std <- scale(Yt, center = T, scale = T)

# run for selected K
k.select <- 10

# run randomized PACA
set.seed(4499)
rPACA.res <- rpaca(Xt.std, Yt.std, k =  k.select, niter = 10, batch = 300, rank = 5)

# run randomized PACA
set.seed(4499)
autorPACA.res <- rpaca(Xt.std, Yt.std, niter = 10, batch = 300, rank = 5, thrsh = 10.0)


# the dimension of the returned unique components of the cases from rPACA with fixed K
print(dim(rPACA.res$x))

# the dimension of the returned unique components of the cases from auto rPACA
print(dim(autorPACA.res$x))

# print list of K selected in each iteration
print(autorPACA.res$k.iter)

niter, rank and batch are optional params. However, the users needs to make sure to set batch to batch < min({M, N} and k < batch-1. Increasing niter and/or rank empirically seems to increase the estimation accuracy of the randomized algorithm; however, at the expense of increased runtime. Please refer to the rPACA man page for more detailed usage information.

Null Testing

The paca_null algorithm allows users to test for the statistical significance of the presence of phenotypic variation unique to the cases, for a given fixed k. This pocedure should be able to reject the null (no subphenotypic variation) when there is sufficiently strong variation unique to the cases.

# load package
library(PACA)

# load data
Xt <- t(read.table("case_data1.txt"))     # NxM -> MxN
Yt <- t(read.table("control_data1.txt"))  # NxM -> MxN

# test for selected K
k.h0 <- 10
set.seed(4499)
PACA.nulltest <- paca_null(Xt.std, Yt.std, k.h0, nperm = 100)

# p-value of rejecting H0 there is no case specific variation PACA PC1
print(PACA.nulltest$pval)

The input data, X & Y needs to be of form features-by-samples (MxN), where M > N. Assuming the features are scaled as appropriate for the data type. Increasing nperm increases the precision of the pval estimate. Please refer to the PACA man page for more detailed usage information.

🚧 Documentation Under Development 🚧

The code is currently in public beta and may contain incomplete features. We are actively working to improve the documentation and code stability.

🐛 Found a Bug? Have a suggestion or found a bug? We’d love to hear from you! Please create a new issue on our GitHub repository to report any bugs or request new features.

📣 Feedback is Welcome! Your feedback is invaluable in helping us improve!

Citation

If you use this software in your research, please cite our work as follows:

@article{gorla2023paca,
    author = {Gorla, Aditya and Sankararaman, Sriram and Burchard, Esteban and Flint, Jonathan and Zaitlen, Noah and Rahmani, Elior},
    title = {Phenotypic subtyping via contrastive learning},
    journal = {bioRxiv},
    year = {2023},
    doi = {10.1101/2023.01.05.522921},
    URL = {https://www.biorxiv.org/content/10.1101/2023.01.05.522921v1}
}

Gorla, et al. "Phenotypic subtyping via contrastive learning" biorxiv (2023).

License and Disclaimer

PACA is publicly released under the GPL-3.0 license (full license text found here). Note however that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.

Troubleshooting

Instructions

If you are using a mac and having installation issues, try installing homebrew or xcode then reinstalling Rcpp and RcppEigen.

R >= 4.0+ on M1/2 Macs

If you are having issues compiling R/Rcpp code on the newer ARM (M1/2) Mac hardware, make you have gcc(13+) and llvm installed using homebrew.

brew install gcc && brew install llvm 

Then update the Makevars file in the ~/.R/ directory to the following:

# custom g++ makevars 
# adapeted from here: https://stackoverflow.com/questions/65860439/installing-data-table-on-macos

GCC_LOC = /opt/homebrew/Cellar/gcc/13.1.0                      # UPATDTE & CHECK  path is valid
FLIBS=-L$(GCC_LOC)/lib/gcc/13 -L$(GCC_LOC)/lib -lgfortran -lm
FC=$(GCC_LOC)/bin/gfortran
F77=$(GCC_LOC)/bin/gfortran
CXX1X=$(GCC_LOC)/bin/g++-13
CXX98=$(GCC_LOC)/bin/g++-13
CXX11=$(GCC_LOC)/bin/g++-13
CXX14=$(GCC_LOC)/bin/g++-13
CXX17=$(GCC_LOC)/bin/g++-13
CXX20=$(GCC_LOC)/bin/g++-13


LLVM_LOC = /opt/homebrew/opt/llvm                              # UPATDTE & CHECK path is valid
CC=$(GCC_LOC)/bin/gcc-13 -fopenmp
CXX=$(GCC_LOC)/bin/g++-13 -fopenmp -llapack
CFLAGS=-g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe
CXXFLAGS=-g -O3 -Wall -pedantic -std=c++14 -mtune=native -pipe
LDFLAGS=-L$(LLVM_LOC)/lib -Wl,-rpath,$(LLVM_LOC)/lib
RARM_LOC = /opt/R/arm64                                        # UPATDTE & CHECK path is valid
BREW_LOC = /opt/homebrew                                       # UPATDTE & CHECK path is valid
CPPFLAGS=-I$(LLVM_LOC)/include -I$(BREW_LOC)/include -I$(RARM_LOC)/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include

Make sure that the four "UPATDTE & CHECK path is valid" lines point to valid location on your machine.

For all older versions of R and Intel Mac installation issues, please refer to the detailed instructions on the The Coatless Professor website.

About

Phenotype Aware Components Analysis is a contrastive learning approach leveraging CCA to robustly capture weak sources of subphenotypic variation to define de novo subtypes

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published