PACA

Phenotype Aware Components Analysis (PACA) is a contrastive learning approach leveraging canonical correlation analysis to robustly capture weak sources of subphenotypic variation. PACA can be used to define de novo subtypes that are more likely to reflect molecular heterogeneity, especially in challenging cases where the phenotypic heterogeneity may be masked by a myriad of strong unrelated effects in the data.

Installation

PACA is implemented as a R (>= v4) packages which depends on the following :

Rcpp (>= v1.0)
RcppEigen (>= v3.4)
stats (>= v4.1)

You can install PACA using devtools:

devtools::install_github("adigorla/PACA")

Please see troubleshooting at the bottom for compilation issues.

Usage

PACA

The paca commmand runs the basic PACA algorithm after automatically estimating the number of shared dimensions k to be removed and returns components capturing variation unique to the cases. It chooses k which maximizes the variation unique to cases in a given case/control dataset.

# load package
library(PACA)

# load data
X <- read.table("case_data1.txt")
Y <- read.table("control_data1.txt")

Important

All PACA functions functions the input matrices to be of shape features-by-samples (MxN). So if input data is NxM, transpose both matrices to MxN

Xt <- t(X)
Yt <- t(Y)

The input data, X & Y needs to be of the form features-by-samples (MxN), where M > N. We assume the features are scaled as appropriate for the data type (e.g., quantile normalization for RNAseq data). Then the input data needs to be scaled along the sample axis, like below.

# standardize 
Xt.std <- scale(Xt, center = T, scale = T)
Yt.std <- scale(Yt, center = T, scale = T)

# run PACA (and infer k)
set.seed(4499)
PACA.res <- paca(Xt.std, Yt.std)

# return the top 5 (defult rank) unique components of the case data
print(dim(PACA.res$x)) # Nx5

# return the corrected case data
print(dim(PACA.res$xtil)) # MxN

Users also have the option to run the paca algorithm with a fixed k. This would return the unique components in the cases at the user-defined k. Again, the input data, X & Y needs to be of form samples-by-features (NxM), where M > N.

# standardize 
Xt.std <- scale(Xt, center = T, scale = T)
Yt.std <- scale(Yt, center = T, scale = T)

# run PACA with fixed k
PACA.res.k10 <- paca(Xt.std, Yt.std, k = 10)

# return the top 5 (defult rank) unique components of the cases, after correcting for the top 10 shared components
print(dim(PACA.res$x)) # Nx5

# the dimension of the corrected case data
print(dim(PACA.res$xtil)) # MxN

Please refer to the PACA man page for more detailed usage information.

Randomized (r)PACA

rpaca is a randomized extension of the basic paca algorithm. rpaca allows us to apply PACA in regimes where M << N, i.e., in cases where the number of samples is greater than the number of features.

# load package
library(PACA)

# load data
Xt <- t(read.table("case_data1.txt"))
Yt <- t(read.table("control_data1.txt"))

The input data, X & Y needs to be of form samples-by-features (NxM). While rpaca can automatically select k, we recommend users leverage domain knowledge when possible. For optimal results, consider performing a grid search with rpaca over a range of k values, selecting the one that maximizes an application-specific metric or aligns best with your field expertise.

# standardize 
Xt.std <- scale(Xt, center = T, scale = T)
Yt.std <- scale(Yt, center = T, scale = T)

# run for selected K
k.select <- 10

# run randomized PACA
set.seed(4499)
rPACA.res <- rpaca(Xt.std, Yt.std, k =  k.select, niter = 10, batch = 300, rank = 5)

# run randomized PACA
set.seed(4499)
autorPACA.res <- rpaca(Xt.std, Yt.std, niter = 10, batch = 300, rank = 5, thrsh = 10.0)


# the dimension of the returned unique components of the cases from rPACA with fixed K
print(dim(rPACA.res$x))

# the dimension of the returned unique components of the cases from auto rPACA
print(dim(autorPACA.res$x))

# print list of K selected in each iteration
print(autorPACA.res$k.iter)

niter, rank and batch are optional params. However, the users needs to make sure to set batch to batch < min({M, N} and k < batch-1. Increasing niter and/or rank empirically seems to increase the estimation accuracy of the randomized algorithm; however, at the expense of increased runtime. Please refer to the rPACA man page for more detailed usage information.

Null Testing

The paca_null algorithm allows users to test for the statistical significance of the presence of phenotypic variation unique to the cases, for a given fixed k. This pocedure should be able to reject the null (no subphenotypic variation) when there is sufficiently strong variation unique to the cases.

# load package
library(PACA)

# load data
Xt <- t(read.table("case_data1.txt"))     # NxM -> MxN
Yt <- t(read.table("control_data1.txt"))  # NxM -> MxN

# test for selected K
k.h0 <- 10
set.seed(4499)
PACA.nulltest <- paca_null(Xt.std, Yt.std, k.h0, nperm = 100)

# p-value of rejecting H0 there is no case specific variation PACA PC1
print(PACA.nulltest$pval)

The input data, X & Y needs to be of form features-by-samples (MxN), where M > N. Assuming the features are scaled as appropriate for the data type. Increasing nperm increases the precision of the pval estimate. Please refer to the PACA man page for more detailed usage information.

🚧 Documentation Under Development 🚧

The code is currently in public beta and may contain incomplete features. We are actively working to improve the documentation and code stability.

🐛 Found a Bug? Have a suggestion or found a bug? We’d love to hear from you! Please create a new issue on our GitHub repository to report any bugs or request new features.

📣 Feedback is Welcome! Your feedback is invaluable in helping us improve!

Citation

If you use this software in your research, please cite our work as follows:

@article{gorla2023paca,
    author = {Gorla, Aditya and Sankararaman, Sriram and Burchard, Esteban and Flint, Jonathan and Zaitlen, Noah and Rahmani, Elior},
    title = {Phenotypic subtyping via contrastive learning},
    journal = {bioRxiv},
    year = {2023},
    doi = {10.1101/2023.01.05.522921},
    URL = {https://www.biorxiv.org/content/10.1101/2023.01.05.522921v1}
}

Gorla, et al. "Phenotypic subtyping via contrastive learning" biorxiv (2023).

License and Disclaimer

PACA is publicly released under the GPL-3.0 license (full license text found here). Note however that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.

Troubleshooting

Instructions

If you are using a mac and having installation issues, try installing homebrew or xcode then reinstalling Rcpp and RcppEigen.

R >= 4.0+ on M1/2 Macs

If you are having issues compiling R/Rcpp code on the newer ARM (M1/2) Mac hardware, make you have gcc(13+) and llvm installed using homebrew.

brew install gcc && brew install llvm

Then update the Makevars file in the ~/.R/ directory to the following:

# custom g++ makevars 
# adapeted from here: https://stackoverflow.com/questions/65860439/installing-data-table-on-macos

GCC_LOC = /opt/homebrew/Cellar/gcc/13.1.0                      # UPATDTE & CHECK  path is valid
FLIBS=-L$(GCC_LOC)/lib/gcc/13 -L$(GCC_LOC)/lib -lgfortran -lm
FC=$(GCC_LOC)/bin/gfortran
F77=$(GCC_LOC)/bin/gfortran
CXX1X=$(GCC_LOC)/bin/g++-13
CXX98=$(GCC_LOC)/bin/g++-13
CXX11=$(GCC_LOC)/bin/g++-13
CXX14=$(GCC_LOC)/bin/g++-13
CXX17=$(GCC_LOC)/bin/g++-13
CXX20=$(GCC_LOC)/bin/g++-13


LLVM_LOC = /opt/homebrew/opt/llvm                              # UPATDTE & CHECK path is valid
CC=$(GCC_LOC)/bin/gcc-13 -fopenmp
CXX=$(GCC_LOC)/bin/g++-13 -fopenmp -llapack
CFLAGS=-g -O3 -Wall -pedantic -std=gnu99 -mtune=native -pipe
CXXFLAGS=-g -O3 -Wall -pedantic -std=c++14 -mtune=native -pipe
LDFLAGS=-L$(LLVM_LOC)/lib -Wl,-rpath,$(LLVM_LOC)/lib
RARM_LOC = /opt/R/arm64                                        # UPATDTE & CHECK path is valid
BREW_LOC = /opt/homebrew                                       # UPATDTE & CHECK path is valid
CPPFLAGS=-I$(LLVM_LOC)/include -I$(BREW_LOC)/include -I$(RARM_LOC)/include -I/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include

Make sure that the four "UPATDTE & CHECK path is valid" lines point to valid location on your machine.

For all older versions of R and Intel Mac installation issues, please refer to the detailed instructions on the The Coatless Professor website.

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
R		R
man		man
src		src
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
PACA.Rproj		PACA.Rproj
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PACA

Installation

Usage

PACA

Randomized (r)PACA

Null Testing

Citation

License and Disclaimer

Troubleshooting

R >= 4.0+ on M1/2 Macs

About

Releases

Packages

Languages

License

adigorla/PACA

Folders and files

Latest commit

History

Repository files navigation

PACA

Installation

Usage

PACA

Randomized (r)PACA

Null Testing

Citation

License and Disclaimer

Troubleshooting

R >= 4.0+ on M1/2 Macs

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages