diff --git a/.github/workflows/manuscript.yaml b/.github/workflows/manuscript.yaml index 3f5b1be87..ec2bd9d37 100644 --- a/.github/workflows/manuscript.yaml +++ b/.github/workflows/manuscript.yaml @@ -57,8 +57,12 @@ jobs: uses: actions/upload-artifact@834a144ee995460fba8ed112a2fc961b36a5ec5a # v4 with: name: manuscript-${{ github.ref_name }}-${{ github.sha }} + # Include + # /tmp/git-latexdiff.*/new/reproducibility/manuscript/*.log + # to capture latexdiff log file path: | reproducibility/manuscript/manuscript.* reproducibility/manuscript/v2*.* reproducibility/manuscript/*.bib reproducibility/manuscript/*.dvc + reproducibility/manuscript/*.log diff --git a/reproducibility/manuscript/Makefile b/reproducibility/manuscript/Makefile index 2b75a9b59..8eb6fde05 100644 --- a/reproducibility/manuscript/Makefile +++ b/reproducibility/manuscript/Makefile @@ -108,6 +108,7 @@ cache-tex: copy-tex # run as make latexdiff COMPARISON_SHA1="commit-hash" # to diff with a different commit # see: https://gitlab.com/git-latexdiff/git-latexdiff +# Set `--cleanup none` to keep the tmp files including the latex log latexdiff: git-latexdiff \ --ignore-latex-errors \ diff --git a/reproducibility/manuscript/header.tex b/reproducibility/manuscript/header.tex index 0fa6ebe3f..04d22655a 100644 --- a/reproducibility/manuscript/header.tex +++ b/reproducibility/manuscript/header.tex @@ -1,5 +1,7 @@ \usepackage{nameref} \usepackage{placeins} +\usepackage[utf8]{inputenc} +\usepackage{fontspec} \usepackage[labelfont=bf]{caption} % \usepackage{showframe} % \usepackage{layout} diff --git a/reproducibility/manuscript/manuscript.qmd b/reproducibility/manuscript/manuscript.qmd index 6002915f0..c7223eebc 100644 --- a/reproducibility/manuscript/manuscript.qmd +++ b/reproducibility/manuscript/manuscript.qmd @@ -497,196 +497,136 @@ fate potency are reported in the titles. We assume the dynamical gene expression is determined by the RNA splicing process, and infer the unspliced and spliced gene expression level from the -differential equations proposed in velocyto [@La_Manno2018-lj] and scVelo +ordinary differential equation (ODEs) proposed in velocyto [@La_Manno2018-lj] and scVelo [@Bergen2020-pj] -\begin{align} -\frac{d u\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}} - &= \alpha^{\left(k_{cg}\right)}-\beta_g u\left(\tau^{\left(k_{cg}\right)}\right), - \label{eq-dudt}\\ -\frac{d s\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}} - &= \beta_g u\left(\tau^{\left(k_{cg}\right)}\right) - -\gamma_g s\left(\tau^{\left(k_{cg}\right)}\right). \label{eq-dsdt} -\end{align} +\begin{equation} + \label{eq-rna} +\begin{aligned} +\frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \\ + \frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0, +\end{aligned} +\end{equation} +where $u(t), s(t)$ are the unspliced and spliced expression levels of a gene at time $t$ under a transcription rate $\alpha(t)$ with possible temporal dependence, splicing rate $\beta$, and degradation rate $\gamma$. We specify this model to a setting that depends on cell $c$ and gene $g$ as follows: +\begin{equation} + \label{eq-rna-cg} +\begin{aligned} +\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \\ + \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} . +\end{aligned} +\end{equation} In the equation, the subscript $c$ is the cell dimension, $g$ is the gene dimension, -$\left( u\left( \tau^{(k_{cg})} \right), s\left( \tau^{(k_{cg})} \right) \right)$ +$\left( u_{cg}(t), s_{cg}(t) \right)$ are the unspliced and spliced expression functions given the -change of time per cell and gene. $\tau_{cg}$ represents the displacement of time per +change of time per cell and gene. +We restrict attention to piecewise-constant $\alpha_{cg}(t)$ to capture gene-specific activation and repression. We take special care to model a gene- and cell-specific switching time that marks a single transition from activation to repression by introducing a Bernoulli variable $k_{cg}$ to model unknown transcription state. We assume our cell-by-gene data-matrix arrive as observations of Poisson-counts related to the solution of the above ODEs at unknown times, which is modeled as a relationship between an unknown latent time shared across each cell, $t_c$, and unknown gene-specific time-offsets $t_{0,g}$ where all read counts for a single cell occurred at an unknown, but shared latent time $t_c$. These relative times are also used to parametrize the Bernoulli process for $k_{cg}$. +Importantly, we recognize that the initial conditions are, in fact, unknown. + +We propose and study two models: Model 1 assumes that spliced and unspliced concentrations are both 0 at time 0; Model 2 considers these initial conditions as unknowns with a log-Normal prior distribution. In general, the solution space of ODEs becomes much richer when considered over a domain of initial conditions (as opposed to a single point); indeed, this affords Model 2 much greater expressivity. For clarity, we first present the generative framework for both models, then provide further interpretation and intuition. + +First, we introduce the generative model that describes the various unobserved times: +\begin{align} + % unit lognormal t_c + t_c &\sim \text{LogNormal}(0, 1) \\ + % gene-specific t_0 + t^{(0)}_{0,g} &\sim \text{LogNormal}(0, 1) \\ + % switching time + \Delta \textrm{switching}_g &\sim \text{LogNormal}(0, 1) \\ + % gene-specific t_1 + t^{(1)}_{0,g} &= t^{(0)}_{0,g} + \Delta \textrm{switching}_g \\ + %cell-gene-specific activation state + k_{cg} &\sim \text{Bernoulli}(\textrm{logits}=t_c - t^{(1)}_{0,g}) \\ + % cell-gene-specific latent time + \tau_{cg} &= \text{softplus}(t_c - t^{(k_{cg})}_{0,g}). +\end{align} +Here, $\tau_{cg}$ represents the displacement of time per cell and gene with \begin{align} - \tau^{(k_{cg})} &= \operatorname{softplus} \left( t_{c} - {t_{0}^{(k_{cg})}}_g \right) \\ - & = \log( 1 + \exp (t_c - {t_{0}^{(k_{cg})}}_g)), + \text{softplus}(t) := \log( 1 + e^t). \end{align} -in which $t_c$ is the shared time per cell, ${t_{0}^{(kcg)}}_g$ is the +Recall that $t_c$ is the shared time per cell, $t^{(k_{cg})}_{0,g}$ is the gene-specific switching time. Each cell and gene combination has its transcriptional state $k_{cg} \in \{ 0, 1 \}$, where $0$ indicates the activation state and $1$ indicates the expression state. Each gene has two -switching times for representing activation and repression: ${t_{0}^{(0)}}_g$ is +switching times for representing activation and repression: $t^{(0)}_{0,g}$ is the first switching time corresponding to when the gene expression starts to be -activated, ${t_0^{(1)}}_g$ is the second switching time corresponding to when -the gene expression starts to be repressed. We note that $\alpha^{(1)}$ is -shared for all the genes, while ${\alpha^{(0)}}_g$ is learned independently for -each gene. - -The analytic solution of the differential equations to predict -spliced and unspliced gene expression given their parameters is derived by the -authors of scVelo and a theoretical RNA velocity study -[@Bergen2020-pj;@Li2021-qa] and given in -Eqs. \ref{eq-solution-u}-\ref{eq-solution-s2}. -\begin{align} -u\left(\tau^{\left(k_{c g}\right)}\right) - &= u_0^{\left(k_{c g}\right)}{ }_g e^{-\beta_g \tau^{\left(k_{c g}\right)}} - \nonumber \\ -&\hskip -24pt + \frac{\alpha^{\left(k_{c g}\right)}} - {\beta_g}\left(1-e^{-\beta_g \tau^{\left(k_{c g}\right)}}\right) - \label{eq-solution-u}\\ -s\left(\tau^{\left(k_{c g}\right)}\right) - &= s_0^{\left(k_{c g}\right)} e^{-\gamma_g \tau^{\left(k_{c g}\right)}} - \nonumber \\ - &\hskip -24pt + \frac{\alpha^{\left(k_{c g}\right)}}{\gamma_g} - \left(1-e^{-\gamma_g \tau^{\left(k_{c g}\right)}}\right) - \nonumber\\ - &\hskip -24pt + \frac{\alpha^{\left(k_{c g}\right)}-\beta_g u_0^{\left(k_{c g}\right)}} - {\gamma_g-\beta_g}\left(e^{-\gamma_g \tau^{\left(k_{c g}\right)}} - -e^{-\beta_g \tau^{\left(k_{c g}\right)}}\right), - \nonumber \\ - &\qquad \beta \neq \gamma \label{eq-solution-s} \\ -s\left(\tau^{\left(k_{c g}\right)}\right) - &= {s_0^{\left(k_{c g}\right)}}_g e^{-\beta_g \tau^{\left(k_{c g}\right)}} \nonumber \\ - &\hskip -24pt +\frac{\alpha^{\left(k_{c g}\right)}}{\beta_g} - \left(1-e^{-\beta_g \tau^{(k c g)}}\right) - \nonumber \\ - &\hskip -24pt -\left(\alpha^{\left(k_{c g}\right)} - -\beta_g u_0^{\left(k_{c g}\right)}{ }_g\right) \tau^{\left(k_{c g}\right)} - e^{-\beta_g \tau^{\left(k_{c g}\right)}}, \nonumber \\ - &\qquad \beta = \gamma \label{eq-solution-s2} -\end{align} +activated, $t^{(1)}_{0,g}$ is the second switching time corresponding to when +the gene expression starts to be repressed, and is determined by the first +switching time and the gene-specific switching time $\Delta \text{switching}_g$. +The cell-gene-specific activation state $k_{cg}$ is a Bernoulli random variable +with logits equal to the difference between the cell's shared time $t_c$ and the time $t^{(1)}_{0,g}$ when the gene expression starts to be repressed. -To simplify these equations, consider the case when $k_{cg} = 0$ and -$\beta_g \neq \gamma_g$. Then, -\begin{align} -u\left(\tau^{(0)}\right) &= {u_0^{(0)}}_g e^{-\beta_g \tau^{(0)}} \nonumber \\ - & \hskip -24pt + \frac{{\alpha^{(0)}}_g}{\beta_g}\left(1-e^{-\beta_g \tau^{(0)}}\right), - \label{eq-sol-usimp} \\ -s\left(\tau^{(0)}\right) &= s_0^{(0)}{ }_g e^{-\gamma_g \tau^{(0)}} \nonumber \\ - & \hskip -24pt +\frac{{\alpha^{(0)}}_g}{\gamma_g}\left(1-e^{-\gamma_g \tau^{(0)}}\right) - \nonumber\\ - & \hskip -24pt +\frac{{\alpha^{(0)}}_g-\beta_g {u_0^{(0)}}_g}{\gamma_g-\beta_g} - \left(e^{-\gamma_g \tau^{(0)}}-e^{-\beta_g \tau^{(0)}}\right). \label{eq-sol-ssimp} -\end{align} -When $k_{cg} = 0$ and $\beta_g = \gamma_g$, then $u\left(\tau^{(0)}\right)$ -has the same solution, and $s\left(\tau^{(0)}\right)$ becomes -\begin{align} -s\left(\tau^{(0)}\right) &= s_0^{(0)}{ }_g e^{-\gamma_g \tau^{(0)}} \nonumber \\ - & \hskip -24pt +\frac{{\alpha^{(0)}}_g}{\gamma_g} - \left(1-e^{-\gamma_g \tau^{(0)}}\right) \nonumber\\ - & \hskip -24pt - \left( {\alpha^{(0)}}_g-\beta_g {u_0^{(0)}}_g \right) - \tau^{(0)} e^{-\beta_g \tau^{(0)}}. \label{eq-sol-ssimp2} -\end{align} -When $k_{cg} = 1$ and $\beta_g \neq \gamma_g$, then + +Next we introduce the priors for the splicing parameters (where the activation rate $\alpha$ depends on the activation state $k_{cg}$ from above): \begin{align} -u\left(\tau^{(1)}\right) &=u_0^{(1)} g^{e^{-\beta_g \tau^{(1)}}}, \\ -s\left(\tau^{(1)}\right) &=s_0^{(1)} e^{-\gamma_g \tau^{(1)}} \nonumber \\ - & \hskip -24pt +\frac{-\beta_g u_0^{(1)}}{\gamma_g-\beta_g} - \left(e^{-\gamma_g \tau^{(1)}}-e^{-\beta_g \tau^{(1)}}\right). + \alpha^{(0)}_g &\sim \text{LogNormal}(0, 1) \\ + \beta_g &\sim \text{LogNormal}(0, 1) \\ + \gamma_g &\sim \text{LogNormal}(0, 1) \\ + \alpha_{cg} &= \begin{cases} + \alpha^{(0)}_g & \text{if } k_{cg} = 0 \\ + 0 & \text{if } k_{cg} = 1. + \end{cases} \end{align} -When $k_{cg} = 1$ and $\beta_g = \gamma_g$, then $u\left(\tau^{(1)}\right)$ -has the same solution, and $s\left(\tau^{(1)}\right)$ becomes + +Now, we describe the priors for the initial conditions, noting that this is the only difference between Model 1 and Model 2: \begin{align} -s\left(\tau^{(1)}\right)=s_0^{(1)}{ }_g e^{-\gamma_g \tau^{(1)}} - +\beta_g u_0^{(1)}{ }_g \tau^{(1)} e^{-\beta_g \tau^{(1)}}. + \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} &\sim \begin{cases} + (0, 0) & \text{Model 1} \\ + (\text{LogNormal}(0, 1), \text{LogNormal}(0, 1)) & \text{Model 2} + \end{cases} \\ + u^{(0)}_{cg}, s^{(0)}_{cg} &= \begin{cases} + \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} & \text{if } k_{cg} = 0 \\ + \textrm{ODESolve}\Big( \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg}, \alpha^{(0)}_g, \beta_g, \gamma_g; \ T_0=0, T_1=\Delta \textrm{switching}_g \Big) & \text{if } k_{cg} = 1 + \end{cases} \end{align} -We use these solutions to formulate an end-to-end probabilistic generative model -that relates prior distributions on kinetic parameters to a distribution on pairs -of observed unspliced and spliced read count matrices +We define the ODE solution at time $\tau_{cg}$ as: +\begin{equation} + \label{eq-ODE-solve} + \hat{u}_{cg}, \hat{s}_{cg} = \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big). +\end{equation} +Next, we define the observation model that gives rise to the observed counts as: \begin{align} -\alpha^{(0)}{ }_g &\sim \operatorname{LogNormal}(0,1), \\ -\beta_g &\sim \operatorname{LogNormal}(0,1), \\ -\gamma_g &\sim \operatorname{LogNormal}(0,1), \\ -&\hskip -18pt \Delta \text { switching }_g \sim \operatorname{LogNormal}(0,1), \\ -t_0^{\left(k_{c g}\right)} &= \left\{ - \begin{array}{l} - t_0^{(0)}{ }_g \sim \operatorname{Normal}(0,1), k_{c g}=0 \\ - t_0^{(1)}{ }_g=t_0^{(0)}{ }_g+\Delta \text { switching }_g, \\ - \quad k_{c g}=1 - \end{array}\right. \\ -t_c &\sim \operatorname{LogNormal}(0,1), \\ -k_{c g} &\sim \text{Bernoulli} \left( \text{logits}= t_c-t_0^{(1)} \right), \\ -\tau^{\left(k_{c g}\right)} - &= \operatorname{softplus}\left(t_c-t_0^{\left(k_{c g}\right)}{ }_g\right), \\ -u_{c g} - &= \text { Measurement }_u \left( u\left(\tau^{\left(k_{c g}\right)}\right) ; - u_{c g}^{obs}\right), \\ -s_{c g} - &= \text { Measurement }_s \left( s\left(\tau^{(k_{c g})}\right) ; - s_{c g}^{obs}\right). -\end{align} -$u\left(\tau^{\left(k_{c g}\right)}\right)$ and $s\left(\tau^{(k_{c g})}\right)$ are -are called the denoised gene expression calculated from the velocity analytic -solution input with the kinetics random variables. $u_{cg}$ and $s_{cg}$ are the spliced -and unspliced read count sampled from the Poisson models. $u_{cg}^{obs}$ and $s_{cg}^{obs}$ are -the observed spliced and unspliced read count tables. The generative process - -$\text{Measurement}(\cdot)$ for observed unspliced read counts given denoised unspliced -gene transcript expression level $u\left(\tau^{(k_{cg})}\right)$ -(and identical for observed spliced read counts) models the expected number of observed -reads for a given gene in a given cell as the number of transcripts times the ratio of -the cell's total reads to total transcripts -\begin{align} -u_c^{\hat{obs}} &= \sum_g u_{c g}^{obs}, \\ -\hat{u}_c &= \sum_g u\left( \tau^{(k_{c g})}\right), \\ -\eta_c^{(u)} &\sim \operatorname{Normal}\left( - u_c^{\hat{obs}_c}, - \operatorname{std} \left(u_c^{\hat{obs}}\right) - \right), \\ -\mu_{c g}^{(u)} &= \log \left(u\left(\tau^k{ }_{c g}\right)\right) - +\log \left(\eta_c^{(u)}\right)-\log \left(\hat{u}_c\right), \\ -u_{c g}^{obs} &\sim - \operatorname{Poisson}\left(\lambda=\exp \left(\mu_{c g}^{(u)}\right)\right). -\end{align} - -For the first Pyro-Velocity model (Model 1), we constrain the shared time to be strictly -larger than $t_{0}^{(0)}$ by introducing auxiliary random variables -$$ -\text{t\_constraint}_{cg} - \sim \text{Bernoulli} \left( \text{logits} = t_c - {t_{0}^{(0)}}_g \right), -$$ -and setting their values to $1$, and we set the initial condition per gene to be -\begin{align} -\left( {u_{0}^{(k_{cg})}}_g , {s_{0}^{(k_{cg})}}_g \right) &= \left\{ - \begin{array}{l} - (0,0), k_{c g}=0 \\ - \bigg( {u \left( \Delta \text { switching }_g \right)}_g,\\ - \quad {s \left( \Delta \text { switching }_g \right)}_g \bigg), \\ - \quad k_{c g}=1 - \end{array}\right. -\end{align} -For the extended Pyro -Velocity model (Model 2), we remove the shared -time constraint $\text{t\_constraint}_{cg}$, thus allowing a time lag per gene -that might be caused by delayed gene activation and set the initial condition -per gene as random variables that are strictly positive $\left({u_{0}^{(0)}}_g, -{s_{0}^{(0)}}_g\right)$, which allow genes having a basal expression level before gene -activation. Then, we compute the gene expression at the second switching time as -\begin{align} -({u_{0}^{(1)}}_g, {s_{0}^{(1)}}_g) &= - \bigg( {u \left( \Delta \text { switching }_g \right)}_g, \nonumber \\ -& \qquad {s \left( \Delta \text { switching }_g \right)}_g \bigg), -\end{align} -which shares the same initial condition $\left({u_{0}^{(0)}}_g, {s_{0}^{(0)}}_g\right)$ where -\begin{align} -{u_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1),\\ -{s_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1). + \mu^{(u)}_c &= \sum_{g=1}^G {u}^{\text{(obs)}}_{cg}, \quad \mu^{(s)}_c = \sum_{g=1}^G {s}^{\text{(obs)}}_{cg} \\ + \sigma^{(u)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( u_{cg}^{\text{(obs)}} - \mu^{(u)}_c \right)^2} \\ + \sigma^{(s)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( s_{cg}^{\text{(obs)}} - \mu^{(s)}_c \right)^2} \\ + \eta^{(u)}_c &\sim \text{Normal}\Big(\mu^{(u)}_c, \ \sigma^{(u)}_c\Big) \\ + \eta^{(s)}_c &\sim \text{Normal}\Big(\mu^{(s)}_c, \ \sigma^{(s)}_c\Big) \\ + \hat{\mu}^{(u)}_c &= \sum_{g=1}^G \hat{u}_{cg}, \quad \hat{\mu}^{(s)}_c = \sum_{g=1}^G \hat{s}_{cg} \\ + \lambda^{(u)}_{cg} &= \log(\hat{u}_{cg}) + \log(\eta^{(u)}_{c}) - \log(\hat{\mu}^{(u)}_c) \\ + \lambda^{(s)}_{cg} &= \log(\hat{s}_{cg}) + \log(\eta^{(s)}_{c}) - \log(\hat{\mu}^{(s)}_c) \\ + \hat{u}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(u)}_{cg})\Big) \label{eq-u-hat-obs} \\ + \hat{s}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(s)}_{cg})\Big) \label{eq-s-hat-obs} \end{align} +Here, we use ${u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}$ to denote the observed unspliced and spliced counts +for cell $c$ and gene $g$. We use $\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}$ to +denote our generative model's prediction of these unspliced and spliced expression levels. +The generative process for modeling these observed read counts given denoised gene transcript expression level $\hat{u}_{cg}, \hat{s}_{cg}$ considers the expected number of observed reads for a given gene in a given cell as the number of transcripts times the ratio of the cell's total reads to total transcripts. + +The linear ODE in Eq. \ref{eq-rna} has an analytic solution under the assumption that the transcription rate $\alpha(t)$ is piece-wise constant, and is derived by the authors of scVelo and a theoretical RNA velocity study +[@Bergen2020-pj;@Li2021-qa]. +For brevity, we present the solution under constant transcription rate $\alpha(t) = \alpha_0$, and note that this can be combined to assemble the solution in the piece-wise constant case. +First, let $C = u_0 - \frac{\alpha}{\beta}$. Then, the solution to the ODE in Eq. \ref{eq-rna} is: +\begin{equation} +\label{eq-rna-solution} +\begin{aligned} + u(t) &= \frac{\alpha}{\beta} + C e^{-\beta t} \\ + s(t) &= \begin{cases} + \frac{\alpha}{\gamma} + \frac{\beta}{\gamma - \beta} C e^{-\beta t} + \Big(s_0 - \frac{\alpha}{\gamma} - \frac{\beta}{\gamma - \beta} C\Big) e^{-\gamma t} & \text{if } \beta \neq \gamma \\ + \frac{\alpha}{\beta} + \Big(s_0 - \frac{\alpha}{\beta} + \beta C t\Big) e^{-\beta t} & \text{if } \beta = \gamma. + \end{cases} +\end{aligned} +\end{equation} +These solutions are the essential building block used to compute all cell-gene-specific ODE solutions for the unspliced and spliced expression levels in the generative model. ## Variational inference {#sec-methods-inference} -Given observations $\tilde{X}_{cg} = \left( u_{cg}^{obs}, s_{cg}^{obs} \right)$, -we would like to compute the posterior distribution over the random variables +Given observations $\tilde{X}_{cg} = \left( u_{cg}^{\text{(obs)}}, s_{cg}^{\text{(obs)}} \right)$ over cells $1 \dots C$ and genes $1 \dots G$, +we would like to compute the posterior distribution over the random variables $\theta := \left( \theta_1, \ \dots, \theta_C \right)$ and $\psi := \left( \psi_1, \ \dots, \psi_G \right)$, where \begin{align} -\theta &= \left( t_{c}, \eta_{c}^{(u)}, \eta_{c}^{(s)} \right), \\ -\phi &= \left( {t_{0}^{(0)}}_g, \Delta \text{switching}_{g}, - {\alpha^{(0)}}_{g}, \beta_{g}, \gamma_{g} \right), +\theta_c &= \left( t_{c}, \eta_{c}^{(u)}, \eta_{c}^{(s)} \right), \\ +\phi_g &= \left( {t_{0,g}^{(0)}}, \Delta \text{switching}_{g}, + \alpha^{(0)}_{g}, \beta_{g}, \gamma_{g} \right), \end{align} but exact Bayesian inference is intractable in this model. We use Pyro to automatically integrate out the local discrete latent variables $k$, which is @@ -747,18 +687,20 @@ machine with an NVIDIA A100 GPU and the CentOS 7 operating system. ## Posterior prediction {#sec-methods-posterior-prediction} To benchmark Pyro -Velocity performance in predicting cell fate, we -generated the posterior samples measurement -$x_{cg}=\left(u\left(\tau^{(k_{cg})}\right), -s\left(\tau^{(k_{cg})}\right)\right)$ or $x_{cg}=\left(u_{cg}, s_{cg} \right)$, -$t_c$, $\beta_g$, $\gamma_g$ from the same single cell RNA-seq using $N=30$ -Monte Carlo samples from the posterior predictive distribution following -$p(x_{cg} \vert \theta, \psi) p(\theta, \psi) \approx p(x_{cg} \vert \theta, -\psi) q_{\phi}(\theta, \psi)$. $x_{cg}$ can be posterior samples of either -denoised gene expression (used for phase portraits and vector field-based -trajectory inference) or raw read counts (used for gene selection). Then we -calculated the posterior samples of RNA velocity as -$v_{cg}=\frac{ds(\tau^{(k_{cg})})}{d \tau^{(k{cg})}}$ based on posterior samples -measurement of denoised gene expression $x_{cg}$ and $\beta_{g}$, $\gamma_{g}$. +generated posterior samples of quantitites of interest in the model. +In particular, we focus on posterior samples of $i)$ spliced and unspliced gene expression across all cells and genes, defined in Eq. \ref{eq-ODE-solve} and denoted by +$X_{cg}=\left(u_{cg}, s_{cg} \right)$ (used for phase portraits and vector-field-based +trajectory inference), +$ii)$ read counts, defined in Eqs. \ref{eq-u-hat-obs} and \ref{eq-s-hat-obs} and denoted by +$\hat{X}^{\text{(obs)}}_{cg}=\left(\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg} \right)$ +(used for gene selection), +and +$iii)$ RNA velocity, defined in Eq. \ref{eq-ds-cg} as the time derivative of the spliced expression, denoted by +$v_{cg}=\frac{ds_{cg}}{d t}(X_{cg})$. +We generate 30 Monte Carlo samples from the posterior predictive distribution +$X, \hat{X}^{(\text{obs})}, v \sim p(X, \hat{X}^{(\text{obs})}, v \vert \theta, \psi) p(\theta, \psi) \approx q_{\phi}(\theta, \psi)$ +by sampling the posterior distribution of $\theta$, $\psi$ using the trained model. +We use these samples to compute statistics on the posterior predictive distribution of our quantities of interest; these statistics are used to evaluate the model's performance and uncertainty in predicting cell fate. ## Prioritization of cell fate markers {#sec-methods-fate-markers} @@ -767,9 +709,7 @@ correlation between each gene's posterior mean of the denoised spliced expression and posterior mean of the shared time; Second, the negative mean absolute errors of each gene's observed spliced, and unspliced read counts with posterior predictive samples of spliced and unspliced raw read counts, i.e. -$\frac{1}{N n_c} \sum_i \sum_c (x_{icg}-\tilde{X}_{cg})$, where $N$ is the -posterior sample number that is set to $30$, $i$ is the posterior sample index, -and $n_c$ is the cell number. We first select the top $300$ genes with the +$\mathcal{E}_g = \frac{1}{NC} \sum_{i=1}^N \sum_{c=1}^{C} \vert \tilde{X}_{cg} - \hat{X}_{cg}^{(\text{obs}), (i)} \vert$, where we use $N=30$ Monte Carlo samples. We first select the top $300$ genes with the highest negative mean absolute errors and then rank the $300$ genes based on the most positively correlated genes and the least negatively correlated genes. We use the same strategy for scVelo to rank the markers by the model likelihood to @@ -908,8 +848,9 @@ is state uncertainty which evaluates the mean Euclidean distance between posterior mean and posterior samples of the raw read count prediction for each cell \begin{align*} -\frac{1}{N} \sum_{i} \sqrt{\sum_{g} \left(x_{icg} - \frac{1}{N}\sum_i x_{icg}\right)^2}. +\mathcal{U}_c = \frac{1}{N} \sum_{i=1}^N \sqrt{\sum_{g=1}^G \left(\hat{X}^{\text{(obs)},(i)}_{cg} - \frac{1}{N}\sum_{i=1}^N \hat{X}^{\text{(obs)},(i)}_{cg}\right)^2} \end{align*} +with $N=30$ Monte Carlo samples. ## Cospar {#sec-methods-cospar} @@ -956,11 +897,11 @@ to use for a given dataset. Namely, the mean absolute errors of posterior samples prediction of spliced and unspliced raw read counts $x_{icg}$ from its observation $\tilde{X}_{cg}$ per gene and cell combinations \begin{align*} -\frac{1}{N_i N_c N_g} \sum_{i=1}^{N_i} \sum_{c=1}^{N_c} \sum_{g=1}^{N_g} - \vert x_{icg} - \tilde{X}_{cg} \vert, +\frac{1}{N C G} \sum_{i=1}^{N} \sum_{c=1}^{C} \sum_{g=1}^{G} + \vert X^{\text{(obs), (i)}}_{cg} - \tilde{X}_{cg} \vert, \end{align*} -where $i$ is the posterior sample index, $N_i$ is the posterior sample number, -$N_c$ is the cell number, $N_g$ is the gene number. Then it is possible to +where $i$ is the posterior sample index, $N$ is the posterior sample number, +$C$ is the cell number, $G$ is the gene number. Then it is possible to evaluate the cell fate inference performance of the models based on two metrics: 1. the consistency between the velocity vector field and the *clonal progression diff --git a/reproducibility/manuscript/v2.tex b/reproducibility/manuscript/v2.tex index b69d93566..45996fbdb 100644 --- a/reproducibility/manuscript/v2.tex +++ b/reproducibility/manuscript/v2.tex @@ -156,6 +156,8 @@ \raggedbottom \usepackage{nameref} \usepackage{placeins} +\usepackage[utf8]{inputenc} +\usepackage{fontspec} \usepackage[labelfont=bf]{caption} % \usepackage{showframe} % \usepackage{layout} @@ -661,34 +663,133 @@ \subsection{Model formulation}\label{sec-methods-model} We assume the dynamical gene expression is determined by the RNA splicing process, and infer the unspliced and spliced gene expression -level from the differential equations proposed in velocyto -\citep{La_Manno2018-lj} and scVelo \citep{Bergen2020-pj} \begin{align} -\frac{d u\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}} - &= \alpha^{\left(k_{cg}\right)}-\beta_g u\left(\tau^{\left(k_{cg}\right)}\right), - \label{eq-dudt}\\ -\frac{d s\left(\tau^{\left(k_{cg}\right)}\right)}{d \tau^{\left(k_{cg}\right)}} - &= \beta_g u\left(\tau^{\left(k_{cg}\right)}\right) - -\gamma_g s\left(\tau^{\left(k_{cg}\right)}\right). \label{eq-dsdt} +level from the ordinary differential equation (ODEs) proposed in +velocyto \citep{La_Manno2018-lj} and scVelo \citep{Bergen2020-pj} +\begin{align} +\frac{du}{dt} &= \alpha(t) - \beta u, \quad u(0) = u_0 \label{eq-dudt}\\ + \frac{ds}{dt} &= \beta u - \gamma s, \quad s(0) = s_0, \label{eq-dsdt} +\end{align} where \(u(t), s(t)\) are the unspliced and spliced +expression levels of a gene at time \(t\) under a transcription rate +\(\alpha(t)\) with possible temporal dependence, splicing rate +\(\beta\), and degradation rate \(\gamma\). We specify this model to a +setting that depends on cell \(c\) and gene \(g\) as follows: +\begin{align} +\frac{du_{cg}}{dt} &= \alpha_{cg}(t) - \beta_{g} u_{cg}, \quad u_{cg}(0) = u_{cg}^{(0)} \label{eq-dudt}\\ + \frac{ds_{cg}}{dt} &= \beta_{g} u_{cg} - \gamma_{g} s_{cg}, \quad s_{cg}(0) = s_{cg}^{(0)} \label{eq-dsdt}. \end{align} In the equation, the subscript \(c\) is the cell dimension, -\(g\) is the gene dimension, -\(\left( u\left( \tau^{(k_{cg})} \right), s\left( \tau^{(k_{cg})} \right) \right)\) -are the unspliced and spliced expression functions given the change of -time per cell and gene. \(\tau_{cg}\) represents the displacement of -time per cell and gene with \begin{align} - \tau^{(k_{cg})} &= \operatorname{softplus} \left( t_{c} - {t_{0}^{(k_{cg})}}_g \right) \\ - & = \log( 1 + \exp (t_c - {t_{0}^{(k_{cg})}}_g)), -\end{align} in which \(t_c\) is the shared time per cell, -\({t_{0}^{(kcg)}}_g\) is the gene-specific switching time. Each cell and -gene combination has its transcriptional state +\(g\) is the gene dimension, \(\left( u_{cg}(t), s_{cg}(t) \right)\) are +the unspliced and spliced expression functions given the change of time +per cell and gene. We restrict attention to piecewise-constant +\(\alpha_{cg}(t)\) to capture gene-specific activation and repression. +We take special care to model a gene- and cell-specific switching time +that marks a single transition from activation to repression by +introducing a Bernoulli variable \(k_{cg}\) to model unknown activation +state. We assume our cell-by-gene data-matrix arrive as observations of +Poisson-counts related to the solution of the above ODEs at unknown +times \(\tau_{cg}\), which is modeled as a relationship between an +unknown latent time shared across each cell, \(t_c\), and unknown +gene-specific time-offsets \(t_{0,g}\) where all read counts for a +single cell occurred at an unknown, but shared latent time \(t_c\). +These relative times are also used to parametrize the Bernoulli process +for \(k_{cg}\). Importantly, we recognize that the initial conditions +are in fact unknown. + +We propose and study two models: Model 1 assumes that spliced and +unspliced concentrations are both 0 at time 0; Model 2 considers these +initial conditions as unknowns with a log-Normal prior distribution. In +general, the solution space of ODEs becomes much richer when considered +over a domain of initial conditions (as opposed to a single point); +indeed, this affords Model 2 much greater expressivity. For clarity, we +first present the generative framework for both models, then provide +further interpretation and intuition. + +First, we introduce the generative model that describes the various +unobserved times: \begin{align} + % unit lognormal t_c + t_c &\sim \text{LogNormal}(0, 1) \\ + % gene-specific t_0 + t^{(0)}_{0,g} &\sim \text{LogNormal}(0, 1) \\ + % switching time + \Delta \textrm{switching}_g &\sim \text{LogNormal}(0, 1) \\ + % gene-specific t_1 + t^{(1)}_{0,g} &= t^{(0)}_{0,g} + \Delta \textrm{switching}_g \\ + %cell-gene-specific activation state + k_{cg} &\sim \text{Bernoulli}(\textrm{logits}=t_c - t^{(1)}_{0,g}) \\ + % cell-gene-specific latent time + \tau_{cg} &= \text{softplus}(t_c - t^{(k_{cg})}_{0,g}). +\end{align} Here, \(\tau_{cg}\) represents the displacement of time per +cell and gene with \begin{align} + \text{softplus}(t) := \log( 1 + e^t). +\end{align} Recall that \(t_c\) is the shared time per cell, +\(t^{(k_{cg})}_{0,g}\) is the gene-specific switching time. Each cell +and gene combination has its transcriptional state \(k_{cg} \in \{ 0, 1 \}\), where \(0\) indicates the activation state and \(1\) indicates the expression state. Each gene has two switching -times for representing activation and repression: \({t_{0}^{(0)}}_g\) is +times for representing activation and repression: \(t^{(0)}_{0,g}\) is the first switching time corresponding to when the gene expression -starts to be activated, \({t_0^{(1)}}_g\) is the second switching time -corresponding to when the gene expression starts to be repressed. We -note that \(\alpha^{(1)}\) is shared for all the genes, while -\({\alpha^{(0)}}_g\) is learned independently for each gene. +starts to be activated, \(t^{(1)}_{0,g}\) is the second switching time +corresponding to when the gene expression starts to be repressed, and is +determined by the first switching time and the gene-specific switching +time \(\Delta \text{switching}_g\). The cell-gene-specific activation +state \(k_{cg}\) is a Bernoulli random variable with logits equal to the +difference between the cell's shared time \(t_c\) and the time +\(t^{(1)}_{0,g}\) when the gene expression starts to be repressed. + +Next we introduce the priors for the splicing parameters (where the +activation rate \(\alpha\) depends on the activation state \(k_{cg}\) +from above): \begin{align} + \alpha^{(0)}_g &\sim \text{LogNormal}(0, 1) \\ + \beta_g &\sim \text{LogNormal}(0, 1) \\ + \gamma_g &\sim \text{LogNormal}(0, 1) \\ + \alpha_{cg} &= \begin{cases} + \alpha^{(0)}_g & \text{if } k_{cg} = 0 \\ + 0 & \text{if } k_{cg} = 1 + \end{cases} +\end{align} +\textbf{Note that $\alpha^{(1)}$ is shared for all the genes, while ${\alpha^{(0)}}_g$ is learned independently for +each gene. MATT: this was in the old text, but I think $\alpha^{(1)}$ is no longer used based on conversations with Alvin?} + +Now, we describe the priors for the initial conditions, noting that this +is the only difference between Model 1 and Model 2: \begin{align} + \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} &\sim \begin{cases} + (0, 0) & \text{Model 1} \\ + (\text{LogNormal}(0, 1), \text{LogNormal}(0, 1)) & \text{Model 2} + \end{cases} \\ + u^{(0)}_{cg}, s^{(0)}_{cg} &= \begin{cases} + \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg} & \text{if } k_{cg} = 0 \\ + \textrm{ODESolve}\Big( \hat{u}^{(0)}_{cg}, \hat{s}^{(0)}_{cg}, \alpha^{(0)}_g, \beta_g, \gamma_g; \ T_0=0, T_1=\Delta \textrm{switching}_g \Big) & \text{if } k_{cg} = 1 + \end{cases} +\end{align} +We define the ODE solution at time \(\tau_{cg}\) as: \begin{equation} + \hat{u}_{cg}, \hat{s}_{cg} = \text{ODESolve}\Big( u^{(0)}_{cg}, s^{(0)}_{cg}, \alpha_{cg}, \beta_g, \gamma_g; \ T_0=0, T_1=\tau_{cg} \Big). +\end{equation} + +Next, we define the observation model that gives rise to the observed +counts as: \begin{align} + \mu^{(u)}_c &= \sum_{g=1}^G {u}^{\text{(obs)}}_{cg}, \quad \mu^{(s)}_c = \sum_{g=1}^G {s}^{\text{(obs)}}_{cg} \\ + \sigma^{(u)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( u_{cg}^{\text{(obs)}} - \mu^{(u)}_c \right)^2} \\ + \sigma^{(s)}_c &= \sqrt{\frac{1}{G} \sum_{g=1}^G \left( s_{cg}^{\text{(obs)}} - \mu^{(s)}_c \right)^2} \\ + \eta^{(u)}_c &\sim \text{Normal}\Big(\mu^{(u)}_c, \ \sigma^{(u)}_c\Big) \\ + \eta^{(s)}_c &\sim \text{Normal}\Big(\mu^{(s)}_c, \ \sigma^{(s)}_c\Big) \\ + \hat{\mu}^{(u)}_c &= \sum_{g=1}^G \hat{u}_{cg}, \quad \hat{\mu}^{(s)}_c = \sum_{g=1}^G \hat{s}_{cg} \\ + \lambda^{(u)}_{cg} &= \log(\hat{u}_{cg}) + \log(\eta^{(u)}_{c}) - \log(\hat{\mu}^{(u)}_c) \\ + \lambda^{(s)}_{cg} &= \log(\hat{s}_{cg}) + \log(\eta^{(s)}_{c}) - \log(\hat{\mu}^{(s)}_c) \\ + \hat{u}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(u)}_{cg})\Big) \\ + \hat{s}^{\text{(obs)}}_{cg} &\sim \text{Poisson}\Big(\exp (\lambda^{(s)}_{cg})\Big) +\end{align} Here, we use +\({u}^{\text{(obs)}}_{cg}, {s}^{\text{(obs)}}_{cg}\) to denote the +observed unspliced and spliced counts for cell \(c\) and gene \(g\). We +use \(\hat{u}^{\text{(obs)}}_{cg}, \hat{s}^{\text{(obs)}}_{cg}\) to +denote our generative model's prediction of these unspliced and spliced +expression levels. The generative process for modeling these observed +read counts given denoised gene transcript expression level +\(\hat{u}_{cg}, \hat{s}_{cg}\) considers the expected number of observed +reads for a given gene in a given cell as the number of transcripts +times the ratio of the cell's total reads to total transcripts. +\textbf{Improve descriptions of how noise is modeled in the observation model.} + +\textbf{Need to update the analytic solutions, but first need to confirm the above is correct. Also, I recommend pushing all of the below analytic solutions to the appendix.} The analytic solution of the differential equations to predict spliced and unspliced gene expression given their parameters is derived by the authors of scVelo and a theoretical RNA velocity study @@ -753,89 +854,6 @@ \subsection{Model formulation}\label{sec-methods-model} +\beta_g u_0^{(1)}{ }_g \tau^{(1)} e^{-\beta_g \tau^{(1)}}. \end{align} -We use these solutions to formulate an end-to-end probabilistic -generative model that relates prior distributions on kinetic parameters -to a distribution on pairs of observed unspliced and spliced read count -matrices - -\begin{align} -\alpha^{(0)}{ }_g &\sim \operatorname{LogNormal}(0,1), \\ -\beta_g &\sim \operatorname{LogNormal}(0,1), \\ -\gamma_g &\sim \operatorname{LogNormal}(0,1), \\ -&\hskip -18pt \Delta \text { switching }_g \sim \operatorname{LogNormal}(0,1), \\ -t_0^{\left(k_{c g}\right)} &= \left\{ - \begin{array}{l} - t_0^{(0)}{ }_g \sim \operatorname{Normal}(0,1), k_{c g}=0 \\ - t_0^{(1)}{ }_g=t_0^{(0)}{ }_g+\Delta \text { switching }_g, \\ - \quad k_{c g}=1 - \end{array}\right. \\ -t_c &\sim \operatorname{LogNormal}(0,1), \\ -k_{c g} &\sim \text{Bernoulli} \left( \text{logits}= t_c-t_0^{(1)} \right), \\ -\tau^{\left(k_{c g}\right)} - &= \operatorname{softplus}\left(t_c-t_0^{\left(k_{c g}\right)}{ }_g\right), \\ -u_{c g} - &= \text { Measurement }_u \left( u\left(\tau^{\left(k_{c g}\right)}\right) ; - u_{c g}^{obs}\right), \\ -s_{c g} - &= \text { Measurement }_s \left( s\left(\tau^{(k_{c g})}\right) ; - s_{c g}^{obs}\right). -\end{align} \(u\left(\tau^{\left(k_{c g}\right)}\right)\) and -\(s\left(\tau^{(k_{c g})}\right)\) are are called the denoised gene -expression calculated from the velocity analytic solution input with the -kinetics random variables. \(u_{cg}\) and \(s_{cg}\) are the spliced and -unspliced read count sampled from the Poisson models. \(u_{cg}^{obs}\) -and \(s_{cg}^{obs}\) are the observed spliced and unspliced read count -tables. The generative process - -\(\text{Measurement}(\cdot)\) for observed unspliced read counts given -denoised unspliced gene transcript expression level -\(u\left(\tau^{(k_{cg})}\right)\) (and identical for observed spliced -read counts) models the expected number of observed reads for a given -gene in a given cell as the number of transcripts times the ratio of the -cell's total reads to total transcripts \begin{align} -u_c^{\hat{obs}} &= \sum_g u_{c g}^{obs}, \\ -\hat{u}_c &= \sum_g u\left( \tau^{(k_{c g})}\right), \\ -\eta_c^{(u)} &\sim \operatorname{Normal}\left( - u_c^{\hat{obs}_c}, - \operatorname{std} \left(u_c^{\hat{obs}}\right) - \right), \\ -\mu_{c g}^{(u)} &= \log \left(u\left(\tau^k{ }_{c g}\right)\right) - +\log \left(\eta_c^{(u)}\right)-\log \left(\hat{u}_c\right), \\ -u_{c g}^{obs} &\sim - \operatorname{Poisson}\left(\lambda=\exp \left(\mu_{c g}^{(u)}\right)\right). -\end{align} - -For the first Pyro-Velocity model (Model 1), we constrain the shared -time to be strictly larger than \(t_{0}^{(0)}\) by introducing auxiliary -random variables \[ -\text{t\_constraint}_{cg} - \sim \text{Bernoulli} \left( \text{logits} = t_c - {t_{0}^{(0)}}_g \right), -\] and setting their values to \(1\), and we set the initial condition -per gene to be \begin{align} -\left( {u_{0}^{(k_{cg})}}_g , {s_{0}^{(k_{cg})}}_g \right) &= \left\{ - \begin{array}{l} - (0,0), k_{c g}=0 \\ - \bigg( {u \left( \Delta \text { switching }_g \right)}_g,\\ - \quad {s \left( \Delta \text { switching }_g \right)}_g \bigg), \\ - \quad k_{c g}=1 - \end{array}\right. -\end{align} For the extended Pyro -Velocity model (Model 2), we remove -the shared time constraint \(\text{t\_constraint}_{cg}\), thus allowing -a time lag per gene that might be caused by delayed gene activation and -set the initial condition per gene as random variables that are strictly -positive \(\left({u_{0}^{(0)}}_g, -{s_{0}^{(0)}}_g\right)\), which allow genes having a basal expression -level before gene activation. Then, we compute the gene expression at -the second switching time as \begin{align} -({u_{0}^{(1)}}_g, {s_{0}^{(1)}}_g) &= - \bigg( {u \left( \Delta \text { switching }_g \right)}_g, \nonumber \\ -& \qquad {s \left( \Delta \text { switching }_g \right)}_g \bigg), -\end{align} which shares the same initial condition -\(\left({u_{0}^{(0)}}_g, {s_{0}^{(0)}}_g\right)\) where \begin{align} -{u_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1),\\ -{s_{0}^{(0)}}_g &\sim \operatorname{LogNormal}(0,1). -\end{align} - \subsection{Variational inference}\label{sec-methods-inference} Given observations