Skip to content

Covariate Likelihood

Igor Siveroni edited this page Jul 14, 2021 · 10 revisions

The CovariateLikelihood component

The CovariateLikelihood class calculates the log-likelihood of a set of data points (a time series) given a population model. It determines (and adds) the log-likelihood of each data point given a parameterised distribution assigned to all points.

A Covariate Likelihood element is made of the following components:

  • popmodel: A PopModelODE element, usually a reference to the population model used by the STreeLikelihood component used in the PhyDyn analysis.
  • data : A string made of rows and columns of data in table format, where the first column must correspond to time and the second column to the covariate value of interest.
  • covariate-expression: A mathematical expression used to calculate the covariate value of interest.
  • One of the following,
    • covariate-distribution : The parametric distribution used to calculate the log-likelihood of each data point, OR
    • distribution-expression : A mathematical expression describing the log-density function of the covariate distribution - in case the distribution is not available

The Covariate Likelihood is usually used as a prior and in conjunction with PhyDyn's tree likelihood component.

Example: Prevalence

Let's say our PhyDyn analysis defines the following SEIR population model (extract shown):

<distribution id="seirmodel" spec="phydyn.distribution.STreeLikelihoodODE" equations="QL" minP="0.0001" stepSize="0.001" penaltyAgtY="0.0" useStateName="true">
   <popmodel id="pdseirmodel.t:algn" spec="phydyn.model.PopModelODE">
       <matrixeq spec="phydyn.model.MatrixEquation" destination="E" origin="Il" type="birth">beta*Il*S / N</matrixeq>
       <matrixeq spec="phydyn.model.MatrixEquation" destination="E" origin="Ih" type="birth">beta*Ih*tau * S / N</matrixeq>
       ...
       <matrixeq spec="phydyn.model.MatrixEquation" origin="infections" type="nondeme">beta*(Ih*tau+Il) * S / N</matrixeq>
       ...
</distribution>

The model above uses two compartments for infected demes, Ih and Ih, migrating from compartment E. The model also defines a non-deme variable infections that keeps track of the accumulated number of infections in time, which can be used to calculate prevalence.

Let's assume that we have seroprevalence information (percentage) for two dates, 2020.412 and 2020.6, with confidence intervals that correspond to a standard deviation of 0.5, and that the total population size is 1500000. The covariate likelihood component that calculates the log-likelihood of the prevalence computed from the trajectories of the population model referenced by ID seirmodel, given the two data points, is:

<distribution id="seir.seroprevalencelh.t" spec="phydyn.covariate.CovariateLikelihood" 
    popmodel='@seirmodel' >
    <data>
        time,sp
        2020.412, 6.3
        2020.6, 6.5
    </data>
	<covariate-expression> (infections / 150000)*100 </covariate-expression>
	<covariate-distribution spec="phydyn.covariate.distribution.Normal"  mean="sp" sigma="0.5"/>  
</distribution>

Note the following:

  • The value of popmodel is a reference to the population model definition introduced before with ID seirmodel.
  • The data element generates 2 data points and, for each row/data-point, binds the values of the first and second column to variables time and sp, respectively.
  • The covariate-expression formula is used to calculate the value of prevalence at any given the time point in our trajectory. In our example, the 'covariate-expression' is calculated twice for each population trajectory.
  • Each point is assigned the Normal distribution with sigma 0.5 and mean equal to the value of sp entered in the table e.g. the first point has N(6.3, 0.5).
  • The total log-likelihood is computed as the sum of the logarithms of the density functions assigned to each data point e.g. N(6.3, 0.5) for the first row, evaluated at the value generated by evaluating covariate-expression at each data point.

The data component

The data component must be entered in table format. The table must have headers and contain at least 2 columns, the first column corresponding to time values while the second column must contain the real-values of the covariates. The first column must have header t or time.

Header names are important. Header names are used as variable names available during the evaluation of covariate-expression, distribution-expression or available as references used by parameters of covariate-distribution. This feature can be used to add parameters to our model. For example, if we want to assign values of total population size to each data point under variable name popSize, we can define the following:

    <data>
        time,sp, popSize
        2020.412, 6.3, 1500000
        2020.6, 6.5, 1500000
    </data>

The data above will generate variables named time, sp and popSize which will be assigned different values for each data point e.g during the evaluation of the likelihood of the second data point, the time, sp and popSize variables will have values 2020.6, 6.5 and 1500000, respectively. The covariate-expression component can then be re-written as follows:

<covariate-expression> (infections / popSize)*100 </covariate-expression>

Covariate Expression

The covariate-expression is an arithmetical expression used to calculate the covariate value of interest. Its syntax is identical to the one used to write the matrix equations in our PopModelODE models ( syntax ).

The value generated by the evaluation of the covariate expression is used to calculate the point covariate likelihood.

The covariate expression can be written using the following variables:

  • The deme and non-deme names defined in our population model (definitions can't be accessed at the moment) e.g. S, I and R from a SIR model.
  • The headers declared in the data component.

For example, the covariate expression (infections / popSize)*100 uses variables infections, a non-deme defined by the seirmodel population model, and popSize, introduced by the table in the data component introduced in the previous section.

The Covariate Distribution and Point Likelihood

The covariate distribution component declares the distribution to be used to calculate the point covariate likelihood i.e. the likelihood of the computed covariate value.

Currently, the covariate likekelihood component only implements the Normal distribution. It has the following syntax:

<covariate-distribution spec="phydyn.covariate.distribution.Normal"  mean="<P>" sigma="<P>"/> 

where <P> is either a numerical constant or a valid variable name e.g. a header name. For example, we can write

<covariate-distribution spec="phydyn.covariate.distribution.Normal"  mean="sp" sigma="0.5"/> 

to assign the normal distribution N(sp,0.5) to each data point, where sp is the value entered in the data table for row corresponding to the given data point.

We put everything together in the next section.

Extended Example: Prevalence

Let's now consider the case where the standard deviation values of the normal distributions that describe each data point are provided as data i.e. as a column in our table names sigma, and that the total population size values per data point (constant 1500000 in our case) are provided in column popSize. The data element looks like this:

    <data>
        time,sp, sigma, popSize
        2020.412, 6.3, 0.45, 1500000
        2020.6, 6.5, 0.6, 1500000
    </data>

The corresponding Covariate Likelihood component is written as follows:

<distribution id="seir.seroprevalencelh.t" spec="phydyn.covariate.CovariateLikelihood" 
    popmodel='@seirmodel' >
	<covariate-expression> (infections / popSize)*100 </covariate-expression>
	<covariate-distribution spec="phydyn.covariate.distribution.Normal"  mean="sp" sigma="sigma"/>  
    <data>
        time,sp, sigma, popSize
        2020.412, 6.3, 0.45, 1500000
        2020.6, 6.5, 0.6, 1500000
    </data>
</distribution>

The Covariate Distribution Expression

It's possible that none of the pre-defined distributions (currently only Normal is available) fits our requirements. If this is the case, we can use the covariate-distribution component to explictly define an ad-hoc distribution by writing the arithmentic expression that calculates (the logarithm of) the density function used to calculate the point covariate likelihoods. However, in order to do this we require an extra value, the value computed by covariate-expression. This extra value is provided by an extra variable, <id>Val, where <id> is the name of the header of the second column - the covariate column.

WE can re-qrite our example as follows:

<distribution id="seir.seroprevalencelh.t" spec="phydyn.covariate.CovariateLikelihood" 
    popmodel='@seirmodel' >
    <data>
        time, sp, sigma, popSize
        2020.412, 6.3, 0.45, 1500000
        2020.6, 6.5, 0.6, 1500000
    </data>
	<covariate-expression> (infections / popSize)*100 </covariate-expression>
	<distribution-expression> -(log(sigma*sqrt(2*PI))) - 0.5*(spVal - sp)*(spVal-sp)/(sigma*sigma) </distribution-expression>  
</distribution>

where, for each data point (row),

  • sp denotes the value of the second column (covariate data).
  • spVal denotes the value compute by (infections / popSize)*100.