egp3.knitr

\documentclass[a4paper]{article}
\usepackage[british]{babel}
%\usepackage{fancyhdr}
\usepackage{amsmath}
\usepackage{rotating}
%\pagestyle{fancy}
%\rfoot{DRAFT}
%\rhead{DRAFT}

%\VignetteIndexEntry{egp3}
%\VignetteEngine{knitr}

\begin{document}
\title{The Extended Generalized Pareto Distribution 3 (EGP3) in texmex}
\author{Harry Southworth}
\maketitle%\thispagestyle{empty}
\tableofcontents
\clearpage

<<setup, echo=FALSE>>=
seed <- 20140911
set.seed(seed)
opts_chunk$set(fig.path='AutoGeneratedFiles/egp3',
               dev = 'png',dpi=200,warning=FALSE,message=FALSE)
@

\section{Introduction}
Version 2 of the the texmex~\cite{texmex} package for R~\cite{R} introduced
the ability to add new families of extreme value distributions to the package
and itself added the Generalized Extreme Value (GEV) distribution, previous
releases having supported modelling only with the Generalized Pareto
Distribution (GPD). We here describe the implementation of a new extreme
value family, the Extended Generalized Pareto Distribution 3 (EGP3) described
by Papastathopoulos and Tawn~\cite{egp}.

The EGP3 family introduces a new parameter, $\kappa > 0$. For the purposes
of numerical stability and avoiding negative values, when modelling data we
work with $\lambda = \log\kappa$.

Papastathopoulos and Tawn~\cite{egp} work more with their EGP1 and 2 models
than with EGP3, but state that there are few differences in results between
the 3 versions. The EGP3 model has the advantage of closed form derivatives
for approximating the standard errors of return levels, which is
one reason why it is preferred here. Section~\ref{sec:egp3} contains some
technical details.

\subsection{Acknowledgements}
This work was partly funded by AstraZeneca. I'm also grateful to Yiannis
Papastathopoulos and Paul Metcalfe for various comments and corrections.
Thanks also to Emmanuel Flachaire for pointing out a bug in an earlier
version of {\tt egp3RangeFit}.

\subsection{Software}
\Sexpr{version$version.string}~\cite{R} will be used for all analyses.
A summary of the R session appears in the Appendix, in the interests of
reproducibility.

\section{Using the EGP3 distribution for extreme value modelling}
The additional parameter, $\kappa$ in the EGP models is allowed to vary
over the positive real line, and in each case a value of $\kappa = 1$ results in
a distribution identical to the GPD. This property suggests a new diagnostic
plot to aid threshold selection: plot the estimated value of $\kappa$ with a
confidence interval over a range of thresholds and select the lowest threshold
which contains the value $\hat\kappa = 1$. GPD modelling can then be performed
on values above this threshold.
This diagnostic plot provides a useful addition to the standard methods of
examining the values of $\hat\sigma_*$ and $\hat\xi$ over a range of thresholds,
and examining the mean residual life plot as described by Coles~\cite{coles}.

\subsection{River Nidd example}
Following Papastathopoulos and Tawn, we work with the River Nidd data, producing
the standard threshold selection plots and the new plot based on EGP3.

<<nidd, echo=TRUE, fig.cap="Threshold selection plots for the River Nidd data. The bottom right panel displays the $\\hat\\kappa$ with an approximate 95\\% confidence interval. The lowest value of $\\hat\\kappa$ for which the confidence interval contains 1 is approximately 75, suggesting this as a threshold above which GPD modelling can be performed.",message=FALSE>>=
library(texmex)
library(gridExtra)
g1 <- ggplot(gpdRangeFit(nidd, cov="numeric", umax=90, umin=65, nint=20))
g2 <- ggplot(mrl(nidd))
g3 <- ggplot(egp3RangeFit(nidd, umax=90, umin=65, nint=20),)

grid.arrange(g1[[1]],g1[[2]],g2,g3,ncol=2)
@

Figure~\ref{fig:nidd} displays the results. The lowest threshold for which
$\hat\kappa$ is similar to 1 is at about 75, suggesting that GPD models can be
used above this level.

\subsection{Pharmaceutical example}
The introduction of $\kappa$ into the distribution suggests that lower thresholds
might be usable for extreme value modelling. We now follow Southworth and
Heffernan~\cite{southHeffMultivariate} in analyzing some clinical trial safety data.
Southworth and Heffernan model all the values of various safety related laboratory
variables above the $70^{th}$ percentile
using GPD models allowing $\hat\xi$ to vary linearly with dose. In each case,
they find a linear relationship with dose, except for bilirubin. Following
Papastathopoulos and Tawn, we revisit the issue of threshold selection, hoping
to find a lower threshold above which EGP3 models can be fit, thus including
more of the available data in the model, increasing the chance of identifying
a dose effect if one exists.

<<liver, echo=TRUE, fig.cap="The EGP3 threshold selection plot for each dose group in the liver data.">>=
library(MASS)
rmod <- rlm(log(TBL.M) ~ log(TBL.B) + as.numeric(dose),
            data=liver, method="MM", c=3.44)
liver$r <- resid(rmod)

p <- lapply(LETTERS[1:4],function(dose){
    ggplot(egp3RangeFit(liver$r[liver$dose == dose],
                        umin=-.5, umax=.135)) +
        ggtitle(paste("\nDose",dose))})

grid.arrange(p[[1]],p[[2]],p[[3]],p[[4]],ncol=2)
@

The plots of $\hat\kappa$ over the range of thresholds in Figure~\ref{fig:liver}
suggests that GPD models ought to fit above a threshold of about 0, the $57^{th}$
percentile, or even a little lower for some doses.

We now pool the residual bilirubin data from all doses and present the standard
threshold selection plots as well as the EGP3 plot. The results appear in
Figure~\ref{fig:liverthresh} and also suggest a threshold of 0 to be appropriate,
gaining us an additional 79 observations compared to the $70^{th}$ percentile
used by Southworth and Heffernan.

<<liverthresh, echo=TRUE, fig.cap="Threshold selection plots for the liver data. The bottom right panel displays our new plot based on the EGP3 distribution.",message=FALSE>>=
p1 <- ggplot(gpdRangeFit(liver$r, nint=20))
p2 <- ggplot(mrl(liver$r))
p3 <- ggplot(egp3RangeFit(liver$r, nint=20))
grid.arrange(p1[[1]],p1[[2]],p2,p3,ncol=2)
@
\clearpage

We now fit the GPD model and produce diagnostic plots (Figure~\ref{fig:gmod}) and a summary of the model.
<<gmod, echo=TRUE, fig.cap="Diagnostic plots for the GPD model with threshold at 1.3.">>=
gmod <- evm(r, data=liver, th=0, xi=~as.numeric(dose))
ggplot(gmod,span=1)
summary(gmod)
@

The GPD model fit to all values above 0 seems to fit reasonably well, with the diagnostic plots
revealing no great cause for concern. A few points on the QQ-plot fall outside
the simulated envelope, but because these points are correlated, it makes no
sense to choose thresholds such as 5\% or 10\% as hard cut-offs, so the model
fit is not seriously called into question. The summary table reveals there to
be still no evidence of a dose
effect, at least according to the approximate test implied by the {\em t}-value.

\clearpage
We next attempt to model {\em all} the data using the EGP3 distribution.

<<liveregp3, echo=TRUE, fig.cap="Diagnostic plots from fitting the EGP3 distribution to all of the residual bilirubin data.">>=
emod <- evm(r, data=liver, family=egp3, th=min(liver$r - .0001),
            xi=~as.numeric(dose))
ggplot(emod,span=0.8)
summary(emod)
@

In Figure~\ref{fig:liveregp3} there is some noticeable structure in the diagnostic
plots. The simulation test reveals there to be about 65\% of the observations
outside of the simulated envelope, so that the model is a terrible fit to the
data.
\clearpage

We now try a somewhat higher threshold of -0.25.
<<liveregp3_2, echo=TRUE, fig.cap="Diagnostic plots for the EGP3 model when using a threshold of -0.25">>=
emod <- evm(r, data=liver, family=egp3, th=-.25,
            xi=~as.numeric(dose))
ggplot(emod)
summary(emod)
@

We see from the output that the model appears to fit the data, and still there
is no evidence of a dose effect.

\subsection{Discussion}
In the pharmaceutical example, we were able to claw back some additional data into
the model by using the EGP3 distribution, at the expense of an additional
parameter. However, when all of the data were used, the fit was awful. It appears
that the EGP3 distribution's most useful function might be to provide an extra
threshold selection plot, or even a test to decide on a lower threshold. For
modelling extreme values it will, at least in some examples, allow inclusion of
a larger proportion of observations, but some care will need to be taken in
selection of the threshold, and no obvious threshold selection methods for EPG3
are currently available.

\clearpage
\section{\label{sec:egp3}The EGP3 distribution: some technical details}
The EGP3 family introduces a new parameter, $\kappa$, to the familiar
parameters $\sigma$ and $\xi$ in the GPD family. The threshold $u$ remains.
The EPG3 distribution function is then obtained by raising the GPD distribution
function to $\kappa > 0$.

\subsection{Distribution function, probability density function and random number generation}

The cumulative distribution function for the EGP3 family is
\begin{equation} \label{eqn:cdf}
F(x) = \begin{cases}
  \left[ 1 - \left[ 1 + \xi \frac{x - u}{\sigma} \right]^{-1/\xi}\right]^\kappa & \xi \ne 0 \\
  \left[ 1 - \exp\left(-\frac{(x-u)}{\sigma}\right)\right]^\kappa & \xi = 0
\end{cases}
\end{equation}

which yields probability density function
\begin{equation}\label{eqn:pdf}
f(x) = \begin{cases}
  \frac{\kappa}{\sigma}\left\{1 - (1 + \xi (x-u)/\sigma)^{-1/\xi}\right\}^{\kappa - 1}(1 + \xi (x-u)/\sigma)^{-1/\xi -1} & \xi \ne 0 \\
  \frac{\kappa}{\sigma}e^{-(x - u)/\sigma}\left(1 - e^{-(x-u)/\sigma}\right)^{\kappa - 1} & \xi = 0.
\end{cases}
\end{equation}

Equations (\ref{eqn:cdf}) and (\ref{eqn:pdf}) are implemented in texmex in the
functions {\tt pegp3} and {\tt degp3}.

Inversion of (\ref{eqn:cdf}) yields

\begin{eqnarray}
z = \left\{ \arraycolsep=1.4pt\def\arraystretch{2.2} % increase vertical spacing
    \begin{array}{ll}
    u + \frac{\sigma}{\xi}\left[\left(1 - x^{1/\kappa}\right)^{-\xi} - 1\right] & \xi \ne 0 \label{eqn:rgn}\\
    u - \sigma \log\left(1 - x^{1/\kappa}\right) & \xi = 0
    \end{array}
    \right.
\end{eqnarray}

enabling random number generation as implemented in {\tt regp3}.

\subsection{Return levels}
Following Coles (\cite{coles} Section 4.3.3) computation of return levels proceeds
as follows. We note that

\begin{eqnarray*}
P(X > x | X > u) = 1 - F(X)_{X > u}
\end{eqnarray*}

so that

\begin{eqnarray*}
P(X > x) = \theta_u\{1 - F(X)\}
\end{eqnarray*}

where $\theta_u = P(X > u)$ for threshold u. Therefore, the level $x_M$ that is
exceeded on average once every $M$ observations is the solution to

\begin{eqnarray}
\frac{1}{M} = \theta_u\{1 - F(X)\} \label{eqn:rl}.
\end{eqnarray}

For the EGP3 distribution, we solve (\ref{eqn:rl}) to yield

\begin{eqnarray}
x_M = \left\{ \arraycolsep=1.4pt\def\arraystretch{2.2} % increase vertical spacing
      \begin{array}{ll}
      u + \frac{\sigma}{\xi}\left[\left\{1 - \left( 1 - \frac{1}{M\theta_u}\right)^{1/\kappa}\right\}^{-\xi} - 1\right] & \xi \ne 0 \label{eqn:retlev}\\
      u - \sigma\log\left[1 - \left(1 - \frac{1}{M\theta_u}\right)^{1/\kappa}\right] & \xi = 0
      \end{array}
      \right.
\end{eqnarray}

correcting Papastathopoulos and Tawn.

\subsubsection{Derivatives}
In order to compute approximate standard errors for return levels, we need derivatives
of (\ref{eqn:retlev}) with respect to each of $\kappa$, $\sigma$ and $\xi$. These
are found to be

\begin{eqnarray*}
\frac{dx_M}{d\kappa} &=& -\frac{\left( 1 - \frac{1}{M\theta_u}\right)^{1/\kappa} \sigma\left( 1 - \left( 1 - \frac{1}{M\theta_u}\right)^{1/\kappa}\right)^{-\xi -1}\log\left( 1 - \frac{1}{M\theta_u}\right)}{\kappa^2}\\
\frac{dx_M}{d\sigma} &=& \frac{\left(1 - \left( 1 - \frac{1}{M\theta_u}\right)^{1/\kappa}\right)^{-\xi} -1}{\xi}\\
\frac{dx_M}{d\xi} &=& -\frac{\sigma\left(1 - \left( 1 - \frac{1}{M\theta_u}\right)^{1/\kappa}\right)^{-\xi}\log\left(1 - \left( 1 - \frac{1}{M\theta_u}\right)^{1/\kappa}\right)}{\xi} - \frac{\sigma\left(\left(1 - \left( 1 - \frac{1}{M\theta_u}\right)^{1/\kappa}\right)^{-\xi} -1\right)}{\xi^2}
\end{eqnarray*}

\subsection{Upper endpoint}
When $\xi < 0$, the GPD has upper endpoint $u - \frac{\sigma}{\xi}$. This value
is obtained by setting the distribution function to 1 and solving. Working with
(\ref{eqn:cdf}), setting it to 1 and solving reveals the EGP3 distribution to have
the same upper endpoint as the GPD.


\clearpage
\section{Appendix}
\subsection{Information on the R session}
Information on the R session, in the interests of reproducibility.
<<sessionInfo, echo=FALSE, results='verbatim'>>=
sessionInfo()
@

\bibliographystyle{plain}
\bibliography{texmex}
\end{document}