Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCA of Raster data [Feature Request] #1361

Closed
ailich opened this issue Nov 30, 2023 · 23 comments
Closed

PCA of Raster data [Feature Request] #1361

ailich opened this issue Nov 30, 2023 · 23 comments

Comments

@ailich
Copy link
Contributor

ailich commented Nov 30, 2023

I used to use RSToolbox for conducting PCA's on raster data, but it relies on the raster package and is no longer actively maintained (I actually can't get it to install anymore). Since PCA is a common procedure it might be nice to have directly in terra like how kmeans was recently added. I've taken the code from their function and converted it to terra code and it appears to work with the development version.

#' Principal Component Analysis for Rasters
#'
#' Calculates R-mode PCA for SpatRasters and returns a named list containing the PCA model and the SpatRaster with the principal component scores.
#'
#' Internally rasterPCA relies on the use of \link[stats]{princomp} (R-mode PCA). If nSamples is given the PCA will be calculated
#' based on a random sample of pixels and then predicted for the full raster. If nSamples is NULL then the covariance matrix will be calculated
#' first and will then be used to calculate princomp and predict the full raster. The latter is more precise, since it considers all pixels,
#' however, it may be slower than calculating the PCA only on a subset of pixels.
#'
#' Pixels with missing values in one or more bands will be set to NA. The built-in check for such pixels can lead to a slow-down of rasterPCA.
#' However, if you make sure or know beforehand that all pixels have either only valid values or only NAs throughout all layers you can disable this check
#' by setting maskCheck=FALSE which speeds up the computation.
#'
#' Standardised PCA (SPCA) can be useful if imagery or bands of different dynamic ranges are combined. SPC uses the correlation matrix instead of the covariance matrix, which
#' has the same effect as using normalised bands of unit variance.
#'
#' @param x SpatRaster
#' @param nSamples Integer or NULL. Number of pixels to sample for PCA fitting. If NULL, all pixels will be used.
#' @param nComp Integer. Number of PCA components to return.
#' @param spca Logical. If \code{TRUE}, perform standardized PCA. Corresponds to centered and scaled input image. This is usually beneficial for equal weighting of all layers. (\code{FALSE} by default)
#' @param maskCheck Logical. Masks all pixels which have at least one NA (default TRUE is reccomended but introduces a slow-down, see Details when it is wise to disable maskCheck).
#' Takes effect only if nSamples is NULL.
#' @param ... further arguments to be passed to \link[terra]{writeRaster}, e.g. filename.
#' @return Returns a named list containing the PCA model object ($model) and the SpatRaster with the principal component layers ($map).
#' @details
#' This code is largely from the rasterPCA function from the [RSToolbox package](https://bleutner.github.io/RStoolbox/), but has been adapted to work with the terra package.
#' @export
#' @examples
#'

rasterPCA <- function(x, nSamples = NULL, nComp = nlyr(x), spca = FALSE,  maskCheck = TRUE, ...){
  if(nlyr(x) <= 1) stop("Need at least two layers to calculate PCA.")
  ellip <- list(...)

  if(nComp > nlyr(x)) nComp <- nlyr(x)

  if(!is.null(nSamples)){
    trainData <- spatSample(x, size = nSamples, method="random", replace=FALSE, na.rm = TRUE)
    if(nrow(trainData) < nlyr(x)) stop("nSamples too small or x contains a layer with NAs only")
    model <- princomp(trainData, scores = FALSE, cor = spca)
  } else {
    if(maskCheck) {
      totalMask <- !sum(app(x, is.na))
      if(global(totalMask, sum)$sum == 0) stop("x contains either a layer with NAs only or no single pixel with valid values across all layers")
      x <- mask(x, totalMask , maskvalue = 0) ## NA areas must be masked from all layers, otherwise the covariance matrix is not non-negative definite
    }
    covMat <- layerCor(x, fun = "cov", na.rm = TRUE)
    model  <- princomp(covmat = covMat[[1]], cor=spca)
    model$center <- diag(covMat$mean)
    model$n.obs  <- global(!any(is.na(x)), sum)$sum
    if(spca) {
      ## Calculate scale as population sd like in in princomp
      S <- diag(covMat$covariance)
      model$scale <- sqrt(S * (model$n.obs-1)/model$n.obs)
    }
  }
  ## Predict
  out<- terra::predict(x, model = model, na.rm = TRUE, index = 1:nComp, wrArgs = ellip)
  names(out) <- paste0("PC", 1:nComp)
  return(structure(list(call = match.call(), model = model, map = out)))
}
@ailich
Copy link
Contributor Author

ailich commented Nov 30, 2023

Here's an example of how it could be used

library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.62

rasterPCA <- function(x, nSamples = NULL, nComp = nlyr(x), spca = FALSE,  maskCheck = TRUE, ...){
  if(nlyr(x) <= 1) stop("Need at least two layers to calculate PCA.")
  ellip <- list(...)
  
  if(nComp > nlyr(x)) nComp <- nlyr(x)
  
  if(!is.null(nSamples)){
    trainData <- spatSample(x, size = nSamples, method="random", replace=FALSE, na.rm = TRUE)
    if(nrow(trainData) < nlyr(x)) stop("nSamples too small or x contains a layer with NAs only")
    model <- princomp(trainData, scores = FALSE, cor = spca)
  } else {
    if(maskCheck) {
      totalMask <- !sum(app(x, is.na))
      if(global(totalMask, sum)$sum == 0) stop("x contains either a layer with NAs only or no single pixel with valid values across all layers")
      x <- mask(x, totalMask , maskvalue = 0) ## NA areas must be masked from all layers, otherwise the covariance matrix is not non-negative definite
    }
    covMat <- layerCor(x, fun = "cov", na.rm = TRUE)
    model  <- princomp(covmat = covMat[[1]], cor=spca)
    model$center <- diag(covMat$mean)
    model$n.obs  <- global(!any(is.na(x)), sum)$sum
    if(spca) {
      ## Calculate scale as population sd like in in princomp
      S <- diag(covMat$covariance)
      model$scale <- sqrt(S * (model$n.obs-1)/model$n.obs)
    }
  }
  ## Predict
  out<- terra::predict(x, model = model, na.rm = TRUE, index = 1:nComp, wrArgs = ellip)
  names(out) <- paste0("PC", 1:nComp)
  return(structure(list(call = match.call(), model = model, map = out)))
}


#Create Data
r<- rast(nrow=40, ncol=40, nlyrs=10)
set.seed(5)
values(r)<- rnorm(ncell(r)*nlyr(r))
values(r)[sample(1:(ncell(r)*nlyr(r)), size = 1000, replace = FALSE)]<- NA #Add NA's

plot(r)

PCA<- rasterPCA(r, spca = TRUE)
summary(PCA$model)
#> Importance of components:
#>                          Comp.1   Comp.2    Comp.3    Comp.4    Comp.5
#> Standard deviation     1.065035 1.060924 1.0353062 1.0232500 1.0026015
#> Proportion of Variance 0.113430 0.112556 0.1071859 0.1047041 0.1005210
#> Cumulative Proportion  0.113430 0.225986 0.3331719 0.4378759 0.5383969
#>                           Comp.6     Comp.7     Comp.8     Comp.9    Comp.10
#> Standard deviation     0.9866337 0.98439181 0.96219138 0.95073612 0.91861112
#> Proportion of Variance 0.0973446 0.09690272 0.09258123 0.09038992 0.08438464
#> Cumulative Proportion  0.6357415 0.73264422 0.82522545 0.91561536 1.00000000
plot(PCA$map)

Created on 2023-11-30 with reprex v2.0.2

@rhijmans
Copy link
Member

rhijmans commented Nov 30, 2023

I am not opposed but not convinced either given the simplicity of this example in ?terra::predict

library(terra)
logo <- rast(system.file("ex/logo.tif", package="terra"))   
### principal components of a SpatRaster
pca <- princomp(logo)

# or use sampling if you have a large raster 
# and cannot process all cell values
sr <- spatSample(logo, 100000, "regular")
pca <- prcomp(sr)

x <- predict(logo, pca)
plot(x)

@ailich
Copy link
Contributor Author

ailich commented Nov 30, 2023

@rhijmans, I think this method doesn't require sampling for large rasters because it calls layerCor and then works off the covariance matrix. That is unless layerCor needs to read in the whole dataset to memory.

rhijmans added a commit that referenced this issue Dec 1, 2023
@rhijmans
Copy link
Member

rhijmans commented Dec 1, 2023

I added a princomp<SpatRaster> method. Please have a look. I think it is best to have model building and prediction in separate steps as one might want to inspect the model and because this is standard R practice.

rhijmans added a commit that referenced this issue Dec 1, 2023
@ailich
Copy link
Contributor Author

ailich commented Dec 1, 2023

Thanks @rhijmans, I think splitting it up into model and prediction makes sense and like the idea of adding a dedicated princomp method in terra. I do have a few comments though.

  1. I think the call needs to stats::princomp needs to have cor=cor to link to what the user specified, and it should have ... for other arguments that can be passed to it. Since currently that line only says model <- princomp(covmat = xcov$covariance), cor and other user specified arguments will not be passed along.
  2. I believe NA values need to be masked from all layers prior to calculating the covariance matrix with LayerCor. In RSToolbox the comment said "NA areas must be masked from all layers, otherwise the covariance matrix is not non-negative definite"
  3. Assign $n.obs in model output (should be number of non-NA cells)
library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.62

rasterPCA <- function(x, nSamples = NULL, nComp = nlyr(x), spca = FALSE,  maskCheck = TRUE, ...){
  if(nlyr(x) <= 1) stop("Need at least two layers to calculate PCA.")
  ellip <- list(...)
  
  if(nComp > nlyr(x)) nComp <- nlyr(x)
  
  if(!is.null(nSamples)){
    trainData <- spatSample(x, size = nSamples, method="random", replace=FALSE, na.rm = TRUE)
    if(nrow(trainData) < nlyr(x)) stop("nSamples too small or x contains a layer with NAs only")
    model <- princomp(trainData, scores = FALSE, cor = spca)
  } else {
    if(maskCheck) {
      totalMask <- !sum(app(x, is.na))
      if(global(totalMask, sum)$sum == 0) stop("x contains either a layer with NAs only or no single pixel with valid values across all layers")
      x <- mask(x, totalMask , maskvalue = 0) ## NA areas must be masked from all layers, otherwise the covariance matrix is not non-negative definite
    }
    covMat <- layerCor(x, fun = "cov", na.rm = TRUE)
    model  <- princomp(covmat = covMat[[1]], cor=spca)
    model$center <- diag(covMat$mean)
    model$n.obs  <- global(!any(is.na(x)), sum)$sum
    if(spca) {
      ## Calculate scale as population sd like in in princomp
      S <- diag(covMat$covariance)
      model$scale <- sqrt(S * (model$n.obs-1)/model$n.obs)
    }
  }
  ## Predict
  out<- terra::predict(x, model = model, na.rm = TRUE, index = 1:nComp, wrArgs = ellip)
  names(out) <- paste0("PC", 1:nComp)
  return(structure(list(call = match.call(), model = model, map = out)))
}


princomp_edit<- function(x, cor=FALSE, ...) {
            stopifnot(nlyr(x) > 1)
            xcov <- layerCor(x, fun="cov", na.rm=TRUE)
            model <- princomp(covmat = xcov$covariance, cor=cor, ...)
            model$center <- diag(xcov$mean)
            if (cor) {
              ## Calculate scale as population sd like in in princomp
              S <- diag(xcov$covariance)
              n <- diag(xcov$n)
              model$scale <- sqrt(S * (n-1) / n)
            }
            model
          }

#Create Data
r<- rast(nrow=40, ncol=40, nlyrs=10)
set.seed(5)
values(r)<- rnorm(ncell(r)*nlyr(r))
values(r)[sample(1:(ncell(r)*nlyr(r)), size = 1000, replace = FALSE)]<- NA #Add NA's

r_df<- as.data.frame(r)
r_df<- na.omit(r_df) # If you do not omit NA's from the dataframe, you get Error in cov.wt(z) : 'x' must contain finite values only

# PCA cor=FALSE
m1<- princomp(r_df) # This is on a dataframe so it serves as a reference to the correct output
m1$n.obs
#> [1] 859
m1$sdev
#>    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7    Comp.8 
#> 1.0826671 1.0704703 1.0429269 1.0255518 1.0028366 0.9855091 0.9820035 0.9674348 
#>    Comp.9   Comp.10 
#> 0.9487587 0.9176803

m2<- rasterPCA(r)$model
m2$n.obs
#> [1] 859
m2$sdev
#>    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7    Comp.8 
#> 1.0832979 1.0710939 1.0435344 1.0261493 1.0034208 0.9860832 0.9825756 0.9679984 
#>    Comp.9   Comp.10 
#> 0.9493114 0.9182149

m3<- terra::princomp(r)
m3$n.obs
#> [1] NA
m3$sdev
#>    Comp.1    Comp.2    Comp.3    Comp.4    Comp.5    Comp.6    Comp.7    Comp.8 
#> 1.0745181 1.0494377 1.0409172 1.0199349 1.0037053 1.0022756 0.9830014 0.9800315 
#>    Comp.9   Comp.10 
#> 0.9710888 0.9527138

# PCA cor=TRUE
m4<- terra::princomp(r, cor=TRUE)
m4$sdev==m3$sdev #cor= TRUE did not change result
#>  Comp.1  Comp.2  Comp.3  Comp.4  Comp.5  Comp.6  Comp.7  Comp.8  Comp.9 Comp.10 
#>    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE    TRUE

m5<- princomp_edit(r, cor=TRUE)
m5$sdev==m3$sdev #Edited version properly passes along cor argument
#>  Comp.1  Comp.2  Comp.3  Comp.4  Comp.5  Comp.6  Comp.7  Comp.8  Comp.9 Comp.10 
#>   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE   FALSE

Created on 2023-12-01 with reprex v2.0.2

@rhijmans
Copy link
Member

rhijmans commented Dec 1, 2023

I think the [call needs to stats::princomp] needs to have cor=cor

Yes, thanks, fixed that.

it should have ... for other arguments

The only other meaningful argument is fix_sign, so I have added that explicitly.

I believe NA values need to be masked from all layers prior to calculating the covariance matrix with LayerCor.

I do not think that is strictly necessary. Our best estimate of the covariance matrix would use all pairs of data that are not NA. We can add a note in the manual that you can do mask(x, anyNA(x), maskvalue=TRUE) if you want the same number of observations for all layers.

It may affect the computation of the scale (if cov=TRUE) as that uses (n-1) / n. In most cases n is very large with raster data so I am not worried about that. But again that can be pointed out.

Assign $n.obs in model output (should be number of non-NA cells)

I put it back in. However, there is not one n as it can vary between layers. So you now get a "dist" object that shows that.

@ailich
Copy link
Contributor Author

ailich commented Dec 5, 2023

Thank you!

@ailich ailich closed this as completed Dec 5, 2023
@ailich
Copy link
Contributor Author

ailich commented Dec 5, 2023

@rhijmans, although RSToolbox used the population standard deviation to calculate scale, on further investigation that may have not been the correct choice (See example and discussion here). That being said though, as you mentioned rasters tend to have a lot of observations, so it likely doesn't make much difference.

@rhijmans
Copy link
Member

rhijmans commented Dec 6, 2023

I would prefer to use model$scale <- sqrt(S) over model$scale <- sqrt(S * (n-1) / n) because the value of n is ambiguous if the number of non-NA cells varies.

The population covariance can be computed with asSample=FALSE.

xcov <- layerCor(x, fun="cov", na.rm=TRUE, asSample=FALSE)

I think it generally makes more sense to to treat raster data as the population and not as a sample. It could also be an option that may lead to oh so small differences for small rasters.

Anyway, this is confusing stuff, and I would appreciate your guidance.

@ailich
Copy link
Contributor Author

ailich commented Dec 6, 2023

@rhijmans, since the "scale" should be the standard deviation and the S (the diagonal of the covariance matrix) is the variance, the square root of that would be appropriate. The pairwise n's must still be used to calculate the covariance matrix so it's not really circumventing that, but we it does avoid multiplying by (n-1)/n which was just there to convert the sample standard deviation to the population standard deviation since

Cov Population (x,y) = ( x i x ¯ ) ( y i y ¯ ) N

Cov Sample(x,y) = ( x i x ¯ ) ( y i y ¯ ) N 1

That being said, I think it is slightly better as you suggested to use asSample=FALSE and scale=sqrt(S) since the covariance matrix and the scale are both calculated with the population formula rather than having a covariance matrix based on the sample formula and a scale based on the population formula.

I am unsure however why results of layerCor(asSample=TRUE) multiplied by (n-1)/n doesn't more closely match the results of layerCor(asSample=FALSE)

library(terra)
#> terra 1.7.62

cov_pop <- function(data) {
  # Calculate the means of each variable
  means <- colMeans(data)
  # Number of observations
  n_obs <- nrow(data)
  # Number of variables
  n_vars <- ncol(data)
  # Initialize an empty covariance matrix
  covariance_matrix <- matrix(0, nrow = n_vars, ncol = n_vars)
  # Calculate the covariance for each pair of variables
  for (i in 1:n_vars) {
    for (j in 1:n_vars) {
      covariance_matrix[i, j] <- sum((data[, i] - means[i]) * (data[, j] - means[j])) / n_obs
    }
  }
  return(covariance_matrix)
}

r<- rast(nrow=40, ncol=40, nlyrs=10)
set.seed(5)
values(r)<- rnorm(ncell(r)*nlyr(r))
r_df<- as.data.frame(r)
n<- ncell(r) #No NA's in this example

cov_samp_terra<- layerCor(r, fun="cov", na.rm=TRUE, asSamle=TRUE)
cov_pop_terra<-  layerCor(r, fun="cov", na.rm=TRUE, asSamle=FALSE)

cov_samp_R<- cov(r_df)
cov_pop_R<- cov_pop(r_df)

# Terra has differences on order of e-05
max((cov_samp_terra$covariance*(n-1)/n)-cov_pop_terra$covariance)
#> [1] 3.426207e-05

# R has differences on order of e-16
max((cov_samp_R*(n-1)/n)-cov_pop_R)
#> [1] 2.220446e-16

Created on 2023-12-06 with reprex v2.0.2

@rhijmans
Copy link
Member

rhijmans commented Dec 6, 2023

b <- rast(system.file("ex/logo.tif", package="terra"))   
layerCor(b, "pearson")

x <- layerCor(b, "cov")
y <- layerCor(b, "cov", asSample=F)

x[["covariance"]] * (x[["n"]]-1)/x[["n"]] == y[["covariance"]]
#       red green blue
#red   TRUE  TRUE TRUE
#green TRUE  TRUE TRUE
#blue  TRUE  TRUE TRUE

or with your approach using asSample (and not asSamle ;)

cov_pop_terra<-  layerCor(r, fun="cov", na.rm=TRUE, asSample=FALSE)
cov_samp_terra<- layerCor(r, fun="cov", na.rm=TRUE, asSample=TRUE)
max((cov_samp_terra$covariance*(n-1)/n)-cov_pop_terra$covariance)
#[1] 2.220446e-16

rhijmans added a commit that referenced this issue Dec 6, 2023
@rhijmans
Copy link
Member

rhijmans commented Dec 6, 2023

but it does avoid multiplying by (n-1)/n

Yes, that is the n I was thinking about. In the covariance computation, n is computed for each pair.

@ailich
Copy link
Contributor Author

ailich commented Dec 6, 2023

Oh that was dumb of me 🤦‍♂️🤦‍♂️🤦‍♂️. Running it again, it all looks good! Since you created the n matrix they're both pairwise and the scale will come out the same. I think it's cleaner and more consistent though to use asSample=FALSE and $scale<- sqrt(S) as you mentioned.

library(terra)
#> terra 1.7.62

#Create Data
r<- rast(nrow=40, ncol=40, nlyrs=10)
set.seed(5)
values(r)<- rnorm(ncell(r)*nlyr(r))
values(r)[sample(1:(ncell(r)*nlyr(r)), size = 1000, replace = FALSE)]<- NA #Add NA's

cov_samp_terra<- layerCor(r, fun="cov", na.rm=TRUE, asSample=TRUE)
cov_pop_terra<-  layerCor(r, fun="cov", na.rm=TRUE, asSample=FALSE)

n_mat<- cov_samp_terra$n
n<- diag(n_mat)
max((cov_samp_terra$covariance*(n_mat-1)/n_mat) - cov_pop_terra$covariance)
#> [1] 6.938894e-18

n # n is a vector of pairwise sample sizes corresponding to the length of S
#>  [1] 1497 1497 1509 1496 1493 1489 1506 1505 1507 1501
S1<- diag(cov_samp_terra$covariance)
scale1<- sqrt(S1*(n-1)/n)
S2<- diag(cov_pop_terra$covariance)
scale2<- sqrt(S2)

scale1==scale2
#>  lyr.1  lyr.2  lyr.3  lyr.4  lyr.5  lyr.6  lyr.7  lyr.8  lyr.9 lyr.10 
#>   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE   TRUE

Created on 2023-12-06 with reprex v2.0.2

@ailich
Copy link
Contributor Author

ailich commented Dec 6, 2023

Thanks so much for adding this!

@rhijmans
Copy link
Member

rhijmans commented Dec 6, 2023

Thank you!

I have changed argument na.rm=TRUE/FALSE to use="everything"/"complete.obs"/ "pairwise.complete.obs"/"masked.obs"

@ailich
Copy link
Contributor Author

ailich commented Dec 7, 2023

@rhijmans, an added bonus is I believe this also fixed a bug in layerCor when calculating "cov" with maxcell specified. In the latest commit I get approximately the same results whereas in the CRAN and a few commits ago I get half the covariance if sampling half the cells so it seems like the number of cells rather than the sample size was used.

Older Version

library(terra)
#> Warning: package 'terra' was built under R version 4.3.2
#> terra 1.7.62
r<- rast(nrow=40, ncol=40, nlyrs=2)
set.seed(5)
values(r[[1]])<- rnorm(ncell(r))
values(r[[2]])<- values(r[[1]])*2+rnorm(ncell(r))

cov_full<- layerCor(r, fun="cov", maxcell=Inf)
cov_half<- layerCor(r, fun="cov", maxcell=ncell(r)/2)

cov_half$covariance
#>          lyr.1    lyr.2
#> lyr.1 0.541334 1.058303
#> lyr.2 1.058303 2.621351
cov_full$covariance
#>          lyr.1    lyr.2
#> lyr.1 1.019472 2.029023
#> lyr.2 2.029023 5.057638

cov_half$covariance/cov_full$covariance
#>           lyr.1     lyr.2
#> lyr.1 0.5309946 0.5215826
#> lyr.2 0.5215826 0.5182955

df_half<- spatSample(r, 800, method="regular")
cov(df_half)
#>          lyr.1    lyr.2
#> lyr.1 1.030468 2.014555
#> lyr.2 2.014555 4.989929

Created on 2023-12-07 with reprex v2.0.2

Latest Commit

library(terra)
#> terra 1.7.62
r<- rast(nrow=40, ncol=40, nlyrs=2)
set.seed(5)
values(r[[1]])<- rnorm(ncell(r))
values(r[[2]])<- values(r[[1]])*2+rnorm(ncell(r))

cov_full<- layerCor(r, fun="cov", maxcell=Inf)
cov_half<- layerCor(r, fun="cov", maxcell=ncell(r)/2)

cov_half$covariance
#>          lyr.1    lyr.2
#> lyr.1 1.030468 2.014555
#> lyr.2 2.014555 4.989929
cov_full$covariance
#>          lyr.1    lyr.2
#> lyr.1 1.019472 2.029023
#> lyr.2 2.029023 5.057638

cov_half$covariance/cov_full$covariance
#>           lyr.1     lyr.2
#> lyr.1 1.0107862 0.9928697
#> lyr.2 0.9928697 0.9866125

df_half<- spatSample(r, 800, method="regular")
cov(df_half)
#>          lyr.1    lyr.2
#> lyr.1 1.030468 2.014555
#> lyr.2 2.014555 4.989929

Created on 2023-12-07 with reprex v2.0.2

@ailich
Copy link
Contributor Author

ailich commented Dec 7, 2023

Since "cov" and maxcell seems to be working, the only extra thing I'd add to this function is allowing a specification of maxcell in princomp. By default that would be maxcell=Inf but could be changed. The only thing that would need to be done is adding a maxcell argument to princomp and passing that to layerCor via model <- princomp(covmat=xcov$covariance, cor=cor, fix_sign=fix_sign, maxcell=maxcell).

@rhijmans
Copy link
Member

rhijmans commented Dec 8, 2023

I did not include the maxcell argument because I figured that you might as well use princomp<data.frame> at that point; but I have added it now anyway.

It is interesting to see how few you cells you need with this (unrealistically simple) example

library(terra)
f <- system.file("ex/logo.tif", package = "terra")
r <- rast(f)
pca <- princomp(r)
pca100 <- princomp(r, maxcell=100)
plot(predict(r, pca), predict(r, pca100))

@Nowosad
Copy link
Contributor

Nowosad commented Dec 18, 2023

@rhijmans, that for adding this function -- I think it will be helpful for many.

Do you also plan to add some updates to prcomp? Currently, princomp works for data with NAs, but prcomp not:

library(terra)
#> terra 1.7.65
f <- system.file("ex/logo.tif", package = "terra")
r <- rast(f)
r[100] <- NA

# works
pca <- princomp(r)

# fails (due to NA's)
pca2 <- prcomp(r)
#> Error in svd(x, nu = 0, nv = k): infinite or missing values in 'x'

From the princomp help file (?princomp), it appears that the prcomp should be preferred, in general:

"The calculation is done using eigen on the correlation or covariance matrix, as determined by cor. This is done for compatibility with the S-PLUS result. A preferred method of calculation is to use svd on x, as is done in prcomp."

@ailich
Copy link
Contributor Author

ailich commented Dec 18, 2023

@Nowosad, the point about prcomp vs princomp was raised in bleutner/RStoolbox#49 where it was stated that "prcomp is not a suitable choice for large remote sensing data sets as it operates on X as opposed to princomp which operates on the correlation or covariance matrix, both of which can be computed in a piecewise fashion. I.e. for prcomp you'd have to load all data into memory at once, while for princomp you can first compute the covariance matrix and the run eigen on it, which is not only faster, but more importantly memory-safe. In the end the results of both methods are equal. While numerical differences can exist, with SVD being more accurate, this is generally negligible compared to the measurement uncertainty inherent in typical remote sensing data."

@rhijmans
Copy link
Member

rhijmans commented Dec 18, 2023

If there are no NAs in SpatRaster x you can do

prcomp(x) 

Otherwise you need something like

maxcell <- Inf
v <- na.omit(spatSample(x, maxcell, "regular"))
prcomp(v)

I could add a prcomp method that implements that to avoid that people trip over the NA issue.

@ailich
Copy link
Contributor Author

ailich commented Dec 19, 2023

@rhijmans @Nowosad, it could be useful to add a prcomp method to make it a bit smoother for those who want to use that version of PCA. Seems like SpatRasters work with prcomp by default since objects are automatically coerced to a matrix in prcomp.

rhijmans added a commit that referenced this issue Dec 20, 2023
@Nowosad
Copy link
Contributor

Nowosad commented Dec 20, 2023

@ailich thanks for the explanation -- I was suspecting something like this, but was unable to find any discussion on the topic. And yes -- my thinking was to make prcomp and princomp more consistent (e.g., both allowing for NA values in input rasters and both having the maxcell argument). I believe this was done a few hours ago by @rhijmans -- thank you a lot. I plan to test it in the coming days.

netbsd-srcmastr pushed a commit to NetBSD/pkgsrc that referenced this issue Dec 10, 2024
# version 1.7-83

## bug fixes

- `flip(direction="vertical")` failed in some cases
  [#1518](rspatial/terra#1518) by Ed Carnell

- `zonal(as.raster=TRUE)` failed when the zonal raster was categorical
  [1514](rspatial/terra#1514) by Jessi L
  Brown

- `distance<data.frame,data.frame>` and `<matrix,matrix>` ignored the
  unit
  argument. [#1545](rspatial/terra#1545) by
  Wencheng Lau-Medrano

- NetCDF files with month time-step encode from 0-11 made R crash
  [#1544](rspatial/terra#1544) by Martin
  Holdrege

- `split<SpatVector>` only worked well if the split field was of type
  character. [#1530](rspatial/terra#1530) by
  Igor Graczykowski

- `gridDist` (and probably some other methods) emitted a "cannot
  overwrite existing file" error when processing large datasets
  [#1522](rspatial/terra#1522) by Clare
  Pearson

- `terrain` did not accept multiple variables
  [#1561](rspatial/terra#1561) by Michael
  Mahoney

- `rotate` was vulnerable to an integer overflow
  [#1562](rspatial/terra#1562) by Sacha
  Ruzzante

- `getTileExtents` could return overlapping tiles or tiles with gaps
  due to floating point
  imprecision. [#1564](rspatial/terra#1564)
  by Michael Sumner


## enhancements

- `as.list<SpatRasterDataset>` sets the names of the list
  [#1513](rspatial/terra#1513)

- a SpatVectorCollection can now be subset with its names; and if made
  from a list it takes the names from the list.
  [1515](rspatial/terra#1515) by jedgroev

- argument `fill_range` to plot<SpatRaster> and `plot<SpatVector>` to
  use the color of the extreme values of the specified range
  [#1553](rspatial/terra#1553) by Mike
  Koontz

- plet<SpatRaster> can now handle rasters with a "local" (Cartesian)
  CRS. [#1570](rspatial/terra#1570) by
  Augustin Lobo.

## new

- `map-region` returns the coordinates of the axes position of a map
  created with `plot<Spat*>`
  [https://github.com/rspatial/terra/issues/1517](https://github.com/rspatial/terra/issues/1517)
  by Daniel Schuch

- `polys<leaflet>` method
  [#1543](rspatial/terra#1543) by Márcia
  Barbosa

- `plot<SpatVectorCollection>` method
  [#1532](rspatial/terra#1532) by jedgroev

- `add_mtext` to add text around the margins of a
  map. [#1567](rspatial/terra#1567) by
  Daniel Schuch

# version 1.7-78

Released 2023-05-22

## bug fixes

- `writeVector` and `readVector` better handle empty geopackage layers
  [#1426](rspatial/terra#1426) by Andrew
  Gene Brown.

- `writeCDF` only wrote global variables if there was more than one
  [#1443](rspatial/terra#1443) by Daniel
  Schlaepfer

- `rasterize` with "by" returned odd layernames
  [#1435](rspatial/terra#1435) by Philippe
  Massicotte

- `convHull`, `minCircle` and `minRect` with a zero-row SpatVector
  crashed R [#1445](rspatial/terra#1445) by
  Andrew Gene Brown

- `rangeFill` with argument `circular=TRUE` did not work properly
  [#1460](rspatial/terra#1460) by Alice

- `crs(describe = TRUE)` returned an mis-ordered extent
  [#1485](rspatial/terra#1485) by Dimitri
  Falk

- `tapp` with a custom function and an index like "yearmonths" could
  shift time for not considering the time
  zone. [#1483](rspatial/terra#1483) by Finn
  Roberts

- `plot<SpatRaster>` could fail when there were multiple values with
  very small differences
  [#1491](rspatial/terra#1491) by srfall

- `as.data.frame<SpatRaster>` with "xy=TRUE" and "wide=FALSE" could
  fail if coordinates were very similar
  [#1476](rspatial/terra#1476) by Pascal
  Oettli

- `rasterizeGeom` now returns the correct layer name
  [#1472](rspatial/terra#1472) by
  HRodenhizer

- `cellSize` with "mask=TRUE" failed if the output was to be written
  to a temp file
  [#1496](rspatial/terra#1496) by Pascal
  Sauer

- `ext<SpatVectorProxy>` did not return the full extent
  [#1501](rspatial/terra#1501) by
  erkent-carb


## enhancements

- `extract` has new argument "small=TRUE" to allow for strict use of
  "touches=FALSE"
  [#1419](rspatial/terra#1419) by Floris
  Vanderhaeghe.

- `as.list<SpatRaster>` has new argument "geom=NULL"

- `rast<list>` now recognizes (x, y, z) base R "image" structures
  [stackoverflow]
  (https://stackoverflow.com/questions/77949551/rspatial-convert-a-grid-list-to-a-raster-using-terra)
  by Ignacio Marzan.

- `inset` has new arguments "offset" and "add"
  [#1422](rspatial/terra#1422) by Armand-CT

- `expanse<SpatRaster>` has argument `usenames`
  [#1446](rspatial/terra#1446) by Bappa Das

- the default color palette is now `terra::map.pal("viridis")` instead
  of `terrain.colors`. The default can be changes with
  `options(terra.pal=...)`
  [#1474](rspatial/terra#1474) by Derek
  Friend

- `as.list<SpatRasterDataset>` now returns a named
  list. [#1513](rspatial/terra#1513) by Eric
  R. Scott


## new

- `bestMatch<SpatRaster>` method

- argument "pairs=TRUE" to `cells` [https://github.com/rspatial/terra/issues/1487](https://github.com/rspatial/terra/issues/1487) by Floris Vanderhaeghe

- `add_grid` to add a grid to a map


# version 1.7-71

Released 2023-01-31

## bug fixes

- k_means did not work if there were NAs
  [#1314](rspatial/terra#1314) by Jakub
  Nowosad

- `layerCor` with a custom function did not work anymore
  [#1387](rspatial/terra#1387) by Jakub
  Nowosad

- `plet` broke when using "panel=TRUE"
  [#1384](rspatial/terra#1384) by Elise
  Hellwig

- using /vis3/ to open a SpatRaster did not work
  [#1382](rspatial/terra#1382) by Mike
  Koontz

- `plot<SpatRaster>(add=TRUE)` sampled the raster data without
  considering the extent of the
  map. [#1394](rspatial/terra#1394) by
  Márcia Barbosa

- `plot<SpatRaster>(add=TRUE)` now only considers the first layer of a
  multi-layer SpatRaster
  [1395](rspatial/terra#1395) by Márcia
  Barbosa

- `set.cats` failed with a tibble was used instead of a data.frame
  [#1406](rspatial/terra#1406) by Mike
  Koontz

- `polys` argument "alpha" was ignored if a single color was
  used. [#1413](rspatial/terra#1413) by
  Derek Friend

- `query` ignore the "vars" argument if all rows were
  selected. [#1398](rspatial/terra#1398) by
  erkent-carb.

- `spatSample` ignored "replace=TRUE" with random sampling,
  na.rm=TRUE, and a sample size larger than the non NA
  cells. [#1411](rspatial/terra#1411) by
  Babak Naimi

- `spatSample` sometimes returned fewer values than requested and
  available for lonlat
  rasters. [#1396](rspatial/terra#1396) by
  Márcia Barbosa.


## enhancements

- `vect<character>` now has argument "opts" for GDAL open options,
  e.g. to declare a file
  encoding. [#1389](rspatial/terra#1389) by
  Mats Blomqvist

- `plot(plg=list(tic=""))` now allows choosing alternative continuous
  legend tic-mark styles ("in", "out", "through" or "none")

- `makeTiles` has new argument "buffer"
  [#1408](rspatial/terra#1408) by Joy
  Flowers.


## new

- `prcomp<SpatRaster>` method
  [#1361](rspatial/terra#1361 (comment))
  by Jakub Nowosad

- `add_box` to add a box around the map. The box is drawn where the
  axes are, not around the plotting region.

- `getTileExtents` provides the extents of tiles. These may be used in
parallelization. See [#1391](https://github.com/rspa
tial/terra/issues/1391) by Alex Ilich.


# version 1.7-65

Released 2023-12-15

## bug fixes

- `flip` with argument `direction="vertical"` filed in some cases with
   large rasters processed in chunks
   [0b714b0](rspatial/terra@0b714b0)
   by Dulci on [stackoveflow](
   https://stackoverflow.com/questions/77304534/rspatial-terraflip-error-when-flipping-a-multi-layer-spatrast-object)

- SpatRaster now correctly handles `NA & FALSE` and `NA | TRUE`
  [#1316](rspatial/terra#1316) by John Baums

- `set.names` wasn't working properly for SpatRasterDataset or
  SpatRasterCollection
  [#1333](rspatial/terra#1333) by Derek Friend

- `extract` with argument "layer" not NULL shifted the layers
  [#1332](rspatial/terra#1332) by Ewan
  Wakefield

- `terraOptions` did not capture "memmin" on
  -[stackoverflow](https://stackoverflow.com/questions/77552234/controlling-chunk
  -size-in-terra) by dww

- `rasterize` with points and a built-in function could crash if no
  field was used
  [#1369](rspatial/terra#1369) by
  anjelinejeline


## enhancements

- `mosaic` can now use `fun="modal"`

- `rast<matrix> and rast<data.frame>` now have option 'type="xylz"
  [#1318](rspatial/terra#1318) by Agustin
  Lobo

- `extract<SpatRaster,SpatVector>` can now use multiple summarizing
  functions [#1335](rspatial/terra#1335) by
  Derek Friend

- `disagg` and `focal` have more optimistic memory requirement
  estimation [#1334](rspatial/terra#1334) by
  Mikko Kuronen

## new

- `k_means<SpatRaster>` method
  [#1314](rspatial/terra#1314) by Agustin
  Lobo

- `princomp<SpatRaster>` method
  [#1361](rspatial/terra#1361) by Alex Ilich

- `has.time<SpatRaster>` method

- new argument "raw=FALSE" to `rast`, `sds`, and `sprc` to allow
  ignoring scale and offset
  [1354](rspatial/terra#1354) by Insang Song


# version 1.7-55

Released 2023-10-14

## bug fixes

- `mosaic` ignored the filename argument if the SpatRasterCollection
  only had a single SpatRaster
  [#1267](rspatial/terra#1267) by Michael
  Mahoney

- Attempting to use `extract` with a raster file that had been deleted
  crashed R. [#1268](rspatial/terra#1268) by
  Derek Friend

- `split<SpatVector,SpatVector>` did not work well in all
  cases. [#1256](rspatial/terra#1256) by
  Derek Corcoran Barrios

- `intersect` with two SpatVectors crashed R if there was a date/time
variable [#1273]( rspatial/terra#1273) by
Dave Dixon

- "values=FALSE" was ignored by
  `spatSample<SpatRaster>(method="weights")`
  [#1275](rspatial/terra#1275) by François
  Rousseu

- `coltab<-` again works with a list as value
[#1280](rspatial/terra#1280) by Diego
Hernangómez

- `stretch` with histogram equalization was not memory-safe
  [#1305](rspatial/terra#1305) by Evan Hersh

- `plot` now resets the "mar" parameter
  [#1297](rspatial/terra#1297) by Márcia
  Barbosa

- `plotRGB` ignored the "smooth" argument
  [#1307](rspatial/terra#1307) by Timothée
  Giraud


## enhancements

- argument "gdal" in `project` was renamed to "use_gdal"
  [#1269](rspatial/terra#1269) by Stuart
  Brown.

- SpatVector attributes can now be stored as an ordered factor
  [#1277](rspatial/terra#1277) by Ben Notkin

- `plot<SpatVector>` now uses an "interval" legend when breaks are
  supplied [#1303](rspatial/terra#1303) by
  Gonzalo Rizzo

- `crop<SpatRaster>` now keeps more metadata, including variable names
  [#1302](rspatial/terra#1302) by rhgof

- `extract(fun="table")` now returns an easier to use data.frame
[#1294](rspatial/terra#1294) by Fernando
Aramburu.


## new
- `metags<-` and `metags` to set arbitrary SpatRaster/file level
   metadata [#1304](https://github.com/rspatial/terra/issues/ 1304) by
   Francesco Chianucci

# version 1.7-46

Released 2023-09-06

## bug fixes

- `plot<SpatVector>` used the wrong main label in some cases
  [#1210](rspatial/terra#1210) by Márcia
  Barbosa

- `plotRGB` failed with an "ext=" argument
  [#1228](rspatial/terra#1228) by Dave Edge

- `rast<array>` failed badly when the array had less than three
  dimensions. [#1254](rspatial/terra#1254)
  by andreimirt.

- `all.equal` for a SpatRaster with multiple layers
[#1236](rspatial/terra#1236) by Sarah
Endicot t

- `zonal(wide=FALSE)` could give wrong results if the zonal SpatRaster
  had "layer" as
  layername. [#1251](rspatial/terra#1251) by
  Jeff Hanson

- `panel` now support argument "range"
  [#141](rspatial/terra#1241) by Jakub
  Nowosad

- `rasterize` with `by=` returned wrong layernames if the by field was
  not sorted [#1266](rspatial/terra#1266) by
  Sebastian Dunnett

- `mosaic` with multiple layers was not correct
  [#1262](rspatial/terra#1262) by
  Jean-Romain


## enhancements

- `wrap<SpatRaster>` now stores color tables
  [#1215](rspatial/terra#1215) by Patrick
  Brown

- `global` now has a "maxcell" argument
  [#1213](rspatial/terra#1213) by Alex Ilich

- `layerCor` with fun='pearson' now returns output with the layer
  names [#1206](rspatial/terra#1206)

- `vrt` now has argument "set_names"
  [#1244](rspatial/terra#1244) by sam-a-levy

- `vrt` now has argument "return_filename"
  [#1258](rspatial/terra#1258) by Krzysztof
  Dyba

- `project<SpatRaster>` has new argument "by_util" exposing the GDAL
  warp utility [#1222](rspatial/terra#1222) by
  Michael Sumner.


## new
- `compareGeom` for list and SpatRasterCollection
  [#1207](rspatial/terra#1207) by Sarah
  Endicott

- `is.rotated<SpatRaster>` method
  [#1229](rspatial/terra#1229) by Andy Lyons

- `forceCCW<SpatVector>` method to force counter-clockwise orientation
  of polygons [#1249](rspatial/terra#1249)
  by srfall.

- `vrt_tiles` returns the filenames of the tiles in a vrt file
  [#1261](rspatial/terra#1261) by Derek
  Friend

- `extractAlong` to extract raster cell values for a line that are
  ordered along the
  line. [#1257](rspatial/terra#1257) by
  adamkc.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants