Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

caching terra objects #1150

Closed
pfuehrlich-pik opened this issue May 9, 2023 · 11 comments
Closed

caching terra objects #1150

pfuehrlich-pik opened this issue May 9, 2023 · 11 comments

Comments

@pfuehrlich-pik
Copy link

We would like to cache SpatRaster and SpatVector objects. To do so we use terra::wrap to create a PackedSpatRaster/Vector which we can save as an rds file and in another R session we can use readRDS and unwrap to get the original SpatRaster. However this relies on the original source files (as reported by terra::sources) to still be available. To make our cache folder independent of external files we'd like to copy the source files into the cache folder. When loading a SpatRaster from cache we want it to reference the source file copy in our cache folder instead of the original source file. Now the question is, with the current interface, is this already possible and how? If it is not possible yet, could this be implemented?

One hacky way to do this using unexported API would be the following:

x <- terra::rast(system.file("ex/meuse.tif", package = "terra"))
file.copy(terra::sources(x), ".")
copiedSource <- normalizePath(base::basename(terra::sources(x)))

a <- terra::wrap(x, proxy = TRUE)
a@attributes$sources$source <- copiedSource

y <- terra::unwrap(a)
terra::sources(y)

packageVersion('terra'): '1.7.32'
OS: Ubuntu 22.04

@rhijmans
Copy link
Member

rhijmans commented May 9, 2023

It is currently not implemented. I think I would copy the file and directly open that with rast. Is the issue that this complicates the workflow?

rhijmans added a commit that referenced this issue May 9, 2023
@rhijmans
Copy link
Member

rhijmans commented May 9, 2023

You can now specify a path:

library(terra)
x <- rast(system.file("ex/meuse.tif", package = "terra"))
w <- wrap(x, TRUE, path=tempdir() )

@pfuehrlich-pik
Copy link
Author

Thanks for the amazingly fast reply! The problem with just loading the copied source file with rast is that things only stored in the SpatRaster are getting lost. For example I could load a raster with 7 layers from a netcdf file, then I change the names and select only 4 of the 7 layer. If I want to cache this SpatRaster I cannot just copy the source netcdf and then load that copied source with rast, because then my layer selection and renaming would be lost.

@pfuehrlich-pik
Copy link
Author

What you implemented now almost solves our problem, thank you for that! However, we'd want to put a hash in the filename of the copied source files to be in line with our general caching workflow, and also to allow multiple different versions of the same source file to be in our cache at the same time. So instead of passing only a single path to a folder, could we pass a vector of absolute paths including the file names that we want the sources to be copied to?

@pfuehrlich-pik
Copy link
Author

We'd also want to cache SpatVector objects in the same way, could wrap for SpatVector also be extended to support this?

rhijmans added a commit that referenced this issue May 14, 2023
rhijmans added a commit that referenced this issue May 14, 2023
@rhijmans
Copy link
Member

I have added a new method wrapCache to avoid making the standard wrap too complex. I have not tested it yet. I will, but perhaps you can have a look a well.

This does not apply to SpatVector at the moment, because the original file source (if any) is not stored in the wrapped object. It is assumed that all SpatVectors can be handled in memory. There is the SpatVectorProxy class for other cases, but this is not very well developed yet. If you really need this type of functionality, then please open a separate issue for that.

@pfuehrlich-pik
Copy link
Author

pfuehrlich-pik commented Jun 13, 2023

Thanks for this, and sorry for not following up on it earlier. That is already very useful, also the note about PackedSpatVector not having a reference to source files really helped! I noticed terra::wrapCache has trouble with SpatRaster with NETCDF sources:

Browse[1]> terra::sources(x)
 [1] "NETCDF:\"/home/pascal/PIK/inputdata/sources/LUH2v2h/states.nc\":primf"

When calling a <- terra::wrapCache(x, path = ".") the source file is not copied and the sources of the resulting PackedSpatRaster are broken I believe, because the NETCDF:\" prefix is missing:

Browse[1]> a@attributes$sources$source
[1] "/home/pascal/PIK/inputdata/states.nc\":primf"

Could you maybe have look if that can be fixed? Thank you so much!
PS: I got the netcdf from https://luh.umd.edu/LUH2/LUH2_v2h/states.nc but it is quite big (~6GB). I'm quite sure this can be reproduced with any multilayer netcdf.

@pfuehrlich-pik
Copy link
Author

I just tried passing the filename arg:

> a <- terra::wrapCache(x, filename = "/home/pascal/PIK/inputdata/states.nc")
Warning message:
[writeRaster] consider writeCDF to write ncdf files
> a@attributes$sources$source
 [1] "/home/pascal/PIK/inputdata/states.nc"
> b <- terra::unwrap(a)
Error: [subset] no (valid) layer selected

So I guess if the source file is a netcdf the terra::sources must look like NETCDF:"<filepath>":<layername> and that's also what should be passed as filename to terra::wrapCache? Calling the argument filename is a bit misleading in that case, but I'd be fine with that. It would be great if you could make the following work:

> a <- terra::wrapCache(x, filename = 'NETCDF:"/home/pascal/PIK/inputdata/states.nc":primf')

That's actually the interface we'll be using, so for us the filename argument is important and path is not.

@achubaty
Copy link

achubaty commented Jun 21, 2023

Thank you for providing a mechanism to deal with this on wrap. My use case is moving e.g., a cache to another machine where the paths may not be known in advance.

library(terra)
b <- rast(system.file("ex/elev.tif", package = "terra"))

tf1 <- tempfile(fileext = ".tif")
b1 <- writeRaster(b, filename = tf1)

tf2 <- tempfile(fileext = ".tif")
b2 <- wrapCache(b1, tf2)
unlink(tf1)

identical(tf2, b2@attributes$sources$source) ## TRUE
b3 <- unwrap(b2)
plot(b3)

compareGeom(b, b3)
unlink(tf2)

If b2 object and the file tf2 are moved to another machine, then I still need to update b2 accordingly:

b2@attributes$sources$source <- "new/path/tf2.tif"

Is possible to deal with this on unwrap too/instead? Or perhaps define a method for sources() and sources()<- for a PackedSpatRaster?

@eliotmcintire
Copy link

Another thank you for this new mechanism. It makes some things easier.

I am noticing, however, what seems like a bug, but perhaps I am expecting incorrectly. If you wrapCache an existing SpatRaster that is backed by a 2-file format e.g., grd, it incorrectly wraps only one file (the grd, not the gri). It works as expected if the writing occurs within the wrapCache...

# reprex
# Setup
td <- tempdir()
td2 <- file.path(td, "innerDir")
unlink(dir(td, pattern = "test", full.names = TRUE)) # just in case
unlink(dir(td2, pattern = "test", full.names = TRUE))# just in case 
dir.create(td2, showWarnings = FALSE)
ras <- terra::rast(terra::ext(0,2,0,2), vals = 1:4, res = 1)

# testing 2 ways of running wrapCache -- with `filename` vs with a `SpatRaster` already on disk
ras1 <- terra::wrapCache(ras, filename = file.path(td, "test.grd")) # correctly makes 2 files (grd, and gri)
ras1@attributes$sources  # only shows one! 
dir(td, pattern = "test.gr") # 3 files -- grd, gri, grd.aux.xml
ras2 <- writeRaster(ras, filename = file.path(td, "test2.grd"))     # manually use writeRaster
dir(td, pattern = "test2.gr") # 3 files -- grd, gri, grd.aux.xml

# now terra::wrapCache it to a new folder
ras3 <- terra::wrapCache(ras2, path = td2)           # wrapCache to a 2nd folder -- only makes 1 file
dir(td2, pattern = "test2.gr") # only has one file, but should have 3
ras3@attributes$sources        # only shows one also

@rhijmans
Copy link
Member

I am sorry I dropped the ball on this one; and then the issue got buried under the pile of newer issues.

wrapCache now attempts to copy all files of a dataset:

# now terra::wrapCache it to a new folder
ras3 <- terra::wrapCache(ras2, path = td2)           # wrapCache to a 2nd folder
dir(td2, pattern = "test2.gr") # NOW HAS 3 FILES
# [1] "test2.grd"         "test2.grd.aux.xml" "test2.gri"        

This is OK

ras3@attributes$sources        # only shows one also
#  sid                                                                 source
#1   1 C:\\Users\\rhijm\\AppData\\Local\\Temp\\Rtmp0MlrTc\\innerDir/test2.grd

because that is the file that terra needs to open. In some cases the others may be optional. In the case of grd/gri this will fail if the .gri file is absent.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants