-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimization for var[:,:,[1,2]], wrong result with indexing with two vectors and the type-stability issue #131
Comments
PRs always welcome |
OK, DiskArrays has an implementation of "batch" getindex.jl Because of a too restrictive type declaration in CommonDataSet, I was not using it. This is now fixed. Now I am using batchgetindex.jl but there are two issues in its implementation: With the a1 = _DiskArray(zeros(2, 6000));
@show getindex_count(a1)
a1[:,[1,6000]]
@show getindex_count(a1) output
The The other issue is that the output can be wrong in some circumstances: a1 = _DiskArray(zeros(5,5,3));
size(a1[[1,2],[2,3],:])
# output (2, 3, 2)
# expected (2,2,3) The code in batchindex is really complicated (for me at least :-)). I am wondering if we should replace it with my implementation (loads less data and works if there are two indices as vectors). It is also shorter but the current batchgetindex.jl may also do something else that I don't know. |
Another issue I just came accross is that with DiskArray the indexing is no-longer type stable (in this case): Currently NCDatasets + DiskArrays:
NCDataset 0.12.17
Should I try to make a PR that replace the current batchgetindex implementation by the one from NCDatasets 0.12 ? |
You can certainly try!
(by the way sharing all of this code across all backends is a very nice outcome of using DiskArrays.jl everywhere - there will likely be other things that NCDatasets.jl previously did better than DiskArrays.jl too, we certainly haven't got everything right yet) |
Thanks a lot for sharing the example. Here it is indeed intended behavior to read the whole array, because in your case the chunk size equals the size of the array. So for tests, please experiment a bit with different chunkings of the However, I agree that many users are also dealing with arrays that are not explicitly chunked or compressed, for these we currently usually generate some fake chunks (through So, having special implementations for arrays where |
I would guess that for uncompressed chunks this case still make a difference. Also we can make some more intelligence choices about how much memory should be really allocated. Some compression algorithm can also work on stream and do not need to decompres the whole objects to get the first bytes for example. From my understanding of this post, the widely used gzip algorithm has this capability. |
Do netcdf or gdal take advantage of that when reading partial chunks?
Benchmarks on a PR would be interesting |
Yes, it makes a difference. Test with NCDatasets 0.12.17: using BenchmarkTools
using NCDatasets
sz = (1024,1024,64)
ds = NCDataset("/tmp/tmp.nc","c")
defDim(ds,"lon",sz[1])
defDim(ds,"lat",sz[2])
defDim(ds,"time",sz[3])
defVar(ds,"data",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1])
defVar(ds,"datac",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1],deflatelevel = 9)
ds["data"][:] = randn(sz);
ds["datac"][:] = randn(sz);
@btime ds["data"][1,1,1];
@btime ds["data"][:,:,1];
@btime ds["datac"][1,1,1];
@btime ds["datac"][:,:,1]; output:
So reading an entire chunk (if you do not need it) is 67 times slower for uncompressed data and 36 times slower for compressed data. But we read the same data multiple times in sz = (1024,1024,64)
isfile("/tmp/tmp3.nc") && rm("/tmp/tmp3.nc")
ds = NCDataset("/tmp/tmp3.nc","c")
defDim(ds,"lon",sz[1])
defDim(ds,"lat",sz[2])
defDim(ds,"time",sz[3])
defVar(ds,"data",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1])
defVar(ds,"datac",Float64,("lon","lat","time"),chunksizes=[sz[1:2]...,1],deflatelevel = 9)
ds["data"][:] = randn(sz);
ds["datac"][:] = randn(sz);
close(ds)
ds = NCDataset("/tmp/tmp3.nc")
# warm-up
ds["data"][1,1,1];
ds["data"][:,:,2];
@time ds["data"][1,1,3];
@time ds["data"][:,:,4];
@time ds["datac"][1,1,3];
@time ds["datac"][:,:,4]; output
So a difference of about 20 % in this case. |
Interesting. May be good to compare the middle index in the chunk as well, as the first one is the extreme (fastest possible) case. |
For:
I get (loading the first time)
If the data is in the libnetcdf chunk cache (using
Feel free to make other benchmarks with other indices. |
How is this one faster ? 😄 That does look worthwhile having. |
We are again running into the situation where benchmarking is extremely difficult because as you say
I am also running the benchmarks right now and still trying to get a better picture, but I guess it will be very complicated to come up with good heuristics.
You are right, but for e.g remote datasets stored in Zarr it will be the complete opposite. When doing multiple reads into the same chunk, you would have to transfer the whole chunk multiple times to the local machine, while the current DiskArrays.jl implementation makes sure that every chunk is touched exactly once for any data. As I said, DiskArrays.jl is really tailored with the Zarr model in mind where there simply is no libary-in-the-middle that does some efficient pre-filtering for you. It would be very important to discuss the different use cases, and probably we will have to define additional backend traits in DiskArrays.jl to be able to find the best heuristics that would fit most/all of the real-world use cases. |
This start to sound like a big change. If the approach in my PR is not suitable, and can somebody else have a look how to fix the wrong result with indexing with two vectors and the type-stability issue? |
Following our meeting, I tried to implement But when I define a more specific using NCDatasets
filename_src = tempname()
ds = NCDataset(filename_src, "c")
data = zeros(Int,5,5,3)
ncv = defVar(ds,"data",data,("lon","lat","time"))
# typeof(ncv.var) is NCDatasets.Variable which implements DiskArrays
@show size(ncv.var[[1,2],[2,3],:])
# previously output (2, 3, 2)
# now (2, 2, 3) (Let me know if you see a better way :-)) |
unfortunately, the type instability persists:
even is this is type-stable:
|
Is this a PR somewhere? Would be good to read it/run it To me it looks loke the instability is in the other half of the |
Here is the PR that brings "my" batchgetindex to DiskArrays |
Ok. If the instabilty is not in your code or batchgetindex (no time to check yet, but really looks like its not), we can merge the PR and address the instability separately? |
OK, can you or @meggart look at the comment |
Given the comment here, maybe this example helps to clarify: NetCDF.jl + DiskArray
NCDatasets 0.12 without DiskArray
So 12 minutes versus 1 seconds for this example. Yes, I know that I can bypass DiskArrays by providing my own batchgetindex. But I don't think that DiskArray should make the current assumptions if it aims to be generic (if this is the goal). Can we have the current assumptions (i.e. reading a chunk similarly fast than reading a subset of chunk) documented? Concerning the API, another think I am wondering if we need a function named batchgetindex at all. Why not having getindex simply pass the indices to |
Thanks for writing this up, but probably this needs a specific issue, how does it relate to this one? (this issue already mixes two separate issues into one... we need to organise this process a little I find myself not understanding what the purpose is) I think we need to make three separate issues and discuss the details of each to contain this a little. Does this make sense:
With separate MWEs and a clear path to fixing them. Currently I don't know what to work on to help move this forward. |
Ok I've made separate issues for these problems. Lets work through them and get it fixed. |
In NCDatasets 0.12.17 I had optimization for
var[:,:,[1,2]]
converting it tovar[:,:,1:2]
andvar[:,:,[1,2,10,11]]
into two load operationv[:,:,1:2]
andv[:,:,10:11]
. This was implemented here:https://github.com/Alexander-Barth/NCDatasets.jl/blob/v0.12.17/src/cfvariable.jl#L359-L436
Now a call to
v[:,:,[1,2]]
loads every element individually (which quite slow).In fact, the relevant code is still in NDatasets 0.13 but no longer called.
Maybe this functionality should go into DiskArrays?
The text was updated successfully, but these errors were encountered: