Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DiskArrays for Variable's #205

Merged
merged 42 commits into from
Sep 29, 2023
Merged

Conversation

tcarion
Copy link
Member

@tcarion tcarion commented Apr 14, 2023

This is an attempt to make the subtypes of AbstractVariable behave like DiskArrays. See JuliaGeo/CommonDataModel.jl#8.
I'm not perfectly sure of what I'm doing, but simple cases already work:

using NCDatasets
ds = NCDataset("tos_O1_2001-2002.nc")
var[:,:,1]
julia> var[:,:,1]
180×170 reshape(::Array{Union{Missing, Float32}, 3}, 180, 170) with eltype Union{Missing, Float32}:
 missing  missing  missing  missing  missing  missing  missing  missing  missing    271.419  271.42   271.422  271.426  271.43   271.437  271.445  271.459
 missing  missing  missing  missing  missing  missing  missing  missing  missing     271.419  271.42   271.422  271.426  271.431  271.438  271.445  271.459
 missing  missing  missing  missing  missing  missing  missing  missing  missing     271.418  271.42   271.422  271.425  271.431  271.438  271.445  271.459
                                                                                                                                                
 missing  missing  missing  missing  missing  missing  missing  missing  missing     271.42   271.421  271.423  271.426  271.43   271.437  271.445  271.459
 missing  missing  missing  missing  missing  missing  missing  missing  missing     271.42   271.42   271.423  271.426  271.43   271.437  271.445  271.459

But of course it doesn't always work:

julia> var[2:4,1:3,1]
ERROR: BoundsError: attempt to access 3×3×1 Array{Union{Missing, Float32}, 3} at index [2:4, 1:3, 1:1]
Stacktrace:
  [1] throw_boundserror(A::Array{Union{Missing, Float32}, 3}, I::Tuple{UnitRange{Int64}, UnitRange{Int64}, UnitRange{Int64}})
    @ Base ./abstractarray.jl:691
  [2] checkbounds
    @ ./abstractarray.jl:656 [inlined]
  [3] view(::Array{Union{Missing, Float32}, 3}, ::UnitRange{Int64}, ::UnitRange{Int64}, ::UnitRange{Int64})
    @ Base ./subarray.jl:177
  [4] maybeview(::Array{Union{Missing, Float32}, 3}, ::UnitRange{Int64}, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64}})
    @ Base ./views.jl:146
  [5] dotview(::Array{Union{Missing, Float32}, 3}, ::UnitRange{Int64}, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64}})
    @ Base.Broadcast ./broadcast.jl:1200
  [6] readblock!(::NCDatasets.Variable{Float32, 3, NCDataset{Nothing}}, ::Array{Union{Missing, Float32}, 3}, ::UnitRange{Int64}, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64}})
    @ NCDatasets ~/.julia/dev/NCDatasets/src/variable.jl:449
  [7] readblock!(::CommonDataModel.CFVariable{Union{Missing, Float32}, 3, NCDatasets.Variable{Float32, 3, NCDataset{Nothing}}, NCDatasets.Attributes{NCDataset{Nothing}}, NamedTuple{(:fillvalue, :missing_values, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor), Tuple{Float32, Tuple{Float32}, Nothing, Nothing, Nothing, Nothing, Nothing}}}, ::Array{Union{Missing, Float32}, 3}, ::UnitRange{Int64}, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64}})
    @ CommonDataModel ~/.julia/dev/CommonDataModel/src/cfvariable.jl:389
  [8] getindex_disk(::CommonDataModel.CFVariable{Union{Missing, Float32}, 3, NCDatasets.Variable{Float32, 3, NCDataset{Nothing}}, NCDatasets.Attributes{NCDataset{Nothing}}, NamedTuple{(:fillvalue, :missing_values, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor), Tuple{Float32, Tuple{Float32}, Nothing, Nothing, Nothing, Nothing, Nothing}}}, ::UnitRange{Int64}, ::Vararg{Any})
    @ DiskArrays ~/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:31
  [9] getindex(::CommonDataModel.CFVariable{Union{Missing, Float32}, 3, NCDatasets.Variable{Float32, 3, NCDataset{Nothing}}, NCDatasets.Attributes{NCDataset{Nothing}}, NamedTuple{(:fillvalue, :missing_values, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor), Tuple{Float32, Tuple{Float32}, Nothing, Nothing, Nothing, Nothing, Nothing}}}, ::UnitRange{Int64}, ::UnitRange{Int64}, ::Int64)
    @ DiskArrays ~/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:177
 [10] top-level scope
    @ REPL[17]:1

I will try to solve this in the coming days, any help is of course welcomed, and as I said above, I'm unsure about the approach I'm following :)

cc @rafaqz @Alexander-Barth

@rafaqz
Copy link
Member

rafaqz commented Apr 15, 2023

Hmm I dont get how your second example could ever work?

Oh the 3x3 is a chunk?

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Apr 15, 2023

Thanks a lot for taking a look at this issue! :-)

I have an old branch https://github.com/Alexander-Barth/NCDatasets.jl/tree/DiskArrays that might be useful.

What do you think about these changes?

diff --git a/src/NCDatasets.jl b/src/NCDatasets.jl
index 2140f5e..bd1da5a 100644
--- a/src/NCDatasets.jl
+++ b/src/NCDatasets.jl
@@ -38,7 +38,7 @@ import CommonDataModel: AbstractDataset, AbstractVariable,
     groupnames, group,
     dimnames, dim,
     attribnames, attrib
-import DiskArrays: readblock!, writeblock!
+import DiskArrays: readblock!, writeblock!, AbstractDiskArray
 
 function __init__()
     NetCDF_jll.is_available() && init_certificate_authority()
diff --git a/src/types.jl b/src/types.jl
index c89fb29..a637ce2 100644
--- a/src/types.jl
+++ b/src/types.jl
@@ -61,7 +61,7 @@ end
 
 # Variable (as stored in NetCDF file, without using
 # add_offset, scale_factor and _FillValue)
-mutable struct Variable{NetCDFType,N,TDS<:AbstractNCDataset} <: AbstractNCVariable{NetCDFType, N}
+mutable struct Variable{NetCDFType,N,TDS<:AbstractNCDataset} <: AbstractDiskArray{NetCDFType, N}
     ds::TDS
     varid::Cint
     dimids::NTuple{N,Cint}
diff --git a/src/variable.jl b/src/variable.jl
index 425df70..ba02c72 100644
--- a/src/variable.jl
+++ b/src/variable.jl
@@ -446,8 +446,8 @@ function readblock!(v::Variable{T,N}, aout, indexes::TR...) where {T,N} where TR
     data = Array{T,N}(undef,jlshape)
 
     datamode(v.ds)
-    aout[indexes...] .= nc_get_vars!(v.ds.ncid,v.varid,start,count,stride,data)
-    return data
+    nc_get_vars!(v.ds.ncid,v.varid,start,count,stride,aout)
+    return aout
 end
 
 function writeblock!(v::Variable{T,N},data::T,indexes::StepRange{Int,Int}...) where {T,N}

(I just commit them to see).

I am wondering if the changes are not rather limited to Variable (as other variable type just wrap the base NetCDF Variable type).

@Alexander-Barth
Copy link
Member

Unfortunately, we still have an infinite recursion when loading a scalar:

https://github.com/Alexander-Barth/NCDatasets.jl/blob/pr-DiskArray/test/test_scalar.jl#L16

@rafaqz
Copy link
Member

rafaqz commented Apr 15, 2023

Indexing a multidimensional array with a Colon like that is again breaking the base julia the array interface ;)

Think I slightly misunderstood the example but indexing with a colon should work but should always return a vector? Not a scalar?

If we dont follow base array indexing sharing code is difficult.

@Alexander-Barth
Copy link
Member

the data in question is a zero-dimensional array. This does work in base julia:

zero_dim_array =  dropdims([1],dims=1)
zero_dim_array[:]

[:] just flattens the array:

julia> ones(2,2)[:]
4-element Vector{Float64}:
 1.0
 1.0
 1.0
 1.0

For some reasons my commits do not show in this PR, strange.

@rafaqz
Copy link
Member

rafaqz commented Apr 15, 2023

Yes it should work, returning a vector.

But youre testing that the result is a scalar?

(Its also possible there are bugs for zero dimensions in DiskArrays.jl)

@tcarion
Copy link
Member Author

tcarion commented Apr 17, 2023

I am wondering if the changes are not rather limited to Variable (as other variable type just wrap the base NetCDF Variable type).

You're probably right. At the moment, Variable <: AbstractNCVariable <: CommonDataModel.AbstractVariable <: AbstractDiskArrays (using this branch of CommonDataModel). But it might not be necessary.

What do you think about these changes?

-    aout[indexes...] .= nc_get_vars!(v.ds.ncid,v.varid,start,count,stride,data)
-    return data
+    nc_get_vars!(v.ds.ncid,v.varid,start,count,stride,aout)
+    return aout

This would be better since we don't allocate the data array anymore, but I've got an issue:

julia> var[2:4,1:5,1]
ERROR: TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
  [1] CFtransform_missing
    @ ~/.julia/dev/CommonDataModel/src/cfvariable.jl:235 [inlined]
  [2] CFtransform
    @ ~/.julia/dev/CommonDataModel/src/cfvariable.jl:277 [inlined]
  [3] macro expansion
    @ ~/.julia/dev/CommonDataModel/src/cfvariable.jl:318 [inlined]
  [4] macro expansion
    @ ./simdloop.jl:77 [inlined]
  [5] CFtransformdata!
    @ ~/.julia/dev/CommonDataModel/src/cfvariable.jl:317 [inlined]
  [6] CFtransformdata
    @ ~/.julia/dev/CommonDataModel/src/cfvariable.jl:326 [inlined]
  [7] readblock!(::CommonDataModel.CFVariable{Union{Missing, Float32}, 3, NCDatasets.Variable{Float32, 3, NCDataset{Nothing}}, NCDatasets.Attributes{NCDataset{Nothing}}, NamedTuple{(:fillvalue, :missing_values, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor), Tuple{Float32, Tuple{Float32}, Nothing, Nothing, Nothing, Nothing, Nothing}}}, ::Array{Union{Missing, Float32}, 3}, ::UnitRange{Int64}, ::UnitRange{Int64}, ::Vararg{UnitRange{Int64}})
    @ CommonDataModel ~/.julia/dev/CommonDataModel/src/cfvariable.jl:390
  [8] getindex_disk(::CommonDataModel.CFVariable{Union{Missing, Float32}, 3, NCDatasets.Variable{Float32, 3, NCDataset{Nothing}}, NCDatasets.Attributes{NCDataset{Nothing}}, NamedTuple{(:fillvalue, :missing_values, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor), Tuple{Float32, Tuple{Float32}, Nothing, Nothing, Nothing, Nothing, Nothing}}}, ::UnitRange{Int64}, ::Vararg{Any})
    @ DiskArrays ~/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:31
  [9] getindex(::CommonDataModel.CFVariable{Union{Missing, Float32}, 3, NCDatasets.Variable{Float32, 3, NCDataset{Nothing}}, NCDatasets.Attributes{NCDataset{Nothing}}, NamedTuple{(:fillvalue, :missing_values, :scale_factor, :add_offset, :calendar, :time_origin, :time_factor), Tuple{Float32, Tuple{Float32}, Nothing, Nothing, Nothing, Nothing, Nothing}}}, ::UnitRange{Int64}, ::UnitRange{Int64}, ::Int64)
    @ DiskArrays ~/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:177
 [10] top-level scope
    @ REPL[25]:1

Note that indexing directly on the variable works though:

julia> var.var[2:4,1:5,1]
3×5 Matrix{Float32}:
 1.0f20  1.0f20  1.0f20  1.0f20  1.0f20
 1.0f20  1.0f20  1.0f20  1.0f20  1.0f20
 1.0f20  1.0f20  1.0f20  1.0f20  1.0f20

I think it comes from the fact that when this method is called, aout is an array of Union{Missing, T}, which makes it doesn't get properly populated with nc_get_vars! (aout keeps containing only missing after the call to the method). Would you know why (I don't really understand how nc_get_vars! works)?

Except from this, reading seems to work for this particular use case. I will try to get writing work, I hope it won't be too much trickier!

For some reasons my commits do not show in this PR, strange.

Strange indeed! Do you commit to my forked diskarrays branch? Maybe I have to setup something in my forked repo?

@rafaqz
Copy link
Member

rafaqz commented Apr 17, 2023

If you want to add diskarrays behavioir just to Variable, you can do that using the macros provided in DiskArrays.jl, rather than subtyping AbstractDiskArray.

But Im not sure you would want any disk based array types to not have disk arrays behaviour, lazy chunked broadcasting is a pretty nice thing to have, instead if broadcasts just not working at all.

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Apr 17, 2023

I am wondering if (at least as a start) we just convert the base type NCDatasets.Variable to derive from DiskArray.
In this case, the conversion from Array{Union{Missing,T},N} to Array{T,N} is already taken care of.
As far as I understand, we want the data on disk be cashed but maybe not the transformed/unpacked data?

Maybe the macros that @rafaqz mentioned can then be used for wrapped types like CFVariable without requiring that they derive from DiskArray?

You guys have a better understand of DiskArrays works... Is there any documentation about the macros?

Strange indeed! Do you commit to my forked diskarrays branch? Maybe I have to setup something in my forked repo?

I used the commands from https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally to test the PR:

git fetch origin pull/205/head:pr-DiskArray
git checkout pr-DiskArray 

and I assumed that any commits with show up here, but maybe that work only for PR that I initiate. In any case, if things become simplier for you, if you are a dev on this repo just let me know.

@rafaqz In general, yes some the test values should change. The issue of the recursion happened before but maybe it is related to this:

https://github.com/Alexander-Barth/NCDatasets.jl/pull/205/files#diff-fe41cd4463f1540b047828f14c99e312c078e817112d8f201a49eace1f5d58b9R495

I will already change the test in the master version.
Update: this is the commit:
690ba22

@tcarion
Copy link
Member Author

tcarion commented Apr 17, 2023

I tried to run the test and I spotted a few general issues:

  1. When scalar variables are implied
  2. When infinite dimensions are implied
  3. When writing to the Variable with a different eltype (e.g. here)
  4. When dealing with other subtypes of CommonDataModel.AbstractVariable (SubVariable, MFVariable). That would imply to change all the getindex / setindex! methods with their readblock!/writeblock equivalent. That's why it might be more simple to add diskarrays behaviour only to Variable as you suggested above, but I don't know how to do this. @rafaqz would you have some examples I could base upon?

@rafaqz
Copy link
Member

rafaqz commented Apr 17, 2023

See here for disk arrays macros. Above that application you can see all the sub macros. But you probably want the main one:

https://github.com/meggart/DiskArrays.jl/blob/main/src/DiskArrays.jl#L37

So @implement_diskarray Variable should do it.

We could run your tests in CI so we can all see the failures and discuss them? I think thats the best way for me to contribute to the discussion.

@tcarion I will make you a JuliaGeo member so you dont need run permissions. You already have a package here anyway.

Edit sorry I just realised this isn't the CommonDataModel thread so not JuliaGeo.

@Alexander-Barth can you change the package settings so that @tcarion can run CI with these pushes without you clicking approve every time? I would like to help with resoloving the bugs.

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Apr 18, 2023

Concerning the error type "2. When infinite dimensions are implied" there are some changes API, for example when previously we could do

# ncvar is a NetCDF vector variable with an ulimited dimension
ncvar[:] = data

we now need to do:

# ncvar is a NetCDF vector variable with an ulimited dimension
ncvar[1:length(data)] = data

which mimics the behaviour of NetCDF.jl.

For the other cases, however, I don't think (or hope :-) ) that we need to break current API.

Another point is that ncvar[:] currently loads all the data and preserving the shape (without flatting the array as in julia). This behavior can go away too. I tried to remove tests that rely on this behaviour, but I might have forgotten some.

Thanks @rafaqz for the info!

So @implement_diskarray Variable should do it.

To be certain that I understand: we can either derive Variable from AbstractDiskArrays or we can use the macro to inherit the behavior (chunking,...) . Is this correct? Or do we need to do both?

What do you think about the approach to make just NCDatasets.Variable as a DiskArrays and all types wrapping NCDatasets.Variable would use method forwarding (for eachchunk...) as explained here?

I changed the setting in github action; I hope it works now. (If not maybe @tcarion can make a trivial change the to README.md so that he is not considered as a first time contributor to NCDatasets for github.)

@tcarion
Copy link
Member Author

tcarion commented Apr 18, 2023

So @implement_diskarray Variable should do it.

I tried and I get a UndefVarError: Variable not defined. I think the problem is that DiskArrays looks for a Variable defined in its own Module:

julia> @macroexpand(@implement_getindex(Variable))
quote
    #= /home/tcarion/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:177 =#
    (DiskArrays.Base).getindex(var"#574#a"::DiskArrays.Variable, var"#575#i"...) = begin
            #= /home/tcarion/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:177 =#
            DiskArrays.getindex_disk(var"#574#a", var"#575#i"...)
        end
    #= /home/tcarion/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:179 =#
    function (DiskArrays.Base).getindex(var"#578#a"::DiskArrays.Variable, var"#579#i"::DiskArrays.ChunkIndex)
        #= /home/tcarion/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:179 =#
        #= /home/tcarion/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:180 =#
        var"#576#cs" = DiskArrays.eachchunk(var"#578#a")
        #= /home/tcarion/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:181 =#
        var"#577#inds" = var"#576#cs"[(var"#579#i").I]
        #= /home/tcarion/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:182 =#
        return DiskArrays.wrapchunk((var"#579#i").chunktype, var"#578#a"[var"#577#inds"...], var"#577#inds")
    end
    #= /home/tcarion/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:184 =#
    function (DiskArrays.DiskArrays).ChunkIndices(var"#580#a"::DiskArrays.Variable; offset = false)
        #= /home/tcarion/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:184 =#
        #= /home/tcarion/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:185 =#
        return DiskArrays.ChunkIndices((DiskArrays.Base).OneTo.(DiskArrays.size(DiskArrays.eachchunk(var"#580#a"))), if offset
                    DiskArrays.OffsetChunks()
                else
                    DiskArrays.OneBasedChunks()
                end)
    end
end

I'm pretty unexperienced with macros, but I'd say it's something that should be fixed in DiskArrays, right?

we now need to do:

# ncvar is a NetCDF vector variable with an ulimited dimension
ncvar[1:length(data)] = data

which mimics the behaviour of NetCDF.jl.

Ok I'll keep that in mind when trying to make the tests work, thanks!

What do you think about the approach to make just NCDatasets.Variable as a DiskArrays and all types wrapping NCDatasets.Variable would use method forwarding (for eachchunk...) as explained here?

Since @implement_diskarray is currently not working, I will go with that approach for now.
After a second thought, I don't see how to make only NCDatasets.Variable a subtype of AbstractDiskArrays while keeping it also a CommonDataModel.AbstractVariable. So I think the macro is essential here.

I changed the setting in github action; I hope it works now. (If not maybe @tcarion can make a trivial change the to README.md so that he is not considered as a first time contributor to NCDatasets for github.)

Thank you, I'll try after my next commit, I'll let you know if it didn't work :)

@rafaqz
Copy link
Member

rafaqz commented Apr 18, 2023

@Alexander-Barth collect(ncvar) or Array(ncvar) would be more Julian ways to do ncvar[:] in a generic way.

we can either derive Variable from AbstractDiskArrays or we can use the macro to inherit the behavior (chunking,...) . Is this correct? Or do we need to do both?

Yes the macro is what we apply to AbstractDiskArray anyway, so it will give identical behavior to whatever you apply it to without inheriting from anything. You can also apply subsets of it with the other macros (see the lines above here)

What do you think about the approach to make just NCDatasets.Variable as a DiskArrays and all types wrapping NCDatasets.Variable would use method forwarding (for eachchunk...) as explained here?

Honestly I think eventually every chunked disk-based array object in the julia ecosystem should use DiskArrays.jl so we have a consistent standard.

But we should of course compromise on that to just get things working! that's why we have the macros.

@rafaqz
Copy link
Member

rafaqz commented Apr 18, 2023

@tcarion yeah this macro hasn't been very widely used, it seems to not be escaped properly.

Try using CommonDataModel.Variable

Otherwise I will fix it later on.

@tcarion
Copy link
Member Author

tcarion commented Apr 18, 2023

I tried and the issue stays:

    (DiskArrays.Base).getindex(var"#1184#a"::(DiskArrays.NCDatasets).Variable, var"#1185#i"...) = begin
            #= /home/tcarion/.julia/packages/DiskArrays/zaCXL/src/diskarray.jl:177 =#
            DiskArrays.getindex_disk(var"#1184#a", var"#1185#i"...)
        end

Thanks for looking at it!

@rafaqz
Copy link
Member

rafaqz commented Apr 18, 2023

Im working on a fix now

@rafaqz
Copy link
Member

rafaqz commented Apr 18, 2023

Turns oit there a few other fixes needed for this to work.

@Alexander-Barth
Copy link
Member

@Alexander-Barth collect(ncvar) or Array(ncvar) would be more Julian ways to do ncvar[:] in a generic way.

I agree, we have already Array(ncvar) in NCDatasets. collect currently works too (but slow).

@rafaqz
Copy link
Member

rafaqz commented Apr 23, 2023

I've registered a patch version of DiskArrays.jl with fixed macros, hopefully that gets things working.

JuliaRegistries/General#82137

@rafaqz
Copy link
Member

rafaqz commented Jul 30, 2023

@Alexander-Barth can we get this merged?

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Aug 2, 2023

Sorry I did not see your message (I got a lot of emails from github recently that I missed that you are ready with the PR).
I think this PR depends on JuliaGeo/CommonDataModel.jl#9 (or maybe not?). Is this ready too?
For NCDatasets, I will draft some release notes to document the changes. Maybe you will have a time to review it.

Thanks a lot for @tcarion for the hard work and @rafaqz and @meggart for their help!

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Aug 3, 2023

For the release notes:
The array API of NCDatasets is now more similar to base Julia in particular:

  • ncvar[range_indices] = scalar should now be ncvar[range_indices] .= scalar
  • ncvar2D[:] flattens the data in the 2D NetCDF variable ncvar2D. To read the full array one need to use ncvar2D[:,:] or Array(ncvar2D) (similarly for 3D, 4D... arrays).
  • Accessing an array out of bounds, new returns a DimensionMismatch exception (previously a NCDatasets.NetCDFError exception was returned)
  • To grow a NetCDF variable with unlimited dimension, the corresponding index of left-hand side of the assignment cannot be a colon, but should be a range. For example if ncvar is a NetCDF variable where the 2nd dimension is unlimited, ncvar[:,:] = zeros(2,3) should now be replaced by ncvar[:,1:3] = zeros(2,3)

@rafaqz
Copy link
Member

rafaqz commented Aug 3, 2023

This doesn't depend on CommonDataModel changes, as we cant do this for AbstractVariable currently (unless you have changed your mind).

These are kind of demonstration PRs... ultimately we need to make AbstractVariable <: AbstractDiskArray in CommonDataModel.jl for much of this to be usable, and avoid duplicating all of this code everywhere.

It would fix issues like this and allow deleting SubVariable and CatArray, and probably a lot more code here and in CommonDataModel. #219

But it will hit your setindex! problem somewhere or other.

I'm happy to do this work if its something you want. Otherwise merging this is useful as a stopgap so we can hack together temporary DiskArrays support elsewhere on top of Variable. This is all pretty horrible though, we realy just need AbstractVariable <: AbstractDiskArray

@tcarion
Copy link
Member Author

tcarion commented Aug 4, 2023

I confirm that JuliaGeo/CommonDataModel.jl#9 is not needed for this PR to work. JuliaGeo/CommonDataModel.jl#9 was intended to try to make all AbstractVariable <: AbstractDiskArray, which I think would be the best long-term solution, but, as @rafaqz mentioned, is failing because of the unlimited dimension issue.

The release notes seem nice and I think they cover everything that has changed, thank you for writing them!

Conflicts:
	src/variable.jl
@Alexander-Barth
Copy link
Member

Unfortunately, I see some slow down for the benchmark on my machine. Current master runs in 181.193 ms while this PR 307.836 ms. Do you see this as well? I am using DiskArrays v0.3.14 (current version).

I tried this PR with another package of our package (DIVAnd) and upgrade was quite smooth.

For info: this commit 7edf68f disable the opendap tests which causes current test failures (opendap server seems to be down).

Master

julia> bm
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  162.789 ms … 231.338 ms  ┊ GC (min … max): 1.82% … 15.08%
 Time  (median):     181.193 ms               ┊ GC (median):    2.68%
 Time  (mean ± σ):   183.786 ms ±  16.540 ms  ┊ GC (mean ± σ):  3.09% ±  1.54%

  █ ▄▂▂            ▂▆  ▂ ▂ ▂                                     
  █▆███▆▆▁▄▆▆▆▆█▄▄███▄▄█▄█▁███▆▆▆▆▁▆▄▁▄▁▁▁▄▄▁▁▄▄▁▄▁▁▁▁▁▁▆▄▄▄▆▁▄ ▄
  163 ms           Histogram: frequency by time          224 ms <

 Memory estimate: 476.90 MiB, allocs estimate: 1610.

PR

julia> bm
BenchmarkTools.Trial: 100 samples with 1 evaluation.
 Range (min … max):  291.086 ms … 326.860 ms  ┊ GC (min … max): 1.15% … 1.50%
 Time  (median):     307.836 ms               ┊ GC (median):    1.74%
 Time  (mean ± σ):   307.185 ms ±   4.399 ms  ┊ GC (mean ± σ):  1.74% ± 0.22%

                            ▃   ▃ ▂     ▅▂▅ █▃                   
  ▄▁▁▁▁▁▁▁▁▁▄▁▁▁▁▁▁▄▁▁▄▁▅▄▅▄█▁█▇█▄█▅▄▄▇▇███▇████▅▄▇▄▄▁▁▁▁▁▁▁▁▁▄ ▄
  291 ms           Histogram: frequency by time          318 ms <

 Memory estimate: 527.15 MiB, allocs estimate: 8010.

function Base.getindex(v::Variable,indexes::Int...)
# This method needs to be duplicated instead of using an Union. Otherwise a DiskArrays fallback is called instead which impacts performances
# (see https://github.com/Alexander-Barth/NCDatasets.jl/pull/205#issuecomment-1589575041)
function readblock!(v::Variable, aout, indexes::TI...) where TI <: Union{AbstractUnitRange,StepRange}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
function readblock!(v::Variable, aout, indexes::TI...) where TI <: Union{AbstractUnitRange,StepRange}
function readblock!(v::Variable, aout, indexes::Union{<:AbstractUnitRange,<:StepRange}...)

This dispatch was not correct so we were using the fallback for AbstractVector.

I get roughly the same performance as master after this change.

@rafaqz
Copy link
Member

rafaqz commented Sep 16, 2023

@Alexander-Barth see the fix above for your performance question.

Took a few hours to track down 😅 - that's a very subtle bug in dispatch on readblock! where using a type parameter for mixed type varargs its not the same as using the types directly in the signature.

Hopefully this is good to merge now after that change is accepted.

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Sep 18, 2023

Great! I confirm that the timing are now good:

Module median minimum mean std. dev.
R-ncdf4 0.373 0.351 0.373 0.012
python-netCDF4 0.569 0.549 0.571 0.010
julia-NCDatasets 0.192 0.174 0.192 0.010

It seems that I have now a new warning that I did not see before:

┌ NCDatasets [85f8d34a-cbdd-5861-8df4-14fed0d494ab]
│  WARNING: Method definition _reshape(DiskArrays.AbstractDiskArray{var"#s1303", 1} where var"#s1303", Tuple{Int64}) in module DiskArrays at /home/abarth/.julia/packages/DiskArrays/qHbZi/src/reshape.jl:83 overwritten in module NCDatasets on the same line (check for duplicate calls to `include`).
│    ** incremental compilation may be fatally broken for this module **
│  
│  WARNING: Method definition zip(Any, NCDatasets.Variable{NetCDFType, N, TDS} where TDS<:NCDatasets.AbstractNCDataset where N where NetCDFType, Any...) in module NCDatasets at /home/abarth/.julia/packages/DiskArrays/qHbZi/src/zip.jl:76 overwritten at /home/abarth/.julia/packages/DiskArrays/qHbZi/src/zip.jl:77.
│    ** incremental compilation may be fatally broken for this module **
└  

Any ideas where this can come from? I am getting them with DiskArrays v0.3.16 and v0.3.17.

@rafaqz
Copy link
Member

rafaqz commented Sep 18, 2023

Great! I think those warnings are my fault form recent addition of methods, will fix.

@rafaqz
Copy link
Member

rafaqz commented Sep 18, 2023

zip should be fixed here JuliaIO/DiskArrays.jl#115

I'm not getting the other error

@rafaqz
Copy link
Member

rafaqz commented Sep 19, 2023

@Alexander-Barth this shouldnt have a warning now

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Sep 21, 2023

With DiskArrays, v0.3.18, there seam to be macro issue:

Line 71 on NCDatasets.jl is @implement_diskarray NCDatasets.Variable.

Any idea which is could be the case?

julia> using NCDatasets
[ Info: Precompiling NCDatasets [85f8d34a-cbdd-5861-8df4-14fed0d494ab]
ERROR: LoadError: syntax: "NCDatasets.Variable" is not a valid function argument name around /home/abarth/.julia/packages/DiskArrays/Wl1VA/src/zip.jl:77
Stacktrace:
 [1] top-level scope
   @ ~/.julia/dev/NCDatasets/src/NCDatasets.jl:71
 [2] include
   @ ./Base.jl:457 [inlined]
 [3] include_package_for_output(pkg::Base.PkgId, input::String, depot_path::Vector{String}, dl_load_path::Vector{String}, load_path::Vector{String}, concrete_deps::Vector{Pair{Base.PkgId, UInt128}}, source::Nothing)
   @ Base ./loading.jl:2010
 [4] top-level scope
   @ stdin:2
in expression starting at /home/abarth/.julia/dev/NCDatasets/src/NCDatasets.jl:1
in expression starting at stdin:2

Wild guess: maybe on Base.zip($t, ::$t, xs...) should be Base.zip(::$t, ::$t, xs...) in DiskArrays.jl zip macro?

@rafaqz
Copy link
Member

rafaqz commented Sep 21, 2023

Feel free to PR your fix ;)

@rafaqz
Copy link
Member

rafaqz commented Sep 21, 2023

Ok fixed, will be registered soon. That method is only for preventing ambiguity before throwing an error, so isn't caught in the DiskArrays tests because we don't use the macro on Package.ArrayType but just ArrayType.

Anyway, there are some overheads to using these macros rather than inheritance, we were actually close to deleting the macros in DiskArrays.jl before this PR.

@rafaqz
Copy link
Member

rafaqz commented Sep 24, 2023

@Alexander-Barth this should be good to merge

@Alexander-Barth
Copy link
Member

Alexander-Barth commented Sep 25, 2023

I still see a method warning (DiskArrays v0.3.19):

WARNING: Method definition _reshape(DiskArrays.AbstractDiskArray{var"#s69", 1} where var"#s69", Tuple{Int64}) in module DiskArrays at /home/abarth/.julia/packages/DiskArrays/fshK3/src/reshape.jl:83 overwritten in module NCDatasets on the same line (check for duplicate calls to `include`).
  ** incremental compilation may be fatally broken for this module **

But with these two commits 08a4888 and 7edf68f (unrelated to DiskArrays) all NCDatasets test pass on julia 1.9 and 1.6.

Can this warning be fixed before I merge? Thanks a lot!

@tcarion
Copy link
Member Author

tcarion commented Sep 25, 2023

Thank you both for your help! Sorry I'm not very available lastly...
I also get this warning coming from DiskArrays.jl on 1.9.

On 1.10, this becomes an error:

               _
   _       _ _(_)_     |  Documentation: https://docs.julialang.org
  (_)     | (_) (_)    |
   _ _   _| |_  __ _   |  Type "?" for help, "]?" for Pkg help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 1.10.0-beta2 (2023-08-17)
 _/ |\__'_|_|_|\__'_|  |  Official https://julialang.org/ release
|__/                   |

julia> using NCDatasets
Precompiling NCDatasets
  ✗ NCDatasets
  0 dependencies successfully precompiled in 7 seconds. 29 already precompiled.

ERROR: The following 1 direct dependency failed to precompile:

NCDatasets [85f8d34a-cbdd-5861-8df4-14fed0d494ab]

Failed to precompile NCDatasets [85f8d34a-cbdd-5861-8df4-14fed0d494ab] to "/home/tcarion/.julia/compiled/v1.10/NCDatasets/jl_c4pSNN".
WARNING: Method definition _reshape(DiskArrays.AbstractDiskArray{var"#s69", 1} where var"#s69", Tuple{Int64}) in module DiskArrays at /home/tcarion/.julia/packages/DiskArrays/fshK3/src/reshape.jl:83 overwritten in module NCDatasets on the same line (check for duplicate calls to `include`).
ERROR: LoadError: Method overwriting is not permitted during Module precompile.

@rafaqz
Copy link
Member

rafaqz commented Sep 25, 2023

Weird, Im not getting that locally or in CI. Will try to reproduce

@rafaqz
Copy link
Member

rafaqz commented Sep 26, 2023

Ok, this was there for an ambiguity method. Will be fixed soon.

Unfortunately we cant fix the ambiguity for NCDatasets.Variable in the macro because we cant know the type parameters of arbitrary types. Using the macro makes these things difficult to handle, really we shouldn't use them.

@rafaqz
Copy link
Member

rafaqz commented Sep 26, 2023

The last fix is registered, hopefully that does it

@Alexander-Barth
Copy link
Member

Thanks a lot! Everything works now fine too on my side.

@Alexander-Barth Alexander-Barth merged commit 7576323 into JuliaGeo:master Sep 29, 2023
@rafaqz
Copy link
Member

rafaqz commented Sep 29, 2023

Amazing so excited to have the basics of this working now.

@rafaqz
Copy link
Member

rafaqz commented Sep 29, 2023

Can we get a new version tagged so we can work on rafaqz/Rasters.jl#416?

@tcarion tcarion deleted the diskarrays branch October 9, 2023 11:28
@Alexander-Barth
Copy link
Member

Alexander-Barth commented Dec 13, 2023

In my tests, unfortunately the slow batchgetindex is still called when using v[:,:,1] with DiskArray 0.3.22.

I traced it down how julia handles function specificity. Surprisingly we get:

julia> foo(i::AbstractVector...) = "vectors"
foo (generic function with 1 method)

julia> foo(i::TI...) where TI <: Union{AbstractUnitRange,StepRange} = "range"
foo (generic function with 2 methods)

julia> foo(Base.OneTo(1),Base.OneTo(1),1:1)
"vectors"

The fix is to define the following function:

julia> foo(i::AbstractRange...) = "range 2"
foo (generic function with 3 methods)

julia> foo(Base.OneTo(1),Base.OneTo(1),1:1)
"range 2"

Unfortunately, even avoiding batchgetindex v[:,:,1] is still not as fast as in NCDataset 0.12 without DiskArray:

NCDatasets 0.13.1 with DiskArrays

julia> v1 = @btime v[:,:,1];
  35.120 μs (36 allocations: 254.42 KiB)

julia> @which v[:,:,1]
getindex(a::NCDatasets.Variable, i...)
     @ NCDatasets ~/.julia/dev/DiskArrays/src/diskarray.jl:215

NCDatasets 0.12.17 without DiskArrays


julia> v1 = @btime v[:,:,1];
  23.730 μs (8 allocations: 253.45 KiB)

julia> @which v[:,:,1]
getindex(v::NCDatasets.Variable{T, N}, indexes::Union{Colon, Int64, AbstractRange{<:Integer}}...) where {T, N}                               
     @ NCDatasets ~/.julia/packages/NCDatasets/st9Jz/src/variable.jl:482

Any ideas how we can improve would be appreciated :-)
@tcarion @meggart @rafaqz

Commit: b52a71a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants