Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Return Tables instead of NamedTuples. #59

Merged
merged 10 commits into from
Sep 6, 2023
Merged

Return Tables instead of NamedTuples. #59

merged 10 commits into from
Sep 6, 2023

Conversation

evetion
Copy link
Owner

@evetion evetion commented Aug 19, 2023

Also fixes #50

This changes the returns of points and lines to Table or PartitionedTable, both which implement the Tables interface. Technically breaking, but the tests are not broken, so I think the impact in practice is neglible. Old things like reduce(vcat, DataFrame.(points(g)) still work, but DataFrame(points(g)) is now possible, shorter and faster.

@evetion evetion requested a review from alex-s-gardner August 19, 2023 14:36
@evetion
Copy link
Owner Author

evetion commented Aug 19, 2023

I also refactored the canopy/ground switches for ICESat-2 and GEDI, by just calling the underlying methods multiple times (one time for ground, one time for canopy, if both are enabled). This significantly reduces boilerplate and possible bugs.

@evetion
Copy link
Owner Author

evetion commented Aug 19, 2023

Failing tests are unrelated (one slow download, nightly fails on HDF5, for which an issue has been made).

@alex-s-gardner
Copy link
Collaborator

Working my way through breaking changes. I had an internal function:

"""
    points_plus(granule::ICESat_Granule{}; bbox = (min_x = -Inf, min_y = -Inf, max_x = Inf, max_y = Inf))

returns the ICESat granual *WITH* granual infomation for each track
"""
function points_plus(
    granule::ICESat_Granule{};
    extent::Extent = world,
    )

    ptsplus = merge(points(granule, bbox = extent), (; granule_info = granule))
    return ptsplus
end

since points now returns a table I need to append granule as a column. Looking through the Tables.jl documentation it's not readily clear how to do this.

Do you know of an easy way to append a new column to a table?

Thanks.

@evetion
Copy link
Owner Author

evetion commented Aug 21, 2023

I'll add a merge method to Table so this can keep working. Note that it's a SpaceLiDAR specific table, it's not a Table from Tables.jl, but it does implement the Tables.jl interface.

I'm not sure if your code ever worked for ICESat-2 and GEDI, as those returned a vector of namedtuples (now it returns a partionedtable).

Also, wouldn't it make more sense to store the granule info as metadata (depending on your output format)? I wrote points with the intent to have a table of equal length columns and simple types. Storing a non vector custom type goes against that, and can cause some headaches.

@alex-s-gardner
Copy link
Collaborator

I'm not sure if your code ever worked for ICESat-2 and GEDI, as those returned a vector of namedtuples (now it returns a partionedtable).

I also had a method for ICESat2 and GEDI:

function points_plus(
    granule::Union{GEDI_Granule{}, ICESat2_Granule{}};
    extent::Extent = world,
    )
    ptsplus = merge.(points(granule, bbox = extent), Ref((; granule_info = granule)))
    return ptsplus
end

@evetion
Copy link
Owner Author

evetion commented Aug 21, 2023

Can you comment on how you use the granule info? For example, you could further explode it, or store it as metadata in arrow. We could make it default?

I now recover similar info from the filename (my files only have the extensions renamed), which also isn't ideal.

@alex-s-gardner
Copy link
Collaborator

alex-s-gardner commented Aug 21, 2023

Also, wouldn't it make more sense to store the granule info as metadata (depending on your output format)? I wrote points with the intent to have a table of equal length columns and simple types. Storing a non vector custom type goes against that, and can cause some headaches.

The way I've setup my pipeline is to segment the data by "geotiles"... that is X by X degree geographic extents that contain all mission data within the X degree bounding box. Data is extracted from the raw files using points and inserted as rows within a DataFrame. Each row represents a single dataset (or beam in our case), a single row has many observations and one "granule_info". DataFrames are then saved as Arrow files. This gives me an easy way to find what has already been downloaded extracted and what still needs to be added by appending to an DataFrame without needing to update an external file list that can become out of sync with the data files.

This approach is working well but I will eventually overhaul the whole pipleline so that each row of a dataframe contains a single point.. when I do this I will make heavy use of FillArrays.

@alex-s-gardner
Copy link
Collaborator

The biggest bottleneck for my global processing pipeline is file I/O. This is why I've moved to implementing geotiles that segments the data by location.

@evetion
Copy link
Owner Author

evetion commented Aug 22, 2023

I've added merge methods. Your points_plus should work again, and no need to have two separate methods for ICESat and ICESat-2 anymore. merge(points(g::ICESat2Granule), (;g)) should just work.

@evetion
Copy link
Owner Author

evetion commented Aug 23, 2023

Did you find other instances where this PR broke your code? I will hold off on releasing a new version untill I've got this compatible with EarthData and have some version of HFD5Tables in.

@alex-s-gardner
Copy link
Collaborator

alex-s-gardner commented Aug 23, 2023

using SpaceLiDAR
g = ICESat_Granule{:GLAH06}("GLAH06_634_1102_001_0073_1_01_0001.H5", "/Users/gardnera/data/icesat/GLAH06/034/raw/GLAH06_634_1102_001_0073_1_01_0001.H5", (type=:GLAH06, phase=1, rgt=0, instance=73, cycle=1, segment=1, version=1, revision=634), Vector{Vector{Vector{Float64}}}[])
tbl = merge(points(geotile_info), (; g))
DataFrame(tbl)

results in a MethodError: no method matching iterate(::ICESat_Granule{:GLAH06})

I think in this case we want the length of :ICESat_Granule to == 1 so that the named tuple occupies it's own table cell

@evetion
Copy link
Owner Author

evetion commented Aug 24, 2023

Yeah, that won't work. Defining granule as something to iterate on (also requires a getindex on it), will lead to ERROR: DimensionMismatch: column :longitude has length 729117 and column :g has length 1.

But the error comes because DataFrame wants matching length columns, your pointsplus function never did that right? So DataFrame(points_plus(g)) has never worked before? (I went a few commits back to test it).

@alex-s-gardner
Copy link
Collaborator

You're correct... apologies... the exact code that I'm trying to get working again is:

if mission == :ICESat 
    df = DataFrame(points_plus.(row.granules, extent =row.extent));    
elseif mission == :ICESat2 || mission == :GEDI
    df = reduce(vcat, (DataFrame.(points_plus.(row.granules, extent = row.extent))));  
end

With this PR I can create the tables:

using SpaceLiDAR
using DataFrames
g = [ICESat_Granule{:GLAH06}("GLAH06_634_1102_001_0073_1_01_0001.H5", "/Users/gardnera/data/icesat/GLAH06/034/raw/GLAH06_634_1102_001_0073_1_01_0001.H5", (type = :GLAH06, phase = 1, rgt = 0, instance = 73, cycle = 1, segment = 1, version = 1, revision = 634), Vector{Vector{Vector{Float64}}}[]),
 ICESat_Granule{:GLAH06}("GLAH06_634_1102_001_0073_2_01_0001.H5", "/Users/gardnera/data/icesat/GLAH06/034/raw/GLAH06_634_1102_001_0073_2_01_0001.H5", (type = :GLAH06, phase = 1, rgt = 0, instance = 73, cycle = 1, segment = 2, version = 1, revision = 634), Vector{Vector{Vector{Float64}}}[])]

tbl = merge.(points.(g), [(; granuel_info = f) for f in g])

but I'm unable to make a DataFrame where each row makes up a single granule (ICESat) or beam (GEDI & ICESat2).

This makes sense as points now returns a table where each observation occupies a single row. Your new approach is absolutely the way to go but I need someway to provide backwards compatibility. To make my code work I need a way to reverse the Fill arrays and table properties so that I can treat a SpaceLiDAR Table as the original NamedTuple.

@alex-s-gardner
Copy link
Collaborator

As I mentioned before, I have been meaning to refactor my code to move away from storing each granule as it's own row. Maybe this PR will force me to finally make the change. It's a fairly major change on my end as everything is built around rows = granules.

The one thing that is really lacking when moving from rows = granules to rows = single observation is the ability to trace the observation back to it's source file. We can save this in the metadata but as soon as we start concatenating tables it gets hard to maintain traceability.

What do you think is the best path forward? As of now merge will not work for this as it needs to know the number of rows returned from points to create the FillArray. Maybe merge could be modified to create a FillArray if the input is non-iterable.

@evetion
Copy link
Owner Author

evetion commented Aug 25, 2023

As I mentioned before, I have been meaning to refactor my code to move away from storing each granule as it's own row. Maybe this PR will force me to finally make the change. It's a fairly major change on my end as everything is built around rows = granules.

Maybe something changed in DataFrames over time in terms of automatic repeating non-iterable column values? I've added Base.parent to the Tables in SpaceLiDAR, so you get the original (vector of) NamedTuples back (the Tables are a simple wrapper around them, so we can dispatch on a type we own):

julia> SL.points(g)
SpaceLiDAR Table with 6 partitions

julia> parent(SL.points(g))
6-element Vector{NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :phr, :sensitivity, :scattered, :saturated, :clouds, :track, :strong_beam, :classification, :height_reference, :detector_id, :reflectance, :nphotons), Tuple{Vector{Float32}, Vector{Float32}, Vector{Float32}, Vector{Float32}, ...

If that doensn't work, code along the following lines would do the trick:

reduce(vcat, DataFrame.(Pair.(:track, SL.points(g)), :granule=>g))
  0.019452 seconds (2.51 k allocations: 3.784 MiB)
6×2 DataFrame
 Row │ track                              granule                           
     │ NamedTup                          ICESat2_                         
─────┼──────────────────────────────────────────────────────────────────────
   1 │ (longitude = Float32[117.077, 11  ICESat2_Granule{:ATL08}("ATL08_2…
   2 │ (longitude = Float32[117.078, 11…  ICESat2_Granule{:ATL08}("ATL08_2
   3 │ (longitude = Float32[117.112, 11  ICESat2_Granule{:ATL08}("ATL08_2…
   4 │ (longitude = Float32[117.096, 11…  ICESat2_Granule{:ATL08}("ATL08_2
   5 │ (longitude = Float32[117.137, 11  ICESat2_Granule{:ATL08}("ATL08_2…
   6 │ (longitude = Float32[117.134, 11…  ICESat2_Granule{:ATL08}("ATL08_2

The one thing that is really lacking when moving from rows = granules to rows = single observation is the ability to trace the observation back to it's source file. We can save this in the metadata but as soon as we start concatenating tables it gets hard to maintain traceability.

What do you think is the best path forward? As of now merge will not work for this as it needs to know the number of rows returned from points to create the FillArray. Maybe merge could be modified to create a FillArray if the input is non-iterable.

I think metadata is something we should support, as it could be passed if you save granules individually, and it will work with most IO (Arrow, GeoDataFrames), so you would get it back after you open the file again.

But indeed if you concatenate Tables further, the metadata will be lost. I save each granule with the same filename as the HDF5 (just with a different extension), so the unique id of the granule is preserved. From the filename, all granule information can be restored (except for the polygon, which we get from search). But logic in filenames is a bit frowned upon, so the another solution is to make a String column granule=Fill(filename, length)? Yes, it's only a string, but you can get the granule back like SL.granule_from_file(filename) (note I will probably rename to just SL.granule(filename). If you don't want a granule, you can also use icesat2_info:

SL.icesat2_info(fn)
(type = :ATL08, date = Dates.DateTime("2020-08-12T23:54:29"), rgt = 742, cycle = 8, segment = 14, version = 6, revision = 1, ascending = true, descending = false)

If you want, you could even store these attributes as their own (Fill) columns. We just need a combine(type, date, rgt, cycle, segment, version, revision) function that could give you back the id (ATL08_20200812235429_07420814_006_01.h5).

@alex-s-gardner
Copy link
Collaborator

If you want, you could even store these attributes as their own (Fill) columns.

If we went down the path of storing type, date, rgt, cycle, segment, version, revision as Fill arrays is might take up less memory when converted from a Fill to a vector as would be needed when saving to any other format than JLD (file names take up considerable space) . Should we implement this as the default or is that breaking?

@alex-s-gardner
Copy link
Collaborator

Empty granules return a NamedTuple when they should return a SpaceLiDAR Table

@alex-s-gardner
Copy link
Collaborator

@evetion once the empty granules issue is fixed I should be able to use parent to make my code backwards compatible

@evetion
Copy link
Owner Author

evetion commented Sep 3, 2023

If we went down the path of storing type, date, rgt, cycle, segment, version, revision as Fill arrays is might take up less memory when converted from a Fill to a vector as would be needed when saving to any other format than JLD (file names take up considerable space) . Should we implement this as the default or is that breaking?

Adding the metadata to the table is not breaking on top of this. Adding extra columns might be a grey area, but for a performance I rather do not have the extra columns that I don't need (yet), and we make it easy to add them?

Besides, I think doing a Fill(basename(granule.id)) probably requires less column overhead/memory than all the exploded attributes? If you also need it quick as a vector, you could use InlineStrings?

@evetion
Copy link
Owner Author

evetion commented Sep 3, 2023

So purely the id of the granule takes less info than the exploded info namedtuple. The inlinestring takes just as much information.

julia> info(g)
(type = :ATL08, date = Dates.DateTime("2020-08-12T23:54:29"), rgt = 742, cycle = 8, segment = 14, version = 5, revision = 1, ascending = true, descending = false)

julia> sizeof(info(g))
64

julia> sizeof(g.id)
39

julia> typeof(InlineString(g.id))
String63

@evetion
Copy link
Owner Author

evetion commented Sep 3, 2023

@evetion once the empty granules issue is fixed I should be able to use parent to make my code backwards compatible

You didn't specify which product(s) has this problem, but I think I fixed this for GLAH06. Let me know if I missed one.

@evetion
Copy link
Owner Author

evetion commented Sep 3, 2023

Ok, last big change. I've added support for metadata, and included functions to add either id or the granule info to the tables.

DataAPI metadata support. Arrow.jl will get support for it soon: apache/arrow-julia#481, so Tables/DataFrames saved with Arrow will retain their metadata.

julia> g = SL.granule(fn)
ICESat2_Granule{:ATL08}("ATL08_20200812235429_07420814_005_01.h5", "/Users/evetion/Downloads/ATL08_20200812235429_07420814_005_01.h5", (type = :ATL08, date = Dates.DateTime("2020-08-12T23:54:29"), rgt = 742, cycle = 8, segment = 14, version = 5, revision = 1, ascending = true, descending = false), Vector{Vector{Vector{Float64}}}[])

julia> t = points(g)
SpaceLiDAR Table with 6 partitions
julia> DataAPI.metadata(t)
Dict{String, Any} with 10 entries:
  "cycle"      => 8
  "descending" => false
  "revision"   => 1
  "segment"    => 14
  "id"         => "ATL08_20200812235429_07420814_005_01.h5"
  "rgt"        => 742
  "date"       => DateTime("2020-08-12T23:54:29")
  "ascending"  => true
  "type"       => :ATL08
  "version"    => 5
julia> df = DataFrame(t)
julia> DataAPI.metadata(df) == DataAPI.metadata(t)  # metadata is propagated.

Furthermore, I've included the following functions, which should help your workflow.

julia> t = SL.add_info(t)  # adds multiple columns from `info(g)`
julia> t = SL.add_id(t)  # adds :id column
julia> t[1].id
10509-element Fill{String}, with entries equal to ATL08_20200812235429_07420814_005_01.h5
julia> t[end].revision  # info fields are added to all tracks
5180-element Fill{Int64}, with entries equal to 1

@alex-s-gardner
Copy link
Collaborator

Looks like there is an issue with using subsetting:

using Extents, DataFrames, SpaceLiDAR, Dates
g = ICESat2_Granule{:ATL06}("ATL06_20181201095523_09740106_005_01.h5", "/Users/gardnera/data/icesat2/ATL06/005/raw/ATL06_20181201095523_09740106_005_01.h5", (type=:ATL06, date=Dates.DateTime("2018-12-01T09:55:23"), rgt=974, cycle=1, segment=6, version=5, revision=1, ascending=false, descending=true), Vector{Vector{Vector{Float64}}}[]);
points(g)

works great but

ext = Extent(X = (-128.0, -126.0), Y = (50.0, 52.0))
points(g, bbox=ext)

return this error:

ERROR: MethodError: no method matching SpaceLiDAR.PartitionedTable(::Tuple{NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{Dates.DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{Dates.DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{Dates.DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{Dates.DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float64}, Vector{Dates.DateTime}, Vector{Bool}, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float64}, Vector{Dates.DateTime}, Vector{Bool}, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}})

Closest candidates are:
  SpaceLiDAR.PartitionedTable(::NamedTuple)
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/ra53x/src/granule.jl:169
  SpaceLiDAR.PartitionedTable(::Tuple{Vararg{NamedTuple{K, V}, N}}) where {N, K, V}
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/ra53x/src/granule.jl:167

Stacktrace:
  [1] points(granule::ICESat2_Granule{:ATL06}; tracks::NTuple{6, String}, step::Int64, bbox::Extent{(:X, :Y), Tuple{Tuple{Float64, Float64}, Tuple{Float64, Float64}}})
    @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/ra53x/src/ICESat-2/ATL06.jl:49
  [2] points
    @ ~/.julia/packages/SpaceLiDAR/ra53x/src/ICESat-2/ATL06.jl:24 [inlined]
  [3] points_plus(granule::ICESat2_Granule{:ATL06}; extent::Extent{(:X, :Y), Tuple{Tuple{Float64, Float64}, Tuple{Float64, Float64}}})
    @ Altim ~/Documents/GitHub/Altim.jl/src/utilities.jl:103
  [4] points_plus
    @ ~/Documents/GitHub/Altim.jl/src/utilities.jl:97 [inlined]
  [5] #43
    @ ./broadcast.jl:1297 [inlined]
  [6] _broadcast_getindex_evalf
    @ ./broadcast.jl:683 [inlined]
  [7] _broadcast_getindex
    @ ./broadcast.jl:656 [inlined]
  [8] _getindex
    @ ./broadcast.jl:680 [inlined]
  [9] _broadcast_getindex
    @ ./broadcast.jl:655 [inlined]
 [10] getindex
    @ ./broadcast.jl:610 [inlined]
 [11] copyto_nonleaf!(dest::Vector{DataFrame}, bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Tuple{Base.OneTo{Int64}}, Type{DataFrame}, Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, Base.Broadcast.var"#43#44"{Base.Pairs{Symbol, Extent{(:X, :Y), Tuple{Tuple{Float64, Float64}, Tuple{Float64, Float64}}}, Tuple{Symbol}, NamedTuple{(:extent,), Tuple{Extent{(:X, :Y), Tuple{Tuple{Float64, Float64}, Tuple{Float64, Float64}}}}}}, typeof(points_plus)}, Tuple{Base.Broadcast.Extruded{Vector{ICESat2_Granule{:ATL06}}, Tuple{Bool}, Tuple{Int64}}}}}}, iter::Base.OneTo{Int64}, state::Int64, count::Int64)
    @ Base.Broadcast ./broadcast.jl:1068
 [12] copy
    @ ./broadcast.jl:920 [inlined]
 [13] materialize(bc::Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, Type{DataFrame}, Tuple{Base.Broadcast.Broadcasted{Base.Broadcast.DefaultArrayStyle{1}, Nothing, Base.Broadcast.var"#43#44"{Base.Pairs{Symbol, Extent{(:X, :Y), Tuple{Tuple{Float64, Float64}, Tuple{Float64, Float64}}}, Tuple{Symbol}, NamedTuple{(:extent,), Tuple{Extent{(:X, :Y), Tuple{Tuple{Float64, Float64}, Tuple{Float64, Float64}}}}}}, typeof(points_plus)}, Tuple{Vector{ICESat2_Granule{:ATL06}}}}}})
    @ Base.Broadcast ./broadcast.jl:873
 [14] geotile_build(geotile_granules::DataFrame, geotile_dir::String, mission::Symbol; warnings::Bool)
    @ Altim ~/Documents/GitHub/Altim.jl/src/utilities.jl:531
 [15] top-level scope
    @ ~/Documents/GitHub/Altim.jl/src/geotile_build_archive.jl:71

ERROR: MethodError: no method matching points(::String)

Closest candidates are:
  points(::ICESat2_Granule{:ATL03}; tracks, step, bbox)
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/ra53x/src/ICESat-2/ATL03.jl:26
  points(::ICESat2_Granule{:ATL03}, ::HDF5.H5DataStore, ::AbstractString, ::Float64)
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/ra53x/src/ICESat-2/ATL03.jl:89
  points(::ICESat2_Granule{:ATL03}, ::HDF5.H5DataStore, ::AbstractString, ::Float64, ::Any)
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/ra53x/src/ICESat-2/ATL03.jl:89
  ...

Stacktrace:
 [1] top-level scope
   @ ~/Documents/GitHub/Altim.jl/src/geotile_build_archive.jl:87

ERROR: UndefVarError: `Dates` not defined
Stacktrace:
 [1] top-level scope
   @ ~/Documents/GitHub/Altim.jl/src/geotile_build_archive.jl:86



SpaceLiDAR Table with 6 partitions

ERROR: MethodError: no method matching points(::ICESat2_Granule{:ATL06}; extent::Extent{(:X, :Y), Tuple{Tuple{Float64, Float64}, Tuple{Float64, Float64}}})

Closest candidates are:
  points(::ICESat2_Granule{:ATL06}; tracks, step, bbox) got unsupported keyword argument "extent"
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/ra53x/src/ICESat-2/ATL06.jl:24
  points(::ICESat2_Granule{:ATL06}, ::HDF5.H5DataStore, ::AbstractString, ::Float64) got unsupported keyword argument "extent"
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/ra53x/src/ICESat-2/ATL06.jl:52
  points(::ICESat2_Granule{:ATL06}, ::HDF5.H5DataStore, ::AbstractString, ::Float64, ::Any) got unsupported keyword argument "extent"
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/ra53x/src/ICESat-2/ATL06.jl:52
  ...

Stacktrace:
 [1] kwerr(::NamedTuple{(:extent,), Tuple{Extent{(:X, :Y), Tuple{Tuple{Float64, Float64}, Tuple{Float64, Float64}}}}}, ::Function, ::ICESat2_Granule{:ATL06})
   @ Base ./error.jl:165
 [2] top-level scope
   @ ~/Documents/GitHub/Altim.jl/src/geotile_build_archive.jl:87

ERROR: MethodError: no method matching SpaceLiDAR.PartitionedTable(::Tuple{NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float64}, Vector{DateTime}, Vector{Bool}, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float64}, Vector{DateTime}, Vector{Bool}, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}})

Closest candidates are:
  SpaceLiDAR.PartitionedTable(::NamedTuple)
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/ra53x/src/granule.jl:169
  SpaceLiDAR.PartitionedTable(::Tuple{Vararg{NamedTuple{K, V}, N}}) where {N, K, V}
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/ra53x/src/granule.jl:167

Stacktrace:
 [1] points(granule::ICESat2_Granule{:ATL06}; tracks::NTuple{6, String}, step::Int64, bbox::Extent{(:X, :Y), Tuple{Tuple{Float64, Float64}, Tuple{Float64, Float64}}})
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/ra53x/src/ICESat-2/ATL06.jl:49
 [2] top-level scope
   @ ~/Documents/GitHub/Altim.jl/src/geotile_build_archive.jl:87

@evetion
Copy link
Owner Author

evetion commented Sep 4, 2023

Are you sure you checked out this branch, including the latest commits? The type signature of your methods is old and I can't replicate over here.

@alex-s-gardner
Copy link
Collaborator

alex-s-gardner commented Sep 4, 2023

I rm SpaceLiDAR then add https://github.com/evetion/SpaceLiDAR.jl/tree/feat/tables then restart julia to ensure I have the latest version.

ext = Extent{(:X, :Y),Tuple{Tuple{Float64,Float64},Tuple{Float64,Float64}}}((X=(-128.0, -126.0), Y=(50.0, 52.0)))
g = ICESat2_Granule{:ATL06}("ATL06_20181201095523_09740106_005_01.h5", "/Users/gardnera/data/icesat2/ATL06/005/raw/ATL06_20181201095523_09740106_005_01.h5", (type=:ATL06, date=DateTime("2018-12-01T09:55:23"), rgt=974, cycle=1, segment=6, version=5, revision=1, ascending=false, descending=true), Vector{Vector{Vector{Float64}}}[])

points(g, bbox=ext)

results in:

ERROR: MethodError: no method matching SpaceLiDAR.PartitionedTable(::Tuple{NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float64}, Vector{DateTime}, Vector{Bool}, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float64}, Vector{DateTime}, Vector{Bool}, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}}, ::ICESat2_Granule{:ATL06})

Closest candidates are:
  SpaceLiDAR.PartitionedTable(::Tuple{Vararg{NamedTuple{K, V}, N}}, ::G) where {N, K, V, G}
   @ SpaceLiDAR ~/.julia/packages/SpaceLiDAR/tJtlT/src/granule.jl:170

@evetion
Copy link
Owner Author

evetion commented Sep 4, 2023

Thanks, that error message does correspond with the latest changes. I think I fixed it, the problem was in the empty defaults, where we had a Float64[] instead of a Float32[] of the non-empty data.

ERROR: MethodError: no method matching  # --> scroll to the right a bit
SpaceLiDAR.PartitionedTable(::Tuple{
NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, 
NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, 
NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, 
NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float32}, Vector{DateTime}, BitVector, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, 
NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float64}, Vector{DateTime}, Vector{Bool}, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}, 
NamedTuple{(:longitude, :latitude, :height, :height_error, :datetime, :quality, :track, :strong_beam, :detector_id, :height_reference), Tuple{Vector{Float64}, Vector{Float64}, Vector{Float32}, Vector{Float64}, Vector{DateTime}, Vector{Bool}, FillArrays.Fill{String, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Bool, 1, Tuple{Base.OneTo{Int64}}}, FillArrays.Fill{Int8, 1, Tuple{Base.OneTo{Int64}}}, Vector{Float32}}}}, ::ICESat2_Granule{:ATL06})

@alex-s-gardner
Copy link
Collaborator

I think I fixed it, the problem was in the empty defaults, where we had a Float64[] instead of a Float32[] of the non-empty data.

That seems to have fixed it.

@alex-s-gardner
Copy link
Collaborator

alex-s-gardner commented Sep 5, 2023

I've added support for metadata, and included functions to add either id or the granule info to the tables.

I'll test this today

@alex-s-gardner
Copy link
Collaborator

alex-s-gardner commented Sep 5, 2023

julia> t = SL.add_info(t)  # adds multiple columns from `info(g)`
julia> t = SL.add_id(t)  # adds :id column

These are fantastic. One issue is that if t is an empty table then no file id or info is added. My original points_plus returns a row full of empties with info, e.g:

julia> DataFrame(t)
6×11 DataFrame
 Row │ longitude  latitude   height     height_error  datetime    quality    track            strong_beam     detector_id  height_reference  granule_info                      
     │ Array     Array     Array     Array        Array      BitVector  Fill            Fill           Fill        Array            ICESat2_Gra                      
─────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
   1 │ Float64[]  Float64[]  Float32[]  Float32[]     DateTime[]  Bool[]     Fill("gt1l", 0)  Fill(false, 0)  Fill(6, 0)   Float32[]         ICESat2_Granule{:ATL06}("ATL06_2…
   2 │ Float64[]  Float64[]  Float32[]  Float32[]     DateTime[]  Bool[]     Fill("gt1r", 0)  Fill(true, 0)   Fill(5, 0)   Float32[]         ICESat2_Granule{:ATL06}("ATL06_2
   3 │ Float64[]  Float64[]  Float32[]  Float32[]     DateTime[]  Bool[]     Fill("gt2l", 0)  Fill(false, 0)  Fill(4, 0)   Float32[]         ICESat2_Granule{:ATL06}("ATL06_2…
   4 │ Float64[]  Float64[]  Float32[]  Float32[]     DateTime[]  Bool[]     Fill("gt2r", 0)  Fill(true, 0)   Fill(3, 0)   Float32[]         ICESat2_Granule{:ATL06}("ATL06_2
   5 │ Float64[]  Float64[]  Float32[]  Float32[]     DateTime[]  Bool[]     Fill("gt3l", 0)  Fill(false, 0)  Fill(2, 0)   Float32[]         ICESat2_Granule{:ATL06}("ATL06_2…
   6 │ Float64[]  Float64[]  Float32[]  Float32[]     DateTime[]  Bool[]     Fill("gt3r", 0)  Fill(true, 0)   Fill(1, 0)   Float32[]         ICESat2_Granule{:ATL06}("ATL06_2

This behavior made it easy to keep track of which granules had been searched.... not sure if you have any clever ideas on how to handle this ... i suspect with the updated implementation (single observation per row) this becomes difficult.

@evetion
Copy link
Owner Author

evetion commented Sep 5, 2023

Cool, in that case I will merge this!

Regarding per row saving, I save to a single file per granule. So I can just check the granule.id from the filename (or with this PR, open it and read it from the metadata). Can't comment much more without knowing the rest of the data workflow (and my own is a bit ugly). Let's split this into a separate issue.

Note that I will probably not release this as a version immediately (unless you want me to), as I would like to refactor the search (to use EarthData.jl) and have a first version of HDF5Tables.jl in here.

@alex-s-gardner
Copy link
Collaborator

will merge this

Great work! This is shaping up nicely

@evetion
Copy link
Owner Author

evetion commented Sep 5, 2023

See #61 for the workflow discussion.

@evetion evetion merged commit 730f655 into master Sep 6, 2023
@evetion evetion deleted the feat/tables branch September 6, 2023 05:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Can't reduce dataframes from `points(canopy=true) due to extra parameters
2 participants