Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(0.5.0) Metadata for JRA55 #286

Open
wants to merge 93 commits into
base: main
Choose a base branch
from

Conversation

simone-silvestri
Copy link
Collaborator

@simone-silvestri simone-silvestri commented Dec 5, 2024

This PR is an initial proposal to generalize ECCOMetadata to Metadata and rework the JRA55 module to use Metadata. In this way, we can have different JRA55 versions (repeat year and multiple year) and we can define a download_dataset function to download the dataset independently of using JRA55 as we can do for ECCO

This PR also removes the ability to generate a JRA55FieldTimeSeries directly interpolated on the ocean grid, since we need to interpolate anyways when we compute fluxes

closes #182

@glwagner
Copy link
Member

glwagner commented Dec 5, 2024

I think we should get #251 and #284 merged first, do you want to help with those? Otherwise we will have conflicts.

@simone-silvestri
Copy link
Collaborator Author

sounds good

@navidcy
Copy link
Collaborator

navidcy commented Mar 9, 2025

Gotcha! I tried converting and then that changing the default was the way to go. I’m actually bit confused regarding when to change the default. But I’ll drop it from the example and infer the conversion from the exchanger grid!

@navidcy
Copy link
Collaborator

navidcy commented Mar 9, 2025

Seems like the JRA55NetCDFBackend has no effect? Look below. Whether I call with JRA55NetCDFBackend(2) or JRA55NetCDFBackend(248) I get an atmosphere with 2920 time slices (that is a full year).

julia> simulation_days = 31
31

julia> snapshots_per_day = 8 # corresponding to JRA55's 3-hour frequency
8

julia> time_indices_in_memory = simulation_days * snapshots_per_day
248

julia> atmosphere = JRA55PrescribedAtmosphere(longitude = λ★,
                                              latitude = φ★,
                                              backend = JRA55NetCDFBackend(time_indices_in_memory))

2×2×1×2920 PrescribedAtmosphere{Float32} on LatitudeLongitudeGrid:
├── times: 2920-element StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}
├── surface_layer_height: 10.0
└── boundary_layer_height: 600.0

julia> atmosphere = JRA55PrescribedAtmosphere(longitude = λ★,
                                              latitude = φ★,
                                              backend = JRA55NetCDFBackend(2))
2×2×1×2920 PrescribedAtmosphere{Float32} on LatitudeLongitudeGrid:
├── times: 2920-element StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}
├── surface_layer_height: 10.0
└── boundary_layer_height: 600.0

This comes from

native_times() replaced jra55_times() method, but the former ignores the backend so it always returns the times for the whole JRA dataset (in which case above a whole year). @simone-silvestri is there something missing or is this intentional?

@simone-silvestri
Copy link
Collaborator Author

Seems like the JRA55NetCDFBackend has no effect? Look below. Whether I call with JRA55NetCDFBackend(2) or JRA55NetCDFBackend(248) I get an atmosphere with 2920 time slices (that is a full year).

right, the backend has no effect on the data in the timeseries, but it indicates how this is organized in memory.

In main, a way to limit the time series is to pass time_indices as positional arguments like so

time_indices = 1:10
JRA55PrescribedAtmosphere(time_indices; kw...)

for 10 elements, while in this PR we switch to dates as a keyword argument, to anticipate adding different JRA55 metadata versions that have different dates:

start_date = DateTime(1990, 1, 1)
end_date  = DateTime(1990, 2, 1)
dates = range(start_date, end_date, step = Hour(3))
JRA55PrescribedAtmosphere(; dates, kw...)

The source of truth for the dates associated with a particular dataset version can be retrieved with

all_dates(version, name)

for example:

julia> all_dates(JRA55RepeatYear(), :temperature)
Dates.DateTime("1990-01-01T00:00:00"):Dates.Hour(3):Dates.DateTime("1990-12-31T21:00:00")

julia> all_dates(JRA55RepeatYear(), :river_freshwater_flux)
Dates.DateTime("1990-01-01T00:00:00"):Dates.Day(1):Dates.DateTime("1990-12-31T00:00:00")

julia> all_dates(JRA55MultipleYears(), :pressure)
Dates.DateTime("1958-01-01T00:00:00"):Dates.Hour(3):Dates.DateTime("2021-01-01T00:00:00")

This applies also to ECCO datasets:

julia> all_dates(ECCO4Monthly(), :salinity)
Dates.DateTime("1992-01-01T00:00:00"):Dates.Month(1):Dates.DateTime("2023-12-01T00:00:00")

that are also constructed with a dates keyword argument.

julia> ECCOFieldTimeSeries(:temperature; dates = DateTime(1993, 1, 1):Month(1):DateTime(1993, 2, 1))
[ Info: Note: ECCO temperature data is in /Users/simonesilvestri/.julia/scratchspaces/0376089a-ecfe-4b0e-a64f-9c555d74d754/ECCO.
[ Info: Note: ECCO temperature data is in /Users/simonesilvestri/.julia/scratchspaces/0376089a-ecfe-4b0e-a64f-9c555d74d754/ECCO.
[ Info: Note: ECCO temperature data is in /Users/simonesilvestri/.julia/scratchspaces/0376089a-ecfe-4b0e-a64f-9c555d74d754/ECCO.
[ Info: Note: ECCO temperature data is in /Users/simonesilvestri/.julia/scratchspaces/0376089a-ecfe-4b0e-a64f-9c555d74d754/ECCO.
[ Info: Note: ECCO temperature data is in /Users/simonesilvestri/.julia/scratchspaces/0376089a-ecfe-4b0e-a64f-9c555d74d754/ECCO.
[ Info: Note: ECCO temperature data is in /Users/simonesilvestri/.julia/scratchspaces/0376089a-ecfe-4b0e-a64f-9c555d74d754/ECCO.
720×360×50×2 FieldTimeSeries{ClimaOcean.DataWrangling.ECCO.ECCONetCDFBackend} located at (Center, Center, Center) on Oceananigans.Architectures.CPU
├── grid: 720×360×50 LatitudeLongitudeGrid{Float32, Oceananigans.Grids.Periodic, Oceananigans.Grids.Bounded, Oceananigans.Grids.Bounded} on Oceananigans.Architectures.CPU with 7×7×3 halo and with precomputed metrics
├── indices: (:, :, :)
├── time_indexing: Cyclical(period=5.3568e6)
├── backend: ECCONetCDFBackend(1, 2)
└── data: 734×374×56×2 OffsetArray(::Array{Float32, 4}, -6:727, -6:367, -2:53, 1:2) with eltype Float32 with indices -6:727×-6:367×-2:53×1:2
    └── max=31.2508, min=-1.98588, mean=3.33469

Before merging I can write more details in the description of the PR and I am for changes if suggestes

@navidcy
Copy link
Collaborator

navidcy commented Mar 9, 2025

Gotcha.

So at the papa example, I did this:

t_days = atmosphere.times[1:length(ua)] / days

What would be a better way to do it in

atmosphere = JRA55PrescribedAtmosphere(longitude = λ★,

?

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Mar 9, 2025

Gotcha.

So at the papa example, I did this:

t_days = atmosphere.times[1:length(ua)] / days

What would be a better way to do it in

atmosphere = JRA55PrescribedAtmosphere(longitude = λ★,

?

Ah, I didn't realize this was looping over the while timeseries. I guess we could do something like

start_date = DateTime(1990, 1, 1)
end_date  = DateTime(1990, 1, 31)
dates = range(start_date, end_date, step = Hour(3)) # 3 hours is the frequency of JRA55 data
atmosphere = JRA55PrescribedAtmosphere(longitude = λ★,
                                    latitude = φ★,
                                    dates = dates) 

another option would be

version = JRA55RepeatYear()
native_dates = all_dates(version)
simulation_days = 31
snapshots_per_day = Hour(24) / native_dates.step # corresponding to JRA55's 3-hour frequency
time_indices = 1 : simulation_days * snapshots_per_day
dates = native_dates[time_indices]
atmosphere = JRA55PrescribedAtmosphere(longitude = λ★,
                                       latitude = φ★,
                                       version = version,
                                       dates = dates) 

or maybe this is a bit better

version = JRA55RepeatYear()
native_dates = all_dates(version)
end_date_index = findfirst(x -> x == DateTime(1990, 1, 31), native_dates) # We end after 31 days
atmosphere = JRA55PrescribedAtmosphere(longitude = λ★,
                                       latitude = φ★,
                                       version = version,
                                       dates = dates[1:end_date_index]) 

I am also open to changing the name of the function all_dates to something maybe more intuitive?

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Mar 9, 2025

We could also think about extending the interface to pass start_date, end_date, and frequency (which should be a multiple of the native frequency)

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

Why aren't we using backend anymore?

@navidcy the size of the data in memory is displayed at the top of the show, eg 2×2×1×2920 here:

julia> atmosphere = JRA55PrescribedAtmosphere(longitude = λ★,
                                              latitude = φ★,
                                              backend = JRA55NetCDFBackend(2))
2×2×1×2920 PrescribedAtmosphere{Float32} on LatitudeLongitudeGrid:
├── times: 2920-element StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}
├── surface_layer_height: 10.0
└── boundary_layer_height: 600.0

Note, the time-dimension of the in-memory data (given by the 4th element in the size) can differ from the length of times. That's because the "backend" and "times" give different information. "backend" refers to how the data is stored, not to the number of time points. The objective is to have an object that can represent a long time-series, but which may have some data stored on disk rather than in memory ("partially in memory").

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

@simone-silvestri I think that using start_date and end_date in the constructor would be better than having to provide a vector of dates. This alleviates the user from needing to understand the frequency of the information, and from having to learn the name of an additional function like all_dates.

By the way, I am still worried that the design of Metadata is confusing, whereby the difference between "many dates" and a "single date" is expressed by the type of dates. In addition to being more intuitive and easier to understand, Vector{Metadata} is also more general. For example you can string two different filenames / types together with this method, by putting two different Metadata into a single vector.

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Mar 9, 2025

ok, I ll put here the start_date, end_date, then the frequency will be inferred by the frequency of the dataset.

Then when we merge it I can open a new PR that changes metadata to only one date and uses Vector{Metadata} instead of a metadata of dates. However, as a con, I think we lose the option to represent a full dataset with one metadata object.
Another option behind the confusion of dates vs date is to have dates being always an iterable, then we can create a new object Metadatum where one instance of a Metadata in time is a Metadatum

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

However, as a con, I think we lose the option to represent a full dataset with one metadata object.

I see how this is a consequence. But what specifically is the downside? The crux of figuring out the best way to develop this abstraction is to understand this specific trade-off, so we have to articulate the pros and cons clearly.

@simone-silvestri
Copy link
Collaborator Author

I think the difference is whether we want Metadata to be part of the user interface or just an internal convenience type that helps us wrangle data.

In the first case, it is nice to be able to represent a dataset composed of a version, a name, and a set of dates in a type (here we can probably go the Metadata - Metadatum route), while a Vector{Metadata} might not be a consistent dataset because there can possibly be differences in versions or variable names.

In the latter case, there is no problem with the user interface, Metadata can be hidden in the internals with the user interface exposing only the methods that explicitly accept variable_name, dates, and version for things like ECCOFieldTimeSeries and JRA55FieldTimeSeries. We would have to think about how to change the set! function to remove the Metadata option. In the internals, when we pass a Vector{Metadata} to functions like FieldTimeSeries we need to ensure consistency of variable_name and version of all the elements of the vector.

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

I think the difference is whether we want Metadata to be part of the user interface or just an internal convenience type that helps us wrangle data.

In the first case, it is nice to be able to represent a dataset composed of a version, a name, and a set of dates in a type (here we can probably go the Metadata - Metadatum route), while a Vector{Metadata} might not be a consistent dataset because there can possibly be differences in versions or variable names.

In the latter case, there is no problem with the user interface, Metadata can be hidden in the internals with the user interface exposing only the methods that explicitly accept variable_name, dates, and version for things like ECCOFieldTimeSeries and JRA55FieldTimeSeries. We would have to think about how to change the set! function to remove the Metadata option. In the internals, when we pass a Vector{Metadata} to functions like FieldTimeSeries we need to ensure consistency of variable_name and version of all the elements of the vector.

Ok, please clarify what you see as the trade-offs for user interface. I think you are assuming a design, but it is not being explicitly state. I can't respond or judge what user interface you are referring to, unless you state it explicitly.

Part of the problem is the proposal to define someting like

struct Metadatum # represents a single file
    # properties
end

const Metadata = Vector{Metadatum}

has no specific implications for how Metadata is constructed. We can keep the same constructor.

The difference is mainly that we could define a new constructor for a single Metadatum. This avoids the confusion that we pass dates = a_single_date to Metadata.

So in summary i don't see what changes would be required of the user interface. The difference is that we can expand the user interface to make more sense while leaving existing components unchanged.

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

Here's a simple example, just one of many possibilities

# source
Metadata(name; version, dates) = [Metadatum(name; version, date for date in dates]
datum = Metadatum(name; version, date)
data = Metadata(name; version, dates)

the idea of a "version" for general Metadata is weird to me by the way. What concept are we expressing here. data_source or just source? Version implies that a data is specific but has multiple "versions" or "updates" which doesn't seem right.

@simone-silvestri
Copy link
Collaborator Author

I was referring to achieving something similar avoiding a Vector that might mix different versions, I guess the end goal is the same.
An example is

Metadatum
   name
   version 
   date
   dir
end

@propagate_inbounds Base.getindex(m::Metadata, i::Int) = Metadatum(m.name, m.dates[i],   m.version, m.dir)
@propagate_inbounds Base.first(m::Metadata) = Metadatum(m.name, m.dates[1],   m.version, m.dir)
@propagate_inbounds Base.last(m::Metadata) = Metadatum(m.name, m.dates[end], m.version, m.dir)

@inline function Base.iterate(m::Metadata, i=1)
    if (i % UInt) - 1 < length(m)
        return Metadatum(m.name, m.dates[i], m.version, m.dir), i + 1
    else
        return nothing
    end
end

also in this way it would be possible to do

datum = Metadatum(name; version, date)
data = Metadata(name; version, dates)

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

I guess I was seeing the ability to mix "versions" as a perk. For example JRA55 was originally generated up to 2018, but there are updates which continue the dataset past that. You may not be able to form a single consistent dataset (eg a single "version") that encompasses all dates, but it still could be valid to write something like

up_to_2018 = Metadata(name; dataset=original_dataset, dates=dates_till_2018)
past_2018 = Metadata(name; dataset=continuation, dates=dates_past_2018)
data = vcat(up_to_2018, past_2018)

by the way what does "version" mean in the context of a general Metadatum? It seems we need a different word like "dataset" or "label" which is more general than just "version"

@simone-silvestri
Copy link
Collaborator Author

simone-silvestri commented Mar 9, 2025

I guess version is more suited in main where Metadata is ECCOMetatada and the dataset can only be part of the ECCO dataset with a specific version (2, 4, etc).

You are right that that field name has to change in this PR, I like dataset because it points to the dataset the metadata points to

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

I guess version is more suited in main where Metadata is ECCOMetatada and the dataset can only be part of the ECCO dataset with a specific version (2, 4, etc).

You are right that that field name has to change in this PR, I like dataset because it points to the dataset the metadata points to

eg

struct ECCODataset
    version :: Symbol
end

am I right that the path to a particular file is determined by a combination of date, dir, and dataset / version?

@simone-silvestri
Copy link
Collaborator Author

yep, in general the full file path is determined by the whole metadatum (name, date, dataset / version, and dir)

@glwagner
Copy link
Member

glwagner commented Mar 9, 2025

Ruminating on "dataset" --- I think it could be a good term, because it expresses the concept of a "category of data". Which is what we mean here, a single metadatum refers to one file; a whole dataset will have many files, each of which has a metadatum.

Also just to offer an alternative --- rather than metadatum / metadata, we could have

struct Metadata
    name
    dataset
end

const MetadataSeries = Vector{Metadata}

semantically, MetadataSeries is easier to distinguish from Metadata than Metadata is to distinguish from Metadatum. It might help too, to emphasize that a vector of Metadata is a series of snapshots.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build docs Add this label to built the docs in a PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactoring JRA55
3 participants