Non-uniform distribution of S100 dataset #11

PlekhanovaElena · 2024-04-29T12:18:50Z

Hi there,

While exploring the pre-training data, I noticed an issue about S100 dataset that I think can be fixed easily.
I visualized it here

So, basically the problems are:

oversampling in Greenland (probably due to Santinel 2 path, which has much higher visit frequency near poles)
undersampling in tropics and 50-70°N lat

The problem arises due to 1) Santinel 2 path and 2) filtering out the dates with high cloud cover, which impacts the tropics a lot.

I was thinking of a solution for a uniform sampling and realized that the first step of creating S100 is to pick a Santinel tile, and tiles are distributed approx. uniformly. So forcing an algorithm to pick approximately same number of pictures per Santinel tile should fix it.
My easy fix suggestion is to sample uniformly by tile name (the tiles have attribute 's2:mgrs_tile') like this:

df['weight'] = 1./df.groupby('s2:mgrs_tile')['s2:mgrs_tile'].transform('count')
sampledf = df.sample(100000, weights = df.weight)

I know that the SatClip trained on S100 is only a prototype and a proof of concept, but just in case you want to do the experiments with more uniformly distributed pre-training, this seems quite easy to fix :)

Kind regards,
Elena

The text was updated successfully, but these errors were encountered:

konstantinklemmer · 2024-04-29T14:17:11Z

Fantastic, thanks for this analysis @PlekhanovaElena! I will link it in the main repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-uniform distribution of S100 dataset #11

Non-uniform distribution of S100 dataset #11

PlekhanovaElena commented Apr 29, 2024 •

edited

Loading

konstantinklemmer commented Apr 29, 2024

Non-uniform distribution of S100 dataset #11

Non-uniform distribution of S100 dataset #11

Comments

PlekhanovaElena commented Apr 29, 2024 • edited Loading

konstantinklemmer commented Apr 29, 2024

PlekhanovaElena commented Apr 29, 2024 •

edited

Loading