You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While exploring the pre-training data, I noticed an issue about S100 dataset that I think can be fixed easily.
I visualized it here
So, basically the problems are:
oversampling in Greenland (probably due to Santinel 2 path, which has much higher visit frequency near poles)
undersampling in tropics and 50-70°N lat
The problem arises due to 1) Santinel 2 path and 2) filtering out the dates with high cloud cover, which impacts the tropics a lot.
I was thinking of a solution for a uniform sampling and realized that the first step of creating S100 is to pick a Santinel tile, and tiles are distributed approx. uniformly. So forcing an algorithm to pick approximately same number of pictures per Santinel tile should fix it.
My easy fix suggestion is to sample uniformly by tile name (the tiles have attribute 's2:mgrs_tile') like this:
I know that the SatClip trained on S100 is only a prototype and a proof of concept, but just in case you want to do the experiments with more uniformly distributed pre-training, this seems quite easy to fix :)
Kind regards,
Elena
The text was updated successfully, but these errors were encountered:
Hi there,
While exploring the pre-training data, I noticed an issue about S100 dataset that I think can be fixed easily.

I visualized it here
So, basically the problems are:
The problem arises due to 1) Santinel 2 path and 2) filtering out the dates with high cloud cover, which impacts the tropics a lot.
I was thinking of a solution for a uniform sampling and realized that the first step of creating S100 is to pick a Santinel tile, and tiles are distributed approx. uniformly. So forcing an algorithm to pick approximately same number of pictures per Santinel tile should fix it.
My easy fix suggestion is to sample uniformly by tile name (the tiles have attribute 's2:mgrs_tile') like this:
df['weight'] = 1./df.groupby('s2:mgrs_tile')['s2:mgrs_tile'].transform('count')
sampledf = df.sample(100000, weights = df.weight)
I know that the SatClip trained on S100 is only a prototype and a proof of concept, but just in case you want to do the experiments with more uniformly distributed pre-training, this seems quite easy to fix :)
Kind regards,
Elena
The text was updated successfully, but these errors were encountered: