Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Non-uniform distribution of S100 dataset #11

Open
PlekhanovaElena opened this issue Apr 29, 2024 · 1 comment
Open

Non-uniform distribution of S100 dataset #11

PlekhanovaElena opened this issue Apr 29, 2024 · 1 comment

Comments

@PlekhanovaElena
Copy link
Contributor

PlekhanovaElena commented Apr 29, 2024

Hi there,

While exploring the pre-training data, I noticed an issue about S100 dataset that I think can be fixed easily.
I visualized it here
s100_points_distribution

So, basically the problems are:

  1. oversampling in Greenland (probably due to Santinel 2 path, which has much higher visit frequency near poles)
  2. undersampling in tropics and 50-70°N lat

The problem arises due to 1) Santinel 2 path and 2) filtering out the dates with high cloud cover, which impacts the tropics a lot.

I was thinking of a solution for a uniform sampling and realized that the first step of creating S100 is to pick a Santinel tile, and tiles are distributed approx. uniformly. So forcing an algorithm to pick approximately same number of pictures per Santinel tile should fix it.
My easy fix suggestion is to sample uniformly by tile name (the tiles have attribute 's2:mgrs_tile') like this:

df['weight'] = 1./df.groupby('s2:mgrs_tile')['s2:mgrs_tile'].transform('count')
sampledf = df.sample(100000, weights = df.weight)

I know that the SatClip trained on S100 is only a prototype and a proof of concept, but just in case you want to do the experiments with more uniformly distributed pre-training, this seems quite easy to fix :)

Kind regards,
Elena

@konstantinklemmer
Copy link
Collaborator

Fantastic, thanks for this analysis @PlekhanovaElena! I will link it in the main repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants