Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds the LandCoverAI100 dataset and datamodule for use in semantic segmentation notebooks #2262

Merged
merged 11 commits into from
Sep 11, 2024

Conversation

calebrob6
Copy link
Member

@calebrob6 calebrob6 commented Aug 29, 2024

  • Make it actually work
  • Make the tests
  • Green checks all around

@github-actions github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing datamodules PyTorch Lightning datamodules labels Aug 29, 2024
@adamjstewart adamjstewart added this to the 0.7.0 milestone Aug 30, 2024
@adamjstewart
Copy link
Collaborator

Before we get too deep into this, first things first is to decide if LandCover.ai is the correct dataset to use for our notebooks. We reuse EuroSAT100 for all of our classification-based tutorials, and I would like to do the same for all future semantic segmentation-based tutorials as well.

EuroSAT was chosen because it was well known within the community, had non-RGB bands so we could use it to show off our non-RGB weights, had a permissive license, and had geolocation available (see the "Why EuroSAT" section of #1130).

LandCover.ai is RGB-only and has a somewhat restrictive license. As far as I can tell from my non-legal background, here we're just using it for educational purposes, so we don't run into any issues with the license. I do like how the imagery is high resolution, so the plots are much prettier than EuroSAT. Are there any other candidates we should consider?

If we do decide on LandCover.ai, I have a bunch of follow-up review comments (mostly on Hugging Face, not on your PR) to address for both license compliance and redistributing the full dataset.

@calebrob6
Copy link
Member Author

RGB vs. non-RGB doesn't really matter to me as the point of the tutorials is to show how to do things -- extrapolating to non-RGB can be indicated where appropriate. We already show the existence of pre-trained weights, and the value of those would only be demonstrated by full benchmarks.

No idea about the license.

It is relatively popular -- 150 citations

Other options that are commonly used in papers:

  • Vaihingen -- no auto download, 3-channel (but not RGB, bands are near infrared, red and green)
  • DFC2022 -- 3 citations, I think the labels are quite noisy
  • Potsdam -- no auto download, 555 citations
  • Inria -- 5k x 5k patches, no auto download
  • Chesapeake CVPR -- very good dataset, auto downloading, highly cited, pretty

@adamjstewart
Copy link
Collaborator

  • Vaihingen, Potsdam, Inria: license unknown, meaning it's technically illegal for anyone to use these datasets for any purpose
  • DFC2022: sounds like it's pretty unheard of with only 3 citations
  • Chesapeake CVPR: license looks like a nightmare (a different license for every single layer) although it is pretty permissive
  • LandCover.ai: license is much simpler (single license), but less permissive

I also still have nightmares about Chesapeake CVPR given how many bugs our data loader has: https://github.com/microsoft/torchgeo/issues/assigned/calebrob6

Let me poll Slack real quick and see what else people come up with, but so far I'm leaning towards LandCover.ai.

@robmarkcole
Copy link
Contributor

robmarkcole commented Sep 4, 2024

A small dataset that can be trained on easily is the Amazon forest dataset (CC BY 4.0)

@adamjstewart
Copy link
Collaborator

adamjstewart commented Sep 4, 2024

We removed support for rar files, but we could redistribute that as a zip file. But then someone will have to write a data loader for it. I guess it depends on how pretty the visualizations are, it doesn't have to be well known.

@calebrob6
Copy link
Member Author

Looks like a cool dataset!! Do you know if this is used in any papers @robmarkcole?

@robmarkcole
Copy link
Contributor

@calebrob6 a couple of papers. I actually used this in my tech eval for my current role! I believe the license allows redistribution, so could just host on huggingface with attribution

@calebrob6
Copy link
Member Author

Cool, seems like a nice little dataset (as long as a smp unet + resnet18 doesn't get 99.99% acc)!

I'm somewhat OK with implementing it into torchgeo as long as that means that I can actually write tutorials. @adamjstewart what are your problems about landcoverai100 on huggingface?

@adamjstewart
Copy link
Collaborator

Comments on HF:

  • We should rename the repo from landcoverai100 to landcoverai. I'm thinking of redistributing all of landcoverai there in the future. Specifically, if we pre-chip the dataset and upload the chipped version, we can completely remove our OpenCV dependency
  • The license requires attribution, including what changes were made to the dataset. I usually add a blurb to the README pointing to the original source and describing what I changed

Comments on PR:

  • You should replace main with the git commit hash for the latest commit to ensure reproducibility in case we change the dataset in the future. Not that this matters for our tutorials, but want to stay consistent

@calebrob6
Copy link
Member Author

Got it, let me know which route you'd prefer

@adamjstewart
Copy link
Collaborator

I guess it depends on how nice the predictions and visualizations look. The license for the Amazon dataset looks perfect. What spectral bands does the Amazon dataset contain? Would be nice if we could start with a pre-trained model to make the predictions better.

@calebrob6
Copy link
Member Author

The imagery is 8-bit R,G,B (looks like S2 to me). Predictions from the papers that use the dataset seem fine (see below), so I'm assuming a standard training setup will be able to get the same. Note that a standard training setup is not what will be in the tutorials by default as we have the restriction that it run in CI.

image

@adamjstewart
Copy link
Collaborator

Looks good to me. In that case, I have slight preference for this one (given its simplicity and permissive license and the fact that it's a real dataset) but would also be okay with LandCover.ai 100. Either way, we need to redistribute it on HF to convert from .rar to .zip and most of the same HF/PR comments still apply.

@calebrob6
Copy link
Member Author

calebrob6 commented Sep 9, 2024

In that case, going to go with LandCoverAI100 because I've already done all the work!

Integrated your comments (see https://huggingface.co/datasets/torchgeo/landcoverai).

Specifically, if we pre-chip the dataset and upload the chipped version, we can completely remove our OpenCV dependency

On one hand I like this because we get rid of the dependency and don't have to have the custom eval / checksum code. On the other hand, I like the idea of training on LandCoverAI as a tiled dataset instead of a pre-chipped dataset. My hypothesis is that this improves performance. Back on the first hand, as this already has random train/val/test chips, the tiled training experiment would be hard to setup and apples to oranges.

@adamjstewart
Copy link
Collaborator

On the other hand, I like the idea of training on LandCoverAI as a tiled dataset instead of a pre-chipped dataset.

Well you're in luck, see #1126

Speaking of which, @adrianboguszewski may be interested in this discussion. For reference, we are choosing a simpler and popular semantic segmentation dataset for use in our tutorial examples.

@calebrob6
Copy link
Member Author

Yes, I'm aware of the geodataset version, but not aware of any experiments that train on that and try to test on the original test split. Because the original test split is randomly tiled out this comparison would be difficult to do correctly (you'd have to mask out the val and test areas in training).

@adrianboguszewski
Copy link
Contributor

Hi guys. It sounds super cool and I'm fully ok with using the dataset (100) for tutorials :) Let me know if you need anything from my side.

@calebrob6
Copy link
Member Author

Thanks for such a cool dataset :)

@adrianboguszewski
Copy link
Contributor

And I can see you added the dataset to HuggingFace. Could you change the name to landcoverai100, as it doesn't contain all files?

@adamjstewart
Copy link
Collaborator

@adrianboguszewski we're planning on adding all files in the future, but we don't want to rename the repo later since that will break downloads for older versions.

@calebrob6
Copy link
Member Author

Yeah we should decide now. I think @adamjstewart is thinking we have a single HF repo named landcoverai that contains both landcoverai (full) and landcoverai100, while @adrianboguszewski is thinking we have a landcoverai100 repo and a separate landcoverai repo. I'm fine either way (preference to whatever @adrianboguszewski wants as it is his dataset 😄) as I don't know how much people actually browse through the HF dataset repos (i.e. I'd expect users to just consume both through the torchgeo dataset object with download=True).

@adrianboguszewski
Copy link
Contributor

With these 2 options in place, I prefer to go with one repo landcoverai and 2 datasets inside :)

@calebrob6 calebrob6 merged commit 94960bb into main Sep 11, 2024
19 checks passed
@calebrob6 calebrob6 deleted the landcoverai100 branch September 11, 2024 20:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datamodules PyTorch Lightning datamodules datasets Geospatial or benchmark datasets documentation Improvements or additions to documentation testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants