Adds the LandCoverAI100 dataset and datamodule for use in semantic segmentation notebooks #2262

calebrob6 · 2024-08-29T19:42:36Z

Make it actually work
Make the tests
Green checks all around

adamjstewart · 2024-09-04T09:15:33Z

Before we get too deep into this, first things first is to decide if LandCover.ai is the correct dataset to use for our notebooks. We reuse EuroSAT100 for all of our classification-based tutorials, and I would like to do the same for all future semantic segmentation-based tutorials as well.

EuroSAT was chosen because it was well known within the community, had non-RGB bands so we could use it to show off our non-RGB weights, had a permissive license, and had geolocation available (see the "Why EuroSAT" section of #1130).

LandCover.ai is RGB-only and has a somewhat restrictive license. As far as I can tell from my non-legal background, here we're just using it for educational purposes, so we don't run into any issues with the license. I do like how the imagery is high resolution, so the plots are much prettier than EuroSAT. Are there any other candidates we should consider?

If we do decide on LandCover.ai, I have a bunch of follow-up review comments (mostly on Hugging Face, not on your PR) to address for both license compliance and redistributing the full dataset.

calebrob6 · 2024-09-04T14:22:09Z

RGB vs. non-RGB doesn't really matter to me as the point of the tutorials is to show how to do things -- extrapolating to non-RGB can be indicated where appropriate. We already show the existence of pre-trained weights, and the value of those would only be demonstrated by full benchmarks.

No idea about the license.

It is relatively popular -- 150 citations

Other options that are commonly used in papers:

Vaihingen -- no auto download, 3-channel (but not RGB, bands are near infrared, red and green)
DFC2022 -- 3 citations, I think the labels are quite noisy
Potsdam -- no auto download, 555 citations
Inria -- 5k x 5k patches, no auto download
Chesapeake CVPR -- very good dataset, auto downloading, highly cited, pretty

adamjstewart · 2024-09-04T15:33:51Z

Vaihingen, Potsdam, Inria: license unknown, meaning it's technically illegal for anyone to use these datasets for any purpose
DFC2022: sounds like it's pretty unheard of with only 3 citations
Chesapeake CVPR: license looks like a nightmare (a different license for every single layer) although it is pretty permissive
LandCover.ai: license is much simpler (single license), but less permissive

I also still have nightmares about Chesapeake CVPR given how many bugs our data loader has: https://github.com/microsoft/torchgeo/issues/assigned/calebrob6

Let me poll Slack real quick and see what else people come up with, but so far I'm leaning towards LandCover.ai.

robmarkcole · 2024-09-04T15:45:43Z

A small dataset that can be trained on easily is the Amazon forest dataset (CC BY 4.0)

adamjstewart · 2024-09-04T15:53:02Z

We removed support for rar files, but we could redistribute that as a zip file. But then someone will have to write a data loader for it. I guess it depends on how pretty the visualizations are, it doesn't have to be well known.

calebrob6 · 2024-09-04T15:55:12Z

Looks like a cool dataset!! Do you know if this is used in any papers @robmarkcole?

robmarkcole · 2024-09-04T15:59:06Z

@calebrob6 a couple of papers. I actually used this in my tech eval for my current role! I believe the license allows redistribution, so could just host on huggingface with attribution

calebrob6 · 2024-09-04T16:12:03Z

Cool, seems like a nice little dataset (as long as a smp unet + resnet18 doesn't get 99.99% acc)!

I'm somewhat OK with implementing it into torchgeo as long as that means that I can actually write tutorials. @adamjstewart what are your problems about landcoverai100 on huggingface?

adamjstewart · 2024-09-05T12:45:46Z

Comments on HF:

We should rename the repo from landcoverai100 to landcoverai. I'm thinking of redistributing all of landcoverai there in the future. Specifically, if we pre-chip the dataset and upload the chipped version, we can completely remove our OpenCV dependency
The license requires attribution, including what changes were made to the dataset. I usually add a blurb to the README pointing to the original source and describing what I changed

Comments on PR:

You should replace main with the git commit hash for the latest commit to ensure reproducibility in case we change the dataset in the future. Not that this matters for our tutorials, but want to stay consistent

calebrob6 · 2024-09-06T15:28:50Z

Got it, let me know which route you'd prefer

adamjstewart · 2024-09-06T15:34:18Z

I guess it depends on how nice the predictions and visualizations look. The license for the Amazon dataset looks perfect. What spectral bands does the Amazon dataset contain? Would be nice if we could start with a pre-trained model to make the predictions better.

calebrob6 · 2024-09-06T15:43:59Z

The imagery is 8-bit R,G,B (looks like S2 to me). Predictions from the papers that use the dataset seem fine (see below), so I'm assuming a standard training setup will be able to get the same. Note that a standard training setup is not what will be in the tutorials by default as we have the restriction that it run in CI.

adamjstewart · 2024-09-06T15:46:42Z

Looks good to me. In that case, I have slight preference for this one (given its simplicity and permissive license and the fact that it's a real dataset) but would also be okay with LandCover.ai 100. Either way, we need to redistribute it on HF to convert from .rar to .zip and most of the same HF/PR comments still apply.

calebrob6 · 2024-09-09T19:52:46Z

In that case, going to go with LandCoverAI100 because I've already done all the work!

Integrated your comments (see https://huggingface.co/datasets/torchgeo/landcoverai).

Specifically, if we pre-chip the dataset and upload the chipped version, we can completely remove our OpenCV dependency

On one hand I like this because we get rid of the dependency and don't have to have the custom eval / checksum code. On the other hand, I like the idea of training on LandCoverAI as a tiled dataset instead of a pre-chipped dataset. My hypothesis is that this improves performance. Back on the first hand, as this already has random train/val/test chips, the tiled training experiment would be hard to setup and apples to oranges.

adamjstewart · 2024-09-09T20:10:20Z

On the other hand, I like the idea of training on LandCoverAI as a tiled dataset instead of a pre-chipped dataset.

Well you're in luck, see #1126

Speaking of which, @adrianboguszewski may be interested in this discussion. For reference, we are choosing a simpler and popular semantic segmentation dataset for use in our tutorial examples.

calebrob6 · 2024-09-09T20:26:33Z

Yes, I'm aware of the geodataset version, but not aware of any experiments that train on that and try to test on the original test split. Because the original test split is randomly tiled out this comparison would be difficult to do correctly (you'd have to mask out the val and test areas in training).

adrianboguszewski · 2024-09-10T11:33:45Z

Hi guys. It sounds super cool and I'm fully ok with using the dataset (100) for tutorials :) Let me know if you need anything from my side.

calebrob6 · 2024-09-10T14:12:45Z

Thanks for such a cool dataset :)

adrianboguszewski · 2024-09-10T15:50:36Z

And I can see you added the dataset to HuggingFace. Could you change the name to landcoverai100, as it doesn't contain all files?

adamjstewart · 2024-09-11T14:42:05Z

@adrianboguszewski we're planning on adding all files in the future, but we don't want to rename the repo later since that will break downloads for older versions.

calebrob6 · 2024-09-11T15:34:28Z

Yeah we should decide now. I think @adamjstewart is thinking we have a single HF repo named landcoverai that contains both landcoverai (full) and landcoverai100, while @adrianboguszewski is thinking we have a landcoverai100 repo and a separate landcoverai repo. I'm fine either way (preference to whatever @adrianboguszewski wants as it is his dataset 😄) as I don't know how much people actually browse through the HF dataset repos (i.e. I'd expect users to just consume both through the torchgeo dataset object with download=True).

adrianboguszewski · 2024-09-11T20:28:22Z

With these 2 options in place, I prefer to go with one repo landcoverai and 2 datasets inside :)

github-actions bot added documentation Improvements or additions to documentation datasets Geospatial or benchmark datasets testing Continuous integration testing datamodules PyTorch Lightning datamodules labels Aug 29, 2024

calebrob6 mentioned this pull request Aug 29, 2024

Added custom semantic segmentation trainer tutorial #1897

Open

adamjstewart added this to the 0.7.0 milestone Aug 30, 2024

calebrob6 requested review from adamjstewart and isaaccorley August 30, 2024 23:02

calebrob6 added 9 commits September 9, 2024 19:34

Add dataset and datamodule

b44aaf1

Add docs

c0ff8d4

Tests

23f2ced

Ran ruff one time

1b35d61

Fixture needs a params kwarg

2f171c7

Make dataset work

9b286f4

Add versionadded to datamodule

8294824

Add conf file to test new datamodule

8e2ddd2

Test datamodule

a9c4239

calebrob6 force-pushed the landcoverai100 branch from 235bb4f to a9c4239 Compare September 9, 2024 19:34

calebrob6 added 2 commits September 9, 2024 19:46

Changing dataset URL

48be9f3

Update main hash

c09fec9

adamjstewart approved these changes Sep 11, 2024

View reviewed changes

calebrob6 merged commit 94960bb into main Sep 11, 2024
19 checks passed

calebrob6 deleted the landcoverai100 branch September 11, 2024 20:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adds the LandCoverAI100 dataset and datamodule for use in semantic segmentation notebooks #2262

Adds the LandCoverAI100 dataset and datamodule for use in semantic segmentation notebooks #2262

calebrob6 commented Aug 29, 2024 •

edited

Loading

adamjstewart commented Sep 4, 2024

calebrob6 commented Sep 4, 2024

adamjstewart commented Sep 4, 2024

robmarkcole commented Sep 4, 2024 •

edited

Loading

adamjstewart commented Sep 4, 2024 •

edited

Loading

calebrob6 commented Sep 4, 2024

robmarkcole commented Sep 4, 2024

calebrob6 commented Sep 4, 2024

adamjstewart commented Sep 5, 2024

calebrob6 commented Sep 6, 2024

adamjstewart commented Sep 6, 2024

calebrob6 commented Sep 6, 2024

adamjstewart commented Sep 6, 2024

calebrob6 commented Sep 9, 2024 •

edited

Loading

adamjstewart commented Sep 9, 2024

calebrob6 commented Sep 9, 2024

adrianboguszewski commented Sep 10, 2024

calebrob6 commented Sep 10, 2024

adrianboguszewski commented Sep 10, 2024

adamjstewart commented Sep 11, 2024

calebrob6 commented Sep 11, 2024

adrianboguszewski commented Sep 11, 2024

Adds the LandCoverAI100 dataset and datamodule for use in semantic segmentation notebooks #2262

Adds the LandCoverAI100 dataset and datamodule for use in semantic segmentation notebooks #2262

Conversation

calebrob6 commented Aug 29, 2024 • edited Loading

adamjstewart commented Sep 4, 2024

calebrob6 commented Sep 4, 2024

adamjstewart commented Sep 4, 2024

robmarkcole commented Sep 4, 2024 • edited Loading

adamjstewart commented Sep 4, 2024 • edited Loading

calebrob6 commented Sep 4, 2024

robmarkcole commented Sep 4, 2024

calebrob6 commented Sep 4, 2024

adamjstewart commented Sep 5, 2024

calebrob6 commented Sep 6, 2024

adamjstewart commented Sep 6, 2024

calebrob6 commented Sep 6, 2024

adamjstewart commented Sep 6, 2024

calebrob6 commented Sep 9, 2024 • edited Loading

adamjstewart commented Sep 9, 2024

calebrob6 commented Sep 9, 2024

adrianboguszewski commented Sep 10, 2024

calebrob6 commented Sep 10, 2024

adrianboguszewski commented Sep 10, 2024

adamjstewart commented Sep 11, 2024

calebrob6 commented Sep 11, 2024

adrianboguszewski commented Sep 11, 2024

calebrob6 commented Aug 29, 2024 •

edited

Loading

robmarkcole commented Sep 4, 2024 •

edited

Loading

adamjstewart commented Sep 4, 2024 •

edited

Loading

calebrob6 commented Sep 9, 2024 •

edited

Loading