-
Notifications
You must be signed in to change notification settings - Fork 388
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds the LandCoverAI100 dataset and datamodule for use in semantic segmentation notebooks #2262
Conversation
calebrob6
commented
Aug 29, 2024
•
edited
Loading
edited
- Make it actually work
- Make the tests
- Green checks all around
Before we get too deep into this, first things first is to decide if LandCover.ai is the correct dataset to use for our notebooks. We reuse EuroSAT100 for all of our classification-based tutorials, and I would like to do the same for all future semantic segmentation-based tutorials as well. EuroSAT was chosen because it was well known within the community, had non-RGB bands so we could use it to show off our non-RGB weights, had a permissive license, and had geolocation available (see the "Why EuroSAT" section of #1130). LandCover.ai is RGB-only and has a somewhat restrictive license. As far as I can tell from my non-legal background, here we're just using it for educational purposes, so we don't run into any issues with the license. I do like how the imagery is high resolution, so the plots are much prettier than EuroSAT. Are there any other candidates we should consider? If we do decide on LandCover.ai, I have a bunch of follow-up review comments (mostly on Hugging Face, not on your PR) to address for both license compliance and redistributing the full dataset. |
RGB vs. non-RGB doesn't really matter to me as the point of the tutorials is to show how to do things -- extrapolating to non-RGB can be indicated where appropriate. We already show the existence of pre-trained weights, and the value of those would only be demonstrated by full benchmarks. No idea about the license. It is relatively popular -- 150 citations Other options that are commonly used in papers:
|
I also still have nightmares about Chesapeake CVPR given how many bugs our data loader has: https://github.com/microsoft/torchgeo/issues/assigned/calebrob6 Let me poll Slack real quick and see what else people come up with, but so far I'm leaning towards LandCover.ai. |
A small dataset that can be trained on easily is the Amazon forest dataset (CC BY 4.0) |
We removed support for rar files, but we could redistribute that as a zip file. But then someone will have to write a data loader for it. I guess it depends on how pretty the visualizations are, it doesn't have to be well known. |
Looks like a cool dataset!! Do you know if this is used in any papers @robmarkcole? |
@calebrob6 a couple of papers. I actually used this in my tech eval for my current role! I believe the license allows redistribution, so could just host on huggingface with attribution |
Cool, seems like a nice little dataset (as long as a smp unet + resnet18 doesn't get 99.99% acc)! I'm somewhat OK with implementing it into torchgeo as long as that means that I can actually write tutorials. @adamjstewart what are your problems about landcoverai100 on huggingface? |
Comments on HF:
Comments on PR:
|
Got it, let me know which route you'd prefer |
I guess it depends on how nice the predictions and visualizations look. The license for the Amazon dataset looks perfect. What spectral bands does the Amazon dataset contain? Would be nice if we could start with a pre-trained model to make the predictions better. |
The imagery is 8-bit R,G,B (looks like S2 to me). Predictions from the papers that use the dataset seem fine (see below), so I'm assuming a standard training setup will be able to get the same. Note that a standard training setup is not what will be in the tutorials by default as we have the restriction that it run in CI. |
Looks good to me. In that case, I have slight preference for this one (given its simplicity and permissive license and the fact that it's a real dataset) but would also be okay with LandCover.ai 100. Either way, we need to redistribute it on HF to convert from .rar to .zip and most of the same HF/PR comments still apply. |
235bb4f
to
a9c4239
Compare
In that case, going to go with LandCoverAI100 because I've already done all the work! Integrated your comments (see https://huggingface.co/datasets/torchgeo/landcoverai).
On one hand I like this because we get rid of the dependency and don't have to have the custom eval / checksum code. On the other hand, I like the idea of training on LandCoverAI as a tiled dataset instead of a pre-chipped dataset. My hypothesis is that this improves performance. Back on the first hand, as this already has random train/val/test chips, the tiled training experiment would be hard to setup and apples to oranges. |
Well you're in luck, see #1126 Speaking of which, @adrianboguszewski may be interested in this discussion. For reference, we are choosing a simpler and popular semantic segmentation dataset for use in our tutorial examples. |
Yes, I'm aware of the geodataset version, but not aware of any experiments that train on that and try to test on the original test split. Because the original test split is randomly tiled out this comparison would be difficult to do correctly (you'd have to mask out the val and test areas in training). |
Hi guys. It sounds super cool and I'm fully ok with using the dataset (100) for tutorials :) Let me know if you need anything from my side. |
Thanks for such a cool dataset :) |
And I can see you added the dataset to HuggingFace. Could you change the name to landcoverai100, as it doesn't contain all files? |
@adrianboguszewski we're planning on adding all files in the future, but we don't want to rename the repo later since that will break downloads for older versions. |
Yeah we should decide now. I think @adamjstewart is thinking we have a single HF repo named landcoverai that contains both landcoverai (full) and landcoverai100, while @adrianboguszewski is thinking we have a landcoverai100 repo and a separate landcoverai repo. I'm fine either way (preference to whatever @adrianboguszewski wants as it is his dataset 😄) as I don't know how much people actually browse through the HF dataset repos (i.e. I'd expect users to just consume both through the torchgeo dataset object with |
With these 2 options in place, I prefer to go with one repo landcoverai and 2 datasets inside :) |