-
Notifications
You must be signed in to change notification settings - Fork 28.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extended semantic segmentation to image segmentation #27039
Changes from 14 commits
00654ca
3078459
de6ef82
d74a26a
59a9ef1
d1a8e64
03803ad
ddffbcc
73a3dcf
93b3085
e1e627d
0cebee4
7b35bed
6a0f16e
f15645b
1b42272
aeedcc1
1c725de
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -14,29 +14,17 @@ rendered properly in your Markdown viewer. | |||||
|
||||||
--> | ||||||
|
||||||
# Semantic segmentation | ||||||
# Image Segmentation | ||||||
|
||||||
[[open-in-colab]] | ||||||
|
||||||
<Youtube id="dKE8SIt9C-w"/> | ||||||
|
||||||
Semantic segmentation assigns a label or class to each individual pixel of an image. There are several types of segmentation, and in the case of semantic segmentation, no distinction is made between unique instances of the same object. Both objects are given the same label (for example, "car" instead of "car-1" and "car-2"). Common real-world applications of semantic segmentation include training self-driving cars to identify pedestrians and important traffic information, identifying cells and abnormalities in medical imagery, and monitoring environmental changes from satellite imagery. | ||||||
Image segmentation models separate areas corresponding to different areas of interest in an image. These models work by assigning a label to each pixel. There are several types of segmentation: semantic segmentation, instance segmentation, and panoptic segmentation. | ||||||
|
||||||
This guide will show you how to: | ||||||
|
||||||
1. Finetune [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer#segformer) on the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset. | ||||||
2. Use your finetuned model for inference. | ||||||
|
||||||
<Tip> | ||||||
The task illustrated in this tutorial is supported by the following model architectures: | ||||||
|
||||||
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!--> | ||||||
|
||||||
[BEiT](../model_doc/beit), [Data2VecVision](../model_doc/data2vec-vision), [DPT](../model_doc/dpt), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [MobileViTV2](../model_doc/mobilevitv2), [SegFormer](../model_doc/segformer), [UPerNet](../model_doc/upernet) | ||||||
|
||||||
<!--End of the generated tip--> | ||||||
|
||||||
</Tip> | ||||||
In this guide, we will: | ||||||
1. [Take a look at different types of segmentation](#Types-of-Segmentation), | ||||||
2. [Have an end-to-end fine-tuning example for semantic segmentation](#Fine-tuning-a-Model-for-Segmentation). | ||||||
|
||||||
Before you begin, make sure you have all the necessary libraries installed: | ||||||
|
||||||
|
@@ -52,7 +40,178 @@ We encourage you to log in to your Hugging Face account so you can upload and sh | |||||
>>> notebook_login() | ||||||
``` | ||||||
|
||||||
## Load SceneParse150 dataset | ||||||
## Types of Segmentation | ||||||
|
||||||
Semantic segmentation assigns a label or class to every single pixel in an image. Let's take a look at a semantic segmentation model output. It will assign the same class to every instance of an object it comes across in an image, for example, all cats will be labeled as "cat" instead of "cat-1", "cat-2". | ||||||
We can use transformers' image segmentation pipeline to quickly infer a semantic segmentation model. Let's take a look at the example image. | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Not sure the link here will work There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would |
||||||
|
||||||
```python | ||||||
from transformers import pipeline | ||||||
from PIL import Image | ||||||
import requests | ||||||
|
||||||
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/segmentation_input.jpg" | ||||||
image = Image.open(requests.get(url, stream=True).raw) | ||||||
image | ||||||
``` | ||||||
|
||||||
<div class="flex justify-center"> | ||||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/segmentation_input.jpg" alt="Segmentation Input"/> | ||||||
</div> | ||||||
|
||||||
We will use [nvidia/segformer-b1-finetuned-cityscapes-1024-1024](https://huggingface.co/nvidia/segformer-b1-finetuned-cityscapes-1024-1024). | ||||||
|
||||||
```python | ||||||
semantic_segmentation = pipeline("image-segmentation", "nvidia/segformer-b1-finetuned-cityscapes-1024-1024") | ||||||
results = semantic_segmentation(image) | ||||||
results | ||||||
``` | ||||||
|
||||||
The segmentation pipeline output includes a mask for every predicted class. | ||||||
```bash | ||||||
[{'score': None, | ||||||
'label': 'road', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': None, | ||||||
'label': 'sidewalk', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': None, | ||||||
'label': 'building', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': None, | ||||||
'label': 'wall', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': None, | ||||||
'label': 'pole', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': None, | ||||||
'label': 'traffic sign', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': None, | ||||||
'label': 'vegetation', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': None, | ||||||
'label': 'terrain', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': None, | ||||||
'label': 'sky', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': None, | ||||||
'label': 'car', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}] | ||||||
``` | ||||||
|
||||||
Taking a look at the mask for the car class, we can see every car is classified with the same mask. | ||||||
|
||||||
```python | ||||||
results[-1]["mask"] | ||||||
``` | ||||||
<div class="flex justify-center"> | ||||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/semantic_segmentation_output.png" alt="Semantic Segmentation Output"/> | ||||||
</div> | ||||||
|
||||||
In instance segmentation, the goal is not to classify every pixel, but to predict a mask for **every instance of an object** in a given image. It works very similar to object detection, where there is a bounding box for every instance, there's a segmentation mask instead. We will use [facebook/mask2former-swin-large-cityscapes-instance](https://huggingface.co/facebook/mask2former-swin-large-cityscapes-instance) for this. | ||||||
|
||||||
```python | ||||||
instance_segmentation = pipeline("image-segmentation", "facebook/mask2former-swin-large-cityscapes-instance") | ||||||
results = instance_segmentation(Image.open(image)) | ||||||
results | ||||||
``` | ||||||
|
||||||
As you can see below, there are multiple cars classified, and there's no classification for pixels other than pixels that belong to car and person instances. | ||||||
merveenoyan marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
```bash | ||||||
[{'score': 0.999944, | ||||||
'label': 'car', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': 0.999945, | ||||||
'label': 'car', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': 0.999652, | ||||||
'label': 'car', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': 0.903529, | ||||||
'label': 'person', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}] | ||||||
``` | ||||||
Checking out one of the car masks below. | ||||||
|
||||||
```python | ||||||
results[2]["mask"] | ||||||
``` | ||||||
<div class="flex justify-center"> | ||||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/instance_segmentation_output.png" alt="Semantic Segmentation Output"/> | ||||||
</div> | ||||||
|
||||||
Panoptic segmentation combines semantic segmentation and instance segmentation, where every pixel is classified into a class and an instance of that class, and there are multiple masks for each instance of a class. We can use [facebook/mask2former-swin-large-cityscapes-panoptic](https://huggingface.co/facebook/mask2former-swin-large-cityscapes-panoptic) for this. | ||||||
|
||||||
```python | ||||||
panoptic_segmentation = pipeline("image-segmentation", "facebook/mask2former-swin-large-cityscapes-panoptic") | ||||||
results = panoptic_segmentation(Image.open(image)) | ||||||
results | ||||||
``` | ||||||
As you can see below, we have more classes. We will later illustrate to see that every pixel is classified into one of the classes. | ||||||
|
||||||
```bash | ||||||
[{'score': 0.999981, | ||||||
'label': 'car', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': 0.999958, | ||||||
'label': 'car', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': 0.99997, | ||||||
'label': 'vegetation', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': 0.999575, | ||||||
'label': 'pole', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': 0.999958, | ||||||
'label': 'building', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': 0.999634, | ||||||
'label': 'road', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': 0.996092, | ||||||
'label': 'sidewalk', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': 0.999221, | ||||||
'label': 'car', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}, | ||||||
{'score': 0.99987, | ||||||
'label': 'sky', | ||||||
'mask': <PIL.Image.Image image mode=L size=612x415>}] | ||||||
``` | ||||||
|
||||||
Let's have a side by side comparison for all types of segmentation. | ||||||
|
||||||
<div class="flex justify-center"> | ||||||
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/segmentation-comparison.png" alt="Segmentation Maps Compared"/> | ||||||
</div> | ||||||
Comment on lines
+187
to
+189
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd maybe use the same order you used in the exposition: Reference, Semantic Segmentation, Instance Segmentation, Panoptic Segmentation. The Instance Segmentation Output appears to contain more classes than "car" and "person", but the model output above didn't. Perhaps we could make it consistent? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Surprisingly that building is classified as car, and this is one of the best (maybe it is the best) instance segmentation models on Hub (mask2former). I'd rather not modify? |
||||||
|
||||||
Seeing all types of segmentation, let's have a deep dive on fine-tuning a model for semantic segmentation. | ||||||
|
||||||
Common real-world applications of semantic segmentation include training self-driving cars to identify pedestrians and important traffic information, identifying cells and abnormalities in medical imagery, and monitoring environmental changes from satellite imagery. | ||||||
|
||||||
## Fine-tuning a Model for Segmentation | ||||||
|
||||||
We will now: | ||||||
|
||||||
1. Finetune [SegFormer](https://huggingface.co/docs/transformers/main/en/model_doc/segformer#segformer) on the [SceneParse150](https://huggingface.co/datasets/scene_parse_150) dataset. | ||||||
2. Use your fine-tuned model for inference. | ||||||
|
||||||
<Tip> | ||||||
The task illustrated in this tutorial is supported by the following model architectures: | ||||||
|
||||||
<!--This tip is automatically generated by `make fix-copies`, do not fill manually!--> | ||||||
|
||||||
[BEiT](../model_doc/beit), [Data2VecVision](../model_doc/data2vec-vision), [DPT](../model_doc/dpt), [MobileNetV2](../model_doc/mobilenet_v2), [MobileViT](../model_doc/mobilevit), [MobileViTV2](../model_doc/mobilevitv2), [SegFormer](../model_doc/segformer), [UPerNet](../model_doc/upernet) | ||||||
|
||||||
<!--End of the generated tip--> | ||||||
|
||||||
</Tip> | ||||||
|
||||||
|
||||||
### Load SceneParse150 dataset | ||||||
|
||||||
Start by loading a smaller subset of the SceneParse150 dataset from the 🤗 Datasets library. This'll give you a chance to experiment and make sure everything works before spending more time training on the full dataset. | ||||||
|
||||||
|
@@ -97,7 +256,7 @@ You'll also want to create a dictionary that maps a label id to a label class wh | |||||
>>> num_labels = len(id2label) | ||||||
``` | ||||||
|
||||||
## Preprocess | ||||||
### Preprocess | ||||||
|
||||||
The next step is to load a SegFormer image processor to prepare the images and annotations for the model. Some datasets, like this one, use the zero-index as the background class. However, the background class isn't actually included in the 150 classes, so you'll need to set `reduce_labels=True` to subtract one from all the labels. The zero-index is replaced by `255` so it's ignored by SegFormer's loss function: | ||||||
|
||||||
|
@@ -204,7 +363,7 @@ The transform is applied on the fly which is faster and consumes less disk space | |||||
</tf> | ||||||
</frameworkcontent> | ||||||
|
||||||
## Evaluate | ||||||
### Evaluate | ||||||
|
||||||
Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an evaluation method with the 🤗 [Evaluate](https://huggingface.co/docs/evaluate/index) library. For this task, load the [mean Intersection over Union](https://huggingface.co/spaces/evaluate-metric/accuracy) (IoU) metric (see the 🤗 Evaluate [quick tour](https://huggingface.co/docs/evaluate/a_quick_tour) to learn more about how to load and compute a metric): | ||||||
|
||||||
|
@@ -289,7 +448,7 @@ logits first, and then reshaped to match the size of the labels before you can c | |||||
|
||||||
Your `compute_metrics` function is ready to go now, and you'll return to it when you setup your training. | ||||||
|
||||||
## Train | ||||||
### Train | ||||||
<frameworkcontent> | ||||||
<pt> | ||||||
<Tip> | ||||||
|
@@ -453,7 +612,7 @@ Congratulations! You have fine-tuned your model and shared it on the 🤗 Hub. Y | |||||
</frameworkcontent> | ||||||
|
||||||
|
||||||
## Inference | ||||||
### Inference | ||||||
|
||||||
Great, now that you've finetuned a model, you can use it for inference! | ||||||
|
||||||
|
@@ -470,43 +629,8 @@ Load an image for inference: | |||||
|
||||||
<frameworkcontent> | ||||||
<pt> | ||||||
The simplest way to try out your finetuned model for inference is to use it in a [`pipeline`]. Instantiate a `pipeline` for image segmentation with your model, and pass your image to it: | ||||||
|
||||||
```py | ||||||
>>> from transformers import pipeline | ||||||
|
||||||
>>> segmenter = pipeline("image-segmentation", model="my_awesome_seg_model") | ||||||
>>> segmenter(image) | ||||||
[{'score': None, | ||||||
'label': 'wall', | ||||||
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062690>}, | ||||||
{'score': None, | ||||||
'label': 'sky', | ||||||
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062A50>}, | ||||||
{'score': None, | ||||||
'label': 'floor', | ||||||
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062B50>}, | ||||||
{'score': None, | ||||||
'label': 'ceiling', | ||||||
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062A10>}, | ||||||
{'score': None, | ||||||
'label': 'bed ', | ||||||
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062E90>}, | ||||||
{'score': None, | ||||||
'label': 'windowpane', | ||||||
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062390>}, | ||||||
{'score': None, | ||||||
'label': 'cabinet', | ||||||
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062550>}, | ||||||
{'score': None, | ||||||
'label': 'chair', | ||||||
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062D90>}, | ||||||
{'score': None, | ||||||
'label': 'armchair', | ||||||
'mask': <PIL.Image.Image image mode=L size=640x427 at 0x7FD5B2062E10>}] | ||||||
``` | ||||||
|
||||||
You can also manually replicate the results of the `pipeline` if you'd like. Process the image with an image processor and place the `pixel_values` on a GPU: | ||||||
We will now see how to infer without a pipeline. Process the image with an image processor and place the `pixel_values` on a GPU: | ||||||
|
||||||
```py | ||||||
>>> device = torch.device("cuda" if torch.cuda.is_available() else "cpu") # use GPU if available, otherwise use a CPU | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe rename file/URL to
image_segmentation.md
, for consistency with the contents. (Also in the yaml, of course) :)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!