Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of SuperPoint and AutoModelForInterestPointDescription #25786

Closed
wants to merge 208 commits into from

Conversation

sbucaille
Copy link
Contributor

@sbucaille sbucaille commented Aug 27, 2023

What does this PR do?

This PR implements SuperPoint, one of the few models that generate keypoints and descriptors given an image, as discussed in this previous pull request
The goal is to implement this model and a new type of AutoModel : AutoModelForInterestPointDescription (name to be discussed).

Who can review?

@amyeroberts @ArthurZucker

TODO's

  • Implement SuperPointConfig and SuperPointModel as PretrainedConfig and PretrainedModel
  • Generate a conversion script for the original weights
  • Implement the new AutoModelForInterestPointDescription mapping
  • Test the model
  • Write documentation

- Added the SuperPointConfig
- Added the SuperPointModel and its implementation
- Added new ImagePointDescriptionOutput dataclass
@sbucaille
Copy link
Contributor Author

@amyeroberts I have some questions about the implementation and AutoModel classes.

First of all, I try to follow as much as possible the patterns I see in other model implementations (resnet or convnextv2 for example), but unlike these models, SuperPoint really only have one function or "mode" here, just to output the keypoints, their scores and descriptors. This is why I only implemented SuperPoint as a SuperPointModelForInterestPointDescription, so there is no SuperPointModel anymore, does that seem ok ?

Then I added this SuperPointModelForInterestPointDescription class in a new mapping dictionary in the modeling_auto file and added the appropriate AutoModel class for this. But is this kind of changes usually the output of an automated script for model registration or adding it by hand is appropriate ?

Finally,

In that PR, we can also add a mapping AutoModelForInterestPointDescription, which we define as taking two images and returning interest keypoints and their descriptions.

Apart from adding the AutoModelForInterestPointDescription, I couldn't find how to define such inputs and outputs, is it a new pipeline I should define or something else ?

 - Divided SuperPointModel in multiple submodules to handle hidden states in ModelOutput support (see ImagePointDescriptionOutput).
 - Added mandatory information to the SuperPointPreTrainedModel class such as the main input name and supports_gradient_checkpointing boolean.
 - Added weight initialization
 - Added imports to transformers.__init__.py
@amyeroberts
Copy link
Collaborator

Hi @sbucaille,

This is a bit of a special case. For other models which only perform a single task, what we normally do is just have XxxModel. I'd suggest doing this. We can still add AutoModelForInterestPointDescription and have SuperPointModel loaded by it

@amyeroberts
Copy link
Collaborator

@sbucaille From next week, I'll be off for a few weeks. If you have vision-specific questions, please ping @rafaelpadilla; for implementation questions @ArthurZucker.

@@ -298,3 +298,41 @@ def forward(
last_hidden_state=last_hidden_state,
hidden_states=encoder_outputs.hidden_states,
)


Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SuperPointModelForInterestPointDescription is more or less the exact same as SuperPointModel.
I just use it for my own understanding of the transformers library since I'm taking inspiration on other implementations (like ConvNextV2). So either SuperPointModel or SuperPointModelForInterestPointDescription might be deleted later

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In transformers, the class XXXModel does not contain the head on top and the classes XXXForYYY add different head models. XXX represents the model name and YYY represents a task.

Copy link
Contributor Author

@sbucaille sbucaille Oct 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the proposed implementation is correct regarding this convention ? Because I can't really tell what would be considered as a head in SuperPoint. Or shoud SuperPointModel only contain the encoder and the SuperPointForInterestPointDescription contain the keypoint_decoder and descriptor_decoder ? And then in this case SuperPointModel.forward() should output a BaseModelOutputWithPoolingAndNoAttention similar to ConvNextV2 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the paper and code structure, the SuperPoint is a traditional CNN based model, and I couldn't identify what could be identified as the head.

The encoder is simply a CNN which returns features (last_hidden). The 2 decoders keypoint_decoder and descriptor_decoder, return keypoints+scores and descriptors, respectively. So, I think the structure that best fits this case is:

  • The SuperPointModel should only contain the encoder, a SuperPointEncoder object. Then, the SuperPointModel.forward() should output a BaseModelOutputWithNoAttention object.
  • The SuperPointForInterestPointDescriptionshould contain the superpoint, a SuperPointModel object, and both keypoint_decoder and descriptor_decoder. Then, the SuperPointForInterestPointDescriptor.forward() should output a ImagePointDescriptionOutput object.

@amyeroberts , what do you think of this structure? Makes sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case - I would just have SuperPointModel which contains the encoder and decoders. In modeling_auto.py - AutoModelForInterestPointDescription should then map to loading this model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly, I need to remove every mentions of SuperPointForInterestPointDescription and only keep SuperPointModel that can be instantiated by AutoModelForInterestPointDescription ? this commit reflects these changes, let me know if I misunderstood something

@sbucaille
Copy link
Contributor Author

Hi @ArthurZucker, I added the SuperPointImageProcessor as part of the code because SuperPoint requires a grayscale image as input. But when I added the tests I have the test_call_pil which fails and give me a very weird when it reaches these lines

tests/models/superpoint/test_image_processing_superpoint.py:43: in prepare_image_inputs
    return prepare_image_inputs(
tests/test_image_processing_common.py:64: in prepare_image_inputs
    image_inputs = [Image.fromarray(np.moveaxis(image, 0, -1)) for image in image_inputs]
tests/test_image_processing_common.py:64: in <listcomp>
    image_inputs = [Image.fromarray(np.moveaxis(image, 0, -1)) for image in image_inputs]

with the following error :

            except KeyError as e:
                msg = "Cannot handle this data type: %s, %s" % typekey
>               raise TypeError(msg) from e
E               TypeError: Cannot handle this data type: (1, 1, 1), |u1

Not sure what causes the problem as I tried to compare with tests made with the ConvNextImageProcessor which does not raise any error.

Anyway, I continue on the implementation, let me know if I'm missing anything. I'll write the documentation for all the code I've previously pushed.

@rafaelpadilla
Copy link
Contributor

Hi @ArthurZucker, I added the SuperPointImageProcessor as part of the code because SuperPoint requires a grayscale image as input. But when I added the tests I have the test_call_pil which fails and give me a very weird when it reaches these lines

tests/models/superpoint/test_image_processing_superpoint.py:43: in prepare_image_inputs
    return prepare_image_inputs(
tests/test_image_processing_common.py:64: in prepare_image_inputs
    image_inputs = [Image.fromarray(np.moveaxis(image, 0, -1)) for image in image_inputs]
tests/test_image_processing_common.py:64: in <listcomp>
    image_inputs = [Image.fromarray(np.moveaxis(image, 0, -1)) for image in image_inputs]

with the following error :

            except KeyError as e:
                msg = "Cannot handle this data type: %s, %s" % typekey
>               raise TypeError(msg) from e
E               TypeError: Cannot handle this data type: (1, 1, 1), |u1

Not sure what causes the problem as I tried to compare with tests made with the ConvNextImageProcessor which does not raise any error.

Anyway, I continue on the implementation, let me know if I'm missing anything. I'll write the documentation for all the code I've previously pushed.

Hi @sbucaille, 🙂

A quick help with that issue:

I see that your processing converts all images to grayscale (here) and tests are failing here.

The root cause is that the convert_to_grayscale function returns a 1-channel image (luminance) as here. So, when it is later converted to a numpy array, it will turn to be a 1-channel image, making the test fail.

This has been discussed in PR #25767 and is not fully solved yet.

A quick solution for this issue in your code may be possible. See that a 1 single channel image is definetely grayscale, but if the 3 channels in an RGB image are equal (R==G and G==B), the image is also noted as grayscale. So, if you replicate the channels of your 1-channel grayscale image as in here, this issue can be solved. However, SuperPoint would need to be adapted for that -> you would only need to consider one of the RGB channels, as they are equal.

@sbucaille
Copy link
Contributor Author

sbucaille commented Sep 23, 2023

Hi @rafaelpadilla and @ArthurZucker ,
Thanks @rafaelpadilla for the heads up, I adapted SuperPointModel and SuperPointImageProcessor to cover this issue. SuperPointImageProcessor now generates a 3-channel grayscaled image from a given image as input and SuperPointModel extracts one of the channels to perform the forward method. Although it may be necessary to change that in the future when 1-channel images will be supported (if it is planned to).
I added docs, as well as remaining integration tests for the AutoModelForInterestPointDescription.
I think the implementation is complete.

@ArthurZucker, please let me know what I'm missing in the implementation ! 🙂
Although I have some questions :

  • What should I do with the SuperPointModel and SuperPointModelForInterestPointDescription as both are basically the same, should I only keep the latter one ?
  • Regarding docs, there is a mention of expected_output is @add_code_sample_docstrings. I decided not to provide such information since output is dynamic, depending on the number of keypoints found. Should I keep it like that or there is a way to provide a "dynamic shape" to this function ?
  • Regarding tests, I have the test_model_is_small failing, what should I do about it ? And is test_retain_grad_hidden_states_attentions related to models that can be trained ? If so we should probably skip it since SuperPoint can't be trained, also it does not have attentions.

@sbucaille
Copy link
Contributor Author

sbucaille commented Sep 25, 2023

Hi,
When adding the docs to the code on this Saturday, I started thinking, maybe late, about the licence of SuperPoint and got an answer from an original contributor, Paul-Edouard Sarlin.
It turns out it can't be used for commercial use. I am not very familiar with legal stuff like these, but does it compromise this PR ? Or, from the HuggingFace perspective, adding the licence as in the original repo is sufficient ? I added it to the model card anyway.

@rafaelpadilla
Copy link
Contributor

Hi, When adding the docs to the code on this Saturday, I started thinking, maybe late, about the licence of SuperPoint and got an answer from an original contributor, Paul-Edouard Sarlin. It turns out it can't be used for commercial use. I am not very familiar with legal stuff like these, but does it compromise this PR ? Or, from the HuggingFace perspective, adding the licence as in the original repo is sufficient ? I added it to the model card anyway.

Hi @sbucaille,

It seems that the original code is under MIT license. If it is the case, you just need to add the MIT license on the top of the files, as done in graphormer and IDEFICS.

The checkpoints seem to be under a non-commercial customized license. So, as you have already added the License here, you just need to set inference: false in the card as done in owlvit-large-patch14 and musicgen-large.

sbucaille and others added 24 commits February 4, 2024 17:53
…rPointEncoder convolution layers are abstracted into SuperPointConvBlock. SuperPointConfig now has different attributes to separate encoder and decoders parameters. convert_superpoint_to_pytorch.py has been changed accordingly.
…uperPoint code to comply with library standard
@sbucaille
Copy link
Contributor Author

Hi,
Alright, I can't wrap my head around this git problem, I feel like I ended up ruining my branch with the manipulations I tried to do.
What do you think should I do ? I'm about to erase completely my branch and create a fresh one from the main branch, from where I could put back all the work I've done for these past months...

@ydshieh
Copy link
Collaborator

ydshieh commented Feb 5, 2024

@sbucaille

Sorry about this, but I am afraid there is nothing we can help on this situation (at least not from me).
One tip (for future PRs): don't use git merge, use git rebase to keep the commits linear. It will avoid such undesired situation.

Copy link

github-actions bot commented Mar 1, 2024

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot closed this Mar 10, 2024
@sbucaille sbucaille deleted the add_superpoint branch March 19, 2024 21:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.