Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DAB-DETR Object detection/segmentation model #30803

Merged
merged 116 commits into from
Feb 4, 2025
Merged
Show file tree
Hide file tree
Changes from 7 commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
8adf1bb
initial commit
conditionedstimulus May 14, 2024
8291122
encoder+decoder layer changes WIP
conditionedstimulus May 16, 2024
09e2516
architecture checks
conditionedstimulus May 21, 2024
8a004cf
working version of detection + segmentation
conditionedstimulus May 24, 2024
defbc43
fix modeling outputs
conditionedstimulus May 25, 2024
5cfbcfc
fix return dict + output att/hs
conditionedstimulus May 26, 2024
6c7564a
found the position embedding masking bug
conditionedstimulus May 27, 2024
35e056f
pre-training version
conditionedstimulus May 28, 2024
24a9d7a
added iamge processors
conditionedstimulus May 29, 2024
d9b7af4
typo in init.py
conditionedstimulus May 29, 2024
a171339
iterupdate set to false
conditionedstimulus May 29, 2024
b8b2201
fixed num_labels in class_output linear layer bias init
conditionedstimulus May 29, 2024
abe0698
multihead attention shape fixes
conditionedstimulus Jun 2, 2024
e60b555
test improvements
conditionedstimulus Jun 10, 2024
6dafb79
test update
conditionedstimulus Jun 11, 2024
5bbdca1
dab-detr model_doc update
conditionedstimulus Jun 12, 2024
4a5ac4f
dab-detr model_doc update2
conditionedstimulus Jun 12, 2024
592796b
test fix:test_retain_grad_hidden_states_attentions
conditionedstimulus Jun 12, 2024
d76fda2
config file clean and renaming variables
conditionedstimulus Jun 17, 2024
ade9720
config file clean and renaming variables fix
conditionedstimulus Jun 17, 2024
6b58e5f
updated convert_to_hf file
conditionedstimulus Jun 17, 2024
eac19f5
small fixes
conditionedstimulus Jun 17, 2024
460e9d6
style and qulity checks
conditionedstimulus Jun 17, 2024
0151f65
Merge branch 'main' into add_dab_detr
conditionedstimulus Jun 17, 2024
97194c7
return_dict fix
conditionedstimulus Jun 20, 2024
3fc56b4
Merge branch main into add_dab_detr
conditionedstimulus Jun 20, 2024
ffbb1dc
Merge branch main into add_dab_detr
conditionedstimulus Jun 20, 2024
a23b173
small comment fix
conditionedstimulus Jun 20, 2024
886087f
skip test_inputs_embeds test
conditionedstimulus Jun 20, 2024
42f469e
image processor updates + image processor test updates
conditionedstimulus Jun 21, 2024
52d1aea
check copies test fix update
conditionedstimulus Jun 21, 2024
7f0ada9
updates for check_copies.py test
conditionedstimulus Jun 21, 2024
28f30aa
updates for check_copies.py test2
conditionedstimulus Jun 21, 2024
b3713d1
tied weights fix
conditionedstimulus Jun 24, 2024
731d0ae
fixed image processing tests and fixed shared weights issues
conditionedstimulus Jun 25, 2024
ae43a4a
Merge branch 'main' into add_dab_detr
conditionedstimulus Jun 25, 2024
f952fd6
added numpy nd array option to get_Expected_values method in test_ima…
conditionedstimulus Jun 25, 2024
6e3af24
delete prints from test file
conditionedstimulus Jun 25, 2024
baa9af7
SafeTensor modification to solve HF Trainer issue
conditionedstimulus Jun 25, 2024
7de850d
removing the safetensor modifications
conditionedstimulus Jun 25, 2024
17ae1c4
make fix copies and hf uplaod has been added.
conditionedstimulus Jul 10, 2024
56f0846
Merge branch 'main' into add_dab_detr
conditionedstimulus Jul 10, 2024
c13a096
fixed index.md
conditionedstimulus Jul 10, 2024
d7e9e22
fixed repo consistency
conditionedstimulus Jul 10, 2024
8bf75c8
styel fix and dabdetrimageprocessor docstring update
conditionedstimulus Jul 10, 2024
b09f996
requested modifications after the first review
conditionedstimulus Jul 29, 2024
8ae2e1b
Update src/transformers/models/dab_detr/image_processing_dab_detr.py
conditionedstimulus Jul 29, 2024
7ba65b1
repo consistency has been fixed
conditionedstimulus Jul 29, 2024
78cedb4
Merge branch 'main' into add_dab_detr
conditionedstimulus Jul 30, 2024
2b37103
update copied NestedTensor function after main merge
conditionedstimulus Jul 30, 2024
8870773
Update src/transformers/models/dab_detr/modeling_dab_detr.py
conditionedstimulus Aug 2, 2024
a402d0d
temp commit
conditionedstimulus Aug 3, 2024
c4bd33d
temp commit2
conditionedstimulus Aug 5, 2024
973db0c
temp commit 3
conditionedstimulus Aug 7, 2024
adebdc1
Merge branch 'main' into add_dab_detr
conditionedstimulus Aug 7, 2024
75a780c
unit tests are fixed
conditionedstimulus Aug 7, 2024
ee7e11b
fixed repo consistency
conditionedstimulus Aug 7, 2024
738a693
updated expected_boxes varible values based on related notebook resul…
conditionedstimulus Aug 8, 2024
01c7702
Merge branch 'main' into add_dab_detr
conditionedstimulus Aug 26, 2024
ce549c5
temporarialy config modifications and repo consistency fixes
conditionedstimulus Aug 26, 2024
38f91f1
Put dilation parameter back to config
conditionedstimulus Sep 10, 2024
b28b2a6
pattern embeddings have been added to the rename_keys method
conditionedstimulus Sep 10, 2024
1dcd978
add dilation comment to config + add as an exception in check_config_…
conditionedstimulus Sep 29, 2024
46eb24c
Merge branch 'main' into add_dab_detr
conditionedstimulus Sep 29, 2024
13af19b
delete FeatureExtractor part from docs.md
conditionedstimulus Sep 29, 2024
b3bf25e
requested modifications in modeling_dab_detr.py
conditionedstimulus Oct 3, 2024
b76a73a
[run_slow] dab_detr
conditionedstimulus Oct 3, 2024
638f8f5
deleted last segmentation code part, updated conversion script and ch…
conditionedstimulus Oct 5, 2024
9d5dafd
Merge branch 'main' into add_dab_detr
conditionedstimulus Oct 5, 2024
049b625
temp commit of requested modifications
conditionedstimulus Oct 12, 2024
6b0fc91
temp commit of requested modifications 2
conditionedstimulus Oct 12, 2024
7f2e2e2
updated config file, resolved codepaths and refactored conversion script
conditionedstimulus Oct 13, 2024
fac9ee9
updated decodelayer block types and refactored conversion script
conditionedstimulus Oct 14, 2024
78004d0
style and quality update
conditionedstimulus Oct 14, 2024
0bf9e3b
Merge branch 'main' into add_dab_detr
conditionedstimulus Oct 14, 2024
95d7a71
small modifications based on the request
conditionedstimulus Oct 28, 2024
2663c26
attentions are refactored
conditionedstimulus Oct 31, 2024
724e767
Merge branch 'main' into add_dab_detr
conditionedstimulus Nov 1, 2024
04d3e31
removed loss functions from modeling file, added loss function to los…
conditionedstimulus Nov 1, 2024
0122e62
deleted imageprocessor
conditionedstimulus Nov 3, 2024
53e2bd2
fixed conversion script + quality and style
conditionedstimulus Nov 3, 2024
4fd9bfc
fixed config_att
conditionedstimulus Nov 3, 2024
e32cf92
Merge branch 'main' into add_dab_detr
conditionedstimulus Nov 3, 2024
9345341
[run_slow] dab_detr
conditionedstimulus Nov 3, 2024
3ef47cf
changing model path in conversion file and in test file
conditionedstimulus Nov 3, 2024
dc9f359
fix Decoder variable naming
conditionedstimulus Nov 5, 2024
93ec65e
testing the old loss function
conditionedstimulus Nov 6, 2024
c73c0fa
switched back to the new loss function and testing with the odl atten…
conditionedstimulus Nov 6, 2024
e69545d
switched back to the new last good result modeling file
conditionedstimulus Nov 6, 2024
61c5189
moved back to the version when I asked the review
conditionedstimulus Nov 6, 2024
a310f6a
missing new line at the end of the file
conditionedstimulus Nov 6, 2024
464ac93
Merge branch 'main' into add_dab_detr
conditionedstimulus Dec 21, 2024
fc0ced6
old version test
conditionedstimulus Dec 21, 2024
7bf5267
turn back to newest mdoel versino but change image processor
conditionedstimulus Dec 21, 2024
d94baf4
style fix
conditionedstimulus Dec 25, 2024
e9f7772
Merge branch 'main' into add_dab_detr
conditionedstimulus Dec 25, 2024
1a08500
style fix after merge main
conditionedstimulus Dec 25, 2024
c2f45a4
[run_slow] dab_detr
conditionedstimulus Dec 25, 2024
0eea3b5
Merge branch 'main' into add_dab_detr
conditionedstimulus Jan 22, 2025
194f62d
[run_slow] dab_detr
conditionedstimulus Jan 22, 2025
7b58b25
added device and type for head bias data part
conditionedstimulus Jan 23, 2025
675f3fd
Merge branch 'main' into add_dab_detr
conditionedstimulus Jan 23, 2025
7c1161b
[run_slow] dab_detr
conditionedstimulus Jan 23, 2025
0967981
fixed model head bias data fill
conditionedstimulus Jan 23, 2025
ac8f4cb
changed test_inference_object_detection_head assertTrues to torch tes…
conditionedstimulus Jan 28, 2025
3cf9b99
Merge branch 'main' into add_dab_detr
conditionedstimulus Jan 28, 2025
3931e5c
Merge branch 'main' into add_dab_detr
conditionedstimulus Jan 31, 2025
ed7f8f5
fixes part 1
conditionedstimulus Jan 31, 2025
c962ef1
Merge branch 'add_dab_detr' of https://github.com/conditionedstimulus…
conditionedstimulus Jan 31, 2025
e08e6f8
quality update
conditionedstimulus Jan 31, 2025
3f8981b
self.bbox_embed in decoder has been restored
conditionedstimulus Jan 31, 2025
52e5131
Merge branch 'main' into add_dab_detr
conditionedstimulus Feb 1, 2025
757f413
changed Assert true torch closeall methods to torch testing assertclose
conditionedstimulus Feb 1, 2025
f1ba30e
modelcard markdown file has been updated
conditionedstimulus Feb 1, 2025
46710c3
deleted intemediate list from decoder module
conditionedstimulus Feb 3, 2025
350e6af
Merge branch 'main' into add_dab_detr
conditionedstimulus Feb 3, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
45 changes: 42 additions & 3 deletions docs/source/en/model_doc/dab-detr.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ rendered properly in your Markdown viewer.
The DAB-DETR model was proposed in [DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR](https://arxiv.org/abs/2201.12329) by Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, Lei Zhang.
DAB-DETR is an enhanced variant of Conditional DETR. It utilizes dynamically updated anchor boxes to provide both a reference query point (x, y) and a reference anchor size (w, h), improving cross-attention computation. This new approach achieves 45.7% AP when trained for 50 epochs with a single ResNet-50 model as the backbone.

<img src="https://github.com/conditionedstimulus/hf_media/blob/main/dab_detr_convergence_plot.png"
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/dab_detr_convergence_plot.png"
alt="drawing" width="600"/>

The abstract from the paper is the following:
Expand All @@ -42,13 +42,52 @@ experiments to confirm our analysis and verify the effectiveness of our methods.
This model was contributed by [davidhajdu](https://huggingface.co/davidhajdu).
The original code can be found [here](https://github.com/IDEA-Research/DAB-DETR).

There are three ways to instantiate a DAB-DETR model (depending on what you prefer):
## How to Get Started with the Model

Use the code below to get started with the model.

```python
import torch
import requests

from PIL import Image
from transformers import AutoModelForObjectDetection, AutoImageProcessor

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)

image_processor = AutoImageProcessor.from_pretrained("IDEA-Research/dab-detr-resnet-50")
model = AutoModelForObjectDetection.from_pretrained("IDEA-Research/dab-detr-resnet-50")

inputs = image_processor(images=image, return_tensors="pt")

with torch.no_grad():
outputs = model(**inputs)

results = image_processor.post_process_object_detection(outputs, target_sizes=torch.tensor([image.size[::-1]]), threshold=0.3)

for result in results:
for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
score, label = score.item(), label_id.item()
box = [round(i, 2) for i in box.tolist()]
print(f"{model.config.id2label[label]}: {score:.2f} {box}")
```
This should output
```
cat: 0.87 [14.7, 49.39, 320.52, 469.28]
remote: 0.86 [41.08, 72.37, 173.39, 117.2]
cat: 0.86 [344.45, 19.43, 639.85, 367.86]
remote: 0.61 [334.27, 75.93, 367.92, 188.81]
couch: 0.59 [-0.04, 1.34, 639.9, 477.09]
```

There are three other ways to instantiate a DAB-DETR model (depending on what you prefer):

Option 1: Instantiate DAB-DETR with pre-trained weights for entire model
```py
>>> from transformers import DabDetrForObjectDetection

>>> model = DabDetrForObjectDetection.from_pretrained("IDEA-Research/dab_detr_resnet50")
>>> model = DabDetrForObjectDetection.from_pretrained("IDEA-Research/dab-detr-resnet-50")
```

Option 2: Instantiate DAB-DETR with randomly initialized weights for Transformer, but pre-trained weights for backbone
Expand Down
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️ a lot better thanks!

Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@
r"transformer\.decoder\.ref_point_head\.layers\.(\d+)\.(bias|weight)": r"decoder.ref_point_head.layers.\1.\2",
r"transformer\.decoder\.ref_anchor_head\.layers\.(\d+)\.(bias|weight)": r"decoder.ref_anchor_head.layers.\1.\2",
r"transformer\.decoder\.query_scale\.layers\.(\d+)\.(bias|weight)": r"decoder.query_scale.layers.\1.\2",
r"transformer\.decoder\.layers\.0\.ca_qpos_proj\.(bias|weight)": r"decoder.layers.0.layer.1.cross_attn_query_pos_proj.\1",
r"transformer\.decoder\.layers\.0\.ca_qpos_proj\.(bias|weight)": r"decoder.layers.0.cross_attn.cross_attn_query_pos_proj.\1",
# encoder layers: output projection, 2 feedforward neural networks and 2 layernorms + activation function
# output projection
r"transformer\.encoder\.layers\.(\d+)\.self_attn\.out_proj\.(bias|weight)": r"encoder.layers.\1.self_attn.out_proj.\2",
Expand All @@ -59,30 +59,30 @@
r"transformer\.encoder\.layers\.(\d+)\.activation\.weight": r"encoder.layers.\1.activation_fn.weight",
#########################################################################################################################################
# decoder layers: 2 times output projection, 2 feedforward neural networks and 3 layernorms + activiation function weight
r"transformer\.decoder\.layers\.(\d+)\.self_attn\.out_proj\.(bias|weight)": r"decoder.layers.\1.layer.0.self_attn.output_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.cross_attn\.out_proj\.(bias|weight)": r"decoder.layers.\1.layer.1.cross_attn.output_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.self_attn\.out_proj\.(bias|weight)": r"decoder.layers.\1.self_attn.self_attn.output_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.cross_attn\.out_proj\.(bias|weight)": r"decoder.layers.\1.cross_attn.cross_attn.output_proj.\2",
# FFNs
r"transformer\.decoder\.layers\.(\d+)\.linear(\d)\.(bias|weight)": r"decoder.layers.\1.layer.2.fc\2.\3",
r"transformer\.decoder\.layers\.(\d+)\.linear(\d)\.(bias|weight)": r"decoder.layers.\1.mlp.fc\2.\3",
# nm1
r"transformer\.decoder\.layers\.(\d+)\.norm1\.(bias|weight)": r"decoder.layers.\1.layer.0.self_attn_layer_norm.\2",
r"transformer\.decoder\.layers\.(\d+)\.norm1\.(bias|weight)": r"decoder.layers.\1.self_attn.self_attn_layer_norm.\2",
# nm2
r"transformer\.decoder\.layers\.(\d+)\.norm2\.(bias|weight)": r"decoder.layers.\1.layer.1.cross_attn_layer_norm.\2",
r"transformer\.decoder\.layers\.(\d+)\.norm2\.(bias|weight)": r"decoder.layers.\1.cross_attn.cross_attn_layer_norm.\2",
# nm3
r"transformer\.decoder\.layers\.(\d+)\.norm3\.(bias|weight)": r"decoder.layers.\1.layer.2.final_layer_norm.\2",
r"transformer\.decoder\.layers\.(\d+)\.norm3\.(bias|weight)": r"decoder.layers.\1.mlp.final_layer_norm.\2",
# activation function weight
r"transformer\.decoder\.layers\.(\d+)\.activation\.weight": r"decoder.layers.\1.layer.2.activation_fn.weight",
r"transformer\.decoder\.layers\.(\d+)\.activation\.weight": r"decoder.layers.\1.mlp.activation_fn.weight",
# q, k, v projections and biases in self-attention in decoder
r"transformer\.decoder\.layers\.(\d+)\.sa_qcontent_proj\.(bias|weight)": r"decoder.layers.\1.layer.0.self_attn_query_content_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.sa_kcontent_proj\.(bias|weight)": r"decoder.layers.\1.layer.0.self_attn_key_content_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.sa_qpos_proj\.(bias|weight)": r"decoder.layers.\1.layer.0.self_attn_query_pos_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.sa_kpos_proj\.(bias|weight)": r"decoder.layers.\1.layer.0.self_attn_key_pos_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.sa_v_proj\.(bias|weight)": r"decoder.layers.\1.layer.0.self_attn_value_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.sa_qcontent_proj\.(bias|weight)": r"decoder.layers.\1.self_attn.self_attn_query_content_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.sa_kcontent_proj\.(bias|weight)": r"decoder.layers.\1.self_attn.self_attn_key_content_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.sa_qpos_proj\.(bias|weight)": r"decoder.layers.\1.self_attn.self_attn_query_pos_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.sa_kpos_proj\.(bias|weight)": r"decoder.layers.\1.self_attn.self_attn_key_pos_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.sa_v_proj\.(bias|weight)": r"decoder.layers.\1.self_attn.self_attn_value_proj.\2",
# q, k, v projections in cross-attention in decoder
r"transformer\.decoder\.layers\.(\d+)\.ca_qcontent_proj\.(bias|weight)": r"decoder.layers.\1.layer.1.cross_attn_query_content_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.ca_kcontent_proj\.(bias|weight)": r"decoder.layers.\1.layer.1.cross_attn_key_content_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.ca_kpos_proj\.(bias|weight)": r"decoder.layers.\1.layer.1.cross_attn_key_pos_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.ca_v_proj\.(bias|weight)": r"decoder.layers.\1.layer.1.cross_attn_value_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.ca_qpos_sine_proj\.(bias|weight)": r"decoder.layers.\1.layer.1.cross_attn_query_pos_sine_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.ca_qcontent_proj\.(bias|weight)": r"decoder.layers.\1.cross_attn.cross_attn_query_content_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.ca_kcontent_proj\.(bias|weight)": r"decoder.layers.\1.cross_attn.cross_attn_key_content_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.ca_kpos_proj\.(bias|weight)": r"decoder.layers.\1.cross_attn.cross_attn_key_pos_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.ca_v_proj\.(bias|weight)": r"decoder.layers.\1.cross_attn.cross_attn_value_proj.\2",
r"transformer\.decoder\.layers\.(\d+)\.ca_qpos_sine_proj\.(bias|weight)": r"decoder.layers.\1.cross_attn.cross_attn_query_pos_sine_proj.\2",
}


Expand Down
70 changes: 42 additions & 28 deletions src/transformers/models/dab_detr/modeling_dab_detr.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of utils don't really make sense to have here IMO @qubvel let's guide the contributor to have give a place to live to standard losses WDYT ? (it's not model specific, more like training specific)

Original file line number Diff line number Diff line change
Expand Up @@ -772,10 +772,9 @@ def forward(
class DabDetrDecoderLayer(nn.Module):
def __init__(self, config: DabDetrConfig, is_first: bool = False):
super().__init__()
self.layer = nn.ModuleList()
self.layer.append(DabDetrDecoderLayerSelfAttention(config))
self.layer.append(DabDetrDecoderLayerCrossAttention(config, is_first))
self.layer.append(DabDetrDecoderLayerFFN(config))
self.self_attn = DabDetrDecoderLayerSelfAttention(config)
self.cross_attn = DabDetrDecoderLayerCrossAttention(config, is_first)
self.mlp = DabDetrDecoderLayerFFN(config)

def forward(
self,
Expand Down Expand Up @@ -810,14 +809,14 @@ def forward(
returned tensors for more detail.

"""
hidden_states, self_attn_weights = self.layer[0](
hidden_states, self_attn_weights = self.self_attn(
hidden_states=hidden_states,
query_position_embeddings=query_position_embeddings,
attention_mask=attention_mask,
output_attentions=output_attentions,
)

hidden_states, cross_attn_weights = self.layer[1](
hidden_states, cross_attn_weights = self.cross_attn(
hidden_states=hidden_states,
encoder_hidden_states=encoder_hidden_states,
query_position_embeddings=query_position_embeddings,
Expand All @@ -827,7 +826,7 @@ def forward(
output_attentions=output_attentions,
)

hidden_states = self.layer[2](hidden_states=hidden_states)
hidden_states = self.mlp(hidden_states=hidden_states)

outputs = (hidden_states,)

Expand Down Expand Up @@ -973,6 +972,7 @@ def __init__(self, config: DabDetrConfig):
self.query_scale = DabDetrMLP(config.hidden_size, config.hidden_size, config.hidden_size, 2)
self.layers = nn.ModuleList([DabDetrEncoderLayer(config) for _ in range(config.encoder_layers)])
self.norm = nn.LayerNorm(config.hidden_size) if config.normalize_before else None
self.gradient_checkpointing = False

# Initialize weights and apply final processing
self.post_init()
Expand Down Expand Up @@ -1032,14 +1032,24 @@ def forward(
encoder_states = encoder_states + (hidden_states,)
# pos scaler
pos_scales = self.query_scale(hidden_states)
scaled_object_queries = object_queries * pos_scales
# we add object_queries * pos_scaler as extra input to the encoder_layer
layer_outputs = encoder_layer(
hidden_states,
attention_mask=attention_mask,
object_queries=scaled_object_queries,
output_attentions=output_attentions,
)
scaled_object_queries = object_queries * pos_scales

if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
encoder_layer.__call__,
hidden_states,
attention_mask,
scaled_object_queries,
output_attentions,
)
else:
layer_outputs = encoder_layer(
hidden_states,
attention_mask=attention_mask,
object_queries=scaled_object_queries,
output_attentions=output_attentions,
)

hidden_states = layer_outputs[0]

Expand Down Expand Up @@ -1178,10 +1188,14 @@ def forward(
# apply transformation
query_sine_embed = query_sine_embed[..., : self.hidden_size] * pos_transformation

# modulated HW attentions
refHW_cond = self.ref_anchor_head(hidden_states).sigmoid() # nq, bs, 2
query_sine_embed[..., self.hidden_size // 2 :] *= (refHW_cond[..., 0] / obj_center[..., 2]).unsqueeze(-1)
query_sine_embed[..., : self.hidden_size // 2] *= (refHW_cond[..., 1] / obj_center[..., 3]).unsqueeze(-1)
# modulated Height Width attentions
reference_anchor_size = self.ref_anchor_head(hidden_states).sigmoid() # nq, bs, 2
query_sine_embed[..., self.hidden_size // 2 :] *= (
reference_anchor_size[..., 0] / obj_center[..., 2]
).unsqueeze(-1)
query_sine_embed[..., : self.hidden_size // 2] *= (
reference_anchor_size[..., 1] / obj_center[..., 3]
).unsqueeze(-1)

if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
Expand Down Expand Up @@ -1227,10 +1241,10 @@ def forward(
if encoder_hidden_states is not None:
all_cross_attentions += (layer_outputs[2],)

if self.layernorm is not None:
hidden_states = self.layernorm(hidden_states)
intermediate.pop()
intermediate.append(hidden_states)
# Layer normalization on hidden states and add it to the intermediate list
hidden_states = self.layernorm(hidden_states)
intermediate.pop()
intermediate.append(hidden_states)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intermediate_state = self.layernorm(hidden_states)
intermediate.append(intermediate_states)
`
vs
`intermediate.append(self.layernorm(hidden_states))`

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will avoid this ugly pop append

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the list manipulation entirely. I didn’t revisit the original code, but as I recall, this was part of a conditional section. Since we removed many configurations, the list manipulation remained unchanged—popping the last element and appending the same value back. So, I only kept the hidden states layer normalization.


if output_hidden_states:
all_hidden_states += (hidden_states,)
Expand Down Expand Up @@ -1302,7 +1316,7 @@ def __init__(self, config: DabDetrConfig):

self.num_patterns = config.num_patterns
if not isinstance(self.num_patterns, int):
Warning("num_patterns should be int but {}".format(type(self.num_patterns)))
logger.warning("num_patterns should be int but {}".format(type(self.num_patterns)))
self.num_patterns = 0
if self.num_patterns > 0:
self.patterns = nn.Embedding(self.num_patterns, self.hidden_size)
Expand Down Expand Up @@ -1609,8 +1623,8 @@ def forward(
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> image_processor = AutoImageProcessor.from_pretrained("IDEA-Research/dab_detr-base")
>>> model = AutoModelForObjectDetection.from_pretrained("IDEA-Research/dab_detr-base")
>>> image_processor = AutoImageProcessor.from_pretrained("IDEA-Research/dab-detr-resnet-50")
>>> model = AutoModelForObjectDetection.from_pretrained("IDEA-Research/dab-detr-resnet-50")

>>> inputs = image_processor(images=image, return_tensors="pt")

Expand Down Expand Up @@ -1658,9 +1672,9 @@ def forward(
logits = self.class_embed(intermediate_hidden_states[-1])

reference_before_sigmoid = inverse_sigmoid(reference_points)
tmp = self.bbox_predictor(intermediate_hidden_states)
tmp[..., : self.query_dim] += reference_before_sigmoid
outputs_coord = tmp.sigmoid()
bbox_with_refinement = self.bbox_predictor(intermediate_hidden_states)
bbox_with_refinement[..., : self.query_dim] += reference_before_sigmoid
outputs_coord = bbox_with_refinement.sigmoid()

pred_boxes = outputs_coord[-1]

Expand Down
22 changes: 13 additions & 9 deletions tests/models/dab_detr/test_modeling_dab_detr.py
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually more readable when EXPECTED are full cap, but not blocking

Original file line number Diff line number Diff line change
Expand Up @@ -295,10 +295,11 @@ def recursive_check(tuple_object, dict_object):
elif tuple_object is None:
return
else:
self.assertTrue(
torch.allclose(
set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5
),
torch.testing.assert_close(
set_nan_tensor_to_zero(tuple_object),
set_nan_tensor_to_zero(dict_object),
atol=1e-5,
rtol=1e-5,
msg=(
"Tuple and dict output are not equal. Difference:"
f" {torch.max(torch.abs(tuple_object - dict_object))}. Tuple has `nan`:"
Expand Down Expand Up @@ -735,8 +736,11 @@ def test_initialization(self):
# Modifed from RT-DETR
elif "class_embed" in name and "bias" in name:
bias_tensor = torch.full_like(param.data, bias_value)
self.assertTrue(
torch.allclose(param.data, bias_tensor, atol=1e-4),
torch.testing.assert_close(
param.data,
bias_tensor,
atol=1e-4,
rtol=1e-4,
msg=f"Parameter {name} of model {model_class} seems not properly initialized",
)
elif "activation_fn" in name and config.activation_function == "prelu":
Expand Down Expand Up @@ -793,7 +797,7 @@ def test_inference_no_head(self):
expected_slice = torch.tensor(
[[-0.4879, -0.2594, 0.4524], [-0.4997, -0.4258, 0.4329], [-0.8220, -0.4996, 0.0577]]
).to(torch_device)
self.assertTrue(torch.allclose(outputs.last_hidden_state[0, :3, :3], expected_slice, atol=2e-4))
torch.testing.assert_close(outputs.last_hidden_state[0, :3, :3], expected_slice, atol=2e-4, rtol=2e-4)

def test_inference_object_detection_head(self):
model = DabDetrForObjectDetection.from_pretrained(CHECKPOINT).to(torch_device)
Expand All @@ -812,14 +816,14 @@ def test_inference_object_detection_head(self):
expected_slice_logits = torch.tensor(
[[-10.1765, -5.5243, -8.9324], [-9.8138, -5.6721, -7.5161], [-10.3054, -5.6081, -8.5931]]
).to(torch_device)
self.assertTrue(torch.allclose(outputs.logits[0, :3, :3], expected_slice_logits, atol=3e-4))
torch.testing.assert_close(outputs.logits[0, :3, :3], expected_slice_logits, atol=3e-4, rtol=3e-4)

expected_shape_boxes = torch.Size((1, model.config.num_queries, 4))
self.assertEqual(outputs.pred_boxes.shape, expected_shape_boxes)
expected_slice_boxes = torch.tensor(
[[0.3708, 0.3000, 0.2753], [0.5211, 0.6125, 0.9495], [0.2897, 0.6730, 0.5459]]
).to(torch_device)
self.assertTrue(torch.allclose(outputs.pred_boxes[0, :3, :3], expected_slice_boxes, atol=1e-4))
torch.testing.assert_close(outputs.pred_boxes[0, :3, :3], expected_slice_boxes, atol=1e-4, rtol=1e-4)

# verify postprocessing
results = image_processor.post_process_object_detection(
Expand Down