You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This pertains to be cross-attention between queries and encoder feature maps in deformable_transformer.py. The reference_points for the deformable attention operation appear to be interpreted as (x,y) coordinates since they are multiplied by src_valid_ratios in line 339 which has a (width, height) format (as evident from the get_valid_ratio method, line 125).
However, the src_spatial_shapes tensor contains dimensions in (height, width) format (see line 138 and 139). Then, in ms_deform_attn.py the reference points and offsets are combined as follows:
So it appears that the sampling_offsets / input_spatial_shapes[None, None, None, :, None, :] part assumes coordinates to be in (y,x) format whereas reference_points[:, :, None, :, None, :] appears to be in (x,y) format.
Of course I'm not sure if I followed the code correctly here (I could have missed something), but just wanted to bring this to the authors' attention in case they did not know already.
The text was updated successfully, but these errors were encountered:
@Ali2500 Thank you for your kind reminder. This is indeed a mistake made in this released version, but not in the version used in our paper. I will update the code and pre-trained models soon. Thanks again. I will close this issue after the code is updated.
This pertains to be cross-attention between queries and encoder feature maps in
deformable_transformer.py
. Thereference_points
for the deformable attention operation appear to be interpreted as(x,y)
coordinates since they are multiplied bysrc_valid_ratios
in line 339 which has a (width, height) format (as evident from theget_valid_ratio
method, line 125).However, the
src_spatial_shapes
tensor contains dimensions in (height, width) format (see line 138 and 139). Then, inms_deform_attn.py
the reference points and offsets are combined as follows:So it appears that the
sampling_offsets / input_spatial_shapes[None, None, None, :, None, :]
part assumes coordinates to be in(y,x)
format whereasreference_points[:, :, None, :, None, :]
appears to be in(x,y)
format.Of course I'm not sure if I followed the code correctly here (I could have missed something), but just wanted to bring this to the authors' attention in case they did not know already.
The text was updated successfully, but these errors were encountered: