You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The out_channels attribute of the backbones is used to define how to connect the model to necks and decoders. The usual way to connect Prithvi with a segmentation 2D segmentation decoder is through the neck ReshapeTokensToImage, which rearranges the output of the transformer with shape (batch, num_tokens, embed_dim) to be compatible with the decoder convolutional layers that expect something like (batch, channels, H, W).
Since Prithvi is a temporal model (can consume a sequence of images), the output of the model will contain tokens for the entire input sequence, which means that when using the neck ReshapeTokensToImage with correct effective_time_dim, the output of the model will be rearranged to (batch, T*embed_dim, H, W), effectively increasing the number of channels that will be connected to the decoder.
The current solution is to make the out_channels attribute proportional to the effective_time_dim. However, the actual output of the model is (batch, num_tokens, embed_dim) and not (batch, num_tokens, embed_dim * effective_time_dim). The number of channels is only modified after the neck ReshapeTokensToImage rearranges the model output, which I believe makes the neck responsible for modifying the number of channels and not the backbone.
The neck should also allow for different effective_time_dim, as, in theory, the sequence of images could have variable length during inference. Also, it currently assumes the cls_token is the first one to remove, which I’m not sure it is always the case.
The text was updated successfully, but these errors were encountered:
I think these are very important points for multi-temporal data. To put it in some requirements, I think we should:
enable the prediction of multiple masks per time step as currently only one mask is provided (not sure if this case is very reelvant),
enable variable time series lengths as this is might be relevant for inference (e.g. via a neck that applies mean across the time dimension, or via 1. and aggregation over all predicted masks),
automatically infer if a cls_token is provided or not (e.g. check if an error occurs in the forward pass and set cls_token=false if it works),
somehow handle how out_channels are changed by necks.
The
out_channels
attribute of the backbones is used to define how to connect the model to necks and decoders. The usual way to connect Prithvi with a segmentation 2D segmentation decoder is through the neckReshapeTokensToImage
, which rearranges the output of the transformer with shape(batch, num_tokens, embed_dim)
to be compatible with the decoder convolutional layers that expect something like(batch, channels, H, W)
.Since Prithvi is a temporal model (can consume a sequence of images), the output of the model will contain tokens for the entire input sequence, which means that when using the neck
ReshapeTokensToImage
with correcteffective_time_dim
, the output of the model will be rearranged to(batch, T*embed_dim, H, W)
, effectively increasing the number of channels that will be connected to the decoder.The current solution is to make the
out_channels
attribute proportional to theeffective_time_dim
. However, the actual output of the model is(batch, num_tokens, embed_dim)
and not(batch, num_tokens, embed_dim * effective_time_dim)
. The number of channels is only modified after the neckReshapeTokensToImage
rearranges the model output, which I believe makes the neck responsible for modifying the number of channels and not the backbone.The neck should also allow for different
effective_time_dim
, as, in theory, the sequence of images could have variable length during inference. Also, it currently assumes thecls_token
is the first one to remove, which I’m not sure it is always the case.The text was updated successfully, but these errors were encountered: