out_channels and ReshapeTokensToImage for temporal tasks #402

daniszw · 2025-02-04T17:05:44Z

The out_channels attribute of the backbones is used to define how to connect the model to necks and decoders. The usual way to connect Prithvi with a segmentation 2D segmentation decoder is through the neck ReshapeTokensToImage, which rearranges the output of the transformer with shape (batch, num_tokens, embed_dim) to be compatible with the decoder convolutional layers that expect something like (batch, channels, H, W).

Since Prithvi is a temporal model (can consume a sequence of images), the output of the model will contain tokens for the entire input sequence, which means that when using the neck ReshapeTokensToImage with correct effective_time_dim, the output of the model will be rearranged to (batch, T*embed_dim, H, W), effectively increasing the number of channels that will be connected to the decoder.

The current solution is to make the out_channels attribute proportional to the effective_time_dim. However, the actual output of the model is (batch, num_tokens, embed_dim) and not (batch, num_tokens, embed_dim * effective_time_dim). The number of channels is only modified after the neck ReshapeTokensToImage rearranges the model output, which I believe makes the neck responsible for modifying the number of channels and not the backbone.

The neck should also allow for different effective_time_dim, as, in theory, the sequence of images could have variable length during inference. Also, it currently assumes the cls_token is the first one to remove, which I’m not sure it is always the case.

The text was updated successfully, but these errors were encountered:

blumenstiel · 2025-02-04T17:29:46Z

I think these are very important points for multi-temporal data. To put it in some requirements, I think we should:

enable the prediction of multiple masks per time step as currently only one mask is provided (not sure if this case is very reelvant),
enable variable time series lengths as this is might be relevant for inference (e.g. via a neck that applies mean across the time dimension, or via 1. and aggregation over all predicted masks),
automatically infer if a cls_token is provided or not (e.g. check if an error occurs in the forward pass and set cls_token=false if it works),
somehow handle how out_channels are changed by necks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

out_channels and ReshapeTokensToImage for temporal tasks #402

out_channels and ReshapeTokensToImage for temporal tasks #402

daniszw commented Feb 4, 2025

blumenstiel commented Feb 4, 2025 •

edited

Loading

out_channels and ReshapeTokensToImage for temporal tasks #402

out_channels and ReshapeTokensToImage for temporal tasks #402

Comments

daniszw commented Feb 4, 2025

blumenstiel commented Feb 4, 2025 • edited Loading

blumenstiel commented Feb 4, 2025 •

edited

Loading