Skip to content

Official implementation of "Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery", ICRA 2023.

License

Notifications You must be signed in to change notification settings

longbai1006/Surgical-VQLA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

54 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery

Long Bai*, Mobarakol Islam*, Lalithkumar Seenivasan, Hongliang Ren

IEEE International Conference on Robotics and Automation (ICRA) 2023

[arXiv] [Paper]

If you find our code, paper, or dataset useful, please cite the paper as

@inproceedings{bai2023surgical,
  title={Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery},
  author={Bai, Long and Islam, Mobarakol and Seenivasan, Lalithkumar and Ren, Hongliang},
  booktitle={2023 IEEE International Conference on Robotics and Automation (ICRA)},
  pages={6859--6865},
  year={2023},
  organization={IEEE}
}

Abstract

Despite the availability of computer-aided simulators and recorded videos of surgical procedures, junior residents still heavily rely on experts to answer their queries. However, expert surgeons are often overloaded with clinical and academic workloads and limit their time in answering. For this purpose, we develop a surgical question-answering system to facilitate robot-assisted surgical scene and activity understanding from recorded videos. Most of the existing visual question answering (VQA) methods require an object detector and regions based feature extractor to extract visual features and fuse them with the embedded text of the question for answer generation. However, (i) surgical object detection model is scarce due to smaller datasets and lack of bounding box annotation; (ii) current fusion strategy of heterogeneous modalities like text and image is naive; (iii) the localized answering is missing, which is crucial in complex surgical scenarios. In this paper, we propose Visual Question Localized-Answering in Robotic Surgery (Surgical-VQLA) to localize the specific surgical area during the answer prediction. To deal with the fusion of the heterogeneous modalities, we design gated vision-language embedding (GVLE) to build input patches for the Language Vision Transformer (LViT) to predict the answer. To get localization, we add the detection head in parallel with the prediction head of the LViT. We also integrate generalized intersection over union (GIoU) loss to boost localization performance by preserving the accuracy of the question-answering model. We annotate two datasets of VQLA by utilizing publicly available surgical videos from EndoVis-17 and 18 of the MICCAI challenges. Our validation results suggest that Surgical-VQLA can better understand the surgical scene and localized the specific area related to the question-answering. GVLE presents an efficient language-vision embedding technique by showing superior performance over the existing benchmarks.

SurgicalVLQA


Environment

  • PyTorch
  • numpy
  • pandas
  • scipy
  • scikit-learn
  • timm
  • transformers
  • h5py

Directory Setup

In this project, we implement our method using the Pytorch library, the structure is as follows:

  • checkpoints/: Contains trained weights.
  • dataset/
    • bertvocab/
      • v2 : bert tokernizer
    • EndoVis-18-VQLA/ : seq_{1,2,3,4,5,6,7,9,10,11,12,14,15,16}. Each sequence folder follows the same folder structure.
      • seq_1:
        • left_frames: Image frames (left_frames) for each sequence can be downloaded from EndoVIS18 challange.
        • vqla
          • label: Q&A pairs and bounding box label.
          • img_features: Contains img_features extracted from each frame with different patch size.
            • 5x5: img_features extracted with a patch size of 5x5 by ResNet18.
            • frcnn: img_features extracted by Fast-RCNN and ResNet101.
      • ....
      • seq_16
    • EndoVis-17-VQLA/ : selected 97 frames from EndoVIS17 challange for external validation.
      • left_frames
      • vqla
        • label: Q&A pairs and bounding box label.
        • img_features: Contains img_features extracted from each frame with different patch size.
          • 5x5: img_features extracted with a patch size of 5x5 by ResNet18.
          • frcnn: img_features extracted by Fast-RCNN and ResNet101.
    • featre_extraction/:
      • feature_extraction_EndoVis18-VQLA-frcnn.py: Used to extract features with Fast-RCNN and ResNet101.
      • feature_extraction_EndoVis18-VQLA-resnet: Used to extract features with ResNet18 (based on patch size).
  • models/:
    • GatedLanguageVisualEmbedding.py : GLVE module for visual and word embeddings and fusion.
    • LViTPrediction.py : our proposed LViT model for VQLA task.
    • VisualBertResMLP.py : VisualBERT ResMLP encoder from Surgical-VQA.
    • visualBertPrediction.py : VisualBert encoder-based model for VQLA task.
    • VisualBertResMLPPrediction.py : VisualBert ResMLP encoder-based model for VQLA task.
  • dataloader.py
  • train.py
  • utils.py

Dataset

  1. EndoVis-18-VQA (Image frames can be downloaded directly from EndoVis Challenge Website)
  2. EndoVis-17-VLQA (External Validation Set)

Run training

  • Train on EndoVis-18-VLQA
    python train.py --checkpoint_dir /CHECKPOINT_PATH/ --transformer_ver lvit --batch_size 64 --epochs 80

Evaluation

  • Evaluate on both EndoVis-18-VLQA & EndoVis-17-VLQA
    python train.py --validate True --checkpoint_dir /CHECKPOINT_PATH/ --transformer_ver lvit --batch_size 64

References

Code adopted and modified from:

  1. VisualBERT model

  2. VisualBERT ResMLP model

  3. DETR


Contact

For any queries, please raise an issue or contact Long Bai.


About

Official implementation of "Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery", ICRA 2023.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published