(ACL 2024) Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space

Paper: https://arxiv.org/abs/2402.16832
Webpage: https://claws-lab.github.io/projection-in-MLLMs/
GitHub: https://github.com/claws-lab/projection-in-MLLMs

Authors: Gaurav Verma¹, Minje Choi¹, Kartik Sharma¹, Jamelle Watson-Daniels², Sejoon Oh¹, and Srijan Kumar¹
Affiliations: ¹Georgia Institute of Technology, ²Harvard University

Code and Resources

Setup

The codebase is built on top of LLaVA's codebase. Clone the repository from here: https://github.com/haotian-liu/LLaVA inside ./experiments/ -- and name the directory llava-ft. Then, follow the instruction provided in the original repository to setup the environment. Once the setup is complete, to verify the installation, check if everything works by running the llava-v1.5-7b model using the following command inside ./experiments/llava-ft directory:

python -m llava.serve.cli \
    --model-path liuhaotian/llava-v1.5-7b \
    --image-file "https://llava-vl.github.io/static/images/view.jpg" \
    --load-4bit

Additionally, make sure that the mm_projection.bin corresponding to the llava-v1.5-7b model is downloaded from the following link: https://huggingface.co/liuhaotian/llava-v1.5-7b/tree/main . To use any other LLaVa-1.5 variant, explore the Model Zoo in the original repository.

Datasets

We use $4$ different datasets in this work:

Agriculture: Download the PlantDoc dataset from here and use the standard train-test split.
Textures: Download the DTD dataset from here and use the standard train-test split (train1.txt and test1.txt).
Dermatology: Download the DermaNet dataset from here and use the standard train-test split.
Humanitarian: Download the CrisisMMD dataset from here (version v2.0) and use the standard train-test split.

Prepare the dataset for fine-tuning the llava models using the script: ./prepare_data/format_data_for_finetuning.py. This will output a CSV file containing the image paths and labels for the images within the specified directory. This CSV will be used for zero-shot inference with the CLIP model. Additionally, this script will output a JSON which will be used for fine-tuning the llava models.

Experiments

The code for the experiments is available in the experiments directory. The experiments directory contains the following subdirectories:

./experiments/clip-zs: Contains the code for the zero-shot experiments using CLIP. Run the zero-shot experiment using python zero_shot_inference.py after specifying the .csv file containing the image paths and labels from the test set. This file would be obtained as a result of running the format_data_for_finetuning.py script.
./experiments/llava-ft: folder containing the experiments for llava-v1.5-7b model. There are two fine-tuning strategies:
- Fine-tuning the projection layer while keeping the LLM frozen. This corresponds to running the following command:
```
bash experiments/llava-ft/scripts/v1_5/pretrain.sh
```
- Modify the relevant paths in the pretrain.sh script to point to the correct base models (llava-v1.5-7b), correct data_path (i.e., the JSON file obtained above), the image directory, and the output directory (which will store the updated projector). The set hyper-parameter values will work seamlessly with 2 A100 (80 GB) GPUs.
- Once the mm_projector.bin is updated, it will be stored in the specified output directory.
- Following this, the updated mm_projector.bin can be merged with your based model (i.e., llava-v1.5-7b) using the bash script inside ./experiments/llava-ft/merge_proj/.
```
bash ./merge_proj/update_model.sh <source_model_path> <updated_projector_path> <save_merged_model_path>
```
- Following these operations, run the zero-shot inference using the updated model (stored in <save_merged_model_path>) using the cli.py script inside ./experiments/llava-ft/llava/serve/.
- Fine-tuning the entire model. This corresponds to running the following command:
```
bash experiments/llava-ft/scripts/v1_5/finetune_task.sh
```
- Modify the relevant paths in the finetune_task.sh script to point to the correct base models (llava-v1.5-7b), correct data_path (i.e., the JSON file obtained above), the image directory, and the output directory. The set hyper-parameter values will work seamlessly with 2 A100 (80 GB) GPUs.
- Once the model is fine-tuned, run the zero-shot inference using the updated model using the cli.py script inside ./experiments/llava-ft/llava/serve/. No need for merging the updated model with the base model in this case.
./experiments/estimte_richness/ contains the code for training MLPs on the pre- and post- projection representations of the images. Adjust the hyper-parameters in the train_mlp.py and run the script to train the MLPs.

Citation

If you use this codebase, please cite our paper:

@article{verma2024crossmodalprojection,
  title={Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space},
  author={Verma, Gaurav and Choi, Minje and Sharma, Kartik and Watson-Daniels, Jamelle and Oh, Sejoon and Kumar, Srijan},
  publisher={62nd Annual Meeting of the Association for Computational Linguistics (ACL)},
  year={2024}
}

Acknowledgements

The codebase is built on top of LLaVA's codebase. We thank the authors for making the codebase publicly available. Relevant citations:

@misc{liu2023improvedllava,
      title={Improved Baselines with Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Li, Yuheng and Lee, Yong Jae},
      publisher={arXiv:2310.03744},
      year={2023},
}

@misc{liu2023llava,
      title={Visual Instruction Tuning}, 
      author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
      publisher={NeurIPS},
      year={2023},
}

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.github/workflows		.github/workflows
assets		assets
experiments		experiments
prepare_data		prepare_data
README.md		README.md
index.html		index.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

(ACL 2024) Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space

Code and Resources

Setup

Datasets

Experiments

Citation

Acknowledgements

About

Releases

Packages

Languages

claws-lab/projection-in-MLLMs

Folders and files

Latest commit

History

Repository files navigation

(ACL 2024) Cross-Modal Projection in Multimodal LLMs Doesn't Really Project Visual Attributes to Textual Space

Code and Resources

Setup

Datasets

Experiments

Citation

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages