This Repository Implements Prsimatic VLM that could be deployed onto Low Powered Devices such as Jetson Nano using Optimized Models.
Model comprises of DINO V2 Base (224px), SIGLIP Base(224px) as Image Encoders and Llama 3.2:1B as Language Model.
git clone https://github.com/BhavikShangari/Jetson-VLM.git
cd Jetson-VLM
conda env create --name jetson_vlm --file environment.yml
For loading dataset to train Your Model, we have Modified Llava v1.5 595K Mixture Dataset and performed text formatting over it, and created a csv file to make it easy for Loading.
Either Download CSV Manually from this Link or use
pip install gdown
gdown 1yZagkp2xFmPd53Zo0FDPU-CNy8GmAyII
Also Download Images.zip Manually here or
gdown 1MsjR_tfk2YHRwLTX1tLOzGc7r8JQdOfi
unzip Images.zip
If starting from a Checkpoint
python3 train.py --model_path path/to/checkpoint.pt --per_device_batch_size 32 --learning_rate 2e-5 --output_dir ./results --epochs 10 --torch_compile True --save_strategy no --report_to wandb --lr_scheduler cosine --warmup_ratio 0.10 --logging_steps 100 --dataset_path data.csv --save_file_name path/to/model.pt
else
python3 train.py --per_device_batch_size 32 --learning_rate 2e-5 --output_dir ./results --epochs 10 --torch_compile True --save_strategy no --report_to wandb --lr_scheduler cosine --warmup_ratio 0.10 --logging_steps 100 --dataset_path data.csv --save_file_name path/to/model.pt
Pre Trained Checkpoints are available:
Checkpoint Name | Model Checkpoint |
---|---|
Pretrained Llama 3.2:1B + DINOV2 BASE (224px) + SIGLIP BASE (224px) (2 Epochs) | Link |
Instruct Llama 3.2:1B + DINOV2 BASE (224px) + SIGLIP BASE (224px) (2 Epochs) | Link |
Instruct Llama 3.2:1B + DINOV2 BASE (224px) + SIGLIP BASE (224px) (6 Epochs) | Link |
Coming soon
For Generation Download the Checkpoints and place in the Checkpoints Folder
cd Checkpoints
gdown {Checkpoint}
cd ..
python3 generate.py --model_path Checkpoints/{MODEL}.pt --image_path Path/to/image.png --prompt 'Explain what this image depicts' --device cuda:0
Coming Soon...