Official Implementation of the PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures (AAAI 2025) Paper
- Clone the repo
git clone https://github.com/vl2g/PatentLMM/
cd PatentLMM
- Install the environment
conda env create -f patentlmm.yml
conda activate patentlmm
pip3 install -e .
# install flash-attention
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip3 install flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# remove the wheel
rm flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
The PatentDesc-355k dataset is provided here as a json file with image_ids
as keys, and its internet URL and corresponding brief and detailed description here. Below is an example showing data format.
{
"US11036663B2__2": {
"image_url": "...",
"brief_description": "...",
"detailed_description": "..."
},
"US11336511B2__54": {
"image_url": "...",
"brief_description": "...",
"detailed_description": "..."
},
.
.
.
}
As mentioned in the paper, the detailed descriptions in this file are clipped at 500 tokens.
Follow the steps below to download the dataset in appropriate format:
-
mkdir DATASET cd DATASET
-
Download PatentDesc-355k.json
-
mkdir images cd images
Download the images using the given
image_url
from the json file. Please follow the naming convention ofimage_id.png
for saving the images.cd ..
-
Download the text files listing image_ids corresponding to train, val and test splits from here.
-
We utilize the LayoutLMv3 processor which uses off-the-shelf Tesseract OCR engine, to extract OCR text from patent images. For convenience, we provide the json file with extracted OCR here.
-
Run the following command to create data in LLaVA format for training/validation.
mkdir llava_json cd .. python prep_patent_data.py --desc_type [brief/detailed] --split [train/val] --data_dir [path to DATASET directory]
Finally, the DATASET
directory should have the following structure:
│DATASET│
│
├── PatentDesc-355k.json
├── ocr.json
├── splits
│ ├── all.txt
│ ├── train.txt
│ ├── val.txt
│ └── test.txt
├── llava_json
│ ├── brief_train
│ ├── brief_val
│ ├── detailed_train
│ └── detailed_val
└── images
├── US11036663B2__2.png
├── US11336511B2__54.png
.
.
.
The pre-trained checkpoints for PatentMME, PatentLMM and PatentLLaMA are provided below:
PatentMME | PatentLMM-brief | PatentLMM-detailed | PatentLLaMA |
---|---|---|---|
Download | Download | Download | Download |
Download and unzip the respective checkpoints in a checkpoints
directory.
We follow two-stage strategy to train PatentLMM. To train the projection layer in stage-1, run:
bash scripts/v1_5/train_patentlmm_stage1.sh
To train for stage-2:
bash scripts/v1_5/train_patentlmm_stage2.sh
To see the inference results with our trained model, run the command below:
python finetuned_inference.py --path_to_ckp [path to global_step/mp_rank_00_model_states.pt from checkpoint] --data_dir [path to DATASET directory] --desc_type [brief/detailed] --output_file [json file name where to save results]
- If you find this work useful for your research, please consider citing.
@inproceedings{shukla2025patentlmm,
author = "Shukla, Shreya and
Sharma, Nakul and
Gupta, Manish and
Mishra, Anand",
title = "PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures",
booktitle = "AAAI",
year = "2025",
}
- This work was supported by the Microsoft Academic Partnership Grant (MAPG) 2023.
- We would like to thank the authors of LLaVA, LayoutLMv3 and OCR-VQGAN for open-sourcing their code and checkpoints!