PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures

Official Implementation of the PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures (AAAI 2025) Paper

Paper | Project Page

Setting up the repo

Clone the repo

git clone https://github.com/vl2g/PatentLMM/
cd PatentLMM

Install the environment

conda env create -f patentlmm.yml
conda activate patentlmm
pip3 install -e .

# install flash-attention
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.6.3/flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip3 install flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
# remove the wheel
rm flash_attn-2.6.3+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Downloading and preparing data

The PatentDesc-355k dataset is provided here as a json file with image_ids as keys, and its internet URL and corresponding brief and detailed description here. Below is an example showing data format.

{
    "US11036663B2__2": {
        "image_url": "...",
        "brief_description": "...",
        "detailed_description": "..."
    },
    "US11336511B2__54": {
        "image_url": "...",
        "brief_description": "...",
        "detailed_description": "..."
    },
    .
    .
    .
}

As mentioned in the paper, the detailed descriptions in this file are clipped at 500 tokens.

Follow the steps below to download the dataset in appropriate format:

```
mkdir DATASET
cd DATASET
```
Download PatentDesc-355k.json
```
mkdir images
cd images
```
Download the images using the given image_url from the json file. Please follow the naming convention of image_id.png for saving the images.
```
cd ..
```
Download the text files listing image_ids corresponding to train, val and test splits from here.
We utilize the LayoutLMv3 processor which uses off-the-shelf Tesseract OCR engine, to extract OCR text from patent images. For convenience, we provide the json file with extracted OCR here.

Run the following command to create data in LLaVA format for training/validation.

mkdir llava_json
cd ..
python prep_patent_data.py --desc_type [brief/detailed] --split [train/val] --data_dir [path to DATASET directory]

Finally, the DATASET directory should have the following structure:

│DATASET│
│
├── PatentDesc-355k.json
├── ocr.json
├── splits
│   ├── all.txt
│   ├── train.txt
│   ├── val.txt
│   └── test.txt
├── llava_json
│   ├── brief_train
│   ├── brief_val
│   ├── detailed_train
│   └── detailed_val
└── images
    ├── US11036663B2__2.png
    ├── US11336511B2__54.png
    .
    .
    .

Downloading Checkpoints

The pre-trained checkpoints for PatentMME, PatentLMM and PatentLLaMA are provided below:

PatentMME	PatentLMM-brief	PatentLMM-detailed	PatentLLaMA
Download	Download	Download	Download

Download and unzip the respective checkpoints in a checkpoints directory.

Training PatentLMM

We follow two-stage strategy to train PatentLMM. To train the projection layer in stage-1, run:

bash scripts/v1_5/train_patentlmm_stage1.sh

To train for stage-2:

bash scripts/v1_5/train_patentlmm_stage2.sh

Inference

To see the inference results with our trained model, run the command below:

python finetuned_inference.py --path_to_ckp [path to global_step/mp_rank_00_model_states.pt from checkpoint] --data_dir [path to DATASET directory] --desc_type [brief/detailed] --output_file [json file name where to save results]

Cite us

If you find this work useful for your research, please consider citing.

@inproceedings{shukla2025patentlmm,
  author    = "Shukla, Shreya and 
              Sharma, Nakul and 
              Gupta, Manish and
              Mishra, Anand",
  title     = "PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures",
  booktitle = "AAAI",
  year      = "2025",
}

Acknowledgements

This work was supported by the Microsoft Academic Partnership Grant (MAPG) 2023.
We would like to thank the authors of LLaVA, LayoutLMv3 and OCR-VQGAN for open-sourcing their code and checkpoints!

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
PatentMME		PatentMME
llava		llava
scripts		scripts
LICENSE		LICENSE
README.md		README.md
dataset.py		dataset.py
finetuned_inference.py		finetuned_inference.py
patentlmm.yml		patentlmm.yml
prep_patent_data.py		prep_patent_data.py
pyproject.toml		pyproject.toml
slurm		slurm

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures

Setting up the repo

Downloading and preparing data

Downloading Checkpoints

Training PatentLMM

Inference

Cite us

Acknowledgements

About

Releases

Packages

Contributors 3

Languages

License

vl2g/PatentLMM

Folders and files

Latest commit

History

Repository files navigation

PatentLMM: Large Multimodal Model for Generating Descriptions for Patent Figures

Setting up the repo

Downloading and preparing data

Downloading Checkpoints

Training PatentLMM

Inference

Cite us

Acknowledgements

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages