A Large Dataset of Historical Japanese Documents with Complex Layouts
HJDataset is a Large Dataset of Historical Japanese Documents with Complex Layouts. It contains over 250,000 layout element annotations of seven types. In addition to bounding boxes and masks of the content regions, it also includes the hierarchical structures and reading orders for layout elements for advanced analysis.
All the annotations are available through this link. However, due to some copyright issues, we could not directly release the images in this dataset. Please fill out this form to send us a request for downloading, and we will send back the links.
After downloading, we suggest organize the annotation and images in this fashion:
data/
├── train/
├── test/
├── val/
└── annotations/
├── instances_train.json
└── ....
You can also use the provided conda environment file to configure your own environment.
conda install -f environment.yml
However, when installing Detectron2, you may encounter some problems. Please check their official install instructions and Common Installation Issues for better reference.
We provide some starter code in notebooks/
.
1-Dataloader and visualization.ipynb
illustrates how to use the dataloder class to load and visualize layout elements in HJDataset.2-Training Using Detectron2.ipynb
shows how to train segmentation models on the dataset using Detectron2.
If you find the dataset is helpful for your research, please cite our work:
@article{shen2020large,
title={A Large Dataset of Historical Japanese Documents with Complex Layouts},
author={Shen, Zejiang and Zhang, Kaixuan and Dell, Melissa},
journal={arXiv preprint arXiv:2004.08686},
year={2020}
}