DaTikZ is a dataset containing a wide variety of TikZ drawings. It is intended to support research and development of machine learning models that can generate or manipulate vector graphics in LATEX.
There are two main distributions publicly available: DaTikZv1 (introduced in AutomaTikZ) and DaTikZv2 (introduced in DeTikZify). In compliance with licensing agreements, certain TikZ drawings are excluded from these public versions of the dataset. This repository provides tools and methods to help with recreating the complete dataset from scratch.
Note
The datasets you produce might vary slightly from the originally created ones, as the sources used for crawling are subject to continuous updates.
DaTikZ relies on a full TeX Live installation and also requires ghostscript and poppler. Python dependencies can be installed as follows:
pip install -r requirements.txt
For processing arXiv source files (optional), you additionally need to preprocess arXiv bulk data using arxiv-latex-extract.
To generate the dataset, run the main.py
script. Use the --help
flag to
view the available options. DaTikZv2, for example was created as
follows:
main.py --arxiv_files "${DATIKZ_ARXIV_FILES[@]}" --size 384
In this example, the DATIKZ_ARXIV_FILES
environment variable should contain
paths of either the directories with jsonl
files obtained with
arxiv-latex-extract or archives that contain these files.
When executed successfully, the script generates the following output files:
datikz-train.parquet
: The training split of the DaTikZ dataset.datikz-test.parquet
: The test split consisting of 1k items.
Important
The --captionize
flag, formerly used to automatically augment captions to
better align with their figures, is no longer supported. To augment extracted
captions, we recommend implementing your own solution using the latest
MultiModal Large Language
Models.
If DaTikZ has been beneficial for your research or applications, we kindly request you to acknowledge this by citing the following papers:
@inproceedings{belouadi2024detikzify,
title={{DeTikZify}: Synthesizing Graphics Programs for Scientific Figures and Sketches with {TikZ}},
author={Jonas Belouadi and Simone Paolo Ponzetto and Steffen Eger},
booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
year={2024},
url={https://openreview.net/forum?id=bcVLFQCOjc}
}
@inproceedings{belouadi2024automatikz,
title={{AutomaTikZ}: Text-Guided Synthesis of Scientific Vector Graphics with {TikZ}},
author={Jonas Belouadi and Anne Lauscher and Steffen Eger},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=v3K5TVP8kZ}
}