DaTikZ Dataset

DaTikZ is a dataset containing a wide variety of TikZ drawings. It is intended to support research and development of machine learning models that can generate or manipulate vector graphics in L^AT_EX.

There are three main distributions publicly available: DaTikZ_v1 (introduced in AutomaTikZ), DaTikZ_v2 (introduced in DeTikZify), and DaTikZ_v3 (introduced in TikZero). In compliance with licensing agreements, certain TikZ drawings are excluded from these public versions of the dataset. This repository provides tools and methods to help with recreating the complete dataset from scratch.

Note

The datasets you produce might vary slightly from the originally created ones, as the sources used for crawling are subject to continuous updates.

Installation

DaTikZ relies on a full TeX Live installation and also requires ghostscript and poppler. Python dependencies can be installed as follows:

pip install -r requirements.txt

For processing arXiv source files (optional), you additionally need to preprocess arXiv bulk data using arxiv-latex-extract.

Usage

To generate the dataset, run the main.py script. Use the --help flag to view the available options. DaTikZ_v2, for example was created as follows:

main.py --arxiv_files "${DATIKZ_ARXIV_FILES[@]}" --size 384

In this example, the DATIKZ_ARXIV_FILES environment variable should contain paths of either the directories with jsonl files obtained with arxiv-latex-extract or archives that contain these files.

When executed successfully, the script generates the following output files:

datikz-train.parquet: The training split of the DaTikZ dataset.
datikz-test.parquet: The test split consisting of 1k items.

Important

The --captionize flag, formerly used to automatically augment captions to better align with their figures, is no longer supported. To augment extracted captions, we recommend implementing your own solution using the latest MultiModal Large Language Models.

Citation

If DaTikZ has been beneficial for your research or applications, we kindly request you to acknowledge this by citing the following papers:

@inproceedings{belouadi2025tikzero,
    title={{TikZero}: Zero-Shot Text-Guided Graphics Program Synthesis},
    author={Belouadi, Jonas and Ilg, Eddy and Keuper, Margret and Tanaka, Hideki and Utiyama, Masao and Dabre, Raj and Eger, Steffen and Ponzetto, Simone},
    booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
    month={October},
    year={2025},
    pages={17793-17806},
    url={https://openaccess.thecvf.com/content/ICCV2025/html/Belouadi_TikZero_Zero-Shot_Text-Guided_Graphics_Program_Synthesis_ICCV_2025_paper.html}
}

@inproceedings{belouadi2024detikzify,
    title={{DeTikZify}: Synthesizing Graphics Programs for Scientific Figures and Sketches with {TikZ}},
    author={Jonas Belouadi and Simone Paolo Ponzetto and Steffen Eger},
    booktitle={The Thirty-eighth Annual Conference on Neural Information Processing Systems},
    year={2024},
    url={https://openreview.net/forum?id=bcVLFQCOjc}
}

@inproceedings{belouadi2024automatikz,
    title={{AutomaTikZ}: Text-Guided Synthesis of Scientific Vector Graphics with {TikZ}},
    author={Jonas Belouadi and Anne Lauscher and Steffen Eger},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=v3K5TVP8kZ}
}

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
datikz		datikz
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DaTikZ Dataset

Installation

Usage

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

DaTikZ Dataset

Installation

Usage

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages