diff --git a/experiments/2024-08-05-speed/speed-tests.ipynb b/experiments/2024-08-05-speed/01-measure-recent-versions.ipynb similarity index 100% rename from experiments/2024-08-05-speed/speed-tests.ipynb rename to experiments/2024-08-05-speed/01-measure-recent-versions.ipynb diff --git a/experiments/2024-08-05-speed/02-measure-version-3.4.3.ipynb b/experiments/2024-08-05-speed/02-measure-version-3.4.3.ipynb new file mode 100644 index 00000000..458cbe45 --- /dev/null +++ b/experiments/2024-08-05-speed/02-measure-version-3.4.3.ipynb @@ -0,0 +1,476 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "2b21d45f-cf7d-4a09-964c-bc6a81a1893d", + "metadata": {}, + "source": [ + "# Measure the speed of the Markdown package on version 3.4.3\n", + "\n", + "The current version of the Markdown package for TeX takes multiple seconds to initialize and process a markdown text:" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "11e99041-b6c2-4947-b108-9ed4d7c4a46f", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\\markdownRendererDocumentBegin\n", + "foo\\markdownRendererDocumentEnd\n", + "\n", + "real\t0m1.648s\n", + "user\t0m1.410s\n", + "sys\t0m0.237s\n" + ] + } + ], + "source": [ + "! docker run --rm -i witiko/markdown bash -c 'time markdown-cli <<< foo'" + ] + }, + { + "cell_type": "markdown", + "id": "482b95d6-9d03-4e10-acdf-0977a4a0c1dd", + "metadata": {}, + "source": [ + "As shown in the previous Jupyter notebook `01-measure-recent-versions.ipynb` titled \"Measure the speed of the Markdown package across recent versions\", a more than 5× slow-down has been introduced in [version 3.4.3][1] of the Markdown package for TeX.\n", + "\n", + " [1]: https://github.com/Witiko/markdown/releases/tag/3.4.3" + ] + }, + { + "cell_type": "markdown", + "id": "b0eb626b-3eed-47a4-b5ef-6fbd39d38d54", + "metadata": {}, + "source": [ + "This Jupyter notebook measures the speed of the Markdown package for TeX at all merge commits from version 3.4.3 to determine which of the [eight PRs merged in version 3.4.3][1] caused the slow-down. \n", + "As discussed in [#474 (comment)][2], the slow-down is likely related to PRs [#416][3] and [#432][4], which started loading `UnicodeData.txt` and constructing a parser that recognizes all Unicode punctuation.\n", + "\n", + " [1]: https://github.com/Witiko/markdown/pulls?q=is%3Amerged+is%3Apr+milestone%3A3.4.3+\n", + " [2]: https://github.com/Witiko/markdown/issues/474#issuecomment-2286251419\n", + " [3]: https://github.com/Witiko/markdown/pull/416\n", + " [4]: https://github.com/Witiko/markdown/pull/432" + ] + }, + { + "cell_type": "markdown", + "id": "e6ad6175-68c4-4e27-8e1f-b73d10856a74", + "metadata": {}, + "source": [ + "### Experiment\n", + "\n", + "In my experiment, I time the command `markdown-cli <<< foo` with the Docker images for all merge commits from version 3.4.3 on my Dell G5 15 notebook. \n", + "Furthermore, we also include commit [`a45cf0ed`][2] with tag `3.4.2` and commit [`32b52ba3`][3] for PR [#428][1], which lacks a merge commit.\n", + "\n", + " [1]: https://github.com/Witiko/markdown/pull/428\n", + " [2]: https://github.com/Witiko/markdown/commit/a45cf0ed8a26270c9c13dfc13d135c8071ad3ae5\n", + " [3]: https://github.com/Witiko/markdown/commit/32b52ba3a41c8c1b5fd9cbb814a86fab215204c4" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "d124af9e-3b7e-4661-89ee-50d121e62566", + "metadata": {}, + "outputs": [], + "source": [ + "from packaging.version import Version" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "4d863f76-7dd9-447a-a427-b0e95542dbaf", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "a45cf0ed8a26270c9c13dfc13d135c8071ad3ae5 Merge branch 'fix/allow-line-break-before-options'\n", + "4a770b18992e3aef3e13516da9f30cead40120a4 Merge pull request #419 from Witiko/feat/remove-trailing-separators\n", + "5d3dfdb1bf44c04b84c46b6fe8cd9870f4a37504 Merge pull request #422 from Witiko/fix/mathml\n", + "ef6c9d0884c7b6d6398af22b3af90c9e96f48e0e Merge pull request #425 from Witiko/feat/tex-live-2024\n", + "415379f9869c20b64f6e1472d6664fe5ca38ccb7 Merge pull request #426 from Witiko/feat/emails-and-citations\n", + "32b52ba3a41c8c1b5fd9cbb814a86fab215204c4 Require that closing div fence has the same indent as the opening fence\n", + "d8a1d2f9d61258a1ce4b06bda13cd939ecd28e15 Merge pull request #416 from lostenderman/update-commonmark\n", + "828e25a5009e0a7dc6ab83b39afa4b539e88de1a Merge pull request #431 from lostenderman/fix/fenced-divs-indent-table\n", + "e2c6be1a77653281101f068ab4bcf8ee0ef3ebbf Merge pull request #432 from Witiko/fix/parsers-punctuation-memory-issues\n" + ] + } + ], + "source": [ + "refs = ! git log --pretty=oneline 3.4.2^..3.4.3 | tac | sed -n -r \"/Merge pull request|Merge branch 'fix\\\\/allow-line-break-before-options'|Require that closing div fence has the same indent as the opening fence/p\"\n", + "print('\\n'.join(refs))\n", + "refs = [ref.split()[0] for ref in refs]" + ] + }, + { + "cell_type": "markdown", + "id": "fe11371c-102e-41fb-a7f8-fe0dbd296403", + "metadata": {}, + "source": [ + "First, I build the docker images for the individual commits." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "605b5260-da75-4bcb-828c-b395ea2afc43", + "metadata": {}, + "outputs": [], + "source": [ + "import json" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "914a5154-72bd-43a5-914f-383b20ad3ad6", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{\n", + " \"a45cf0ed8a26270c9c13dfc13d135c8071ad3ae5\": \"witiko/markdown:3.4.2-0-ga45cf0ed-TL2022-historic\",\n", + " \"4a770b18992e3aef3e13516da9f30cead40120a4\": \"witiko/markdown:3.4.2-6-g4a770b18-TL2022-historic\",\n", + " \"5d3dfdb1bf44c04b84c46b6fe8cd9870f4a37504\": \"witiko/markdown:3.4.2-14-g5d3dfdb1-TL2022-historic\",\n", + " \"ef6c9d0884c7b6d6398af22b3af90c9e96f48e0e\": \"witiko/markdown:3.4.2-22-gef6c9d08-TL2022-historic\",\n", + " \"415379f9869c20b64f6e1472d6664fe5ca38ccb7\": \"witiko/markdown:3.4.2-39-g415379f9-TL2022-historic\",\n", + " \"32b52ba3a41c8c1b5fd9cbb814a86fab215204c4\": \"witiko/markdown:3.4.2-41-g32b52ba3-TL2022-historic\",\n", + " \"d8a1d2f9d61258a1ce4b06bda13cd939ecd28e15\": \"witiko/markdown:3.4.2-57-gd8a1d2f9-TL2022-historic\",\n", + " \"828e25a5009e0a7dc6ab83b39afa4b539e88de1a\": \"witiko/markdown:3.4.2-77-g828e25a5-TL2022-historic\",\n", + " \"e2c6be1a77653281101f068ab4bcf8ee0ef3ebbf\": \"witiko/markdown:3.4.3-0-ge2c6be1a-TL2022-historic\"\n", + "}\n" + ] + } + ], + "source": [ + "tags = {}\n", + "for ref in refs:\n", + " tag, = ! git describe --abbrev=8 --tags --always --long --exclude latest $ref\n", + " tags[ref] = f'witiko/markdown:{tag}-TL2022-historic'\n", + "print(json.dumps(tags, indent=4))" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "6b8367fe-f2da-45a2-9582-d2d3c970cac5", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "! pip install tqdm" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "681532e1-bd7d-46f5-8851-51c696693638", + "metadata": {}, + "outputs": [], + "source": [ + "from tqdm import tqdm" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "01a2ddd9-3a5c-4cd7-ba97-366fcae424ae", + "metadata": {}, + "outputs": [], + "source": [ + "available_docker_images = set()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "93cc99f3-baa2-4611-b501-347735045e46", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Determining available Docker images: 100%|█| 9/9 [\n" + ] + } + ], + "source": [ + "for ref in tqdm(refs, desc='Determining available Docker images', ncols=50):\n", + " tag = tags[ref]\n", + " images = ! docker images -q $tag\n", + " if images:\n", + " available_docker_images.add(ref)" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "c8e0cbac-dbc2-45f4-8e4c-a4deca76d20b", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Pulling Docker images: 0it [00:00, ?it/s]\n" + ] + } + ], + "source": [ + "for ref in tqdm([ref for ref in refs if ref not in available_docker_images], desc='Pulling Docker images', ncols=50):\n", + " tag = tags[ref]\n", + " _ = ! docker pull $tag\n", + " images = ! docker images -q $tag\n", + " if images:\n", + " available_docker_images.add(ref)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "025cac55-3d5c-4bac-9e9b-6ddc1ba9818f", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Building Docker images: 0it [00:00, ?it/s]\n" + ] + } + ], + "source": [ + "for ref in tqdm([ref for ref in refs if ref not in available_docker_images], desc='Building Docker images', ncols=50):\n", + " tag = tags[ref]\n", + " ! rm -rf markdown\n", + " _ = ! git clone https://github.com/witiko/markdown && cd markdown && git checkout $ref && docker build --pull --build-arg TEXLIVE_TAG=TL2022-historic -t $tag .\n", + " images = ! docker images -q $tag\n", + " if images:\n", + " available_docker_images.add(ref)\n", + "! rm -rf markdown" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "6fd48957-0d2f-4a7d-af8c-78b3a854112f", + "metadata": {}, + "outputs": [], + "source": [ + "assert available_docker_images == set(refs)" + ] + }, + { + "cell_type": "markdown", + "id": "9d76ba61-e07f-466c-8f89-737e5a02e099", + "metadata": {}, + "source": [ + "To determine the median times, I repeat the test five times for every version. To control for the effect of using different versions of the TeX Live distributions, I use the historic TeX Live 2022 distribution." + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "95d6a816-9b92-4a75-af64-d66b1d8cceb0", + "metadata": {}, + "outputs": [], + "source": [ + "from collections import defaultdict\n", + "from itertools import product" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "30cf12b9-89c6-4de6-a228-52d50f09a96b", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "100%|█████████████| 45/45 [00:46<00:00, 1.04s/it]\n" + ] + } + ], + "source": [ + "durations_all = defaultdict(lambda: list())\n", + "parameters = list(product(range(5), refs))\n", + "for repetition, ref in tqdm(parameters, ncols=50):\n", + " tag = tags[ref]\n", + " lines = ! docker run --rm -i $tag bash -c 'time markdown-cli <<< foo'\n", + " for line in lines:\n", + " if line.startswith('real'):\n", + " _, duration = line.split()\n", + " assert len(durations_all[ref]) == repetition\n", + " durations_all[ref].append(duration)\n", + " break\n", + " else:\n", + " raise ValueError(f'Unexpected output for tag {tag}: {lines}')" + ] + }, + { + "cell_type": "markdown", + "id": "454c335b-28fa-42cf-81c2-97334d7da9f7", + "metadata": {}, + "source": [ + "## Results\n", + "In this section, I discuss the results of the experiment." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "27af9f1a-ff88-45bf-b51e-57609ae5f662", + "metadata": {}, + "outputs": [], + "source": [ + "import re" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "26cc8b15-2bfe-4832-baf5-b1f46386e561", + "metadata": {}, + "outputs": [], + "source": [ + "durations_seconds_all = dict()\n", + "for ref, durations in durations_all.items():\n", + " durations_seconds = list()\n", + " for duration in durations:\n", + " match = re.match(r'(?P[0-9]+)m(?P[0-9.]+)s', duration)\n", + " assert match\n", + " duration_seconds = int(match.group('minutes')) + float(match.group('seconds'))\n", + " durations_seconds.append(duration_seconds)\n", + " durations_seconds_all[ref] = durations_seconds\n", + " assert len(durations_seconds_all[ref]) == len(durations_all[ref])\n", + "assert len(durations_seconds_all) == len(durations_all)" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "80b9d16d-7e22-4565-81d6-e75652ac782a", + "metadata": {}, + "outputs": [], + "source": [ + "from statistics import median" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "2d74390f-f557-44e9-9bdc-bdfabe198cf6", + "metadata": {}, + "outputs": [], + "source": [ + "durations_seconds_median = {\n", + " ref[:8]: median(durations)\n", + " for ref, durations\n", + " in durations_seconds_all.items()\n", + "}" + ] + }, + { + "cell_type": "markdown", + "id": "572d1c39-e379-4f68-b582-ef227cd3762a", + "metadata": {}, + "source": [ + "Below, I show the median processing times for all considered versions of the Markdown package." + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "id": "92e12960-4533-4aa0-bb03-3824a8fda863", + "metadata": {}, + "outputs": [], + "source": [ + "%%capture\n", + "! pip install matplotlib" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "21d88566-a463-418f-9a89-547da4094773", + "metadata": {}, + "outputs": [], + "source": [ + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "d842d771-d96f-41da-9592-aecc1f8defbd", + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "plt.figure(figsize=(10, 5))\n", + "plt.plot(*zip(*durations_seconds_median.items()), marker='o', linestyle='-')\n", + "plt.title('Median time of processing a short text with different commits of the Markdown package for TeX 3.4.3')\n", + "plt.xlabel('Commit')\n", + "plt.ylabel('Median processing time (seconds)')\n", + "plt.xticks(rotation=45)\n", + "plt.tight_layout()\n", + "plt.grid()\n", + "plt.show()" + ] + }, + { + "cell_type": "markdown", + "id": "8d67d554-3ed1-480a-b464-7682776b8301", + "metadata": {}, + "source": [ + "As expected, the more than 5× slow-down is caused by PRs [#416][1], which started loading `UnicodeData.txt` and constructing a parser that recognizes all Unicode punctuation.\n", + "\n", + " [1]: https://github.com/Witiko/markdown/pull/416" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/experiments/2024-08-05-speed/README.md b/experiments/2024-08-05-speed/README.md index 2c9ae03f..9dc8a4dc 100644 --- a/experiments/2024-08-05-speed/README.md +++ b/experiments/2024-08-05-speed/README.md @@ -1,6 +1,10 @@ This directory contains experimental code that measures the speed of historic versions of the Markdown package for TeX and compares them with the current version of the Markdown package. The results of the experiment are available in -the Jupyter notebook document [`speed-tests.ipynb`][1]. +the following Jupyter notebook documents: - [1]: speed-tests.ipynb "Measure the speed of the Markdown package across recent versions" +- [Measure the speed of the Markdown package across recent versions][1] +- [Measure the speed of the Markdown package on version 3.4.3][2] + + [1]: 01-measure-recent-versions.ipynb "Measure the speed of the Markdown package across recent versions" + [2]: 02-measure-version-3.4.3.ipynb "Measure the speed of the Markdown package on version 3.4.3"