Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block Parser: Explore a streaming lazy interface #5705

Draft
wants to merge 2 commits into
base: trunk
Choose a base branch
from

Conversation

dmsnell
Copy link
Member

@dmsnell dmsnell commented Nov 26, 2023

Augmented but not replaced entirely by the Unified Block Parser in #6381
Alternatively provided by the block-delimiter-finder in #6760

For a 3 MB document which took 5 seconds and 14 GB to parse, this version of the parser parsed it in 27 ms and 20 MB.

Initial testing This version is slower for the home page render of `twentytwentyfour`. While unfortunate, this is not entirely surprising as this was designed to fix the catastrophically bad cases.

lazy-block-parser-slower-for-tt4-home-page

However, in catastrophic cases it's wildly better than trunk. The following was tested for a 15 KB / 400 line chunk of the 3 MB post mentioned above.

lazy-block-parser-faster-for-large-page

The algorithm has wild complexity too. For the same post, including the first 599 lines (only 23 KB of HTML), trunk consumes 520 MB or memory while this branch only consumes 10 MB. With no more than 15 samples the data is extremely significant.

lazy-block-parser-faster-for-very-large-page

Testing Results

This may be slightly slower for a number of normal posts. For the home page render of twentytwentyfour it rendered 3.7 ms slower than trunk. However, for my catastrophically-broken test post, the impact of the lazy parsing is dramatic and significant after only a single request.

The lazy parser is still slow for really pathological cases, but unlike trunk it runs within a mostly bounded memory footprint. The more pathological the post, the more dramatic the improvement in both runtime and memory use becomes. Below is a chart comparing slices of my test file against both parsers. Between each test run the database is reset. The number of lines reported is the count of how many of the original 3 MB document lines were extracted as the test post.

Of particular note is that this lazy parser allows for more control over the performance threshold still. Further expansion would allow setting a time limit, an upper bound on m emory usage, and a content length threshold after which the parser could pause and/or collapse the remainder of the post into a single unparsed block, essentially turning everything after the limit into a chunk of raw HTML (the static fallback render).

Lines max depth KB trunk ms branch ms Δ speedup trunk MB branch MB
(tt4) 25 92.6 96.1 +3.78% x0.96
400 124 15 1,170 624 -47% x1.88
600 187 23 4.82 s 968 -80% x5.00 520 10
792 248 30 16.7 s 3.53 s -79% x4.73 1.8 GB 14
1000 316 38 118 s 10.1 s -91% x11.7 6.5 GB 23
1200 380 46 8.01 min 22 s -95% x22 13.9 GB 32
79k (all) 25683 3 MB 30

With memory_limit=55G on a 60 GB system I was unable to create the post via wp_insert_post() and it failed after some number of tens of minutes.

on this branch the post inserted after a few seconds and used a peak memory of 64 MB

For a 3 MB document which took 5 seconds and 14 GB to parse, this version of the parser parsed it in 27 ms and 20 MB.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant