Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce support for styling documents with ANSI codes #25

Merged
merged 2 commits into from
Mar 6, 2024

Conversation

silby
Copy link
Contributor

@silby silby commented Mar 5, 2024

This PR introduces support for styling and rendering Docs with ANSI terminal control codes. The public interface uses smart constructors like “bold”, “italic”, and “underlined” to apply font styling to inner Docs. The implementation grows multiple new concepts to deal with this while still supporting the prerendering of Block elements.

NB: This is not quite complete; additional smart constructors are needed for the UI to support the full range of text styles (haven't done anything with color yet) and there's no tests of the actual ANSI output. Some additional work/thinking may be required to support the other two main features I want the pandoc writer to support: links (probably easy-ish, just needs to be supported by Attributed a) and images (maybe harder, but I think orthogonal to the features implemented here).

ANSIFont

The ANSIFont module introduces ADTs for various text properties that are supported by terminals, and for Fonts that indicate all the properties that should apply to a particular span of text. New fonts can be constructed by applying a StyleReq to an existing font, which replaces the requested property in the original font with the requested value.

Attributed

The Attributed module introduces Attributed strings, which carry a Font along with an inner string type. It instantiates the HasChars class so that various features of the existing DocLayout code can somewhat seamlessly support rendering styled text.

Building and rendering styled documents

Implementation outline:

  1. Consumers add a smart constructor like bold to style a Doc. The inner doc gets wrapped in Doc’s Styled constructor, indicating the text style requested for that block.
  2. The renderer maintains a stack of Fonts. When a Styled element is encountered, its StyleReq is applied to the Font on the top of the stack and pushed, the inner document is rendered, and then the font is popped and rendering continues.
  3. The Attributed a returned by prerender can be rendered to a using renderPlain, which ignores all the font requests, or renderANSI, which adds the requisite control sequences to set the font every time the font changes.

Conceptually, the renderer from Doc a to Attributed a turns the nested styling requests into a linear structure where every span of text carries the full set of font attributes it should be rendered with.

The most interesting wrinkle to this implementation is that the contents of Block elements need to be prerendered by the block helper so they can be broken up into lines and filled, but we want to defer the decision of rendering plain text or ANSI-styled text until the final document is rendered. To support this, the Block constructor for a Doc a now carries an Attributed a in its lines field. Once the next rendering pass merges blocks together, instead of using literal to construct Text elements carring an a to render, it uses cook to construct Cooked elements with an Attributed a to be copied directly to the output stream without looking at the font stack. This means that the contents of a block are only ever styled by style requests that were made in the inner document of the block: the contents of bold $ cblock n $ literal "x" will be printed in plain text, whereas the contents of cblock n $ bold $ literal "x" will be bold.

silby added 2 commits March 5, 2024 11:22
This commit introduces support for styling and rendering Docs with ANSI
terminal control codes. The public interface uses smart constructors
like "bold", "italic", and "underlined" to apply font styling to inner
Docs. The implementation grows multiple new concepts to deal with this
while still supporting the prerendering of Block elements.

ANSIFont
========

The ANSIFont module introduces ADTs for various text properties that are
supported by terminals, and for Fonts that indicate all the properties
that should apply to a particular span of text. New fonts can be
constructed by applying a StyleReq to an existing font, which replaces
the requested property in the original font with the requested value.

Attributed
==========

The Attributed model introduces Attributed strings, which carry a Font
along with an inner string type. It instantiates the HasChars class so
that various features of the existing DocLayout code can somewhat
seamlessly support rendering styled text.

Building and rendering styled documents
=======================================

Implementation outline:

1. Consumers add a smart constructor like `bold` to style a Doc. The
   inner doc gets wrapped in Doc's `Styled` constructor, indicating the
   text style requested for that block.
2. The renderer maintains a stack of `Font`s. When a `Styled` element is
   encountered, its `StyleReq` is applied to the `Font` on the top of
   the stack and pushed, the inner document is rendered, and then the
   font is popped and rendering continues.
3. The `Attributed a` returned by `prerender` can be rendered to `a`
   using `renderPlain`, which ignores all the font requests, or
   `renderANSI`, which adds the requisite control sequences to set the
   font every time the font changes.

Conceptually, the renderer from `Doc a` to `Attributed a` turns the
nested styling requests into a linear structure where every span of text
carries the full set of font attributes it should be rendered with.

The most interesting wrinkle to this implementation is that the contents
of `Block` elements need to be prerendered by the `block` helper so they
can be broken up into lines and filled, but we want to defer the
decision of rendering plain text or ANSI-styled text until the final
document is rendered. To support this, the `Block` constructor for a
`Doc a` now carries an `Attributed a` in its lines field. Once the next
rendering pass merges blocks together, instead of using `literal` to
construct `Text` elements carring an `a` to render, it uses `cook` to
construct `Cooked` elements with an `Attributed a` to be copied directly
to the output stream _without looking at the font stack_. This means
that the contents of a block are only ever styled by style requests that
were made in the inner document of the block: the contents of `bold $
cblock n $ literal "x"` will be printed in plain text, whereas the
contents of `cblock n $ bold $ literal "x"` will be bold.
@jgm jgm merged commit 6bd5b83 into jgm:master Mar 6, 2024
7 checks passed
@jgm
Copy link
Owner

jgm commented Mar 6, 2024

Looks good to me!

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

I'm running benchmarks:
after this PR:

benchmarking sample document 2
time                 11.13 μs   (11.11 μs .. 11.14 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 11.12 μs   (11.10 μs .. 11.15 μs)
std dev              67.15 ns   (35.39 ns .. 121.9 ns)

benchmarking reflow English
time                 279.5 μs   (277.6 μs .. 282.0 μs)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 279.2 μs   (278.0 μs .. 281.9 μs)
std dev              5.800 μs   (3.359 μs .. 10.51 μs)
variance introduced by outliers: 14% (moderately inflated)

benchmarking reflow Greek
time                 377.6 μs   (373.6 μs .. 382.3 μs)
                     0.999 R²   (0.999 R² .. 0.999 R²)
mean                 377.1 μs   (374.5 μs .. 380.8 μs)
std dev              10.27 μs   (7.674 μs .. 16.40 μs)
variance introduced by outliers: 20% (moderately inflated)

benchmarking tabular English
time                 5.521 ms   (5.480 ms .. 5.574 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 5.498 ms   (5.478 ms .. 5.521 ms)
std dev              67.53 μs   (52.46 μs .. 87.20 μs)

benchmarking tabular Greek
time                 7.715 ms   (7.104 ms .. 8.469 ms)
                     0.974 R²   (0.962 R² .. 0.999 R²)
mean                 7.154 ms   (7.035 ms .. 7.408 ms)
std dev              469.4 μs   (209.1 μs .. 792.7 μs)
variance introduced by outliers: 36% (moderately inflated)

benchmarking soft spaces at end of line
time                 5.711 μs   (5.676 μs .. 5.776 μs)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 5.742 μs   (5.717 μs .. 5.778 μs)
std dev              99.33 ns   (80.64 ns .. 140.8 ns)
variance introduced by outliers: 16% (moderately inflated)

vs before:

benchmarking sample document 2
time                 13.54 μs   (13.51 μs .. 13.59 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 13.56 μs   (13.52 μs .. 13.65 μs)
std dev              197.0 ns   (115.7 ns .. 285.9 ns)
variance introduced by outliers: 11% (moderately inflated)

benchmarking reflow English
time                 123.4 μs   (123.3 μs .. 123.5 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 123.3 μs   (123.2 μs .. 123.4 μs)
std dev              315.7 ns   (252.8 ns .. 379.8 ns)

benchmarking reflow Greek
time                 108.9 μs   (108.7 μs .. 109.1 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 109.6 μs   (109.4 μs .. 109.8 μs)
std dev              802.4 ns   (661.6 ns .. 980.5 ns)

benchmarking tabular English
time                 1.523 ms   (1.511 ms .. 1.542 ms)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 1.527 ms   (1.524 ms .. 1.532 ms)
std dev              13.76 μs   (9.709 μs .. 24.32 μs)

benchmarking tabular Greek
time                 1.901 ms   (1.896 ms .. 1.907 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 1.918 ms   (1.908 ms .. 1.936 ms)
std dev              44.25 μs   (26.99 μs .. 73.68 μs)
variance introduced by outliers: 11% (moderately inflated)

benchmarking soft spaces at end of line
time                 5.136 μs   (5.128 μs .. 5.146 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.133 μs   (5.127 μs .. 5.140 μs)
std dev              20.13 ns   (16.38 ns .. 27.61 ns)

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

The "reflow" and "tabular" benchmarks show a very significant slowdown, which might be reason to revert or rethink this change. I'll see if I can quantify the impact on pandoc's benchmarks.

Maybe you can see where the slowdown is coming from and do something to mitigate it?

@silby
Copy link
Contributor Author

silby commented Mar 6, 2024

I’m not altogether surprised it’s slow, considering that I’m wrapping what would otherwise be a bunch of variously-optimized stringlike operations in a recursive ADT, and then doing a whole additional pass over it to render it to the final result. Sorry I didn’t think to quantify that before sending the PR.

My design here is pretty much shaped by the fact that Blocks have to be prerendered to something that is HasChars, measured with realLength, then chopped up into lines that all carry the font information with them. I didn’t want the block alignment to simply not work right in the ANSI context.

It seems like less-naïve implementation of Attributed a should be possible, that’s not just built out of raw cons cells that might recurse in either direction.

Maybe we can also win some performance back by replacing renderANSI = attrRender . prerender with a version that knows how to skip the Attributed step on request and just emit strings to the output. I think I’ll need another typeclass though.

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

OK, I can now quantify the impact on pandoc: a few selected examples from the writer benchmarks.

asciidoctor: from 3.2ms to 4.19ms
commonmark: 5.3ms to 7.96 ms
djot: 2.76 ms to 6.3 ms
docbook5: 3.4ms to 14.1 ms
html: 3.64ms to 8.28ms
icml: 14.7ms to 90.1 ms
latex: 2.78 ms to 7.40 ms
man: 1.85 ms to 3.64 ms
markdown: 5.67 ms to 9.00 ms
mediawiki (which doesn't use doclayout): 1.41 ms to 1.45 ms
opendocument: 8.70ms to 27.7ms
org: 1.76ms to 5.37ms

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

Do you mean replace the definition of renderPlain ?
It would be fine if renderANSI takes a while longer; after all, that's still better than the current situation, where you can't do ANSI at all. But I'd rather not have such a big performance regression for things that don't do ANSI.

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

I ran the benchmarks with profiling enabled and got the following profiteur report:

image

@silby
Copy link
Contributor Author

silby commented Mar 6, 2024

My branch perf over here replaces the cons-based Attributed a with a newtype around a Seq of standalone Attr a thingies. Not moving that to a PR tonight b/c I was just bashing it til it worked, and I have no idea if using Seq is even a useful optimization, and so forth, but it does improve performance: the reflow English and reflow Greek benchmarks are comparable to baseline instead of taking twice as long, and the tabular English and tabular Greek benchmarks are 1.7-2x instead of 3.5x over baseline. If you want to pull that down now and try yourself go ahead. More to come.

@silby
Copy link
Contributor Author

silby commented Mar 6, 2024

looks like with the above-mentioned branch of DocLayout there are no performance regressions in the pandoc benchmarks vs baseline. I tried putting some extra tables in testsuite.txt just in case, still ok. From cursory inspection there seem to be no pandoc writers that exercise the lblock/rblock/cblock functions of DocLayout as creatively as the tabular benchmarks here do. Now I'm actually done for the night.

@jgm
Copy link
Owner

jgm commented Mar 6, 2024

That's really great! I'll wait for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants