-
Notifications
You must be signed in to change notification settings - Fork 275
Writing Extensions
Extensions need to extend the parser, or the HTML renderer, or both. To use an extension, the builder objects can be configured with a list of extensions. Because extensions are optional, they live in separate artifacts, so additional dependencies need to be added as well.
The best way to create an extension is to start with a copy of an existing one and modify the source and tests to suit the new extension.
To track source location in the AST all parsing is performed using a special CharSequence
implementing BasedSequence
class which wraps the original source character sequence with start
and end offsets into the original sequence to represent its content. All subSequence()
results
return another BasedSequence
instance with the original base sequence.
In this way the source location representing any string being parsed can be obtained using the
getStartOffset()
and getEndOffset()
. Any string stored in the AST has to be a
subSequence()
of the original source.
The fly in the ointment is that parsing unescaped text from the source is a bit more involved
since it is the escaped original which must be added to the AST. For this all methods in the
Escaping
utility class were added that take a BasedSequence
and a ReplacedTextMapper
class. The returned result is a modified sequence whose contents can be mapped to the original
source using the methods of the ReplacedTextMapper
object. Allowing parsing of massaged text
with ability to extract un-massaged counterpart for placement in the AST. See implementation in
the flexmark-ext-autolink
AutolinkPostProcessor
for an example of how this is achieved in a
working extension.
Similarly, when using regex matching you cannot simply take the string returned by group()
but
must extract a subSequence
from the input using the start/end offsets for the group. Examples
of this are abundant in the core parser implementation.
A small price to pay for having complete source reference in the AST and ease of parsing without having to carry dedicated separate state to represent source position or use dedicated grammar tools.
Source tracking in the core was complicated by leading tab expansion and prefix removal from
parsed lines with later concatenation of these partial results for inline parsing, which too
must track the original source position. This was addressed with additional BasedSequence
implementation classes: PrefixedSubSequence
for partially used tabs and SegmentedSequence
for concatenated sequences. The result is almost a transparent propagation of source position
throughout the parsing process.
If there are any missed or erroneous settings in the AST then these should be caught by tests that also validate the generated AST.
A generic options API was added to allow easy configuration of the parser, renderer and
extensions. It consists of DataKey<T>
instances defined by various components. Each data key
defines the type of its value and a default value.
The values are accessed via the DataHolder
and MutableDataHolder
interfaces, with the former
being a read only container. Since the data key provides a unique identifier for the data there
is no collision for options.
Parser.EXTENSIONS
option holds a list of extensions to use for the Parser
and HtmlWriter
.
This allows configuring the parser and renderer with a single set of optioins.
To configure the parser or renderer, pass a data holder to the builder()
method.
public class SomeClass {
static final MutableDataHolder OPTIONS = new MutableDataSet()
.set(Parser.REFERENCES_KEEP, KeepType.LAST)
.set(HtmlRenderer.INDENT_SIZE, 2)
.set(HtmlRenderer.PERCENT_ENCODE_URLS, true)
.set(Parser.EXTENSIONS, Arrays.asList(TablesExtension.create()))
;
static final Parser PARSER = Parser.builder(OPTIONS).build();
static final HtmlRenderer RENDERER = HtmlRenderer.builder(OPTIONS).build();
}
In the code sample above, ReferenceRepository.KEEP
defines the behavior of references when
duplicate references are defined in the source. In this case it is configured to keep the last
value, whereas the default behavior is to keep the first value.
The HtmlRenderer.INDENT_SIZE
and HtmlRenderer.PERCENT_ENCODE_URLS
define options to use for
rendering. Similarly, other extension options can be added at the same time. Any options not set
will default to their respective defaults as defined by their data keys.
All markdown element reference types should be stored using a subclass of NodeRepository<T>
as
is the case for references, abbreviations and footnotes. This provides a consistent mechanism
for overriding the default behavior of these references for duplicates from keep first to keep
last.
By convention, data keys are defined in the extension class and in the case of the core in the
Parser
or HtmlRenderer
.
BlockParserFactory
instances are assumed to be stateless and will be re-used by
other builders with a different set of options. If you need to store state for a document then
the extension should define a data key for the class holding the extension state and use the
get()
or getOrCompute()
method of the ParserState::getOptions()
data holder to retrieve
the value. The getOrCompute()
is there for convenience if you need to use more than just
options to create the initial value.
DataHolder
argument passed to the DataValueFactory::create()
method will be
null
when creating a read-only default value instance for use by the key. The class
constructor should be able to handle this case seamlessly. To make it convenient to implement
such classes, use the DataKey::getFrom(DataHolder)
method instead of the
DataHolder::get(DataKey)
method to access the values of interest. The former will provide the
key's default value if the data holder argument is null
, the latter will generate a run time
java.lang.ExceptionInInitializerError
error.
For example flexmark-ext-tables
extension uses this method to store its options eliminating
the need to extract multiple values from the data holder on every line of every block that it
tires to match to a table separator line:
class TableParserOptions {
final public int maxHeaderRows;
final public int minHeaderRows;
final public boolean appendMissingColumns;
final public boolean discardExtraColumns;
final public boolean columnSpans;
final public boolean headerSeparatorColumns;
TableParserOptions(DataHolder options) {
this.maxHeaderRows = TablesExtension.MAX_HEADER_ROWS.getFrom(options);
this.minHeaderRows = TablesExtension.MIN_HEADER_ROWS.getFrom(options);
this.appendMissingColumns = TablesExtension.APPEND_MISSING_COLUMNS.getFrom(options);
this.discardExtraColumns = TablesExtension.DISCARD_EXTRA_COLUMNS.getFrom(options);
this.columnSpans = TablesExtension.COLUMN_SPANS.getFrom(options);
this.headerSeparatorColumns = TablesExtension.HEADER_SEPARATOR_COLUMNS.getFrom(options);
}
}
public class TableBlockParser extends AbstractBlockParser {
private static DataKey<TableParserOptions> CACHED_TABLE_OPTIONS = new DataKey<>("CACHED_TABLE_OPTIONS", TableParserOptions::new);
public static class Factory extends AbstractBlockParserFactory {
@Override
public BlockStart tryStart(ParserState state, MatchedBlockParser matchedBlockParser) {
TableParserOptions options = state.getProperties().get(CACHED_TABLE_OPTIONS);
}
}
}
Option data keys for the Parser
:
Class | Static Field | Default Value | Description |
---|---|---|---|
Parser | BLOCK_QUOTE_PARSER | true |
enable parsing of block quotes |
Parser | HEADING_PARSER | true |
enable parsing of headings |
Parser | FENCED_CODE_BLOCK_PARSER | true |
enable parsing of fenced code blocks |
Parser | HTML_BLOCK_PARSER | true |
enable parsing of html blocks |
Parser | THEMATIC_BREAK_PARSER | true |
enable parsing of thematic breaks |
Parser | LIST_BLOCK_PARSER | true |
enable parsing of lists |
Parser | INDENTED_CODE_BLOCK_PARSER | true |
enable parsing of indented code block |
Parser | REFERENCE_PARAGRAPH_PRE_PROCESSOR | true |
enable parsing of reference definitions |
Parser | ASTERISK_DELIMITER_PROCESSOR | true |
enable asterisk delimiter inline processing. |
Parser | UNDERSCORE_DELIMITER_PROCESSOR | true |
enable underscore delimiter inline processing. |
Parser | REFERENCES | new repository | repository for document's reference definitions |
Parser | REFERENCES_KEEP | KeepType.FIRST |
which duplicates to keep. |
Parser | EXTENSIONS | empty list | list of extension to use for builders. Can use this option instead of passing extensions to parser builder and renderer builder. |
Option data keys for the HtmlRenderer
:
Class | Static Field | Default Value | Description |
---|---|---|---|
HtmlRenderer | SOFT_BREAK | "\n" |
string to use for rendering soft breaks |
HtmlRenderer | ESCAPE_HTML | false |
escape html found in the document |
HtmlRenderer | PERCENT_ENCODE_URLS | false |
percent encode urls |
HtmlRenderer | INDENT_SIZE | 0 |
how many spaces to use for each indent level of nested tags |
HtmlRenderer | SUPPRESS_HTML_BLOCKS | false |
suppress html output for html blocks |
HtmlRenderer | SUPPRESS_INLINE_HTML | false |
suppress html output for inline html |
PhasedNodeRenderer
and ParagraphPreProcessor
interfaces were added with associated Builder
methods for extending the parser.
PhasedNodeRenderer
allows an extension to generate HTML for various parts of the HTML
document. These phases are listed in the order of their occurrence during document rendering:
HEAD_TOP
HEAD
HEAD_CSS
HEAD_SCRIPTS
HEAD_BOTTOM
BODY_TOP
BODY
BODY_BOTTOM
BODY_LOAD_SCRIPTS
BODY_SCRIPTS
BODY
phase is the standard HTML generation phase using the NodeRenderer::render(Node node)
method. It is called for every node in the document.
The other phases are only called on the Document
root node and only for custom renderers that
implement the PhasedNodeRenderer
interface. The PhasedNodeRenderer::render(Node node, RenderingPhase phase)
.
The extension can call context.render(node)
and context.renderChildren(node)
during any
rendering phase. The functions will process the node as they do during the BODY
rendering
phase. The FootnoteExtension
uses the BODY_BOTTOM
phase to render the footnotes referenced
within the page. Similarly, Table of Contents extension can use the BODY_TOP
phase to insert
the table of contents at the top of the document.
The HEAD...
phases are not used by any extension but can be used to generate a full HTML
document, with style sheets and scripts.
ParagraphPreProcessor
interface allows customization of pre-processing of block elements at
the time they are closed by the parser. This is done by the ParagraphParser
to extract leading
reference definition from the paragraph. Special handling of ParagraphParser
block was removed
from the parser and instead a generic mechanism was added to allow any BlockParser
to perform
similar functionality and to allow adding custom pre-processors to handle elements other than
the built in reference definitions.
Document level, extensible properties were added to allow extensions to have document level
properties which are available during rendering. While parsing these are available from the
ParserState::getProperties()
, state
parameter and during post-processing and rendering from
the Document
node reachable via getDocument()
method of any Node
.
The DocumentParser
and Document
properties will also contain options passed or defined on
the Parser.builder()
object, in addition to any added in the process of parsing the document.
HtmlRenderer
options are only available on the rendering context object.
NodeRenderer extensions should check for their options using the
NodeRendererContext.getOptions()
not the getDocument()
method. If HtmlRenderer
was
customized with options which were not passed to Parser.Builder
then these options will not be
available through the document properties. The node renderer context options will contain all
custom options defined for HtmlRenderer.builder()
and all document properties, which will
contain all options passed to the Parser.builder()
plus any defined during the parsing
process. If an option is customized or defined in the renderer, its value from the document will
not be accessible. For these you will need to use the document available through the rendering
context getDocument()
method.
DataKey
defines the property, its type and default value instantiation. DataHolder
and
MutableDataHolder
interfaces are used to access or set properties, respectively.
NodeRepository
is an abstract class used to create repositories for nodes: references,
footnotes and abbreviations.
Since the AST now represents the source of the document not the HTML to be rendered, the text stored in the AST must be as it is in the source. This means that all un-escaping and resolving of references has to be done during the rendering phase. For example a footnote reference to an undefined footnote will be rendered as if it was a Text node, including any emphasis embedded in the footnote id. If the footnote reference is defined it will render both as expected.
Handling disparate end of lines used in the source. It too must now be handled in the rendering phase. This means that text which contains end of lines must be normalized before it is rendered since it is no longer normalized during parsing.
This extra processing is not difficult to implement since the necessary member methods were
added to the BasedSequence
class, which used to represent all text in the AST.
Parser Unified options handling was added which are also can be used to selectively disable loading of core processors for greater customization.
Parser.builder()
now implements MutableDataHolder
so you can use get
/set
to customize
properties.
Parser.builder()
now implements MutableDataHolder
so you can use get
/set
to customize p
New extension points for the parser:
-
ParagraphPreProcessor
is used by theParagraphBlock
to extract reference definitions from the beginning of the paragraph, but can be used by any other block for the same purpose. Any custom block pre-processors will be called first, in order of their registration. Multiple calls may result since removal of some text can expose text for another pre-processor. Block pre-processors are called until no changes to the block are made. -
InlineParserFactory
is used to override the default inline parser. Only one custom inline parser factory can be set. If none are set then the default will be used. -
LinkRefProcessor
is used to create custom elements that syntactically derive from link refs:[]
or![]
. This will work correctly for nested[]
in the element and allows for treating the leading!
as plain text if the custom element does not use it. Footnotes ([^footnote ref]
) and wiki links ([[]]
or[[text|link]]
) are examples of such elements.
Renderer Unified options handling added, existing configuration options were kept but now they modify the corresponding unified property.
Renderer Builder()
now has an indentSize(int)
method to set size of indentation for
hierarchical tags. Same as setting HtmlRenderer.INDENT_SIZE
data key in options.
All the HtmlWriter
methods now return this
so method chaining can be used. Additionally,
tag()
and indentedTag()
methods that take a Runnable
will automatically close the tag, and
un-indent after the run()
method is executed. This makes seeing the HTML hierarchy easier in
the rendered output.
Here is the before node renderer source:
class CustomNodeRenderer implements NodeRenderer {
@Override
public void visit(BlockQuote node) {
html.line();
html.tag("blockquote", getAttrs(node));
html.line();
visitChildren(node);
html.line();
html.tag("/blockquote");
html.line();
}
}
And here is the after:
class CustomNodeRenderer implements NodeRenderer {
@Override
public void visit(BlockQuote node) {
html.withAttr().tagIndent("blockquote", () -> {
context.visitChildren(node);
});
}
}
For increased stack use the added benefits are:
- indenting child tags
- attributes are easier to handle since they only require setting the attributes with
.attr()
and using.withAttr()
call before thetag ()
method - tag is automatically close The previous behavior of using explicit attribute parameter is still preserved.
The indentation useful for testing because it is easier to visually validate and correlate:
> - item 1
> - item 2
> 1. item 1
> 2. item 2
the the rendered html:
<blockquote>
<ul>
<li>item 1</li>
<li>item 2
<ol>
<li>item 1</li>
<li>item 2</li>
</ol>
</li>
</ul>
</blockquote>
than this:
<blockquote>
<ul>
<li>item 1</li>
<li>item 2
<ol>
<li>item 1</li>
<li>item 2</li>
</ol>
</li>
</ul>
</blockquote>
Some methods of HtmlWriter were changed to be more descriptive instead of passing boolean
arguments. New methods were added to allow accumulation of attributes without having to create a
hash map and then invoke extendRenderingNodeAttributes()
:
-
tagVoid()
a void tag -
tagVoidLine()
a void tag that should be by itself on a line, equivalent to.line().tag().line()
-
tagLine()
a non-void tag that should start on a new line and should have its closing tag the last one on the line. -
tagIndent()
a tag that indents contained lines, takes aRunnable
argument. -
attr(String name, String value)
set an attribute for the next.withAttr()
tag generation. -
withAttr()
the next tag should take accumulated attributes as default and allow overrides by extensions. -
withCondIndent()
will do an indent before an indenting child tag. Used by a tight list item, which does not normally do an indent, but if it contains other indenting tags then these should be indented. -
withCondLine()
will output an EOL after the opening tag, but only if a child node produces output. Used to conditionally put parent open/close tags on separate lines and on the same line if there is no text between the tags.
❗ withCondLine()
and withCondIndent()
only work on tag..()
functions that take
a Runnable
argument for handling child node output.