All objects in the AST are instances of subclasses of ASTNode
, which in turn
inherits from Array
. First, the notable common properties, methods and
behaviors shared by all nodes.
- ASTNode prototype properties
- ASTNode prototype methods
- Additional methods from
- Node Context and Membership
- Node Constructors
- General Nodes
- Comment
- Document
- Element
element.setAttribute\(key, value\)
- Attributes are Properties of Element
- ProcessingInstruction
- Declaration Nodes
All of these properties are unassignable.
The Doctype
node of the Document
where this node lives, if applicable.
The Document
node within which this node lives, if applicable.
The index position of this node within its parent, if applicable.
Inherited from Array
, but always 0 for "leaf" nodes like Comment
The next adjacent node within the same parent node, if applicable.
The immediate parent of this node, if applicable.
The previous adjacent node within the same parent node, if applicable.
The root Element
node of the Document
where this node lives, if applicable.
Returns a new node of the same type with the same properties and descendents
(also cloned). Note that the clone does not have a context (no parent
, etc)
until it is added as a child of another node. More on this below.
Variation of find
that operates as a depth-first search of descendents
Variation of filter
that operates as a depth-first search of descendents.
Detaches the node from its parent. As with clone()
, afterwards the element
will not have a "context" unless it is added somewhere again. This can be useful
to do things like mutative filtering, e.g.
.filter(node => node instanceof Comment)
.forEach(node => node.remove());
Returns an XML string. This may not be the same as the original source text. In addition to the more obvious cases of normalizing whitespace and formatting markup, there’s also the fact that entity references cannot be restored after parsing. There are likely approaches one could take to achieve this, but they are far from trivial, and I imagine they would increase the complexity of the processor by an order of magnitude. Rather I think it is reasonable to say that, in a sense, XML is a lossy format — not in terms of document content, but in terms of the specific ways that document content is delivered. Another example of this is that we do not retain knowledge of which attributes were supplied as defaults and which were explicitly included in the source text.
The formatted output tries to look good and gives you a number of options to control its appearance.
You can set a threshold above which the number of attributes on an element guarantees the attributes each get a newline of their own; by default this is 1.
In other words, given element foo
with attribute bar
and element baz
attributes qux
and quux
, the following will occur:
// attrInlineMax: 0
// attrInlineMax: 1
<foo bar="true"/>
// attrInlineMax: Infinity
<foo bar="true"/>
<baz qux="true" quux="true"/>
If true (default), attributes are sorted alphabetically by key.
If true (default), comment nodes are included.
Integer >= 0 indicating the current indentation depth. Mainly intended for internal use as serialization propagates downward. Begins at 0 by default.
If true (default), a doctype declaration is included if present.
If true (default), CDATA is formatted for clean multiline presentation in the output. This means whitespace is normalized and linebreaks may be inserted, so turn this off if CDATA whitespace should be considered significant. There are two exceptions built-in:
- The content of explicit CDATA sections is always left as it was found.
- The value of the nearest ancestral
attribute is honored; if the value is "preserve", formatting is not applied.
If true (default), comment content is formatted with linebreaks if needed.
Integer >= 0 specifying the number of spaces to use per indent. Defaults to 2.
Integer >= 0 specifying the minimum number of characters available as a line’s
length after the indent. Defaults to 30. See wrapColumn
for more.
If true (default), processing instruction nodes are included.
If true, single quotes will be preferred for literals, e.g. attribute values and external IDs. Default is false.
Note that it is ‘preferred’ because certain cases (system or public ID literals) may demand one or the other delimiter based on their content.
If true (default), empty elements will be represented using self-closing tags.
Integer >= 0 specifying the target max line length. Defaults to 80.
The wrap column is not applied strictly. In document with deep nesting, trying
to apply the rule with only an allowance for single tokens that cannot be split
could produce very awkward results. The minWidth
option complements
to address this.
For example, with the default options (80, 30), if your indentation depth is 60 characters, the effective wrapColumn ends up being 90, so that there are still at least 30 characters of width available to format within.
If true (default), an xml declaration is included at the start of the document. This declaration will not specify an encoding.
Returns a POJO representation of the node. Each object will have a "nodeType" property and any additional properties specific to the node. If it is a non-leaf node, it will have a "children" property as well to represent its content.
Throws an error if the node (including its descendents) is found in an invalid
state. Expects to only be called when the node is within a Document
All regular methods you expect from Array
are present, except fill
makes no sense here). Methods like map
return regular arrays, not ASTNode
Mutative operations that alter membership, like splice
, have custom
implementations on account of special needs concerning unique parentage and
non-sparseness (explained below).
As in the DOM, a given node may only have a single parent. The Document
context is just determined by looking up the chain, and this determines quite a
bit in turn. For example, if a Document
’s Doctype
specifies that Element
"foo" has a content type of "EMPTY", calling validate()
will only apply this
constraint if the element is actually a descendent of that document. Once
detached, the element no longer has a definition until reattached.
Unless you have no DTD to worry about, generally you will only want to call
on nodes that are attached to aDocument
Unlike the DOM, I wanted to keep the API familiar and intuitive, so nodes are
just arrays and you can move stuff around with push
, pop
, splice
, indexed
assignment, etc. Validity is not enforced as you do this; you must call
to confirm when you’re ready. This is to help keep things flexible.
Trying to enforce constraints at the level of small operations would make it
awkward to perform broad mutations, since the order the changes were made might
end up mattering. You’d need to know that and think about it and some cascading
effects are not always intuitive, especially if you edit things like markup
What is enforced though is that ‘one parent’ rule (more precisely: one ‘slot’ on one parent).
nodeA.length; // 3
nodeA.length; // 2
Closely related is that the nodes can never be sparse — which has potential
implications for doing membership mutation within a for
loop (though that’s
always a bad idea anyway) — and although something like the following is not an
error, it doesn’t make much sense:
nodeA.length; // 0
nodeA.push(nodeB, nodeB, nodeB);
nodeA.length; // 1
Most likely if someone did that, they really wanted something like:
nodeA.length; // 0
nodeA.push(nodeB, nodeB.clone(), nodeB.clone());
nodeA.length; // 3
The subclasses of ASTNode
share a common constructor pattern where the unique
assignable properties associated with that class can be provided in an options
object at construction time.
new NotationDeclaration({ name: 'foo', publicID: 'bar' });
Leaf node representing chardata (text). After parsing, the distinction between
explicit CDATA sections and implicit CDATA is generally not important, but we
do preserve that knowledge for the sake of consistent reserialization. The text
may not be empty unless section
is true (it would be a paradox, sort of).
Boolean, default false. If true, text
must not contain the sequence "]]>".
String, any valid xml characters. Remember that, after parsing, entity
references are replaced by their replacement text. CDATA is literal text by
definition. If calling serialize(), ‘escaping’ any characters or character
sequences that would be interpreted as markup is automatic if section
cdata.text = 'M&Ms';
cdata.serialize(); // 'M&Ms'
cdata.section = true;
cdata.serialize(); // '<![CDATA[M&Ms]]>'
Leaf node representing a comment.
String, any valid xml characters — but it must not contain the illegal sequence "--".
The Document
node can have, as children, any number of Comment
nodes, one Element
node (required) and one
node (optional). If present, the DoctypeDeclaration
must precede the Element
Like all nodes, Document.prototype
has doctype
and root
properties, but
here, they are also assignable.
If there is a DoctypeDeclaration
, it is an error if the root Element
not have the same name.
Returns element which has an attribute of type ID
whose value matches id
The Element
node can have, as children, CDATA
, Comment
, Element
, and
nodes; however, the specific content permitted and its
sequence may be constrained by a corresponding ElementDeclaration
Reference to corresponding ElementDeclaration
if applicable.
An element may have at most one attribute of type ID
. Regardless of its name,
if such an attribute exists, it is also available via the alias
. An ID
must always be unique. Note that this means across the whole document, not
"per element with this ID-typed attdef".
This could probably use extra detail. An ID
attdef can be thought of as
exposing what’s really an intrinsic XML element feature. The attdef exposes this
feature and permits customization of the name by which it will be exposed. In
other words, the ID
type should not be used to model data which just happens
to ‘be an ID’ in some other sense that has nothing to do with identifying
elements in a document.
An element with an ID
attribute can be referenced by other nodes that have
attributes. These three together comprise a significant
feature of the language, I think. For one, it is one of the few ‘constructive’
things you can do with a DTD; most of a DTD can be summed up as a ‘list of what
else is also an error now, actually’. I don’t think it’s very well known that
XML has a native mechanism for defining relationships between nodes that are
non-hierachical (even cyclic).
String, a valid name. If there is a DoctypeDeclaration
, must have a
corresponding ElementDeclaration
If an element has an attribute of type NOTATION
(like ID
, there can be only
one such attribute per element), the property elem.notation
will be a
reference to the associated NotationDeclaration
. Access only.
Returns map of attributes (key => value).
Returns an attribute value (as string).
Returns a node referenced by an attribute of type ENTITY
Returns an array of referenced nodes for attributes of type ENTITIES
Returns a set of token strings for an attribute whose type is NMTOKENS
Returns boolean indicating whether the attribute exists.
Default attribute values from attdefs are provisioned initially. If you remove an attribute that had a default it is actually removed, not reset. You must call this method explicitly to restore the default.
Assigns value
(as a string) to an attribute.
Attributes can be gotten and sotten as arbitrary additional properties of the
element. In cases where the name would collide with an existing property, prefix
the key with $
If the document has a doctype, the name
of each attribute must have a
corresponding AttdefDeclaration
to be valid, and the value must meet any
constraints specified by that declaration. Unlike most markup declarations,
though, AttdefDeclarations
do not just declare constraints or reference stuff.
They also may define the behaviors and, grammar productions, and in some cases,
the meanings of attributes. These can be called the ‘tokenized’ types, since
what they all have in common is that their values are composed of one or more
distinct tokens, not arbitrary chardata.
In the absence of a DTD, any attribute is legal and all attribute values are treated as type CDATA.
Leaf node. Processing instructions are like formatted comments that target
specific agents: <?foo poop?>
. I have never seen these in practice. The case
most people are familiar with is PHP, but those are not really PIs, they just
look like them; a real PI is parsed and included in the document itself, and has
nothing in particular to do with templating. Perhaps they selected this syntax
as a safety mechanism for cases where a template is accidentally rendered to the
client without being processed?
String, a Name (but not "xml" — case insensitive).
String, any valid xml characters (but not including the sequence "?>").
Declaration nodes define the behaviors of validating documents.
Since support for these aspects of XML is the primary distinguishing feature of this library, I’ll go into a little extra depth here and try to badly explain what each of these actually do in addition to just the AST interface.
Several constructs related to DTD source text ‘dissolve’ during parsing. These include conditional sections and parameter references — the actual AST does not have knowledge of these, as they are essentially directives for the parser.
Content from an external DTD is treated as if the internal subset had had a
parameter entity reference as its final source text: %ext_dtd;
Leaf node. An AttlistDeclaration
has, as its children, one or more of these as
its child nodes. Each defines an attribute of the element referenced by
. Somewhat surprisingly, this is the most involving kind of
declaration in a DTD, by far. Some of the effects of different attribute types
are explained more in Element
String. If fixed
is false and defaultValue
is absent, that corresponds to
keyword. If fixed
is true the attribute must have this value,
otherwise it is supplied only if absent.
Reference to corresponding ElementDeclaration
; access only.
A Set
of strings, applicable only if type
, the strings must conform to the Nmtoken
If type is NOTATION
, the strings must all correspond to the names of declared
Boolean. If true, corresponds to the #FIXED
keyword. This makes the default
value the only permitted value, and further, it demands that it is explicitly
included. I don’t know why you would ever want this, it doesn’t make any sense.
When true, required
is implied (this takes precedence).
Boolean, access only. True if the attribute has a defaultValue
which is
really the default value (i.e., not a fixed value).
Boolean, access only. True if the type is a list of space-delimited tokens.
Boolean, access only. True if the type’s grammar is NAME or NAMES.
Boolean, access only. True if the type is IDREF
Boolean, access only. True if the type is a valid type which is not CDATA.
String, any Name
. If two AttdefDeclaration
nodes specify the the same
attribute name for the same element, it is not an error, but only the first is
Boolean. If true and fixed
is false, corresponds to the #REQUIRED
Always true if fixed
is true.
String, one of: CDATA
, ID
. Note that unlike the others,
is not a keyword in the grammar. When there is no doctype, all
attributes behave as if their type is CDATA
. The rest of these are tokenized
Returns boolean; used internally during Element
validation to confirm that an
attribute value fully conforms to the attribute definition.
Returns boolean; this is a subset of matchValue
which confirms only that the
grammar conforms. The distinction is useful because the complete check cannot be
performed until the entire document has been parsed, while the grammatical check
can be performed immediately.
Though logically you’d expect attribute definitions to be hierarchically part of
, they are given as a distinct top-level markup declaration.
Each AttlistDeclaration
specifies an associated element and has one or more
children that describe individual attributes.
One ElementDeclaration
may have multiple associated AttlistDeclaration
It is an error for an AttlistDeclaration
to have no children. If an
specifies an element which was not previously declared, it
is not an error, but the AttlistDeclaration
will be ignored.
A reference to the associated ElementDeclaration
, if applicable.
String, the Name
of an ElementDeclaration
A ContentSpecDeclaration
is either a property of ElementDeclaration
) or is the child of another ContentSpecDeclaration
whose type is not "ELEMENT".
Boolean, access only. True if content spec is non-deterministic according to the XML spec.
String, element name. Should only be populated if type is ELEMENT
String, may be *
, +
, ?
or undefined.
May be "CHOICE", "SEQUENCE", or "ELEMENT". If type
is "ELEMENT", there can be
no children; otherwise, there must be children.
Returns an array of the names which could be the first elements matched by this contentSpec tree/subtree. This is used internally during validation to enforce what the spec calls ‘determinism’.
What is referred as ‘deterministic’ in the XML spec did not match my personal idea of what constitutes determinism (which may just be wrong; not sure). I’ll explain it a bit in case anybody else also finds this unintuitive.
The spec provides
as an example of invalidity. If I am interpreting the text correctly,(a+,a)
, which also implies backtracking, is disallowed as well, and even(a*,a*)
— which does not imply backtracking, but still makes it ambiguous whicha
was matched, as if that actually could matter.This is all for the sake of SGML, but unfortunately it is marked ‘for compatibility’ rather than ‘for interoperability’, so we have to try to enforce it (the former indicates a formal requirement of XML processors while the latter items are ‘non-binding’). It’s actually a lot more work to disallow such patterns than to allow them, and the constraint seems to be an unnatural nuisance to authors, since none of these patterns are actually ambiguous in terms of meaning and effect.
Returns a RegExp pattern which is employed when validating an element against the content spec.
This was a fun realization — the spec refers to ‘language generated by the regular expression in the content model’ in discussing the application of content specifications to content. While they meant this in an abstract sense, it stuck in my head and I realized there was no need to implement the matching logic as such since we really can take advantage of a RegExp object. All we need to do is map the child elements to a testable string. Saves quite a bit of work!
A DoctypeDeclaration
may be the child of Document
and may have any number of
the following nodes as children: AttlistDeclaration
, Comment
, EntityDeclaration
, NotationDeclaration
, and
If the DTD includes an external reference (publicID/systemID), this will be the
node representing the external content.
String, a Name
; should correspond to root Element
name. Required.
String with restricted character set. If present, systemID
is required.
String, any valid xml characters but '
and "
cannot both appear.
Returns an array of the children of both the doctype declaration and (if applicable) the external subset, in that order.
Returns the child ElementDeclaration
whose name is name
Returns the child EntityDeclaration
whose name is name
Returns the child NotationDeclaration
whose name is name
Leaf node, though it may have a ContentSpecDeclaration
as a property.
is used to define an element and what kinds of content it
may contain. Element attributes are declared outside of ElementDeclaration
Boolean, access only. True if the element declaration permits CDATA child nodes.
RegExp, access only. This is used internally when validating whether an element
conforms to its declared content spec. Attempted access will throw if the
is in an invalid state.
This may be one of the following: the string "ANY", the string "EMPTY", or a
node. In a document without a doctype, all elements
behave as if they had the contentSpec
Boolean. If true, contentSpec
must be a ContentSpecDeclaration
of type
, qualifier *
, and which contains only ContentSpecDeclaration
children of type ELEMENT
and qualifier undefined
String, a unique element name. It is an error to declare the same element twice.
Returns an AttdefDeclaration
node associated with the element by that name.
Returns a map of all AttdefDeclaration
nodes associated with the element.
Returns true
if the Element
passed in complies with the content constraints.
Same, but here the test is partial and confirms whether an element name
could be a valid continuation of the content so far, rather than whether the sum
of content is an entirely valid production.
Leaf node. Entities are kind of like variables. The terminology surrounding entities in XML is really, really confusing. So let’s start by trying to clear it up a little, or possibly, making it worse:
"Entity" means like ... it seems to mean practically everything in XML. For
example, a Document is an entity. But a document cannot be ‘declared’ as an
entity. An external DTD (doctype definition) is also an entity, and it does
get declared, but with <!DOCTYPE ...
, not <!ENTITY ...
— though a subset of
an external DTD can be declared with <!ENTITY ...
There are two main categories of ‘entity’: internal and external. An external entity is one which is indicated by reference (‘external ID’) and must be resolved. An internal entity is one whose text is part of the entity declaration itself, provided as a string literal.
There are also ... uh ... two main categories of ‘entity’ ... parsed and
unparsed. A parsed entity may be internal or external. An unparsed entity is
always external. A parsed entity is an entity whose value can be interpreted as
a XML (or a fragment of XML), and an unparsed entity is one whose value is not
going to be interpreted as XML. What might an unparsed entity be interpreted as?
To answer that question, you use NotationDeclaration
Within the category of parsed entities that can be declared with <!ENTITY
there are two more types: GENERAL
. A general entity is one
which can be referenced using &poop;
and its value is "content", like CDATA or
elements. A parameter entity is similar, but the syntax for references is
, and these can be used inside DTDs, sometimes in really wacky ways.
95% of all the complexity of XML comes from the concept of ‘entities’.
For purposes related to EntityDeclaration
, we can forget much of this and
instead talk about there being three types: GENERAL
, and
. Within the resulting AST, likely only UNPARSED
entities will be of
String, any Name
. It is an error if there is a previously declared entity with
the same name.
A reference to the associated NotationDeclaration
, if this is an unparsed
String, the Name
of a previously declared notation.
String with restricted subset of legal characters. If present, this is an
external entity and it does not require a value
; also, if present, systemID
is required.
String, any valid xml characters but it cannot contain both '
and "
. If
present, this is an external entity and it does not require a value
. It is
required if this is an unparsed entity.
One of 'GENERAL', 'PARAMETER', or 'UNPARSED'. If the type is UNPARSED
, the
attribute is required.
This is an array of raw codepoints rather than a string. It is only required if there is no external ID, implying an internal entity, though it will also be populated for external parsed entities which were dereferenced during parsing. Note that parsed entities are mainly an internal mechanism which loses meaning after parsing is complete; I do not recommend editing this property.
tl;dr Unlike the other markup declarations, and excluding the case of unparsed entities, which remain abstract references, entity declarations affect parsing but have no subsequent affect on the AST. Mutating them, or even simply keeping them around, is likely pointless. Assuming a parsed entity was actually used, those original references have all been dereferenced (a necessity to produce the AST to begin with) and hardcore cannot ‘rereference’ them automatically. If reserializing the document, it will have become standalone and it will no longer contain entity references.
Some markup declarations directly influence parsing itself ‘in real time’. Entity declarations are the most important of these, and the only one which is entirely unavoidable. It could be argued that, at least at the lexical level, XML without DTDs is a context-free language, but once entity references — in particular parameter entity references — enter the picture, it is most certainly not. It is like ... aggressively not. The references need to be dereferenced and their values parsed in-context at the point of reference. Since parameter entity references are less constrained than general entity references (when appearing in external entities), it is impossible to tokenize what follows "%poop;" in the following example without dereferencing and parsing it first:
<!NOTATION %poop; "foo">
The string literal there is initially ambiguous — it could equally be a system
ID literal or a public ID literal, but these two productions are defined as
unique lexical tokens (allegedly ‘regular’ ones, too), each with distinct rules.
Is the quoted sequence below to be tokenized as a valid SystemLiteral
, or have
we encountered a syntax error within a PublicLiteral
<!NOTATION %poop; "foo\bar">
So I would add to the long StackOverflow thread about whether XML is context-free (which is oddly focused on whether ID uniqueness constraints count, when there are actully a bunch of similar cases in validating XML, like content specs, enumerated attributes, etc; and of course, even in terms of well-formedness constraints, matching element tags): yes, you can apply all of these constraints after lexical parsing, so things like ID constraints don’t push it over the edge in terms of a distinct lexing phase; you could even go as far as dereferencing and parsing general entity references recursively just to keep it context-free. However, despite these possible approaches, XML can still never be context-free even at the lexical level because of parameter entity references. In theory, you might create a custom set of definitions for the regular productions that conflates the potentially ambiguous tokens found in markup declarations, then resolve the ambiguities in a second pass; but at this point we’d have departed from XML’s own definition of its grammar.
Leaf node representing a "notation". This is a bit under-explained in the XML
spec proper I think. It represents a reference to something external, but unlike
other external references, the thing it points to is not called an ‘entity’. The
main purpose it to associate unparsed entities (references to stuff that’s not
XML) with some agent that should be used interpret them — kind of like
specifying the content-type
header or something. They can also be used to
provide definitions for the agents to whom processing instructions should be
given, but PIs do not expressly require this, while unparsed entities do.
Finally they can be associated with a specific element by having an attribute
with type NOTATION
for that element.
Note that, when used in the context of an unparsed entity definition, the
keyword is NDATA
, in order to maintain the high degree of
esoteric mystery that gives DTDs their fundamental character.
It’s pretty open-ended; it could be used something like this (probably wrong, eh, ymmv):
<!NOTATION ecmascript PUBLIC "node">
<!ENTITY hardcore SYSTEM "node_modules/hardcore-xml" NDATA ecmascript>
<!ELEMENT dependencies (module*)>
<!ATTLIST module
<module name="Hardcore" file="hardcore"/>
String, a unique Name
. It is an error for the same notation to be declared
more than once.
String, restricted character set. Unlike other nodes that have external IDs,
permits a publicID
without also having a systemID
String, any valid XML chars, but not both '
and "