Matt Briggs, 10/6/2022
This document specifies the document interface for the query rules.
The ruleset document will be run when a document is validated for a specific content type. The content type is a specific document schema that has a specific purpose. The ruleset describes the required and recommended features of a document type. The ruleset will allow the system to programmatically check a markdown document and validate that conformance to the rules, and if not valid, produce a message for the user on how the markdown file can be fixed.
This specification document is more of the design. You can review the stab at implementation here.
The code is a set of Python scripts for validating markdown files used in static site builders such as docFX and Hugo.
You can read the Docs at: https://blue-forest-0a98c3610.azurestaticapps.net/
The ruleset is a JSON array. Each rule is defined with ten attributes. The rules are set of declarative assertions. They are broken into types. Metadata assertions and assertions about the body of the document. An executed rule requires as input the rule JSON declaration and a markdown file. The rule returns two values: True or False for the validity of the rule and an object found by the query.
For example, if a rule that checks for the existence of a H1 runs the query "/html/body/h1" for a markdown document. The rule will return both True and the string for the header.
You can find an example here: validationrule-examplev2.json
Markdown is an ambiguous format. For validation parsing, the validator will render markdown in a semantically unambiguous format. Since the target is a web site, the markdown will be rendered as XHTML.
The rules will run on a markdown document that will be rendered into two parts:
Part 1: Is a JSON array of the metadata.
Part2: Is an XHTML document that has been modified from the classic HTML DOM to reflect the table of contents organization of the document. A typical DOM is wrapped in HTML and then contains a header and body section. The body section contains a list of nodes that are children of the body node. In the validation XHTML schema, the body nodes reflect the outline levels of the document. That is an H3 is a child of a H2 and the H2 is a child of an H1.
A document is divided into two parts with two different modes of running a check. A validation will run a rule using an assert operation. This means it will find a value and then check if the operation is true. For example, it might check if the metadata value is ms.author == "mabrigg". This will be true or false.
Part 1: metadata. This will render the metadata as a set of key value pairs. Each key can be interrogated both for its existence and for the values.1 For example, you can check if key: ms.author exists. And you can check the value of ms.author.
Part 2: XHTML (with outline). The body of the document will be rendered as an XML tree that reflects the outline of the document. The nodes of the tree can be navigated and extracted using an XPath query. For example, you can check if the title of the document (the H1) is "A specific string" with an XPath query: /html/body/H1 == "A specific string".
Each query returns two values. It returns a bool (true/false) and the element that is found. For example, ms.author would return true, and the value of the key.
Queries can be concatenated by a reference to the query id attribute. This allows queries to build off of each other in a chain. For instance, we can check that the title metadata attribute equals the H1. Or you can grab a section of the document with one query and then run checks on it in another query.
Each rule contains the following attributes:
-
Name: a string to identify the rule.
-
Type: if this is a header (metadata query) or a body (xpath query)
-
ID: A unique identifier for the rule.
-
Query: the metadata or the xpath query. A query can reference a previous query. The query returns a value as either a string, a list of strings, or an XML collection of child/parent nodes.
-
Filter: Query filters reduce the amount of data retrieved from the query.
-
Flag: Selects the expression type for the operation. For instance, regex runs a regex check on the value(s) returned by the query.
-
Operation: Numerical operations are basic mathematical operations like addition, subtraction, multiplication, and division as well as other operations that are denoted by symbols like 'less than', 'greater than', 'equal to', 'not equal to'.
-
Value: The expected value.
-
Level: The level of the validation. Required or recommended. A required value that fails would fail the overall validation.
-
Mitigation: A text (string) explanation of how to fix the issue. The mitigation may require a mechanism to pass the identified value with the expected value and a line number where the validation fail occurred.
{
"name": "H1 must begin with tutorial",
"type:" : "body",
"id": "29",
"query": " /html/body/h1",
"filter": "/text",
"flag": "",
"operation": "regex",
"value": "^Tutorial",
"level": "Required",
"mitigation": "The H3 headings aren't numbered"
},
{
"name": "H1 must begin with tutorial",
"body",
"id": "29",
"query": "{29}",
"filter": "",
"flag": "text",
"operation": "regex",
"value": "Azure Monitor",
"level": "Required",
"mitigation": "The H3 headings aren't numbered"
}
Field | Datatype | Required | Description |
---|---|---|---|
Name | string | Yes | A unique string and name for the rule. |
ID | number | Yes and must be unique. | The alias for the rule that used when constructing the workflow. The ID must be a number and must be unique in the context of a document. |
Query | string | Yes | Header: The key for the metadata value. Body: This is the xpath query that returns a result from the document. Body rules also support references to other rules. For example: {29} as the query would run the new rule on the result of rule ID 29. |
Filter | string | No | Contains a regex. |
Flag | Enumeration: check value date type all |
Yes | Selects the modes of the check. For header: - Check will check if the value is present. Used for checking for the presence of a key in the metadata value (header). For a body query, will check against the node type in the document object model. - Value will evaluate the value of the metadata with a given value and operation. - Date will check a metadata value with a given date. Note, you can use "now" to indicate the current time. - Count will return the number of items found in the query. Text will return the text of the first item returned. - All will return the entire text of the page. |
Operation | Enumeration (See list below) | Perform an operation of the query and the value. | |
Value | string | Yes | The value used for the comparison in the assertion. Supports a ref to another rule. This will compare the return of the referenced rule to the value. For example, {1} |
Level | Enumeration: Required Suggested |
Required | This is the level of the validation. All required validation rules must pass for validation to pass. |
Fix | string | Yes | This is the message returned to the user when they run validation and a validation error is triggered. |
Comment | String | Yes |
The following operations are supported and will return a tuple of (Boolean, Value).
Operand | Name | Return | Description | Example |
---|---|---|---|---|
== | equals | Boolean | Will compare to values. | |
!= | Not equal | Boolean | Will compare to values. | |
< | Less than | Boolean | Will compare to values. | |
> | Greater than | Boolean | Will compare to values. | |
[: | Starts with | Boolean | Will check if the string occurs at the start of the item. | /html/body/h2[1] text [: Why use Azure Stack Hub? |
[] | Contains | Boolean | Will check if the string can be found within the item. | /html/body/h2[1] text [] Why use Azure Stack Hub? |
:] | End with | Boolean | Will check if the string occurs at the end of the item. | /html/body/h2[1] text :] Why use Azure Stack Hub? |
p0 | Part of speech | Boolean | Will check at the index the part of speech of the word. | /html/body/h2[1] text [1 verb |
l | Character length | Boolean | Will check if less than the length given | /html/body/h2[1] text [- 120 |
s | Sentence length | Boolean | Will check if less than the length given | /html/body/h2[1] text -] 6 |
r | regex | Boolean | Will check if the items matches a pattern. Uses Python flavored RegEx. | %%/%%/%% |
Alphabetical list of part-of-speech tags used in the Penn Treebank Project.
Token | Part of Speech |
---|---|
WRB | Wh-adverb |
WP$ | Possessive wh-pronoun |
WDT | Wh-determiner |
VBZ | Verb, 3rd person singular present |
VBP | Verb, non-3rd person singular present |
VBN | Verb, past participle |
VBG | Verb, gerund or present participle |
VBD | Verb, past tense |
SYM | Symbol |
RBS | Adverb, superlative |
RBR | Adverb, comparative |
PRP | Personal pronoun |
POS | Possessive ending |
PDT | Predeterminer |
NNS | Noun, plural |
NNP | Proper noun, singular |
JJS | Adjective, superlative |
JJR | Adjective, comparative |
WP | Wh-pronoun |
VB | Verb, base form |
UH | Interjection |
TO | to |
RP | Particle |
RB | Adverb |
NN | Noun, singular or mass |
MD | Modal |
LS | List item marker |
JJ | Adjective |
IN | Preposition or subordinating conjunction |
FW | Foreign word |
EX | Existential there |
DT | Determiner |
CD | Cardinal number |
CC | Coordinating conjunction |
PRP$ | Possessive pronoun |
NNPS | Proper noun, plural |
Alphabetical list of nodes types in the document object model.
todo
Footnotes
-
Metadata can be programmatically checked in a few ways. 1. The metadata section can be rendered as JSON. This JSON can than have JSON Queries run against it and the return value assessed. 2. The metadata section can be converted into a dictionary and then the dictionary compared with a validation dictionary. An implementation of this is the Python library Cerebus. 3. In this case, we can simply convert the metadata section into a set of key value pairs. Depending on the datatype of each value we can run operations to determine if is the expected value.
Another method of schema-based validation has suggested itself and that is using JSON schema validation. ↩