-
Notifications
You must be signed in to change notification settings - Fork 45
Some thoughts about CFIs
By Justin Hume
This document describes some thoughts and issues related to the EPUBC 3.0 specification for Canonical Fragment Identifiers (CFI), as well as some issues implementing it in a stand-alone Javascript library, now used in Readium. The rationale behind documenting these issues is to provide information that might be relevant to EPUB stakeholders, spec writers and reading system implementors. This seems especially important as this CFI library appears to be one of the first public implementations of CFI support. This is intended primarily for developers and people who have some understanding of the spec.
I'll cover a number of topics that came up during development of the CFI library. First I'll cover the CFI grammar and specification and second some practical issues related to doing things with CFIs in a reading system. General information about the implementation of the CFI library can be found in the EPUBCFI repository.
Comments are welcome, of course. I'm sure there are parts of the spec I missed, misinterpreted etc. I'm also sure others out there have better solutions for some of these issues, so please let me know if you do!
Most of the things I'll cover regarding the CFI grammar are probably what I'd classify as nit-picking, oddities and edge cases. Generally speaking, my opinion is that the CFI grammar and specification appear sound and can be implemented - although it is beyond my expertise to know how useful CFIs are as an approach to intra-EPUB referencing and indexing.
So, onto the issues...
The first two issues are related in that they deal with the "assertion," "CSV" and "parameter" described in the CFI grammar.
The "CSV" non-terminal is a part of the "assertion" and "parameter" production rules. This means that the following syntax productions are valid:
-
An id assertion can take the form: [value, value, value, value, ...]. While commas are allowed in an XML id and this string could thus be interpreted as a valid XML id, it would seem a somewhat confusing way to specify that an id assertion is simply an XML id made up of non-reserved unicode characters.
-
A text assertion can take the form: [value, value, value, value, ...]. As a text assertion specifies a string that should be found on either side of a character offset, it's not clear how a text assertion with more than two comma-separated values can be interpreted, based on how text assertions are described in the CFI specification.
-
A parameter can take the form: [;something=value, value, value, value, ...]. While it is possible that certain types of parameter may eventually take a form such as this, the only parameter described by the spec appears to be the "sided" parameter. It is unclear from the specification if this syntax structure is intended to support future parameter types.
-
A text assertion of the form [,aaa] would seem a valid way to specify a string that should follow a CFI character offset, without specifying the text that preceeds the offset (perhaps when the character offset is the first position in the text node). However, it appears that the grammar does not allow for this syntax, given the "CSV" rule, which is "value, {"," value}."
This clearly means that an id assertion could take the form [value, value; value=value, value, value, ...], while a text assertion could take the form [value]. While both of these forms would appear to be valid strings for an xml id attribute and the meaning of a "text assetion", respectively, I think it's a bit confusing for the grammar to be specified like this. It provides the strong implication that the assertions have the same semantics. In fact, the intended semantics for an id assertion and text assertion are different, as I understand it.
At present, the specification only describes semantics for a text assertion when it is part of a character offset terminus. There may be an intention here to utilize assertions for other terminus types, however, this is not mentioned in the specification. It is thus unclear as to the meaning of an assertion, when part of the other terminus types.
The "local_path" non-terminal is a bit odd in that it appears to be able to produce a string of indirection (!!!! etc.) operators as valid syntax. A string of indirection operators with assertions would also be valid. The meaning for such strings is not clear from the specification.
The following two issues relate to the grammar and other issues that arose from trying to implement the CFI specification.
1. Step references to child nodes use even and odd positive integers to specify the type of step reference
Section 3.1.1 of the CFI specification describes the semantics of a step reference to a child node. Even positive integers in a step reference point to xml elements, while odd positive integers point to collections of non-element nodes. The intention of these semantics is to ensure that "node identification is not sensitive to XML parser handling of whitespace text nodes, CDATA sections and entity references." This makes perfect sense from a logical perspective.
However, from a practical perspective, there seem to be two issues. First, the CFI grammar is able to produce valid syntax that has no meaning, given the above. For example: epubcfi(.../3/4...) specifies a step to a (non-element) text node, followed by a step to an element. While syntactically valid, this would have no meaning that I'm aware of as 1) the specification is explicit that odd integers point to non-element nodes and 2) non-element nodes are leaf nodes in XML and thus would not allow for subsequent CFI steps to child elements.
Second, while this indexing method makes node identification insensitive to xml parser behaviour on a logical level, it doesn't appear to be particularly helpful practically. Iterating through, or indexing into, an ordered set of nodes parsed from an XML document requires the developer to be aware of how a specific parser actually handles content (or lack thereof) between elements in an XML document.
The Chrome (Webkit) parser, accessed through window.DOMParser, produces text nodes for whitespace but does not create a non-element node if xml elements are adjacent "</el2." Essentially, the existence of non-element nodes between elements can change, depending on changes to the xml document that have no impact on its logical structure. My guess is that this could be problematic, as parser handling of non-element nodes in an XML document is probably not well-understood by authors and developers. For example, most people creating an xml document would probably consider the following two examples equivalent:
and
However, when using the Chrome window.DOMParser the first would produce the following list of child nodes for the "parent" element:
[textNode, elementNode, textNode, elementNode, textNode]
...where the textNodes would consist of whitepace and carriage returns.
The second example would produce the following list of child nodes:
[textNode, elementNode, elementNode, textNode]
The first case results in a list that would be consistent with the even and odd integer indexing scheme specified for CFIs; the element nodes are the even nodes and the non-element nodes are the odd nodes, assuming the index starts at 1. However, this is upset in the second case by "cosmetic" changes to the xml document - when the Chrome parser is used. Other parsers may have different behaviour (perhaps injecting empty text nodes between adjacent xml elements).
Clearly, a developer implementing the CFI specification needs to be acutely aware of how a specific xml parser behaves, and how cosmetic changes in the structure of xml elements in a document might change the parsed object representation of that document.
Given this issue, the most robust way to implement node identification using CFI steps is to explicitly count into a set of child nodes, including, excluding or inferring the position of relevant node types. As an example, if the step target is an element node (an even integer), the set of child nodes should be "filtered" to include only element nodes and the step value converted to its sequential integer representation (6 becomes 3; indicating the third element in the child set). This converted step value should then be used to index into the filtered set of xml elements.
A text-node step (an odd positive integer) has to be handled differently. As empty non-element nodes may not be created by a parser in some circumstances, their presence must be inferred to correctly index into the child list. The presence of a non-element node should be inferred in the following cases for a given set of child nodes:
- The first child in the set is an element node: As it is relevant to the CFI specification, the first node in the list should be a non-element node; increment the non-element index by 1.
- Two adjacent element nodes: A missing non-element node can be inferred here, and the non-element index should be incremented by 1.
- The last element in the list is an element node: As it is relevant to the CFI specification, there should be a non-element node in the last position; increment the non-element index by 1.
Given that the index positions of element nodes in the child set are required to infer missing non-element nodes, element nodes should NOT be filtered from the child set.
The takeaway is that a developer working with CFIs has to be very careful about how an XML parser is going to behave and how they use the step reference values (even or odd integers) to index into lists of child nodes.
In light of the discussion above, the use of even and odd positive integers in the CFI grammar to indicate step references of particular types seems somewhat odd. This approach seems to tie CFI syntax to specific assumptions about the physical structure of a document (or object) that represents parsed XML. The use of odd and even positive integers to denote each type of step suggests that it is the intention of the specification authors that developers can use the step reference values to directly index into lists of child nodes. For the reasons given above, this is probably not a very robust way to identify the intended target node. If it is not the case that the specification authors intended the step values to be used this way, it is not obvious why even and odd integers are used to denote each type of step. It could be safely assumed that every step up until the last one is a step to an element child, while the last step is an index into one of a set of text node children.
Additionally, the current grammar allows for syntactically valid productions of steps that have no obvious meaning, such as "...\2\3\4\3..." This sequence of four steps contains element steps after text node steps. As mentioned above, since non-element nodes are leaf nodes in XML, it would seem that this syntax describes a set of step references that cannot be followed.
There are a number of issues that came up related to implementing CFI support in a client-side reading system - Readium, in this case.
Generating or "following" a CFI requires that the object representation of an XML document (A DOM object in the browser context) have an identical structure to that of the xml text document contained in an .epub file. If this is not the case, the step references in the CFI will end up pointing to different elements, or text nodes, than intended. This is obvious conceptually. However, as a practical matter, reading system developers need to be careful to ensure that when they're working with CFIs, differences are accounted for between the object representation of EPUB documents and their xml text representation.
For example, when a content document is loaded in Readium, Readium injects a number of different elements into the document object so as to support other features of the reading system. This includes elements such as script tags for Readium-specific styling, elements that implement math symbol support with Mathjax and elements that have been injected to represent content at CFI locations. Once these elements have been injected, the document object representation of EPUB documents is no longer consistent with the text representation contained in the .epub file. Following a CFI, or generating one, in Readium requires that elements that are not part of the original text version of the EPUB document to be accounted for.
Related to the previous point, it would seem that very few guarantees can be made about generating and following CFIs if an EPUB includes scripted content, as scripted content can essentially change anything in the DOM.
The rationale for building a stand-alone CFI library was that it could provide both a reference implementation of CFI support for reading systems and/or so that it could be integrated in existing client-side reading systems as a proof-of-concept for using CFIs for intra-publication indexing.
Building a simple, useful library for CFI handling requires designing an API that provides support for the major abstractions related to using CFIs, and enscapsulating most of the complexity of interpretating CFIs.
Most use cases for CFIs will probably involve stepping through documents to some point in an EPUB, at which time something will be done with that part of the document. As a specific example, a CFI library method would interpret, navigate to, and inject some content at a point in a EPUB referenced by a CFI.
In the case of this example, we'd probably want an API method that takes a CFI as an argument, and some injectable content, and returns a result. The immediate complication for designing this method is that the CFI is referencing, and must navigate through, multiple resources to get to a specific part of the EPUB. This set of resources will definitely include the package document and most likely, a content document. Additional content documents, images etc. may also be referenced by the CFI through iframe or img elements, or other types detailed in the specification.
Clearly, this hypothetical API method would need to load at least some of these resources to fully interpret the CFI, inject content and return a result. Herein lies an inherent conflict between how a CFI needs to be processed and programming in the browser context. CFIs are inherently synchronous; each step must be executed in strict sequence. Programming with Javascript in the browser, especially as it relates to loading resources, in asynchronous. We typically request a resource and then go off and do something else, executing a callback when the resource has been successfully loaded. In the case of our hypothetical CFI method, the executing CFI script would have to block as soon as it made the resource request, waiting on its successful completion, before it could continue following the CFI.
This is clearly not an ideal solution because the abstraction provided by the CFI method leaks in a big way: First, a user has to know that it is going to make a blocking call for a resource, certainly not ideal in a single-threaded client-side web application. Second, the method needs to know how to make a resource request in the first place. This is likely to be specific to a particular reading system (AJAX? Local store? Something else? etc.) and may be different depending on the specific resource. The user of the method may also want to supply parameters and behaviour for timeouts if the requested resource cannot be retrieved.
The end result is that it is pretty difficult to fully abstract the interpretation of a CFI, given its multi-resource referencing and synchronous nature, at least in the context of client-side web application. I don't think this problem is unique to the creation of a stand-alone library, either. The synchronous blocking problem is going to exist in any single-threaded context. There are, of course, alternatives to this single-method approach; I'll describe the one used in the CFI library. Additionally, reading systems implemented with different technologies may have other options, as well.
The API for the CFI library was designed to provide methods that could return information about a CFI, allowing the Reading system to then execute arbitrary code to, say, load a resource. After the resource was loaded (maybe asynchronously while the calling script did something else) another CFI API method could be called to finish doing something with the CFI, such as injecting content into the retrieved content document.
This approach has the disadvantage that the user has to know more about the details of how CFIs navigate through the contents of an EPUB. It also introduces some hidden coupling between the CFI library methods, and between the values of arguments to some of the library methods. However, it seems to be a reasonable balance between these problems and those inherent in processing CFIs in the browser context, as described previously.
The takeaway is that interpreting a CFI requires multiple resources, accessed in sequence and so, reading system designers need to give thought to how CFI interpretation is going to integrate with other parts of the Reading system; parts such as resource retrieval, the object representation of EPUB documents and the rendering of documents to the screen. This is particularly true in the context of client-side web-application.