-
Notifications
You must be signed in to change notification settings - Fork 61
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revised restriction on external identifiers #1368
Conversation
add restriction on external parsed and parameter entities
I think that the direction is right. It remains to create a list of public identifiers and system identifiers for document type declarations and to write some text about how RSes handle external DTD subsets. Strictly speaking, the original text further disallows declarations of external unparsed entities and notations in internal subsets. Although I don't think that they are used, I would like to mention them in the third bullet. |
@mattgarrish that works for me. Two remarks
|
Just an additional comment that came to my mind after I pushed the 'comment' button: we already have a bunch of registries and we will have to have these in future. Maybe the list of the DTD-s could be migrated to such a registry; this would make it easier to maintain them later. We can populate them with 1-2 DTD-s today and we can then move from there. |
Oh, absolutely. I just wanted to start creating something concrete that we can take to the working group, as the original issue has grown a little too immense to wade through. |
I thought of that, too, but it's been controversial because of validation. The remaining registries generally avoid defining aspects that require changes to epubcheck (or are tied to features that have largely died off, like specialized publication types). The only exception is the structural semantics vocabulary, but even for that we ultimately had to remove strict validation of the terms from the specification because they were causing problems with vendor ingestion. I like the approach of being as thorough as we can now knowing what we know about what people actually use. My impression is most new specifications aren't defining doctypes/DTDs for their content anymore, and we can always loosen epubcheck in advance of another revision if it absolutely becomes important to add a new one (the wait for vendors to implement will always be outside our control, though). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the direction is the right one, with the following caveats:
- editorial: s/parsed entity/general entity/
- editorial: I wonder how to better tie the list (or registry) of allowed doctype to the related formats
- general: while we're discussing that aspect of the spec, I think we should try to identify the conformance statements for RS processing. That can be done in a separate PR of course.
epub33/core/index.html
Outdated
<p id="confreq-xml-entities">It MUST NOT declare external <a | ||
href="https://www.w3.org/TR/2008/REC-xml-20081126/#dt-parsedent">parsed entities</a> or | ||
external <a href="https://www.w3.org/TR/2008/REC-xml-20081126/#dt-PE">parameter entities</a> | ||
in an internal DTD subset [[!XML]].</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "general entities" is more appropriate for the former, instead of just "parsed entities", which is broader.
Parameter entities are a kind of parsed entities. Here we want to disallow any external parsed entity, so we disallow both external general and parameter entities.
Note that we would still allow external unparsed entities, FWIW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The taxonomy of entities in XML is extremely difficult to understand. I believe that I read every sentence in the first edition of XML at least 20 times. But this taxonomy is still confusing to me.
You correctly pointed out that parameter entities are parsed entities referenced within DTDs. Stand corrected. But general entities include unparsed entities as well. So, "parsed entities" are not broader than "general entities".
IMHO, internal DTD subsets must not declare
- external parameter entities
- external general parsed entities
- unparsed entities (which are always external and general)
The sum of the first three is "external entities". We can simply say that declarations of external entities are disallowed in internal DTD subsets. But this is just too difficult for everybody.
The original wording in EPUB 3 is quite good since it is free from the confusing taxonomy.
We might want to slightly change the original wording. How about this?
- External identifiers MUST NOT appear in the internal DTD subset [XML].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there terminology that could cover "all things external" so we don't have to get into a list of technical terms?
For example:
"It MUST NOT reference external resources from the internal DTD subset [[!XML]].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, we were writing at the same time! I'm fine with any wording that encompasses all we need to restrict without being overly detailed.
epub33/core/index.html
Outdated
<p id="confreq-xml-identifiers">It MAY only include a <a | ||
href="https://www.w3.org/TR/2008/REC-xml-20081126/#dt-doctype">document type | ||
declaration</a> that references the <a | ||
href="https://www.w3.org/TR/2008/REC-xml-20081126/#dt-pubid">public identifier</a> and | ||
<a href="https://www.w3.org/TR/2008/REC-xml-20081126/#dt-sysid">system identifier</a> | ||
for a format referenced in <a href="#app-identifiers-allowed"></a>, or that omits <a | ||
href="https://www.w3.org/TR/2008/REC-xml-20081126/#NT-ExternalID">external | ||
identifiers</a> [[!XML]].</p> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find the wording a bit confusing: doesn't it mean that any XML document, regardless of what it contains or how it's declared in the Package Document, can contain any of the public/system ID defined in the list?
For instance, a non-SVG XML document declaring the typical SVG doctype would conform to this statement.
Can we somehow better tie the doctype declaration to the media type of the document it is declared in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya, I was thinking of that more in the table, but even here we could add something like "that is allowed for its media type" and then also add a media type column to the table.
So looking at the current XML conformance requirements, reading systems are already required to be non-validating processors (i.e., the current resolution is already handled). Since external identifiers are largely still banned in content, and we only allow select safe doctypes, is there actually anything new we need to restrict on the RS side? Do we mention not resolving external identifiers just to be on the safe side? |
I think so. This is kind of a obscure part of the spec, emphasizing can only improve it. |
I've added preview/diff links for the RS spec to the first comment. Please review the current state of both documents everyone and let me know if we've finally reached a happy place on this. (Let's take up the other DTDs to add in the existing open issues so we don't block this part trying to agree on those.) |
Looks good to me. If I were to nitpick, I would only reword:
to:
The only reason being that "external entity" has a proper definition in the XML spec, whereas "external identifier" has not (it's only kinda inferred from the combined definitions of system identifier and public identifier). So we can point to that definition instead of pointing to the non-terminal symbol of the grammar 😉. |
If we disallow external entities in the DTD subset, I think that the conformance section does not have to disallow fetching of external entities. But if we disallow it on the safe side, we need a note saying that external entities are already banned. |
You are correct. Another possibility is
This is technically equivalent. To me, this is easier to understand. Every user of XML understands SYSTEM and PUBLIC since they type these words often. But different guys think different things easy. |
This issue was discussed in a meeting.
View the transcriptWendy Reid: we had resolutions at the F2F, and further discussions on github… and came to a happy place Matt Garrish: #1368 Matt Garrish: where we ended up was… … we put in an allowance for a specific set of external identifiers that we have put in an appendix … we have SVG and MathML that are allowed to be used in content docs or in separate files … and we made a restriction against external entities in the internal DTD subset … so it prevents some security issues but eases authoring … so we’ll no longer force people to remove SVG DTDs from tool-generated files … I’m hoping this is it :) Ivan Herman: tech comment … in fact, the changes are such that … makes possible something that I’m not sure we really use … I can define as part of an internal entity something that won’t go out to the network … I’m not sure if this feature is in use … formal comment … there was a formal resolution on the previous version; this PR slightly changes that … can we get a formal resolution to merge, and also close a bunch of issues which were examples of the problem? Proposed resolution: Merge PR #1368 to address outstanding DTD issues, and close GH issues 1369-1373 (Wendy Reid) Garth Conboy: +1 Matt Garrish: +1 Ivan Herman: +1 Charles LaPierre: +1 Matthew Chan: +1 Wendy Reid: +1 Brady Duga: +1 George Kerscher: +1 Laura Brady: +1 Bill Kasdorf: +1 Ben Schroeter: +1 Resolution #1: Merge PR #1368 to address outstanding DTD issues, and close GH issues 1369-1373 |
Without getting into whether we should list other allowed doctypes, is there at least the possibility for agreement on this approach to updating the restriction @iherman @murata2makoto?
Fixes #1338.
Reading Systems preview
Reading Systems diff
Preview | Diff