Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revised restriction on external identifiers #1368

Merged
merged 10 commits into from
Nov 6, 2020
Merged

Conversation

mattgarrish
Copy link
Member

@mattgarrish mattgarrish commented Oct 26, 2020

Without getting into whether we should list other allowed doctypes, is there at least the possibility for agreement on this approach to updating the restriction @iherman @murata2makoto?

Fixes #1338.

Reading Systems preview
Reading Systems diff


Preview | Diff

add restriction on external parsed and parameter entities
@murata2makoto
Copy link
Contributor

I think that the direction is right. It remains to create a list of public identifiers and system identifiers for document type declarations and to write some text about how RSes handle external DTD subsets.

Strictly speaking, the original text further disallows declarations of external unparsed entities and notations in internal subsets. Although I don't think that they are used, I would like to mention them in the third bullet.

@iherman
Copy link
Member

iherman commented Oct 27, 2020

@mattgarrish that works for me. Two remarks

  1. Because this PR is a discussion that is based on overwriting a WG resolution, merging this PR should be subject of an explicit and documented WG resolution on a call. Cc @shiestyle @wareid @dauwhe
  2. As an overall reaction (without going into the details) on the content of Appendix B. my fundamental stand is that if there is any chance now or in the future that a well-known, stable and, possibly standard DTD might be used by any resource that might be part of the EPUB instance, then we should allow it (even if it is not in use today). Adding a DTD to that table today is painless and harmless, a process of adding a DTD later may become more complicated. (That is why I am in favour of, for example, adding MathML that can be used as an exchange format in educational or scholarly publications even if not embedded directly in the HTML content.)

@iherman
Copy link
Member

iherman commented Oct 27, 2020

Just an additional comment that came to my mind after I pushed the 'comment' button: we already have a bunch of registries and we will have to have these in future. Maybe the list of the DTD-s could be migrated to such a registry; this would make it easier to maintain them later. We can populate them with 1-2 DTD-s today and we can then move from there.

@mattgarrish
Copy link
Member Author

  1. Because this PR is a discussion that is based on overwriting a WG resolution, merging this PR should be subject of an explicit and documented WG resolution on a call.

Oh, absolutely. I just wanted to start creating something concrete that we can take to the working group, as the original issue has grown a little too immense to wade through.

@mattgarrish
Copy link
Member Author

Maybe the list of the DTD-s could be migrated to such a registry

I thought of that, too, but it's been controversial because of validation. The remaining registries generally avoid defining aspects that require changes to epubcheck (or are tied to features that have largely died off, like specialized publication types). The only exception is the structural semantics vocabulary, but even for that we ultimately had to remove strict validation of the terms from the specification because they were causing problems with vendor ingestion.

I like the approach of being as thorough as we can now knowing what we know about what people actually use. My impression is most new specifications aren't defining doctypes/DTDs for their content anymore, and we can always loosen epubcheck in advance of another revision if it absolutely becomes important to add a new one (the wait for vendors to implement will always be outside our control, though).

Copy link
Member

@rdeltour rdeltour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the direction is the right one, with the following caveats:

  • editorial: s/parsed entity/general entity/
  • editorial: I wonder how to better tie the list (or registry) of allowed doctype to the related formats
  • general: while we're discussing that aspect of the spec, I think we should try to identify the conformance statements for RS processing. That can be done in a separate PR of course.

Comment on lines 1027 to 1030
<p id="confreq-xml-entities">It MUST NOT declare external <a
href="https://www.w3.org/TR/2008/REC-xml-20081126/#dt-parsedent">parsed entities</a> or
external <a href="https://www.w3.org/TR/2008/REC-xml-20081126/#dt-PE">parameter entities</a>
in an internal DTD subset [[!XML]].</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "general entities" is more appropriate for the former, instead of just "parsed entities", which is broader.
Parameter entities are a kind of parsed entities. Here we want to disallow any external parsed entity, so we disallow both external general and parameter entities.

Note that we would still allow external unparsed entities, FWIW.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdeltour

The taxonomy of entities in XML is extremely difficult to understand. I believe that I read every sentence in the first edition of XML at least 20 times. But this taxonomy is still confusing to me.

You correctly pointed out that parameter entities are parsed entities referenced within DTDs. Stand corrected. But general entities include unparsed entities as well. So, "parsed entities" are not broader than "general entities".

IMHO, internal DTD subsets must not declare

  1. external parameter entities
  2. external general parsed entities
  3. unparsed entities (which are always external and general)

The sum of the first three is "external entities". We can simply say that declarations of external entities are disallowed in internal DTD subsets. But this is just too difficult for everybody.

The original wording in EPUB 3 is quite good since it is free from the confusing taxonomy.

We might want to slightly change the original wording. How about this?

  • External identifiers MUST NOT appear in the internal DTD subset [XML].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there terminology that could cover "all things external" so we don't have to get into a list of technical terms?

For example:

"It MUST NOT reference external resources from the internal DTD subset [[!XML]].

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, we were writing at the same time! I'm fine with any wording that encompasses all we need to restrict without being overly detailed.

Comment on lines 1017 to 1024
<p id="confreq-xml-identifiers">It MAY only include a <a
href="https://www.w3.org/TR/2008/REC-xml-20081126/#dt-doctype">document type
declaration</a> that references the <a
href="https://www.w3.org/TR/2008/REC-xml-20081126/#dt-pubid">public identifier</a> and
<a href="https://www.w3.org/TR/2008/REC-xml-20081126/#dt-sysid">system identifier</a>
for a format referenced in <a href="#app-identifiers-allowed"></a>, or that omits <a
href="https://www.w3.org/TR/2008/REC-xml-20081126/#NT-ExternalID">external
identifiers</a> [[!XML]].</p>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the wording a bit confusing: doesn't it mean that any XML document, regardless of what it contains or how it's declared in the Package Document, can contain any of the public/system ID defined in the list?

For instance, a non-SVG XML document declaring the typical SVG doctype would conform to this statement.

Can we somehow better tie the doctype declaration to the media type of the document it is declared in?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya, I was thinking of that more in the table, but even here we could add something like "that is allowed for its media type" and then also add a media type column to the table.

@mattgarrish
Copy link
Member Author

mattgarrish commented Oct 27, 2020

So looking at the current XML conformance requirements, reading systems are already required to be non-validating processors (i.e., the current resolution is already handled).

Since external identifiers are largely still banned in content, and we only allow select safe doctypes, is there actually anything new we need to restrict on the RS side? Do we mention not resolving external identifiers just to be on the safe side?

@iherman
Copy link
Member

iherman commented Oct 27, 2020

Do we mention not resolving external identifiers just to be on the safe side?

I think so. This is kind of a obscure part of the spec, emphasizing can only improve it.

@mattgarrish
Copy link
Member Author

I've added preview/diff links for the RS spec to the first comment. Please review the current state of both documents everyone and let me know if we've finally reached a happy place on this.

(Let's take up the other DTDs to add in the existing open issues so we don't block this part trying to agree on those.)

@rdeltour
Copy link
Member

I've added preview/diff links for the RS spec to the first comment. Please review the current state of both documents everyone and let me know if we've finally reached a happy place on this.

Looks good to me. If I were to nitpick, I would only reword:

External identifiers MUST NOT appear in the internal DTD subset

to:

It MUST NOT contain external entity declarations in the internal DTD subset

The only reason being that "external entity" has a proper definition in the XML spec, whereas "external identifier" has not (it's only kinda inferred from the combined definitions of system identifier and public identifier). So we can point to that definition instead of pointing to the non-terminal symbol of the grammar 😉.

@murata2makoto
Copy link
Contributor

If we disallow external entities in the DTD subset, I think that the conformance section does not have to disallow fetching of external entities. But if we disallow it on the safe side, we need a note saying that external entities are already banned.

@murata2makoto
Copy link
Contributor

@rdeltour

You are correct. Another possibility is

  • Public identifiers and system identifiers MUST NOT appear in the internal DTD subset.

This is technically equivalent. To me, this is easier to understand. Every user of XML understands SYSTEM and PUBLIC since they type these words often. But different guys think different things easy.

@iherman
Copy link
Member

iherman commented Nov 6, 2020

This issue was discussed in a meeting.

  • RESOLVED: Merge PR #1368 to address outstanding DTD issues, and close GH issues 1369-1373
View the transcript Wendy Reid: we had resolutions at the F2F, and further discussions on github
… and came to a happy place
Matt Garrish: #1368
Matt Garrish: where we ended up was…
… we put in an allowance for a specific set of external identifiers that we have put in an appendix
… we have SVG and MathML that are allowed to be used in content docs or in separate files
… and we made a restriction against external entities in the internal DTD subset
… so it prevents some security issues but eases authoring
… so we’ll no longer force people to remove SVG DTDs from tool-generated files
… I’m hoping this is it :)
Ivan Herman: tech comment
… in fact, the changes are such that
… makes possible something that I’m not sure we really use
… I can define as part of an internal entity something that won’t go out to the network
… I’m not sure if this feature is in use
… formal comment
… there was a formal resolution on the previous version; this PR slightly changes that
… can we get a formal resolution to merge, and also close a bunch of issues which were examples of the problem?
Proposed resolution: Merge PR #1368 to address outstanding DTD issues, and close GH issues 1369-1373 (Wendy Reid)
Garth Conboy: +1
Matt Garrish: +1
Ivan Herman: +1
Charles LaPierre: +1
Matthew Chan: +1
Wendy Reid: +1
Brady Duga: +1
George Kerscher: +1
Laura Brady: +1
Bill Kasdorf: +1
Ben Schroeter: +1
Resolution #1: Merge PR #1368 to address outstanding DTD issues, and close GH issues 1369-1373

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Problem with XML content requirements and external entities: make MathML and SVG1.1 invalid
4 participants