Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XMLEventReader does not handle ' properly #72

Closed
mbeckerle opened this issue Jul 29, 2015 · 7 comments
Closed

XMLEventReader does not handle ' properly #72

mbeckerle opened this issue Jul 29, 2015 · 7 comments
Milestone

Comments

@mbeckerle
Copy link

(This issue migrated from https://issues.scala-lang.org/browse/SI-7796)

Of the five required predefined entities in XML, XMLEventReader does not handle ', returning an EvComment of " unknown entity apos; " instead of an EvEntityRef.

This test program:

import scala.xml.pull._
import scala.io.Source
object reader {
  val src = Source.fromString("<test>&amp;&lt;&gt;&apos;&quot;</test>")
  val er = new XMLEventReader(src)
  def main(args: Array[String]) {
    while (er.hasNext)
      Console.println(er.next)
  }
}

outputs:

EvElemStart(null,test,,)
EvEntityRef(amp)
EvEntityRef(lt)
EvEntityRef(gt)
EvComment( unknown entity apos; )
EvEntityRef(quot)
EvElemEnd(null,test)

Also, apos does not appear in XhtmlEntities.scala (may be unrelated).

Since these five entities are predefined, I would argue that the parser should auto-replace them with their equivalents so the user doesn't have to.

@Mark-L6n
Copy link

Furthermore, it does not handle other HTML entities well. There are well over 1,000 HTML entities (see list), and their values are simply tossed out with EvEntityRef. I am processing Wikipedia dumps and will encounter a wide range of them.
Why are only 4/5 entities processed, when any entity can occur in a text field? There shouldn't be a security concern, as a motivation for using entities is security.
Also, why are entities treated as an event at all? It'd be nice to have the option to disable this functionality so one could simply get all the text in a EvText() event.
Hopefully, there can be a way to either a) enable EvEntityRef to process all HTML entities or b) disable EvEntityRef events from occurring and breaking up EvText() events.
Example problem:

object testEntityErr {
  import scala.io.Source
  import scala.xml.pull._

  val testStr = "<text> &amp; &quot; &lt; &gt; </text>" +
    "<notext> &nbsp; &apos; &copy; &reg; &euro; &dollar; &cent; &pound; &yen; </notext>"
  val xml = new XMLEventReader(Source.fromString(testStr))
  for (event <- xml) {
    event match {
      case EvEntityRef(e) => println(e)
      case EvComment(_) => println(event)
      case _ => // ignore
    }
  }
}

Output:

import scala.io.Source
import scala.xml.pull._

testStr: String = <text> &amp; &quot; &lt; &gt; </text><notext> &nbsp; &apos; &copy; &reg; &euro; &dollar; &cent; &pound; &yen; </notext>

xml: scala.xml.pull.XMLEventReader = non-empty iterator
amp
quot
lt
gt
EvComment( unknown entity nbsp; )
EvComment( unknown entity apos; )
EvComment( unknown entity copy; )
EvComment( unknown entity reg; )
EvComment( unknown entity euro; )
EvComment( unknown entity dollar; )
EvComment( unknown entity cent; )
EvComment( unknown entity pound; )
EvComment( unknown entity yen; )
res0: Unit = ()

@ashawley
Copy link
Member

@Mark-L6n Thanks for sharing your comments. Would you be willing to create a separate issue with your concerns? It seems broader than the original issue.

@ashawley
Copy link
Member

It seems like apos was explicitly commented out for some historical issue about HTML4 and a Web browser called Internet Explorer -- see e1fadeb.

Nine years later, can we put back that line for apos support?

There is some commentary on the Entities representing special characters in XHTML at Wikipedia.

I couldn't find an an analysis of browser support for apos.

Presumably, all browsers support apos if the document is properly declared as XHTML. So, it seems that fixing this would require accepting that scala-xml no longer supports HTML4 or earlier?

ashawley added a commit to ashawley/scala-xml that referenced this issue Dec 15, 2015
	* src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	Escapes map.

	* src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
lespea added a commit to lespea/scala-xml that referenced this issue Mar 22, 2016
Fix scala#72 XMLEventReader does not handle &apos; properly
@lespea
Copy link

lespea commented Apr 21, 2016

It would be great if this was including in an upcoming release. I've been using #89 for a while now without issue.

@ashawley
Copy link
Member

@lespea Thanks for your feedback on #89.

I had suggested on the Release Plans wiki page to delay merging a PR on this. Maybe I'm making too big of a deal over compatability issues?

@SethTisue
Copy link
Member

SethTisue commented Apr 21, 2016

I'm certainly not an expert on the &apos; issue, but I did some Googling and it appears to me it's a historical thing that can now be dropped. It seems better to adhere to the XML standard than to an obsolete version of the HTML standard.

@ashawley
Copy link
Member

ashawley commented Apr 22, 2016

Seems fixing the XMLEventReader bug is not mutually exclusive with breaking HTML support.

The MarkupParser doesn't even use Utility.unescape but chooses instead to use the pairs hash in the imports at the top of the file:

import Utility.Escapes.{ pairs => unescape }

Then, the Utility.escape method doesn't even use the escMap hash. It's an entirely hand-coded imperative function written for performance optimization purposes. It doesn't bother to escape apostrophes. This is confirmed by the following unit test in UtitlityTest:

  @Test
  def aposEscaping: Unit = {
    val z = <bar>''</bar>
    val z1 = z.toString
    assertEquals("<bar>''</bar>", z1)
  }

ashawley added a commit to ashawley/scala-xml that referenced this issue Jun 13, 2016
	* src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	Escapes map.

	* src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Jun 13, 2016
	* src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	Escapes map.

	* src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Jun 13, 2016
	* src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	Escapes map.
	(escape): Add case for apos.

	* src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Jun 13, 2016
	* src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	Escapes map.
	(escape): Add case for apos.

	* src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Sep 14, 2016
	* src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	Escapes map.
	(escape): Add case for apos.

	* src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Sep 20, 2016
	* src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	Escapes map.
	(escape): Add case for apos.

	* src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Oct 18, 2016
	* src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	Escapes map.
	(escape): Add case for apos.

	* src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Dec 1, 2016
	* src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	Escapes map.
	(escape): Add case for apos.

	* src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Feb 4, 2017
	* src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	Escapes map.
	(escape): Add case for apos.

	* src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
fehmicansaglam added a commit to fehmicansaglam/scala-xml that referenced this issue Feb 6, 2017
* Upgrade junit-interface to 0.11
& Fixes scala#72
fehmicansaglam added a commit to fehmicansaglam/scala-xml that referenced this issue Feb 6, 2017
ashawley added a commit to ashawley/scala-xml that referenced this issue Feb 6, 2017
	* jvm/src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	Escapes map.
	(escape): Add case for apos.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Feb 6, 2017
	* jvm/src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	Escapes map.
	(escape): Add case for apos.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Feb 6, 2017
	* shared/src/main/scala/scala/xml/parsing/MarkupParser.scala:
	Import unescMap instead of pairs.

	* shared/src/main/scala/scala/xml/parsing/MarkupParserCommon.scala:
	Import unescMap instead of pairs.

	* jvm/src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	unescape map.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Feb 7, 2017
	* shared/src/main/scala/scala/xml/parsing/MarkupParser.scala:
	Import unescMap instead of pairs.

	* shared/src/main/scala/scala/xml/parsing/MarkupParserCommon.scala:
	Import unescMap instead of pairs.

	* jvm/src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	unescape map.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Feb 15, 2017
	* shared/src/main/scala/scala/xml/parsing/MarkupParser.scala:
	Import unescMap instead of pairs.

	* shared/src/main/scala/scala/xml/parsing/MarkupParserCommon.scala:
	Import unescMap instead of pairs.

	* jvm/src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	unescape map.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Apr 25, 2017
	* shared/src/main/scala/scala/xml/parsing/MarkupParser.scala:
	Import unescMap instead of pairs.

	* shared/src/main/scala/scala/xml/parsing/MarkupParserCommon.scala:
	Import unescMap instead of pairs.

	* jvm/src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	unescape map.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Apr 26, 2017
	* shared/src/main/scala/scala/xml/parsing/MarkupParser.scala:
	Import unescMap instead of pairs.

	* shared/src/main/scala/scala/xml/parsing/MarkupParserCommon.scala:
	Import unescMap instead of pairs.

	* jvm/src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	unescape map.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Apr 26, 2017
	* shared/src/main/scala/scala/xml/parsing/MarkupParser.scala:
	Import unescMap instead of pairs.

	* shared/src/main/scala/scala/xml/parsing/MarkupParserCommon.scala:
	Import unescMap instead of pairs.

	* jvm/src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	unescape map.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Apr 28, 2017
	* shared/src/main/scala/scala/xml/parsing/MarkupParser.scala:
	Import unescMap instead of pairs.

	* shared/src/main/scala/scala/xml/parsing/MarkupParserCommon.scala:
	Import unescMap instead of pairs.

	* jvm/src/main/scala/scala/xml/Utility.scala: Uncomment apos in
	unescape map.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala (entityRefTest):
	Unit test from Fehmi Can Saglam <fehmican.saglam@gmail.com>
ashawley added a commit to ashawley/scala-xml that referenced this issue Apr 28, 2017
	* shared/src/main/scala/scala/xml/Utility.scala: Uncomment
	apos in unescape map.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala
	(entityRefTest): Refactor unit test.
ashawley added a commit to ashawley/scala-xml that referenced this issue Apr 28, 2017
	* shared/src/main/scala/scala/xml/Utility.scala: Uncomment
	apos in unescape map.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala
	(entityRefTest): Refactor unit test.
ashawley added a commit to ashawley/scala-xml that referenced this issue May 5, 2017
	* shared/src/main/scala/scala/xml/Utility.scala: Uncomment
	apos in unescape map.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala
	(entityRefTest): Refactor unit test.
ashawley added a commit to ashawley/scala-xml that referenced this issue May 6, 2017
	* shared/src/main/scala/scala/xml/Utility.scala: Uncomment
	apos in unescape map.

	* jvm/src/test/scala/scala/xml/pull/XMLEventReaderTest.scala
	(entityRefTest): Refactor unit test.
ashawley added a commit that referenced this issue May 24, 2017
Fix #72 XMLEventReader does not handle &apos; properly
@ashawley ashawley added this to the 1.1.0 milestone Oct 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants