-
Notifications
You must be signed in to change notification settings - Fork 20
/
Copy pathREADME
445 lines (359 loc) · 19.4 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
[TagSoupLogo32.png] TagSoup - Just Keep On Truckin' [TagSoupLogo32.png]
Index
* [1]Introduction
* [2]Tagsoup 1.2.1 released
* [3]Taggle, a C++ port of TagSoup, available now
* [4]TagSoup 1.2 released
* [5]What TagSoup does
* [6]The TSaxon XSLT-for-HTML processor
* [7]Note: TagSoup in Java 1.1
* [8]Warning: TagSoup will not build on stock Java 5.x or 6.x!
* [9]TagSoup as a stand-alone program
* [10]SAX features and properties
* [11]Other TagSoups and related things
* [12]More information
Introduction
This is the home page of TagSoup, a SAX-compliant parser written in
Java that, instead of parsing well-formed or valid XML, parses HTML as
it is found in the wild: [13]poor, nasty and brutish, though quite
often far from short. TagSoup is designed for people who have to
process this stuff using some semblance of a rational application
design. By providing a SAX interface, it allows standard XML tools to
be applied to even the worst HTML. TagSoup also includes a command-line
processor that reads HTML files and can generate either clean HTML or
well-formed XML that is a close approximation to XHTML.
This is also the README file packaged with TagSoup.
TagSoup is free and Open Source software. As of version 1.2, it is
licensed under the [14]Apache License, Version 2.0, which allows
proprietary re-use as well as use with GPL 3.0 or GPL 2.0-or-later
projects. (If anyone needs a GPL 2.0 license for a GPL 2.0-only
project, feel free to ask.)
The TagSoup logo is courtesy Ian Leslie.
TagSoup 1.2.1 released
TagSoup 1.2.1 is a much-delayed bug fix release. The following bugs
have hopefully been repaired:
* DOCTYPE is now recognized even in lower case.
* We make sure to buffer the reader, eliminating a long-standing bug
that would crash on certain inputs, such as & followed by CR+LF.
* The HTML scanner's table is precompiled at run time for efficiency,
causing a 4x speedup on large input documents.
* ]] within a CDATA section no longer causes input to be discarded.
* Remove bogus newline after printing children of the root element.
* Allow the noscript element anywhere, the same as the script
element.
* Updated to the 2011 edition of the W3C character entity list.
[15]Download the TagSoup 1.2.1 jar file here. It's about 89K long.
[16]Download the full TagSoup 1.2.1 source here. If you don't have zip,
you can use jar to unpack it.
[17]Download the current CHANGES file here.
Taggle, a TagSoup in C++, available now
A company called [18]JezUK has released [19]Taggle, which is a straight
port of TagSoup 1.2 to C++. It's a part of [20]Arabica, a C++ XML
toolkit providing SAX, DOM, XPath, and partial XSLT. I have no
connection with JezUK (except apparently as source of inspiration).
The author says the code is alpha-quality now, so he'd appreciate lots
of testers to shake out bugs. C++ users, go to it! Having a C++ port
will be a real enhancement for TagSoup.
The code is currently in public Subversion: you can fetch it with
svn co svn://jezuk.dnsalias.net/jezuk/arabica/branches/tagsoup-port
.
TagSoup 1.2 released
There are a great many changes, most of them fixes for long-standing
bugs, in this release. Only the most important are listed here; for the
rest, see the CHANGES file in the source distribution. Very special
thanks to Jojo Dijamco, whose intensive efforts at debugging made this
release a usable upgrade rather than a useless mass of undetected bugs.
* As noted above, I have changed the license to Apache 2.0.
* The default content model for bogons (unknown elements) is now ANY
rather than EMPTY. This is a breaking change, which I have done
only because there was so much demand for it. It can be undone on
the command line with the --emptybogons switch, or programmatically
with parser.setFeature(Parser.emptyBogonsFeature, true).
* The processing of entity references in attribute values has finally
been fixed to do what browsers do. That is, a reference is only
recognized if it is properly terminated by a semicolon; otherwise
it is treated as plain text. This means that URIs like
foo?cdown=32&cup=42 are no longer seen as containing an instance of
the )U character (whose name happens to be cup).
* Several new switches have been added:
+ --doctype-system and --doctype-public force a DOCTYPE
declaration to be output and allow setting the system and
public identifiers.
+ --standalone and --version allow control of the XML
declaration that is output. (Note that TagSoup's XML output is
always version 1.0, even if you use --version=1.1.)
+ --norootbogons causes unknown elements not to be allowed as
the document root element. Instead, they are made children of
the default root element (the html element for HTML).
* The TagSoup core now supports character entities with values above
U+FFFF. As a consequence, the HTML schema now supports all 2,210
standard character entities from the [21]2007-12-14 draft of XML
Entity Definitions for Characters, except the 94 which require more
than one Unicode character to represent.
* The SAX events startPrefixMapping and endPrefixMapping are now
being reported for all cases of foreign elements and attributes.
* All bugs around newline processing on Windows should now be gone.
* A number of content models have been loosened to allow elements to
appear in new and non-standard (but commonly found) places. In
particular, tables are now allowed inside paragraphs, against the
letter of the W3C specification.
* Since the span element is intended for fine control of appearance
using CSS, it should never have been a restartable element. This
very long-standing bug has now been fixed.
* The following non-standard elements are now at least partly
supported: bgsound, blink, canvas, comment, listing, marquee, nobr,
rbc, rb, rp, rtc, rt, ruby, wbr, xmp.
* In HTML output mode, boolean attributes like checked are now output
as such, rather than in XML style as checked="checked".
* Runs of < characters such as << and <<< are now handled correctly
in text rather than being transformed into extremely bogus
start-tags.
What TagSoup does
TagSoup is designed as a parser, not a whole application; it isn't
intended to permanently clean up bad HTML, as [22]HTML Tidy does, only
to parse it on the fly. Therefore, it does not convert presentation
HTML to CSS or anything similar. It does guarantee well-structured
results: tags will wind up properly nested, default attributes will
appear appropriately, and so on.
The semantics of TagSoup are as far as practical those of actual HTML
browsers. In particular, never, never will it throw any sort of syntax
error: the TagSoup motto is [23]"Just Keep On Truckin'". But there's
much, much more. For example, if the first tag is LI, it will supply
the application with enclosing HTML, BODY, and UL tags. Why UL? Because
that's what browsers assume in this situation. For the same reason,
overlapping tags are correctly restarted whenever possible: text like:
This is <B>bold, <I>bold italic, </b>italic, </i>normal text
gets correctly rewritten as:
This is <b>bold, <i>bold italic, </i></b><i>italic, </i>normal text.
By intention, TagSoup is small and fast. It does not depend on the
existence of any framework other than SAX, and should be able to work
with any framework that can accept SAX parsers. In particular, [24]XOM
is known to work.
You can replace the low-level HTML scanner with one based on Sean
McGrath's [25]PYX format (very close to James Clark's ESIS format). You
can also supply an AutoDetector that peeks at the incoming byte stream
and guesses a character encoding for it. Otherwise, the platform
default is used. If you need an autodetector of character sets,
consider trying to adapt the [26]Mozilla one; if you succeed, let me
know.
The TSaxon XSLT-for-HTML processor
[27]I am also distributing [28]TSaxon, a repackaging of version 6.5.5
of Michael Kay's Saxon XSLT version 1.0 implementation that includes
TagSoup. TSaxon is a drop-in replacement for Saxon, and can be used to
process either HTML or XML documents with XSLT stylesheets.
Note: TagSoup in Java 1.1
If you go through the TagSoup source and replace all references to
HashMap with Hashtable and recompile, TagSoup will work fine in Java
1.1 VMs. Thanks to Thorbjørn Vinne for this discovery.
Warning: TagSoup will not build on stock Java 5.x or 6.x!
Due to a bug in the versions of Xalan shipped with Java 5.x and 6.x,
TagSoup will not build out of the box. You need to retrieve [29]Saxon
6.5.5, which does not have the bug. Unpack the zipfile in an empty
directory and copy the saxon.jar and saxon-xml-apis.jar files to
$ANT_HOME/lib. The Ant build process for TagSoup will then notice that
Saxon is available and use it instead.
In addition, if you are building on a Debian-derived distro, you will
need to install not only the ant package but the ant-optional package
as well.
TagSoup as a stand-alone program
It is possible to run TagSoup as a program by saying java -jar
tagsoup-1.2.1 [option ...] [file ...]. Files mentioned on the command
line will be parsed individually. If no files are specified, the
standard input is read.
The following options are understood:
--files
Output into individual files, with html extensions changed to
xhtml. Otherwise, all output is sent to the standard output.
--html
Output is in clean HTML: the XML declaration is suppressed, as
are end-tags for the known empty elements.
--omit-xml-declaration
The XML declaration is suppressed.
--method=html
End-tags for the known empty HTML elements are suppressed.
--doctype-system=systemid
Forces the output of a DOCTYPE declaration with the specified
systemid.
--doctype-public=publicid
Forces the output of a DOCTYPE declaration with the specified
publicid.
--version=version
Sets the version string in the XML declaration.
--standalone=[yes|no]
Sets the standalone declaration to yes or no.
--pyx
Output is in PYX format.
--pyxin
Input is in PYXoid format (need not be well-formed).
--nons
Namespaces are suppressed. Normally, all elements are in the
XHTML 1.x namespace, and all attributes are in no namespace.
--nobogons
Bogons (unknown elements) are suppressed.
--nodefaults
Default attribute values are suppressed.
--nocolons
Explicit colons in element and attribute names are changed to
underscores.
--norestart
don't restart any normally restartable elements.
--ignorable
Output whitespace in elements with element-only content.
--emptybogons
Bogons are given a content model of EMPTY rather than ANY.
--any
Bogons are given a content model of ANY rather than EMPTY
(default).
--norootbogons
Bogons are not allowed to be root elements; make them
subordinate to the root.
--lexical
Pass through HTML comments and DOCTYPE declarations. Has no
effect when output is in PYX format.
--reuse
Reuse a single instance of TagSoup parser throughout. Normally,
a new one is instantiated for each input file.
--nocdata
Change the content models of the script and style elements to
treat them as ordinary #PCDATA (text-only) elements, as in
XHTML, rather than with the special CDATA content model.
--encoding=encoding
Specify the input encoding. The default is the Java platform
default.
--output-encoding=encoding
Specify the output encoding. The default is the Java platform
default.
--help
Print help.
--version
Print the version number.
SAX features and properties
TagSoup supports the following SAX features in addition to [30]the
standard ones:
http://www.ccil.org/~cowan/tagsoup/features/ignore-bogons
A value of "true" indicates that the parser will ignore unknown
elements.
http://www.ccil.org/~cowan/tagsoup/features/bogons-empty
A value of "true" indicates that the parser will give unknown
elements a content model of EMPTY; a value of "false", a content
model of ANY.
http://www.ccil.org/~cowan/tagsoup/features/root-bogons
A value of "true" indicates that the parser will allow unknown
elements to be the root of the output document.
http://www.ccil.org/~cowan/tagsoup/features/default-attributes
A value of "true" indicates that the parser will return default
attribute values for missing attributes that have default
values.
http://www.ccil.org/~cowan/tagsoup/features/translate-colons
A value of "true" indicates that the parser will translate
colons into underscores in names.
http://www.ccil.org/~cowan/tagsoup/features/restart-elements
A value of "true" indicates that the parser will attempt to
restart the restartable elements.
http://www.ccil.org/~cowan/tagsoup/features/ignorable-whitespace
A value of "true" indicates that the parser will transmit
whitespace in element-only content via the SAX
ignorableWhitespace callback. Normally this is not done, because
HTML is an SGML application and SGML suppresses such whitespace.
http://www.ccil.org/~cowan/tagsoup/features/cdata-elements
A value of "true" indicates that the parser will process the
script and style elements (or any elements with type='cdata' in
the TSSL schema) as SGML CDATA elements (that is, no markup is
recognized except the matching end-tag).
TagSoup supports the following SAX properties in addition to [31]the
standard ones:
http://www.ccil.org/~cowan/tagsoup/properties/scanner
Specifies the Scanner object this parser uses.
http://www.ccil.org/~cowan/tagsoup/properties/schema
Specifies the Schema object this parser uses.
http://www.ccil.org/~cowan/tagsoup/properties/auto-detector
Specifies the AutoDetector (for encoding detection) this parser
uses.
Other TagSoups and related things
[32]TagSoup is written in [33]the world's finest imperative programming
language, as opposed to my TagSoup, which is written in [34]perhaps the
world's most widely used imperative programming language. As far as I
can make out, TagSoup only lexes its input, and does not attempt to
balance tags in the style of my TagSoup.
[35]BeautifulSoup is closer to my TagSoup, but is written in Python and
returns a parse tree. I believe its heuristics are hard-coded for HTML.
There is a port to Ruby called [36]RubyfulSoup.
There are a variety of other HTML SAX parsers written in Java, notably
[37]NekoHTML, [38]JTidy (a port of the C library and tool [39]HTML
Tidy), and [40]HTML Parser. All have their good and bad points: the
general view around the Web seems to be that TagSoup is the slowest,
but also the most robust and reliable.
Finally, there is a full port of my TagSoup to C++, but unfortunately
it is currently trapped inside IBM. The process to release it as Open
Source is under way, and I hope to feature it here some time soon.
More information
I gave a presentation (a nocturne, so it's not on the schedule) at
[41]Extreme Markup Languages 2004 about TagSoup, updated from the one
presented in 2002 at the New York City XML SIG and at XML 2002. This is
the main high-level documentation about how TagSoup works. Formats:
[42]OpenDocument [43]Powerpoint [44]PDF.
I also had people add [45]"evil" HTML to a large poster so that I could
[46]clean it up; View Source is probably more useful than ordinary
browsing. The original instructions were:
SOUPE DE BALISES (BE EVIL)!
Ecritez une balise ouvrante (sans attributs)
ou fermante HTML ici, s.v.p.
There is a [47]tagsoup-friends mailing list hosted at [48]Yahoo Groups.
You can [49]join via the Web, or by sending a blank email to
[50]tagsoup-friends-subscribe@yahoogroups.com. The [51]archives are
open to all.
Online TagSoup processing for publicly accessible HTML documents is now
[52]available courtesy of Leigh Dodds.
References
1. file://localhost/home/cowan/tagsoup/trunk/index.html#intro
2. file://localhost/home/cowan/tagsoup/trunk/index.html#1.2.1
3. file://localhost/home/cowan/tagsoup/trunk/index.html#taggle
4. file://localhost/home/cowan/tagsoup/trunk/index.html#1.2
5. file://localhost/home/cowan/tagsoup/trunk/index.html#what
6. file://localhost/home/cowan/tagsoup/trunk/index.html#tsaxon
7. file://localhost/home/cowan/tagsoup/trunk/index.html#java1.1
8. file://localhost/home/cowan/tagsoup/trunk/index.html#warning
9. file://localhost/home/cowan/tagsoup/trunk/index.html#program
10. file://localhost/home/cowan/tagsoup/trunk/index.html#properties
11. file://localhost/home/cowan/tagsoup/trunk/index.html#other
12. file://localhost/home/cowan/tagsoup/trunk/index.html#more
13. http://oregonstate.edu/instruct/phl302/texts/hobbes/leviathan-c.html
14. http://opensource.org/licenses/apache2.0.php
15. file://localhost/home/cowan/tagsoup/trunk/tagsoup-1.2.1.jar
16. file://localhost/home/cowan/tagsoup/trunk/tagsoup-1.2.1-src.zip
17. file://localhost/home/cowan/tagsoup/trunk/CHANGES
18. http://www.jezuk.co.uk/
19. http://www.jezuk.co.uk/arabica/log?id=3591
20. http://www.jezuk.co.uk/arabica
21. http://www.w3.org/TR/2007/WD-xml-entity-names-20071214
22. http://tidy.sf.net/
23. http://www.crumbmuseum.com/truckin.html
24. http://www.cafeconleche.org/XOM
25. http://gnosis.cx/publish/programming/xml_matters_17.html
26. http://jchardet.sourceforge.net/
27. http://www.ccil.org/~cowan
28. file://localhost/home/cowan/tagsoup/trunk/tsaxon
29. http://prdownloads.sourceforge.net/saxon/saxon6-5-5.zip
30. http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html
31. http://www.saxproject.org/apidoc/org/xml/sax/package-summary.html
32. http://www-users.cs.york.ac.uk/~ndm/tagsoup/
33. http://www.haskell.org/
34. http://java.sun.com/
35. http://www.crummy.com/software/BeautifulSoup
36. http://www.crummy.com/software/RubyfulSoup/
37. http://nekohtml.sourceforge.net/
38. http://jtidy.sourceforge.net/
39. http://tidy.sourceforge.net/
40. http://htmlparser.sourceforge.net/
41. http://www.extrememarkup.com/extreme/2004
42. file://localhost/home/cowan/tagsoup/trunk/tagsoup.odp
43. file://localhost/home/cowan/tagsoup/trunk/tagsoup.ppt
44. file://localhost/home/cowan/tagsoup/trunk/tagsoup.pdf
45. file://localhost/home/cowan/tagsoup/trunk/extreme.html
46. file://localhost/home/cowan/tagsoup/trunk/extreme.xhtml
47. http://groups.yahoo.com/group/tagsoup-friends
48. http://groups.yahoo.com/
49. http://groups.yahoo.com/group/tagsoup-friends/join
50. mailto:tagsoup-friends-subscribe@yahoogroups.com
51. http://groups.yahoo.com/group/tagsoup-friends/messages
52. http://xmlarmyknife.org/docs/xhtml/tagsoup/