Skip to content

Commit

Permalink
Implement HTML rewriting stream (closes #222)
Browse files Browse the repository at this point in the history
  • Loading branch information
inikulin committed May 17, 2018
1 parent 9d872c4 commit 12d81cc
Show file tree
Hide file tree
Showing 18 changed files with 560 additions and 101 deletions.
2 changes: 1 addition & 1 deletion docs/classes/parserstream.html
Original file line number Diff line number Diff line change
Expand Up @@ -1023,7 +1023,7 @@ <h3>on</h3>
</aside>
<div class="tsd-comment tsd-typography">
<div class="lead">
<p>Raised then parser encounters a <code>&lt;script&gt;</code> element.
<p>Raised when parser encounters a <code>&lt;script&gt;</code> element.
If this event has listeners, parsing will be suspended once it is emitted.
So, if <code>&lt;script&gt;</code> has the <code>src</code> attribute, you can fetch it, execute and then resume parsing just like browsers do.</p>
</div>
Expand Down
2 changes: 1 addition & 1 deletion docs/classes/plaintextconversionstream.html
Original file line number Diff line number Diff line change
Expand Up @@ -1019,7 +1019,7 @@ <h3>on</h3>
</aside>
<div class="tsd-comment tsd-typography">
<div class="lead">
<p>Raised then parser encounters a <code>&lt;script&gt;</code> element.
<p>Raised when parser encounters a <code>&lt;script&gt;</code> element.
If this event has listeners, parsing will be suspended once it is emitted.
So, if <code>&lt;script&gt;</code> has the <code>src</code> attribute, you can fetch it, execute and then resume parsing just like browsers do.</p>
</div>
Expand Down
8 changes: 4 additions & 4 deletions docs/classes/saxparser.html
Original file line number Diff line number Diff line change
Expand Up @@ -1161,7 +1161,7 @@ <h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">this</spa
</aside>
<div class="tsd-comment tsd-typography">
<div class="lead">
<p>Raised then parser encounters an end tag.</p>
<p>Raised when parser encounters an end tag.</p>
</div>
</div>
<h4 class="tsd-parameters-title">Parameters</h4>
Expand Down Expand Up @@ -1211,7 +1211,7 @@ <h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">this</spa
</aside>
<div class="tsd-comment tsd-typography">
<div class="lead">
<p>Raised then parser encounters a comment.</p>
<p>Raised when parser encounters a comment.</p>
</div>
</div>
<h4 class="tsd-parameters-title">Parameters</h4>
Expand Down Expand Up @@ -1261,7 +1261,7 @@ <h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">this</spa
</aside>
<div class="tsd-comment tsd-typography">
<div class="lead">
<p>Raised then parser encounters text content.</p>
<p>Raised when parser encounters text content.</p>
</div>
</div>
<h4 class="tsd-parameters-title">Parameters</h4>
Expand Down Expand Up @@ -1311,7 +1311,7 @@ <h4 class="tsd-returns-title">Returns <span class="tsd-signature-type">this</spa
</aside>
<div class="tsd-comment tsd-typography">
<div class="lead">
<p>Raised then parser encounters a <a href="https://en.wikipedia.org/wiki/Document_type_declaration">document type declaration</a>.</p>
<p>Raised when parser encounters a <a href="https://en.wikipedia.org/wiki/Document_type_declaration">document type declaration</a>.</p>
</div>
</div>
<h4 class="tsd-parameters-title">Parameters</h4>
Expand Down
1 change: 1 addition & 0 deletions docs/list-of-packages.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,3 +6,4 @@
- [parse5-plain-text-conversion-stream](https://github.com/inikulin/parse5/tree/master/packages/parse5-plain-text-conversion-stream) - stream that converts plain text files into HTML documents.
- [parse5-sax-parser](https://github.com/inikulin/parse5/tree/master/packages/parse5-sax-parser) - streaming SAX-style HTML parser.
- [parse5-serializer-stream](https://github.com/inikulin/parse5/tree/master/packages/parse5-serializer-stream) - streaming HTML serializer.
- [parse5-html-rewriting-stream](https://github.com/inikulin/parse5/tree/master/packages/parse5-html-rewriting-stream) - streaming HTML rewriter.
3 changes: 2 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -2,11 +2,12 @@
"private": true,
"devDependencies": {
"@types/node": "*",
"dedent": "^0.7.0",
"eslint": "^4.19.1",
"eslint-config-prettier": "^2.9.0",
"eslint-plugin-prettier": "^2.6.0",
"husky": "^0.14.3",
"lerna": "^2.10.2",
"lerna": "^2.11.0",
"mocha": "^5.1.1",
"prettier": "^1.12.0",
"r2": "^2.0.1",
Expand Down
34 changes: 34 additions & 0 deletions packages/parse5-html-rewriting-stream/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
<p align="center">
<a href="https://github.com/inikulin/parse5">
<img src="https://raw.github.com/inikulin/parse5/master/media/logo.png" alt="parse5" />
</a>
</p>

<div align="center">
<h1>parse5-html-rewriting-stream</h1>
<i><b>Streaming HTML rewriter.</b></i>
</div>
<br>

<div align="center">
<code>npm install --save parse5-html-rewriting-stream</code>
</div>
<br>

<p align="center">
📖 <a href="https://github.com/inikulin/parse5/tree/master/packages/parse5-html-rewriting-stream/docs/index.md"><b>Documentation</b></a> 📖
</p>

---

<p align="center">
<a href="https://github.com/inikulin/parse5/tree/master/docs/list-of-packages.md">List of parse5 toolset packages</a>
</p>

<p align="center">
<a href="https://github.com/inikulin/parse5">GitHub</a>
</p>

<p align="center">
<a href="https://github.com/inikulin/parse5/tree/master/docs/version-history.md">Version history</a>
</p>
133 changes: 133 additions & 0 deletions packages/parse5-html-rewriting-stream/lib/index.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,133 @@
'use strict';

const SAXParser = require('parse5-sax-parser');
const Tokenizer = require('parse5/lib/tokenizer');

class RewritingStream extends SAXParser {
constructor() {
super({ sourceCodeLocationInfo: true });

this.posTracker = this.locInfoMixin.posTracker;

this.tokenEmissionHelpers = {
[Tokenizer.START_TAG_TOKEN]: {
eventName: 'startTag',
reshapeToken: token => this._reshapeStartTagToken(token)
},
[Tokenizer.END_TAG_TOKEN]: {
eventName: 'endTag',
reshapeToken: token => this._reshapeEndTagToken(token)
},
[Tokenizer.COMMENT_TOKEN]: {
eventName: 'comment',
reshapeToken: token => this._reshapeCommentToken(token)
},
[Tokenizer.DOCTYPE_TOKEN]: {
eventName: 'doctype',
reshapeToken: token => this._reshapeDoctypeToken(token)
}
};
}

_transform(chunk, encoding, callback) {
this._parseChunk(chunk);

callback();
}

_getCurrentTokenRawHtml() {
const droppedBufferSize = this.posTracker.droppedBufferSize;
const start = this.currentTokenLocation.startOffset - droppedBufferSize;
const end = this.currentTokenLocation.endOffset - droppedBufferSize;

return this.tokenizer.preprocessor.html.slice(start, end);
}

// Events
_handleToken(token) {
if (token.type === Tokenizer.EOF_TOKEN) {
return;
}

const { eventName, reshapeToken } = this.tokenEmissionHelpers[token.type];

this.currentTokenLocation = token.location;

const raw = this._getCurrentTokenRawHtml();

if (this.listenerCount(eventName) > 0) {
this.emit(eventName, reshapeToken(token), raw);
} else {
this.emitRaw(raw);
}

// NOTE: don't skip new lines after <pre> and other tags,
// otherwise we'll have incorrect raw data.
this.parserFeedbackSimulator.skipNextNewLine = false;
}

_emitPendingText() {
if (this.pendingText !== null) {
const raw = this._getCurrentTokenRawHtml();

if (this.listenerCount('text') > 0) {
this.emit('text', this._createTextToken(), raw);
} else {
this.emitRaw(raw);
}

this.pendingText = null;
}
}

// Emitter API
emitDoctype(token) {
let res = `<!DOCTYPE ${token.name}`;

if (token.publicId !== null) {
res += ` PUBLIC "${token.publicId}"`;
} else if (token.systemId !== null) {
res += ' SYSTEM';
}

if (token.systemId !== null) {
res += ` "${token.systemId}"`;
}

res += '>';

this.push(res);
}

emitStartTag(token) {
let res = `<${token.tagName}`;

const attrs = token.attrs;

for (let i = 0; i < attrs.length; i++) {
res += ` ${attrs[i].name}="${attrs[i].value}"`;

This comment has been minimized.

Copy link
@RReverser

RReverser May 17, 2018

Collaborator

No escaping of values?

This comment has been minimized.

Copy link
@inikulin

inikulin May 17, 2018

Author Owner

Fixed in 294856d. Thanks for pointing this out.

}

res += token.selfClosing ? '/>' : '>';

this.push(res);
}

emitEndTag(token) {
this.push(`</${token.tagName}>`);
}

emitText({ text }) {
this.push(text);

This comment has been minimized.

Copy link
@RReverser

RReverser May 17, 2018

Collaborator

Same question - as far as I remember, text in token is in decoded form, shouldn't entities be re-encoded back for safety?

}

emitComment(token) {
this.push(`<!--${token.text}-->`);
}

emitRaw(html) {
this.push(html);
}
}

module.exports = RewritingStream;
20 changes: 20 additions & 0 deletions packages/parse5-html-rewriting-stream/package.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
{
"name": "parse5-html-rewriting-stream",
"description": "Streaming HTML rewriter.",
"version": "5.0.0",
"author": "Ivan Nikulin <ifaaan@gmail.com> (https://github.com/inikulin)",
"contributors": "https://github.com/inikulin/parse5/graphs/contributors",
"homepage": "https://github.com/inikulin/parse5",
"keywords": ["parse5", "parser", "stream", "streaming", "rewritter", "rewrite", "HTML"],
"license": "MIT",
"main": "./lib/index.js",
"dependencies": {
"parse5": "^5.0.0",
"parse5-sax-parser": "^5.0.0"
},
"repository": {
"type": "git",
"url": "git://github.com/inikulin/parse5.git"
},
"files": ["lib"]
}
Loading

0 comments on commit 12d81cc

Please sign in to comment.