-
-
Notifications
You must be signed in to change notification settings - Fork 232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor(tokenizer): Introduce events #404
Conversation
No need to construct objects, when flat parameters suffice
This allows us to start asserting expected values as we walk through the document. This will give us stack traces for all future errors. Improve generate tokenization tests imports
Once again, stack traces will make errors easier to identify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a lot of changes. Including what seems like refactoring. That make it hard to review.
If possible, can you move part of this work to separate PRs?
Can you point me towards subsets that you want me to review?
protected override _handleToken(token: Token): boolean { | ||
if (!super._handleToken(token)) { | ||
this.emitRaw(this._getRawHtml(token.location!)); | ||
protected override emitIfListenerExists(eventName: string, token: SaxToken): boolean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SAX parser now handles events, so _handleToken
is gone. We can override emitIfListenerExists
instead for the same effect.
protected parserFeedbackSimulator: ParserFeedbackSimulator; | ||
private pendingText: CharacterToken | null = null; | ||
private pendingText: Text | null = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With _handleToken
gone, we keep a Text
object here for simplicity.
|
||
constructor(private tokenizer: Tokenizer) { | ||
constructor(options: TokenizerOptions, private handler: TokenHandler) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing tokens on after a getNextToken
call, the ParserFeedbackSimulator
now mirrors tokens received from the tokenizer to a passed handler.
'ParserFeedbackSimulator', | ||
'ParserFeedbackSimulator', | ||
feedbackPath.pathname, | ||
(handler) => new ParserFeedbackSimulator({}, handler).tokenizer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
generateTokenizationTests
now only needs a tokenizer to be returned.
packages/parse5/lib/parser/index.ts
Outdated
@@ -1,4 +1,4 @@ | |||
import { Tokenizer, TokenizerMode } from '../tokenizer/index.js'; | |||
import { QueuedTokenizer, TokenizerMode } from '../tokenizer/index.js'; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The parser is mostly unchanged — only the wrapped tokenizer is imported.
generateTokenizationTests('tokenizer', 'Tokenizer', dataPath.pathname, ({ errors }) => { | ||
const tokenizer = new Tokenizer({ | ||
sourceCodeLocationInfo: true, | ||
onParseError(err): void { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The onParseError
method for the tokenizer moved to the handler. This is not a breaking change for anyone not explicitly using the tokenizer (and there are a lot of breaking changes here for those that are).
@@ -40,7 +40,41 @@ function appendToken(dest: Token[], token: Token): void { | |||
dest.push(token); | |||
} | |||
|
|||
function collectParserTokens(html: string): ReturnType<typeof convertTokenToHtml5Lib>[] { | |||
function convertTokenToHtml5Lib(token: Token): HtmlLibToken { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
convertTokenToHtml5Lib
was no longer needed in test/utils/generate-tokenization-tests.ts
, and therefore moved here.
initialState: getTokenizerSuitableStateName(initialStateName), | ||
initialStateName, | ||
lastStartTag: descr.lastStartTag, | ||
expectedErrors: descr.errors || [], | ||
expectedErrors: TestsWithBrokenErrors[descr.description] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Broken tests are now referenced here, instead of when comparing once all errors are gathered.
@@ -141,7 +164,7 @@ interface TestDescription { | |||
description: string; | |||
input: string; | |||
lastStartTag: string; | |||
errors?: string[]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
errors
was wrongly typed; it should have been a TokenError
to begin with.
I have pushed some commits that might help make this more digestible. I was debating whether or not to include updates to other modules & tests. I felt like they help to illustrate how code looks like after adopting this change, so I kept them for now. But happy to move them to a separate PR if needed.
The tokenizer unfortunately has a lot of changes due to all |
sorry, thanks for putting in that work, sorry that I didn’t review this! :( |
No worries — hope it's okay that I moved forward with this after two weeks. I have opened #419 to wrap up this refactoring and would appreciate a review there (if you have the time). |
Update the tokenizer to produce events. See #403 for details.