-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: remove newline entities #46
Conversation
src/index.ts
Outdated
@@ -1,6 +1,6 @@ | |||
const invalidProtocolRegex = /^([^\w]*)(javascript|data|vbscript)/im; | |||
const htmlEntitiesRegex = /&#(\w+)(^\w|;)?/g; | |||
const htmlTabEntityRegex = /&tab;/gi; | |||
const htmlSpaceEntityRegex = /&(tab|newline);/gi; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any other html entities that represent blank spaces?
I'm also wondering if we can just trash html entities entirely, or if there's a world in which an html entity is a valid part of a url?
I guess it technically is for query params. Should we be using the URL
class here and only remove the html entities from the domain itself, and leave them in the query params? (would depend again on whether or not we continue to support IE11)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous regex (htmlEntitiesRegex
) has the ;
as an optional piece. Is there a world in which new lines or tabs can be added without using the ;
part of it?
I'm still concerned that there are other space characters that are vulnerable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe the problem came from whitespace ctrl characters specifically that were making it through our checks - other whitespace html entities appear to be properly encoded in the browser and i think are valid in a url, so i'm inclined to not strip those (or any other html entity) out.
based on what i can find, the ;
is always required for tabs and new lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, perfect ❤️
src/index.ts
Outdated
@@ -12,7 +12,7 @@ function isRelativeUrlWithoutProtocol(url: string): boolean { | |||
|
|||
// adapted from https://stackoverflow.com/a/29824550/2601552 | |||
function decodeHtmlCharacters(str: string) { | |||
str = str.replace(htmlTabEntityRegex, "	"); | |||
str = str.replace(htmlSpaceEntityRegex, "	"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a little odd that &tab;
and &newline;
both get converted to 	
, which is a tab. The 	
then gets decoded and stripped out as a control character. Since we know &tab;
and &newline;
are both control characters, can we just eliminate them here?
str = str.replace(htmlSpaceEntityRegex, "");
That does change the nature of what decodeHtmlCharacters
does, though, but decoding &newline;
as a tab is already doing that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah - i like that a lot better (eliminating them entirely). fixed!
No description provided.