-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace C HTTP parser with JS HTTP parser #1457
Conversation
@@ -0,0 +1,890 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be best to make this an internal module, a la freelist.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
two things
- this is deprecating
process.binding('http_parser')
which means we're gonna break some people, maybe too many to be able to remove this right away. - if we don't expose this parser for use externally then I think it should be it's own library and vendored in like
readable-stream
is so that people can require it on npm. if we do that then I think it's reasonable to make it internal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is deprecating process.binding('http_parser') which means we're gonna break some people, maybe too many to be able to remove this right away.
process.binding('http_parser')
, like other bindings, is and always has been internal. If people are using it directly and removal breaks their code, the onus is on them.
Besides, they can probably substitute it with https://github.com/creationix/http-parser-js.
if we don't expose this parser for use externally then I think it should be it's own library and vendored in like readable-stream is so that people can require it on npm. if we do that then I think it's reasonable to make it internal.
I'm not sure I follow this logic. If someone feels strongly enough about it, they can always break out the code into a standalone module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be it's own library and vendored in like readable-stream is
readable-stream still pulls from iojs itself, because the library has different goals than iojs itself (it favors broad compatibility where io.js favors performance.) It seems to me like the JS parser is likely to make similar tradeoffs and thus run into the same problems with vendoring.
Crazy excited about this. |
Test-parser-durability seems to fail on multiple platforms: https://jenkins-iojs.nodesource.com/job/iojs+any-pr+multi/nodes=freebsd101-32/543/console |
@jbergstroem Yes, I noted the timeout issue in my Final Notes section. I'm not sure what to do about it... |
Great work. I will review this when I have some time. +1 to move this to its own repository like readable-stream and bundle it. I should also mention that I've done a little experimentation on a http/2 implementation in pure js. It does somewhat works, but is not nearly finished. But my hpack implementation should be complete and is available here. I have not read your code yet, but when we are doing a big change to the http implementation, we should make it ready for a http/2 addition later. I'll come back to you on that. As @mikeal said, this is crazy exiting. |
@mscdex sorry, I missed that note. Suggesting we move |
Oh man! This is really great. Can't wait for this to be in 👍 |
Great!! 👍 I benchmarked using
1.04 % faster ! machine spec: |
|
||
var RE_CLOSE = /close/i; | ||
var RE_KEEPALIVE = /keep\-alive/i; | ||
var RE_UPGRADE = /upgrade/i; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be /^upgrade$/i
instead? ( #828 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically the Connection
header is a comma-separated list, so there could be more than one value. I'm not sure how often that occurs in practice though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've just pushed a commit to improve multi-value matching.
Nice work, Brian. You mention that the parser buffers and rescans? How does it handle slowloris attacks?
Requests with only LF line endings happens in the wild. It must be supported.
Proxy-Connection is rare but it is used in the wild. On the other hand, I believe support for it in the current parser is broken as well. I remember raising a couple of issues about it at joyent/node and I don't think they have been fixed.
I remember there was one guy that raised a bug about some network appliance that only spoke HTTP/0.9. Probably an extreme fringe case, though. |
For what it is worth, RFC 7230 explicitly says there is no longer any obligation to support HTTP/0.9. I think it is wise to not pretend to support it in io.js. Whatever code paths there are for HTTP/0.9 will likely only be exercised by the test suite, or by people looking for bugs to exploit. Seems like a magnet for code rot. One can always put nginx in front of io.js if this capability is needed. |
The js parser currently only executes the regexp once a full line is received. I was not aware of the "slowloris" attack beforehand, but I imagine one mitigation might be to add max line lengths for start lines, header lines, chunk size lines, etc. in addition to the header bytes limit that covers both start line + header lines. EDIT: I should add that having additional max line lengths will only help so much. It won't prevent someone from opening a ton of connections and trickling in valid requests to use up server resources though.
I've never seen this in reality. Do you have any modern examples? |
This is awesome, great work @mscdex! |
* {Array} | ||
|
||
A list of the HTTP methods that are supported by the parser. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a decent amount of code that accesses this. Maybe we should deprecate in docs first?
There is a full replacement available [1], but if we are still using the list in core, is there a reason to not export it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move to a model where we just support any method? I argued with ry back in the day about this, but people at the IETF add HTTP methods for crazy new protocols all the time. Ryan hated this and so he intentionally limited the number of methods supported but there's no real technical reason for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For reference, here's one of my favorites http://greenbytes.de/tech/webdav/draft-dusseault-caldav-05.html#METHOD_MKCALENDAR CalDAV's MKCALENDAR event which "makes a calendar" and totally needs its own HTTP method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think one of the reasons for limiting the supported verbs is to be able to fail earlier on bad input. That is just a guess though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as the spec is concerned new verbs can be added at any time and it's not the parsers job to fail on any specific verb except where semantics are defined specific to that verb and do not occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In C parser verbs are hardcoded into state machine, so we didn't really have a choice.
But the new parser should support any valid verbs imho, it's not its job to enforce any limitations like this.
PS: I want BREW method, 'cause hey it's RFC :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we expose a way for user-land to maybe add verbs they explicitly want to support? For all verbs that aren't hard coded (i.e. common verbs) I'd want them to just fail as bad input. If someone has a specific need for an unsupported verb they can make a PR or if we have a way to add new verbs in user-land, use that. We (at Grooveshark) get crazy verb requests all the time (for instance thousands of LOAD requests per day) and drop them since they're not defined anywhere and we have no idea how we'd even respond.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how standard verb like MKCALENDAR
is any different from non-standard one like HELLOWORLD
. Your application will likely respond the same way for both of them.
And since treating all uncommon methods as bad input is undesirable, we could just allow all methods.
Seriously, what is the benefit of enforcing any limits on that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how standard verb like
MKCALENDAR
is any different from non-standard one likeHELLOWORLD
.
One you need to make a calendar for and the other you should probably respond with a HELLO. If you're a CalDAV app you should respond to MKCALENDAR
by making a calendar and respond to HELLOWORLD
with a 405 Invalid Method. But majority of people (if not all) using this will respond to MKCALENDAR
by NOT making a calendar and also probably not responding with 405, therefore not doing what the client that made the request expected.
A few remarks:
I don't say that the JS parser is worse (or better), I just wanted to shed some light on those two aspects. Unfortunately, I don't have time to investigate them. |
RE_AUTHORITY.source + '|\\*)'); | ||
var RE_REQUEST_LINE = new RegExp('^([!#$%\'*+\\-.^_`|~0-9A-Za-z]+) (' + | ||
RE_REQUEST_TARGET.source + | ||
')(?: HTTP\\/1\\.([01]))?$'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would using a template string be more sensible for some of these?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe, I'm not up on the ES6 stuff.
As an example on my computer the following regexp takes 23 seconds! to execute: /a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?a?aaaaaaaaaaaaaaaaaaaaaaaaaaaaa/.exec("aaaaaaaaaaaaaaaaaaaaaaaaaaaaa")
|
|
That's a bad idea, because it does not allow simple testing of an http server with "netcat".
I'm not sure why this test was included in joyent/http-parser, but we'd better make sure it doesn't break backward compatibility. |
HTTPParser.RESPONSE = 1; | ||
module.exports = HTTPParser; | ||
|
||
function indexOfCRLF(buf, buflen, offset) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dchusovitin As mentioned in my performance improvement techniques, the custom indexOfCRLF()
is faster than buffer.indexOf()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds about right. Buffer#indexOf()
is a naive implementation ATM.
@dcousens keyword being a WG vs all WG's. |
A reason I would like this to be held in a separate repo is for the utility. One example is that businesses have asked me how they can inject themselves into the data stream in between the TCP and HTTP parser. They need to do this for a variety of reasons. For example, decrypting cookies. Unfortunately this is impossible to do today without horribly hacking into what should be core internal functionality. Most of the cases like the one above could be solved by allowing devs the ability to use the parser similar to a transform stream. Then it could just be plugged into the pipeline. Though at that point it seemed to me like it would live more appropriately in its own repo and io.js would simply bring it in. |
+1 to the code living in its own repo. I don't think this will get its own WG yet, there aren't enough people actively engaged in working on it for one thing. Perhaps if this module starts to take on HTTP2 support the contributors would go and it would be a good idea to get its own WG. |
See this[1] comment for more details and benchmark results. Notable changes: * Removed all regexps. Only request URLs are not validated currently. * Added globally configurable start line length limits * Chunk size parsing no longer line buffers * Lines can end in just LF too * Faster than the original regexp-based implementation [1] nodejs#1457 (comment)
One concern: even if faster than the C version, what about GC? Will we see increased GC freezes because of this? |
return false; | ||
} | ||
|
||
if ((flags & FLAG_CHUNKED) > 0 || this._contentLen !== null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition already yields a boolean, so you shouldn't need an if-statement (and potential branch misprediction). Micro-optimization, but I guess that's where you're at at this point :)
return (flags & FLAG_CHUNKED) > 0 || this._contentLen !== null;
@Fishrock123 I'm not sure I understand how that issue could have been avoided in a js parser like this. At some point there has to be Buffer-to-string conversion in order to easily work with (not to mention backwards compatibility) things like headers, request urls, request methods, response reason text, etc. |
@mscdex Sure, but that should take place at the last moment, just before the surface API. Everything before the HTTP body in a legit request is pure 7-bit ASCII, so there's no real complication in handling this correctly. (no multi-byte codepoints, etc) |
There's also the fact that this retains |
@nathan7 Actually request URLs are treated as UTF-8 in node/io.js, since node/io.js utilizes the less-strict parsing option of I'm not sure what you mean about |
Oh, whoops — I misread the diff apparently. Goodbye, |
My thoughts on a separate repo are:
|
Is the C parser actually slower or is the boundary the problem? |
@Matt-Esch I don't recall now. I haven't ran any benchmarks since changes were made to the way the http parser receives incoming data from the socket (it reads directly from the socket in C++ land instead of going through JS). It's entirely possible these changes could have made a considerable improvement in the benchmarks. |
@Matt-Esch The boundry is the problem. As you can see by my pending PR #3780 that I was able to increase the performance of the header parser using some v8 cheats. There is more we can do there as well. |
Status? |
7da4fd4
to
c7066fb
Compare
I'm going to close this as no activity has been had here for about half a year. Feel free to reopen. |
Background
There had been brief mention during a TC meeting some time back about the
possibility of having a js HTTP parser for io.js core. I was looking for a new
and interesting project to work on in my free time, and I'm a protocol
enthusiast, so for the past week or so I've been working on writing a js HTTP
parser.
This parser uses regular expressions for most of the parsing. Others have
expressed concerns about using regular expressions due to the (negative)
performance impacts they would most likely bring. However I wanted something
that would be easy to get right (since the RFCs provide the appropriate
grammars) and even though the resulting regular expressions would turn out to be
pretty large (for request line parsing), they are broken up in the source code
to make it easier to follow. I was also curious to see just how much of a
negative performance hit there would be.
Right now all io.js tests pass, including the (relevant) tests I ported from
joyent/http-parser's tests.
Non-breaking Behavioral Changes
Only one "onHeaders" callback (no separate onHeaders and onHeadersComplete
callbacks)
pause()
andresume()
are no-ops. The only place these methods are usedcurrently is in the http server and even there they are not used correctly
(as far as I can tell). From what I have seen in the C parser implementation,
pause()
andresume()
only change the current state and nothing more(e.g. no saving the existing chunk or automatically re-parsing the existing
chunk on resume or anything like that). As far as I can tell it's up to the
end user (http server in this case) to re-
execute()
the same, previouschunk after calling
resume()
in order for the rest of the original data tobe processed.
If any of this is incorrect, please let me know.
Backwards Incompatibilities
Currently there are no
.code
or.bytesParsed
properties added to Errorobjects returned by the parser. This should be trivial to re-add though.
Because of the difference in parsing strategies, the js parser cannot
terminate as quickly because it tries to buffer until a CRLF is seen for
request/response and header lines for example. I've attempted to workaround
this somewhat when parsing the first bytes for request/response lines since
there is one test (
test-https-connecting-to-http
) where TLS handshaking iswritten to the parser. Basically I check that the first byte is a printable
ASCII byte before continuing. This will keep out binary TLS data, but it is
obviously not foolproof. I am open to suggestions to remedy this problem.
Folding whitespace behavior is conformant with RFC 7230:
"A user agent that receives an obs-fold in a response message that is
not within a message/http container MUST replace each received
obs-fold with one or more SP octets prior to interpreting the field
value."
It should also be noted that RFC 7230 now deprecates line folding for HTTP
parsing, FWIW. This parser replaces folds with a single SP octet.
Optional whitespace is removed before interpreting a header field value, as
suggested by RFC 7230:
"A field value might be preceded and/or followed by optional
whitespace (OWS); a single SP preceding the field-value is preferred
for consistent readability by humans. The field value does not
include any leading or trailing whitespace: OWS occurring before the
first non-whitespace octet of the field value or after the last
non-whitespace octet of the field value ought to be excluded by
parsers when extracting the field value from a header field."
joyent/http-parser keeps trailing whitespace. This parser keeps neither
preceding nor trailing whitespace.
Enforces CRLF for line endings instead of additionally allowing just LF.
Does not allow spaces (which are invalid) in header field names.
Smaller maximum chunk/content length (2^53-1 vs 2^64-2). Obviously it's
not impossible to handle a full 64-bit length, but it would mean adding
some kind of "big integer" library for lengths > 2^53-1.
No special handling for
Proxy-Connection
header. The reason for this isthat
Proxy-Connection
was an experimental header for HTTP 1.0 user agentsthat ended up being a bad idea because of the confusion it can bring. You
can read a bit more about this in
RFC 7230 A.1.2
Various Performance Improvement Techniques
buffer.indexOf()
that only looks for CRLFs and avoids trips toC++ land.
parseInt(..., 16)
. Thisconsists of using a lookup table and multiplying and adding. This technique
is borrowed from joyent/http-parser.
jsperf
/^foo$/i.test(val)
,val.toLowerCase().indexOf('foo') === 0
, or somethingsimilar. This technique is again borrowed from joyent/http-parser, which
does
charcode | 0x20
to get the "lowercase" version of the character.Additionally, initial string length comparisons allow early terminations
since we look for exact matches.
jsperf
trim()
that loops on the string twice, once to find thefirst non-whitespace character and second to find the last non-whitespace
character. Returning the
str.slice()
between these two indexes turns outto be much faster than
str.trim()
. The performance difference is muchlarger when the string has no leading or trailing whitespace.
jsperf
val !== val
instead ofisNaN(val)
.NaN
is the only value inJavaScript that does not equal itself or anything else.
jsperf
Benchmarks
Using the new (included) parser benchmark, here are the results I'm seeing:
(units are
execute()
s per second)DNF = terminated after waiting a few minutes to finish.
I was kind of suspicious about the large differences in some of the benchmarks,
so I attempted to explicitly call
parser.reinitialize(...)
afterexecute()
ing each full message for the C parser, thinking maybe that was needed toget correct data. When I made that change I got these numbers:
As you can see that resulted in speedups for most benchmarks, but a reduction in
the
small-res
benchmark. Either way, the numbers still are not close to thosefor the js parser. I tried to use varied types of inputs and tried to make the
benchmarks as fair as I could tell.
If I am missing something or measuring wrong, please let me know.
Final Notes
I think it's safe to say nobody is using HTTP/0.9 anywhere anymore. HTTP/0.9
responses simply consist of a body (no response line or headers or anything),
so there is no way to differentiate an HTTP/0.9 response from bad input.
_http_*.js
code, I foundthat the old binding API was still being used
(e.g.
.execute(data, 0, data.length)
). I have now corrected that, whichmay bring some speed improvements by itself.
possible to further improve upon them (performance-wise) while still being
correct. For example, catastrophic backtracking could be minimized by
emulating atomic groups by using positive lookaheads with capturing groups
(this means more data is saved because of the capture groups, but maybe that
won't be as bad as it might sound):
(?=(atomic pattern))\1
make test
will time out for the ported durability tests, but I'mnot sure how to handle that, especially since it would take even longer on
some of the embedded devices we currently test on.