-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The streaming implementation does not take in account that a token may be split across two chunks #632
Comments
One way to fix ... Parser already expects production rules to be delimited by newline. Making the below change at https://github.com/kach/nearley/blob/master/lib/nearley.js#L279, works for your grammar. This should work without additional ripple effects, since var tailPrevChunk='';
Parser.prototype.feed = function(chunk) {
var idxLastNewline = chunk.lastIndexOf('\n');
if (idxLastNewline === -1 ){
tailPrevChunk += chunk;
return;
}
var alteredChunk = tailPrevChunk + chunk.substring(0,idxLastNewline);
tailPrevChunk= chunk.substring(idxLastNewline);
var lexer = this.lexer;
lexer.reset(alteredChunk, this.lexerState);
. . . Having shared that. It is worth debating if supporting an arbitrarily long production rule is a good idea. Seems to me, having a defined max length is a good idea. |
@naraen Thank you, it is a good idea. This will likely fix parsing longer .ne files, but will this fix arbitrary user data with a Parser that One can imagine a valid data format without any newlines (or even set delimiters). For such corner cases, though, we may let the user configure when to stop working with the current chunk. |
Agreed. This isn't the right location in the code flow for the fix. I wanted to do a quick proof. |
Opened a PR with the change in StreamWriter. As far as I can tell, only nearleyc uses the StreamWriter. So this should fix the issue identified without changing the behavior of the generated parser. |
Consider this code in the
nearleyc
implementation: https://github.com/kach/nearley/blob/master/bin/nearleyc.js#L29It feeds the fs stream via a
StreamWrapper
instance to the parser's feed() function as is.The
Parser.feed()
implementation callslexer.reset()
with the new chunk.If the fs stream splits a token between two chunks, and we're lucky we will get a lexer error either at the end of the previous chunk, or at the beginning of the next chunk. (Otherwise the result is valid, but may not be what the user expects, as two tokens instead of one appear silently in the result.)
Ideally, the fs streams should be handled on lexer level, not on parser level. Barring that, the easiest solution, probably, would be to drop streaming support, at least in
nearleyc
, and deprecate the usage of streaming in the API.In practice, this issue affects users with longer grammars with longer symbol names.
Here is the code to reproduce (probably can be whittled down to something simpler):
gen-long-grammar.ts
Running this
npx tsx gen-long-grammar.ts | npx nearleyc
Results in an error similar to this:
The text was updated successfully, but these errors were encountered: