-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Soft Keywords and How to Implement Them #138
Comments
CC: @davidhalter @bgw |
@isidentical Interesting work. I'm not sure we should go this route. An alternative way to deal with "soft keywords" is using something like However I would probably prefer a PEG parser at that point. But at the same time I haven't really thought about that either. PS: I hope that the PEP gets rejected. IMO this is just additional syntax without major benefits. If pattern matching was at least an expression, then I would maybe see it, but this way it's just useless (you can do most thing with if). |
Will look into that (and possibly draft a PoC), but at least for now, I don't think it would be the right fit.
Yes, me too (if we can). As I said, this is just a workaround for that problem. I, unfortunately, can't start such an action but can participate whenever I'm free. (If anyone interested with the idea, and maybe some rust/c backend).
I ~partially agree. I don't really mind if it gets accepted, but not as is. IMHO It needs major changes including very big simplifications. It took like hours and hours to finally comprehend the whole document, which I believe shows it is a very complex and dense piece of language addition at least in a single transaction. It might be considered to divide the PEP into parts and gradually include/exclude some parts of it. Let's wait and see it, no need to rush :P |
Why can't you start? Because it would be written in a lower level language?
Yeah, I also didn't read the whole thing, it's simply too long and complicated. But that might just be this iteration of the draft. |
No, I'm currently in the process of getting a job for this summer. And not sure if I can dedicate some time to it, at least solely. If you / or anybody else tries to attempt it, I can help in my free times. |
@davidhalter would you check out (and comment if you have anything to say) this? https://bugs.python.org/issue40360 |
I have thought quite a bit about this. It's similar to your plan, but not exactly the same. We can probably use our existing LL parser and make it PEG compatible with a small trick: This is very similar to PEG and would need some work, but it's probably not even that hard to do. The biggest challenge will probably be keywords/soft keywords and backtracking. It seems like your solution is quite similar to this, but you seem to transition ahead to make it clear which plan to take. This works of course as well. |
Interesting, do you have time to come up with a draft that supports both soft keywords and the multiline context managers. Obivously, we don't need to have an ideal, an epitome strategy to parse, so we could just embed the backtracking into the LL parser. For the reference, I've written about it: https://tree.science/what-the-backtracking.html |
At the moment I have other priorities, but I will definitely come up with something before 3.10 releases if you don't do the work :) My goal for parso is just to get it running so people can use it. My priorities lie with creating a parser that is both faster and better than parso (and is probably a different project). |
:-) [inserts a november song]
I like hacky ways 😎 |
I think your general approach is pretty good, I would probably just use a slightly different approach in the generator (if possible): I would move some of the logic there to calculate the plans beforehand. But I haven't really thought about everything and your solution might be what we need. |
I started some stuff, but calculating the rules beforehand is a bit tricky. The thing that makes backtracking harder for us is error-recovery. If we'd assume all input is valid, we iterate all possible plans (for e.g optional parens for with-stmt) and try to perform a dry-run and see which one of the alternatives completed. And as soon as one completes, we could return it. But since we also do error recovery, we need to take on token at the time, feed all the plans, get the next possible plans, and complete those and repeat this until we find a plan that either takes more input than others. The proposed patch does this on a very small scale, but for ambigous rules, I might have an alternate patch (still working on, but no luck so far). Basically what it does is, instead of keeping |
I will answer later, but maybe you can answer something about CPython's PEG grammar. I don't really understand why the tilde in the following code is necessary. I have read the PEP about it and understand that it's usually used to avoid considering alternatives. But in this case there are no alternatives, so why is it there?
|
I presume you found it on the |
Thanks! I was indeed looking at the wrong file. I was aware of the other file, but I thought the About the proposed patch: I'm currently trying a technique in Rust that might work for us as well. I will keep you posted. |
👀 |
I was reading a paper about error-recovery for PEG parsers, especially the ones that used by IDEs for auto-completition and noticed that I was missing an important aspect regarding backtracking on ambiguities on my imitated fork for parso's LL(1) parser. I believe with something similar to the commit ( |
status update: PEP 634 accepted with soft keywords |
Unfortunately, yes. :) If we don't have a good solution for PEG/LL(*) parsing in autumn we can maybe also use error recovery. It could work well enough. |
Yeah, that is what I kind of used (not entirely). Do you have any opinions regarding my initial approach? I'm thinking about also introducing a |
Give me one more month. I'm pretty far with my approach and would then show it to you. |
@isidentical I'm a bit confused by how the new Grammar uses lookaheads. Negative lookaheads seem to be used in some places for speedups (maybe like 2-3 times), but in all other places they are pretty much useless and might be omitted. Do you happen to know why they are there? |
Can you show like a concrete example? |
I realized that again I wasn't reading https://github.com/python/cpython/blob/master/Grammar/python.gram properly. I usually read https://docs.python.org/3.10/reference/grammar.html first to try to understand the grammar somewhat and then refer to the official grammar and try to see if I understood it right, but in this case I didn't see that most |
@isidentical I'm doing my first few benchmarks with my new Rust Parser. I'm using a 2 MB file that I repeat ten times. So 20 MB of code. Generally parso is like 30 times slower than both the CPython Parser and my Rust parser:
Numbers are not extremely precise. Just there for a general tendency. I'm a bit confused by the memory usage. Can you explain this? Especially ast.parse seems to be pretty extreme. 20 MB of code should IMO not take up 1.93 GB of space. Everything else seems to make sense. AST generation seems to take a lot of time though. My Rust parser still has some space for user definable stuff on the nodes, so that's essentially 250 MB that could be reduced even more if the nodes were compressed properly. I guess <200 MB would be feasible for a full syntax tree for 20 MB of code. It's especially interesting that a pure Python version (parso) of a syntax tree uses less RAM than the CPython version (however it's also extremely slow, not sure how much interning of values is going on there). Is there a way to test the pure speed of the CPython PEG parser? I'm not interested in the AST stuff. |
The difference seems quite big and the only thing that I can think is the memorization cost though it shouldn't cause that much of a difference. There is a way to ~dump memorization stats but it is not really explicit, though feel free to try (the
Since the parser's only output is the AST, I don't think there is an easy way to see that speed. The only overhead that I see between those is the serialization phase which shouldn't take much time. Though if you want to dive really deep you could use C profiler to profile the |
Thanks for your comment! After an hour or two of optimizing I got it down from 2.9s to 2.6s and from about 250 MB to 185 MB. It will probably stay there for a while. I could reduce a tree node further to get to 150 MB and then compress it to 120 MB. Not sure if that's all worth it, so I won't do it for now. A tree with ten million nodes would then use 85 MB RAM + 20 MB code + program. Pretty cool :). As a comparison: The Python standard library has ~10 MB of source code. I'm almost at the place where I can show my concepts to you. I feel like they would be pretty nice for parso as well, but they might be a bit of work. And I'm definitely to lazy to profile |
@davidhalter It's interesting that you say that you're working on a Rust parser. I've slowly been working on porting bits of LibCST to Rust for better performance (starting with Instagram/LibCST#452). I've currently got a 1:1 port of CPython's tokenizer.c (code isn't published yet, but I can put it up if you want to look at it) to Rust, and I'm working on adding support for Parso's f-string splitting behavior. It doesn't currently support any sort of error recovery. I don't know if that would be useful to you. It'd be cool if we could continue to share some infrastructure. I haven't started on trying to move the parser to Rust yet (though Python 3.10 gives it some urgency), but I've been considering rust-peg for the job. IMO, parsing between Parso and LibCST is different enough (LibCST doesn't need error recovery or incremental parsing) that I think it's fine if we diverge there. |
I have also been enjoying working with you. However: I'm considering releasing the software as commercial software, so that will probably have to wait for a bit. :) It will probably also be released open source and for free for small companies and individuals, but your company obviously isn't part of that group :). |
Up to you; commercial software can pull in PSF or MIT licensed dependencies! I've benefited immensely from your open-source work. It's totally fine if you want to take any of my code and use it in a commercial project, and it's okay if you chose not share changes back. I don't really see Jedi/Parso and Fixit/LibCST as competing products: LibCST makes no attempt to target as-you-type IDE use-cases. It sounds like you already have a tokenizer, but if you're interested, here's my Rust tokenizer implementation. It pretty closely matches the behavior of Parso's tokenizer, except that it doesn't have any error recovery. Best of luck on your possible commercial software endeavor. A faster and more capable version of Parso/Jedi would be a no-brainer for lots of companies. |
@bgw Thanks a lot for sharing. I'm really interested in comparing the two tokenizers in terms of performance. I might not do it now, but I will definitely compare it eventually. |
For what its worth, here is a (~decent looking) python parser written in Rust that I stumbled against a few days ago: https://github.com/ProgVal/rust-python-parser/ |
@isidentical, I've seen that. It does look decent, except:
If you're looking for a Python AST parser in Rust, you should also consider I started by trying to wrap RustPython's lexer for LibCST, but I wanted the Rust tokenizer to match with our Python tokenizer's behavior exactly so that it could be a drop-in replacement. There were enough differences that it ultimately ended up being easier for me to write my own implementation instead. |
@bgw I agree. I'm not a big fan of parser combinators for language grammars. I like proper BNF grammars much more. Nom is definitely a good library, but I think language parsing is not what it's good for. I'm still a big fan of LL parsing, because of it's mathematical beauty :) It also lends itself better towards error recovery than LR parsing. I have never really understood why people would prefer it. @isidentical I have given you access to my private project. My approach does pretty much the following:
What do you think about this approach? It is IMO interesting, because it does not need to calculate the Please also let me know if you don't understand what I just tried to explain. It's pretty hard to explain it well and I don't feel like having done a good job. :) |
I think it makes sense, especially considering that my approach was more of a brute-forcing rather than proper calculations. |
Do you want to implement it with a different approach? I wouldn't be too unhappy, since I've already implemented it in another language :) |
Nop, feel free to go with the pre-calculation. I might try to give it a shot, but pretty stacked up until the next month so if anybody is up to the implementation, it would be great! |
Hello. Do we have a plan for this feature needed for full Python 3.10 support? |
Currently no. I know how this would work with an LL grammar and I'm happy to help people work on it, but for now I have other priorities. For Jedi this is mostly not really an issue, since autocompletions/goto will still work. The error listing is probably broken in 3.10 unfortunately. For everyone who wants to parse 3.10 grammar and work with that, I guess you are out of luck at the moment. |
@davidhalter Is there a possibility to sponsor this feature? |
Yes. Just note that this is a few days of work (might even be weeks), so giving me 5$ is not going to cut it :) |
Totally understandable - if you don't mind I'll reach out via email? |
Yes, sure. |
@stnley Did my email answers reach you? I'm just rechecking here. |
I've been thinking for a day about how we can possibly implement PEP 622 in case of acceptance. While I was thinking, I drafted a piece of code, which might seem totally unreasonable (since it changes the parser to semi-LL(1) from LL(1)) but wanted to share it as a maybe last resort? I tried to document the strategy as inline comments, so feel free to read and comment about it:
https://github.com/davidhalter/parso/compare/master...isidentical:soft-keywords-very-very-bad-demo?expand=1
The text was updated successfully, but these errors were encountered: