-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[native] Add a rust implementation of whitespace_parser #452
Conversation
This adds an experimental new native extension, with the goal of improving performance, currently just reimplementing `whitespace_parser`. This appears to make parsing marginally faster overall, but the performance gains are pretty limited because whitespace parsing is a small portion of the overall parsing process, and there's some added overhead of converting types (mostly unicode strings) between rust and python. Instead, this helps show what's needed to port part of the codebase over to rust in an example that's small enough to be digestible, but big enough to be more than a trivial toy. I originally started by trying to port the tokenizer to rust, since it takes a larger portion of time and it's infrequently modified (making it a good candidate), but found issues with the implementation direction I was taking, so I scrapped that for now.
Interesting! I wonder how this compares to a
Would be really curious about the bottlenecks and if they're easy to fix.
Yep, agreed. We can keep the experiment in this PR for now until it provides a significant enough perf advantage to outweigh the maintenance and convenience (to users) cost |
Status update! I've got a working Rust-based tokenizer here: bgw@ad0d900 It's based on CPython's C-based tokenizer (not the regex-heavy pure-python parso tokenizer), so it's a little more complicated, but it can also (at least in theory) generate better error messages. It should also handle a few very weird edge cases around formfeed characters (#446) and mixed tab/space indentation better. Benchmark on master:
Benchmark on top of bgw@ad0d900:
We're getting to the point where there's actually some useful performance gain! However, I think most of the possible performance gains are in porting the CSTNode types, so that's likely my next target. The CST type definitions involve a lot more code than the tokenizer or whitespace parser though, so it'll take even more time. We currently frequently copy and traverse over CSTNodes; I think we could do a lot better with codegen and a native implementation. |
Porting all of LibCST to mypyc would be a difficult project. The tokenizer and whitespace parser have type annotations, but the LL(1) parser is nearly impossible to type annotate as it's currently designed. I suppose you could just use mypyc on the parts that are already typechecked. The CST node types could be compile with mypyc, but the biggest problems with them (unnecessary object clones during traversal) are actually going to need some sort of codegen (or macro) system to solve, so it's not actually a very compiler/language-specific problem. My real (selfish) motivation behind this PR is that I've fallen in love with Rust, and I want to write more Rust code. If I was still working on LibCST as part of my day job, I'd probably aim for much lower-hanging fruit than rewriting everything in Rust.
When we looked at this before, CST node traversal was disproportionately awful. As stated above, we'd need some sort of codegen (or macro system, cough Rust proc-macros cough) to solve this in any sort of sane way. |
Should we close this? I believe #566 built off of a version of this work, so it's effectively been merged already? |
Summary
This adds an experimental new native extension, with the goal of improving performance, currently just reimplementing
whitespace_parser
.The implementation is in Rust with PyO3.
This appears to make parsing marginally faster overall, but the performance gains are pretty limited because whitespace parsing is a small portion of the overall parsing process, and there's some added overhead of converting types (mostly unicode strings) between Rust and Python. I haven't done much benchmarking or any profiling yet.
Instead, this helps show what's needed to port part of the codebase over to rust in an example that's small enough to be digestible, but big enough to be more than a trivial toy.
I originally started by trying to port the tokenizer to Rust, since it takes a larger portion of time and it's infrequently modified (making it a good candidate), but found issues with the implementation direction I was taking, so I scrapped that for now.
Test Plan
Performance Numbers
Here's some very rough performance numbers (Debian 10 with Python 3.7). I've run this a few times, and
libcst_native
seems to be consistently faster by a very small margin.With
libcst_native
(compiled with--release
)Without
libcst_native
Stuff I've Learned
Passing around Python objects in Rust (e.g.
PyObject
) is cheap, but convertingPyString
s to RustString
s (or vice-versa) can easily add up and destroy performance. I added aPyCached
struct to reduce the number of conversions I was doing, but it'd be best to pick what stuff to port tolibcst_native
based on how much data needs to be converted between languages.Python subclasses and Rust are an odd couple. PyO3 doesn't allow Rust classes to extend non-native Python classes. PyO3 uses
__new__
for construction instead of__init__
, and when subclassing a rust class in Python, this may mean that you have to use__new__
instead of__init__
too. There are ways around these problems, but it requires some cleverness.This isn't much of a surprise, but there's no
dataclasses
module in Rust, so things that exposed dataclasses will no longer expose dataclasses. This may result in some small but technically breaking API changes.Meta-Commentary
It's totally reasonable if people don't feel like this should be merged yet. It doesn't currently provide any useful performance advantage, and I can't guarantee that I'll ever get
libcst_native
this to a state where it provides a useful performance advantage. I'm just playing with this in my spare time.If we actually started shipping
libcst_native
as anything more than an experimental feature, I expect that we'd want to get rid of the Python versions of the modules thatlibcst_native
provides implementations for since maintaining two implementations seems insane.If we start taking this seriously, I can find somebody in the internal "Rust at Facebook" group with the expertise to help review this.