-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Completion should take <16ms most of the time #7542
Comments
Related issue about the benchmarks: #6633 A few comments on the interesting places in the
As an optimisation idea, we can use a better key in There we have an fst (the index we store textual full path strings in) that does a blazingly fast fuzzy search on the completion input, for our test profiling case.
to retain only the traits that the completion receiver ( There's a way to find a part of the traits that are implemented for a certain struct, which is used in the Alas, it's limited and does not find some classes of trait impls entirely, and getting us there might be hard. Yet, the |
Some more assorted thoughts:
|
#8254 should help with perf optimization workflow. |
9017: internal: Reduce the number of traits passed through chalk during applicable trait lookup r=SomeoneToIgnore a=SomeoneToIgnore Inherent traits can be omitted before trait solving, presumably slightly helping #7542 and slightly simplifying the code. Co-authored-by: Kirill Bulatov <mail4score@gmail.com>
9206: minor: Speed up fst items lookup during completions r=SomeoneToIgnore a=SomeoneToIgnore Part of #7542 A number of profile calls added for `import_on_the_fly` contents. Before: <img width="606" alt="Screenshot 2021-06-11 at 00 19 13" src="https://user-images.githubusercontent.com/2690773/121598998-22321e80-ca4b-11eb-9a3d-dc9cb2936705.png"> After: <img width="859" alt="Screenshot 2021-06-11 at 00 19 27" src="https://user-images.githubusercontent.com/2690773/121599022-2a8a5980-ca4b-11eb-82b6-13ab0ed56d58.png"> As a result, low hanging fruit was spotted: crazy amount of `fst_path` calls. Reducing that won ~200ms in the `import_on_the_fly @ sel` case in the `integrated_completion_benchmark`: <img width="861" alt="Screenshot 2021-06-11 at 00 19 38" src="https://user-images.githubusercontent.com/2690773/121599277-7d641100-ca4b-11eb-8667-53206994de27.png"> I'm not sure how to proceed with the remaining `???` marks in such methods as `collect_import_map` though: there's nothing but library calls in cycles, but maybe I'll come up with something later. Co-authored-by: Kirill Bulatov <mail4score@gmail.com>
9208: minor: Populate import maps eagerly to speed up flyimports r=SomeoneToIgnore a=SomeoneToIgnore Part of #7542 Follow up of #9206 (comment) Reduces `import_on_the_fly @ sel` case in the `integrated_completion_benchmark` by ~300ms. Also enables cache priming for manual workspace loading to reflect the results in the benchmarks. Before: <img width="1198" alt="image" src="https://user-images.githubusercontent.com/2690773/121606148-4a734a80-ca56-11eb-812a-7955e93817f1.png"> After: <img width="1200" alt="image" src="https://user-images.githubusercontent.com/2690773/121606156-4e06d180-ca56-11eb-891b-1ed878b53d7e.png"> Co-authored-by: Kirill Bulatov <mail4score@gmail.com>
Hello. I'd like to tackle some of this work during my company's hackathon this week. Commenting here to hopefully claim it. |
Splendid! @mirkoRainer running this test in release mode should be a good start. |
I'm terribly sorry, but I was not able to make any good progress on this issue during the Hackathon. I'm fine with leaving it assigned to me but I can't guarantee any timely progress on it. |
I tested in a recent rust-analyzer version:
flyimport now takes a small portion of the time. The majority of time is spent in analyze_impl |
I tested Gnome-Builder with Rust-analyzer. The second one loads instantly even though the first one takes some time to process. |
Remove redundant 'resolve_obligations_as_possible' call Hi! I was looking for a "good first issue" and saw this one: #7542. I like searching for performance improvements, so I wanted to try to find something useful there. There are two tests in integrated_benchmarks.rs, I looked at 'integrated_highlighting_benchmark' (not the one discussed in the issue above). Profile from that test looks like this: ``` $ RUN_SLOW_BENCHES=1 cargo test --release --package rust-analyzer --lib -- integrated_benchmarks::integrated_highlighting_benchmark --exact --nocapture Finished release [optimized] target(s) in 0.06s Running unittests src/lib.rs (target/release/deps/rust_analyzer-a80ca6bb8f877458) running 1 test workspace loading: 358.45ms initial: 9.60s change: 13.96µs cpu profiling is disabled, uncomment `default = [ "cpu_profiler" ]` in Cargo.toml to enable. 273ms - highlight 143ms - infer:wait @ per_query_memory_usage 143ms - infer_query 0 - crate_def_map:wait (3165 calls) 4ms - deref_by_trait (967 calls) 96ms - resolve_obligations_as_possible (22106 calls) 0 - trait_solve::wait (2068 calls) 21ms - Semantics::analyze_impl (18 calls) 0 - SourceBinder::to_module_def (20 calls) 36ms - classify_name (19 calls) 19ms - classify_name_ref (308 calls) 0 - crate_def_map:wait (461 calls) 4ms - descend_into_macros (628 calls) 0 - generic_params_query (4 calls) 0 - impl_data_with_diagnostics_query (1 calls) 45ms - infer:wait (37 calls) 0 - resolve_obligations_as_possible (2 calls) 0 - source_file_to_def (1 calls) 0 - trait_solve::wait (42 calls) after change: 275.23ms test integrated_benchmarks::integrated_highlighting_benchmark ... ok ``` 22106 calls to `resolve_obligations_as_possible` seem like the main issue there. One thing I noticed (and fixed in this PR) is that `InferenceContext::resolve_ty_shallow` first calls `resolve_obligations_as_possible`, and then calls `InferenceTable::resolve_ty_shallow`. But `InferenceTable::resolve_ty_shallow` [inside](https://github.com/rust-lang/rust-analyzer/blob/2e9f1204ca01c3e20898d4a67c8b84899d394a88/crates/hir-ty/src/infer/unify.rs#L372) again calls `resolve_obligations_as_possible`. `resolve_obligations_as_possible` inside has a while loop, which works until it can't find any helpful information. So calling this function for the second time does nothing, so one of the calls could be safely removed. `InferenceContext::resolve_ty_shallow` is actually quite a hot place, and after fixing it, the total number of `resolve_obligations_as_possible` in this test is reduced to 15516 (from 22106). "After change" time also improves from ~270ms to ~240ms, which is not a very huge win, but still something measurable. Same profile after PR: ``` $ RUN_SLOW_BENCHES=1 cargo test --release --package rust-analyzer --lib -- integrated_benchmarks::integrated_highlighting_benchmark --exact --nocapture Finished release [optimized] target(s) in 0.06s Running unittests src/lib.rs (target/release/deps/rust_analyzer-a80ca6bb8f877458) running 1 test workspace loading: 339.86ms initial: 9.28s change: 10.69µs cpu profiling is disabled, uncomment `default = [ "cpu_profiler" ]` in Cargo.toml to enable. 236ms - highlight 110ms - infer:wait @ per_query_memory_usage 110ms - infer_query 0 - crate_def_map:wait (3165 calls) 4ms - deref_by_trait (967 calls) 64ms - resolve_obligations_as_possible (15516 calls) 0 - trait_solve::wait (2068 calls) 21ms - Semantics::analyze_impl (18 calls) 0 - SourceBinder::to_module_def (20 calls) 34ms - classify_name (19 calls) 18ms - classify_name_ref (308 calls) 0 - crate_def_map:wait (461 calls) 3ms - descend_into_macros (628 calls) 0 - generic_params_query (4 calls) 0 - impl_data_with_diagnostics_query (1 calls) 45ms - infer:wait (37 calls) 0 - resolve_obligations_as_possible (2 calls) 0 - source_file_to_def (1 calls) 0 - trait_solve::wait (42 calls) after change: 238.15ms test integrated_benchmarks::integrated_highlighting_benchmark ... ok ``` The performance of this test could be further improved but at the cost of making code more complicated, so I wanted to check if such a change is desirable before sending another PR. `resolve_obligations_as_possible` is actually called a lot of times even when no new information was provided. As I understand, `resolve_obligations_as_possible` could do something useful only if some variables/values were unified since the last check. We can store a boolean variable inside `InferenceTable`, which indicates if `try_unify` was called after last `resolve_obligations_as_possible`. If it wasn't called, we can safely not call `resolve_obligations_as_possible` again. I tested this change locally, and it reduces the number of `resolve_obligations_as_possible` to several thousand (it is not shown in the profile anymore, so don't know the exact number), and the total time is reduced to ~180ms. Here is a generated profile: ``` $ RUN_SLOW_BENCHES=1 cargo test --release --package rust-analyzer --lib -- integrated_benchmarks::integrated_highlighting_benchmark --exact --nocapture Finished release [optimized] target(s) in 0.06s Running unittests src/lib.rs (target/release/deps/rust_analyzer-a80ca6bb8f877458) running 1 test workspace loading: 349.92ms initial: 8.56s change: 11.32µs cpu profiling is disabled, uncomment `default = [ "cpu_profiler" ]` in Cargo.toml to enable. 175ms - highlight 21ms - Semantics::analyze_impl (18 calls) 0 - SourceBinder::to_module_def (20 calls) 33ms - classify_name (19 calls) 17ms - classify_name_ref (308 calls) 0 - crate_def_map:wait (461 calls) 3ms - descend_into_macros (628 calls) 0 - generic_params_query (4 calls) 0 - impl_data_with_diagnostics_query (1 calls) 97ms - infer:wait (38 calls) 0 - resolve_obligations_as_possible (2 calls) 0 - source_file_to_def (1 calls) 0 - trait_solve::wait (42 calls) after change: 177.04ms test integrated_benchmarks::integrated_highlighting_benchmark ... ok ``` Let me know if adding a new bool field seems like a reasonable tradeoff, so I can send a PR.
Code completion performance is more important than it seems. Obviously, completion shouldn't take a second or two. But even for "almost instant" completion, theres' a huge difference in experience when going 300ms -> 150ms -> 50 ms -> 16ms. It's difficult to perceive such small differences by just eyeballing a single completion, but it does affect fluency of typing a lot.
Historically, we haven't payed much attention to completion, barring obvious regressions. It was, and it is, generally fast enough. But it seems like we are at the point where it makes sense to push performance here a bit.
The 16ms is the boundary such that it doesn't make much sense to go beyond it (as thats comparable to typing latency), and which should be achievable. The size of the result is small, and the work should be roughly linear in the size of the result.
The best way to start here is to set this config:
and check Code's output panel for profiling info, which looks like this:
Keep in mind that rust-analyzer employes lazy evaluation. That means that if both
f
andg
calls
, ands
is slow, thens
time will be attributed to eitherf
org
depending on their relative order.Another place to look at is the end-to-end request flow in the main loop. Profiling captures only the computation, but it's important that completion is not blocked by other requests.
It would also be sweet to implement some maintainable benchmarks here. This would have high impact, but I don't know how to best approach this.
Other that that, just look at the core with the profiler and try to make slow things faster!
The text was updated successfully, but these errors were encountered: