Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Completion should take <16ms most of the time #7542

Open
matklad opened this issue Feb 3, 2021 · 9 comments
Open

Completion should take <16ms most of the time #7542

matklad opened this issue Feb 3, 2021 · 9 comments
Labels
A-completion autocompletion A-perf performance issues E-has-instructions Issue has some instructions and pointers to code to get started fun A technically challenging issue with high impact good first issue S-actionable Someone could pick this issue up and work on it right now

Comments

@matklad
Copy link
Member

matklad commented Feb 3, 2021

Code completion performance is more important than it seems. Obviously, completion shouldn't take a second or two. But even for "almost instant" completion, theres' a huge difference in experience when going 300ms -> 150ms -> 50 ms -> 16ms. It's difficult to perceive such small differences by just eyeballing a single completion, but it does affect fluency of typing a lot.

Historically, we haven't payed much attention to completion, barring obvious regressions. It was, and it is, generally fast enough. But it seems like we are at the point where it makes sense to push performance here a bit.

The 16ms is the boundary such that it doesn't make much sense to go beyond it (as thats comparable to typing latency), and which should be achievable. The size of the result is small, and the work should be roughly linear in the size of the result.

The best way to start here is to set this config:

    "rust-analyzer.server.extraEnv": {
        "RA_PROFILE": "handle_completion>16",
    },

and check Code's output panel for profiling info, which looks like this:

   85ms - handle_completion
       68ms - import_on_the_fly @ 
           67ms - import_assets::search_for_relative_paths
                0ms - crate_def_map:wait (804 calls)
                0ms - find_path (16 calls)
                2ms - find_similar_imports (1 calls)
                0ms - generic_params_query (334 calls)
               59ms - trait_solve_query (186 calls)
            0ms - Semantics::analyze_impl (1 calls)
            1ms - render_resolution (8 calls)
        0ms - Semantics::analyze_impl (5 calls)

Keep in mind that rust-analyzer employes lazy evaluation. That means that if both f and g call s, and s is slow, then s time will be attributed to either f or g depending on their relative order.

Another place to look at is the end-to-end request flow in the main loop. Profiling captures only the computation, but it's important that completion is not blocked by other requests.

It would also be sweet to implement some maintainable benchmarks here. This would have high impact, but I don't know how to best approach this.

Other that that, just look at the core with the profiler and try to make slow things faster!

@matklad matklad added good first issue fun A technically challenging issue with high impact A-completion autocompletion S-actionable Someone could pick this issue up and work on it right now labels Feb 3, 2021
@SomeoneToIgnore
Copy link
Contributor

SomeoneToIgnore commented Feb 3, 2021

Related issue about the benchmarks: #6633

A few comments on the interesting places in the handle_completion test profiling trace:

  • import_on_the_fly @ shows that we look up imports for blank input, which we only do, when searching for the associated items' completion (autoimporting traits).
    We optimize away struct search, performing the actual search after the 3rd symbol in the completion input, but we cannot optimize the associated items this way: showing methods and functions not right away is awkward

  • the time difference between import_assets::search_for_relative_paths and trait_solve_query is 8ms, but 6ms are not shown in the profiling info, a few more profile::span placed in the modules used inside the import_assets::search_for_relative_paths should help to locate them

  • interesting to note, that it took only 2ms to look up all traits with at least one associated item in the project, as the find_similar_imports profile span tells.
    We can still improve it nonetheless, this might be beneficial for other import_on_the_fly completion cases such as struct completions, we can remove that "after the 3rd symbol" "optimisation", but this all requires a separate set of checks.

As an optimisation idea, we can use a better key in import_map
https://github.com/rust-analyzer/rust-analyzer/blob/2c735ed734be9b9041921478e0049fffd7160f78/crates/hir_def/src/import_map.rs#L161

There we have an fst (the index we store textual full path strings in) that does a blazingly fast fuzzy search on the completion input, for our test profiling case.
Now, we use a lowercased path string, as a weird idea, we can prepend the paths in keys with special symbols (©, Ѭ?), depending on the entity type, thus decreasing the number of checks we need to do after getting the search results from fst.

  • trait_solve_query is called within the depth of the trait resolution, where we send all traits looked up on the previous step into chalk:

https://github.com/rust-analyzer/rust-analyzer/blob/342bf41022fa78856a48560b521ea32aa76f6f06/crates/ide_db/src/helpers/import_assets.rs#L267-L283

to retain only the traits that the completion receiver (some::path:: or struct.) implements.
We can limit the number of traits to check, but most probably not fully eliminate it.

There's a way to find a part of the traits that are implemented for a certain struct, which is used in the implementations code lens: https://github.com/rust-analyzer/rust-analyzer/blob/7967ce85cfc5fc2b1996425b44f2a45d0841c8ff/crates/ide/src/goto_implementation.rs#L41

Alas, it's limited and does not find some classes of trait impls entirely, and getting us there might be hard.
For example, one of those trait categories are the blanket trait impls: impl <T> MyTrait for T.
The following FIXME has two links with deeper ideas on the topic:
https://github.com/rust-analyzer/rust-analyzer/blob/342bf41022fa78856a48560b521ea32aa76f6f06/crates/ide_db/src/helpers/import_assets.rs#L177-L181

Yet, the fst lookup collects the all possible traits in the project for solving, including the ones that are not detected by the lens' code.
If we could come up with the way to filter out fst results to retain only the traits from such categories, we can reduce the number of trait solvings, optimising the slowest place in the current profile trace.
Additionally, we'll need to extract and profile the lens' code for finding the impls and add some code to filter out the traits, extract their associated item names and compare those to the fst search results (to reuse the approach for non-empty completion input)

@matklad
Copy link
Member Author

matklad commented Feb 5, 2021

Some more assorted thoughts:

  • on the fly imports are fundamentally costlier than anything else in completion, so it makes sense to specifically optimize that
  • the set of auto-imporatble items doesn't change between funcitons, so we might want to cache that
  • more generally, completions for xs.f and xs.fo are virtually identical, we should ensure that the second one takes almost no time
  • there's discontinuity in completion speed: it the completion is fast enough, code shows completion list without ctrl+space. We must always hit this breakpoint
  • on the protocol/API level, it might be worthwhile to separate immediate completions and slow completions. Itis imrortant that, eg, snippets are always instant, but its ok if flyimport is async
  • on the protocol level, we should measure the delay between the time we get a notification about a new char typed, and the time we get the completion request.

@matklad
Copy link
Member Author

matklad commented Mar 30, 2021

#8254 should help with perf optimization workflow.

@matklad matklad mentioned this issue May 24, 2021
6 tasks
bors bot added a commit that referenced this issue May 26, 2021
9017: internal: Reduce the number of traits passed through chalk during applicable trait lookup r=SomeoneToIgnore a=SomeoneToIgnore

Inherent traits can be omitted before trait solving, presumably slightly helping #7542 and slightly simplifying the code.

Co-authored-by: Kirill Bulatov <mail4score@gmail.com>
bors bot added a commit that referenced this issue Jun 10, 2021
9206: minor: Speed up fst items lookup during completions r=SomeoneToIgnore a=SomeoneToIgnore

Part of #7542

A number of profile calls added for `import_on_the_fly` contents.
Before:
<img width="606" alt="Screenshot 2021-06-11 at 00 19 13" src="https://user-images.githubusercontent.com/2690773/121598998-22321e80-ca4b-11eb-9a3d-dc9cb2936705.png">

After:
<img width="859" alt="Screenshot 2021-06-11 at 00 19 27" src="https://user-images.githubusercontent.com/2690773/121599022-2a8a5980-ca4b-11eb-82b6-13ab0ed56d58.png">

As a result, low hanging fruit was spotted: crazy amount of `fst_path` calls. Reducing that won ~200ms in the `import_on_the_fly @ sel` case in the `integrated_completion_benchmark`:

<img width="861" alt="Screenshot 2021-06-11 at 00 19 38" src="https://user-images.githubusercontent.com/2690773/121599277-7d641100-ca4b-11eb-8667-53206994de27.png">

I'm not sure how to proceed with the remaining `???` marks in such methods as `collect_import_map` though: there's nothing but library calls in cycles, but maybe I'll come up with something later.

Co-authored-by: Kirill Bulatov <mail4score@gmail.com>
bors bot added a commit that referenced this issue Jun 11, 2021
9208: minor: Populate import maps eagerly to speed up flyimports r=SomeoneToIgnore a=SomeoneToIgnore

Part of #7542
Follow up of #9206 (comment)
Reduces `import_on_the_fly @ sel` case in the `integrated_completion_benchmark` by ~300ms.


Also enables cache priming for manual workspace loading to reflect the results in the benchmarks.

Before:
<img width="1198" alt="image" src="https://user-images.githubusercontent.com/2690773/121606148-4a734a80-ca56-11eb-812a-7955e93817f1.png">


After:
<img width="1200" alt="image" src="https://user-images.githubusercontent.com/2690773/121606156-4e06d180-ca56-11eb-891b-1ed878b53d7e.png">


Co-authored-by: Kirill Bulatov <mail4score@gmail.com>
@matklad matklad mentioned this issue Jul 12, 2021
7 tasks
@jonas-schievink jonas-schievink added the A-perf performance issues label Jul 28, 2021
@mirkoRainer
Copy link
Contributor

Hello. I'd like to tackle some of this work during my company's hackathon this week. Commenting here to hopefully claim it.

@matklad
Copy link
Member Author

matklad commented Oct 11, 2021

Splendid! @mirkoRainer running this test in release mode should be a good start.

@matklad
Copy link
Member Author

matklad commented Oct 11, 2021

@mirkoRainer
Copy link
Contributor

Splendid! @mirkoRainer running this test in release mode should be a good start.

I'm terribly sorry, but I was not able to make any good progress on this issue during the Hackathon. I'm fine with leaving it assigned to me but I can't guarantee any timely progress on it.
Thank you though! I was able to learn a lot with the video series and reading the discussions around this issue.

@hf29h8sh321
Copy link

I tested in a recent rust-analyzer version:

  288ms - handle_completion
       31ms - CompletionContext::new
           30ms - descend_into_macros
               30ms - Semantics::analyze_impl
                   29ms - SourceBinder::to_module_def
                       29ms - crate_def_map:wait
                        0   - relevant_crates (1 calls)
                    0   - body_with_source_map_query (1 calls)
                    0   - crate_def_map:wait (2 calls)
            0   - CompletionContext::expand_and_fill (1 calls)
            0   - Semantics::analyze_impl (1 calls)
            0   - crate_def_map:wait (1 calls)
      202ms - Semantics::analyze_impl
          202ms - infer:wait @ from_value
            0   - SourceBinder::to_module_def (1 calls)
            0   - crate_def_map:wait (1 calls)
       40ms - import_on_the_fly @ 
           40ms - import_assets::search_for_imports
               40ms - import_assets::search_for
                   40ms - import_assets::trait_applicable_items
                        0   - applicable_inherent_traits (1 calls)
                        0   - deref (2 calls)
                        0   - env_traits (1 calls)
                       20ms - get_name_definition (40 calls)
                       13ms - items_with_name (1 calls)
                        5ms - iterate_method_candidates (1 calls)
                    0   - import_assets::scope_definitions (1 calls)
            0   - Semantics::analyze_impl (1 calls)
            0   - render_resolution (1 calls)
        0   - Semantics::analyze_impl (1 calls)
        0   - complete_unqualified_path (1 calls)
        0   - crate_def_map:wait (226 calls)
        0   - deref (1 calls)
        0   - find_path_prefixed (3 calls)
        0   - item::Builder::build (17 calls)
       12ms - iterate_method_candidates (1 calls)
        0   - trait_solve::wait (3 calls)

flyimport now takes a small portion of the time. The majority of time is spent in analyze_impl

@hf29h8sh321
Copy link

more generally, completions for xs.f and xs.fo are virtually identical, we should ensure that the second one takes almost no time

I tested Gnome-Builder with Rust-analyzer. The second one loads instantly even though the first one takes some time to process.

bors added a commit that referenced this issue Sep 12, 2022
Remove redundant 'resolve_obligations_as_possible' call

Hi! I was looking for a "good first issue" and saw this one: #7542. I like searching for performance improvements, so I wanted to try to find something useful there.

There are two tests in integrated_benchmarks.rs, I looked at 'integrated_highlighting_benchmark' (not the one discussed in the issue above).

Profile from that test looks like this:
```
$ RUN_SLOW_BENCHES=1 cargo test --release --package rust-analyzer --lib -- integrated_benchmarks::integrated_highlighting_benchmark --exact --nocapture
    Finished release [optimized] target(s) in 0.06s
     Running unittests src/lib.rs (target/release/deps/rust_analyzer-a80ca6bb8f877458)

running 1 test
workspace loading: 358.45ms
initial: 9.60s
change: 13.96µs
cpu profiling is disabled, uncomment `default = [ "cpu_profiler" ]` in Cargo.toml to enable.
  273ms - highlight
      143ms - infer:wait @ per_query_memory_usage
          143ms - infer_query
                0   - crate_def_map:wait (3165 calls)
                4ms - deref_by_trait (967 calls)
               96ms - resolve_obligations_as_possible (22106 calls)
                0   - trait_solve::wait (2068 calls)
       21ms - Semantics::analyze_impl (18 calls)
        0   - SourceBinder::to_module_def (20 calls)
       36ms - classify_name (19 calls)
       19ms - classify_name_ref (308 calls)
        0   - crate_def_map:wait (461 calls)
        4ms - descend_into_macros (628 calls)
        0   - generic_params_query (4 calls)
        0   - impl_data_with_diagnostics_query (1 calls)
       45ms - infer:wait (37 calls)
        0   - resolve_obligations_as_possible (2 calls)
        0   - source_file_to_def (1 calls)
        0   - trait_solve::wait (42 calls)
after change: 275.23ms
test integrated_benchmarks::integrated_highlighting_benchmark ... ok
```
22106 calls to `resolve_obligations_as_possible` seem like the main issue there.

One thing I noticed (and fixed in this PR) is that `InferenceContext::resolve_ty_shallow` first calls `resolve_obligations_as_possible`, and then calls `InferenceTable::resolve_ty_shallow`. But `InferenceTable::resolve_ty_shallow` [inside](https://github.com/rust-lang/rust-analyzer/blob/2e9f1204ca01c3e20898d4a67c8b84899d394a88/crates/hir-ty/src/infer/unify.rs#L372) again calls `resolve_obligations_as_possible`.

`resolve_obligations_as_possible` inside has a while loop, which works until it can't find any helpful information. So calling this function for the second time does nothing, so one of the calls could be safely removed.

`InferenceContext::resolve_ty_shallow` is actually quite a hot place, and after fixing it, the total number of `resolve_obligations_as_possible` in this test is reduced to 15516 (from 22106). "After change" time also improves from ~270ms to ~240ms, which is not a very huge win, but still something measurable.

Same profile after PR:
```
$ RUN_SLOW_BENCHES=1 cargo test --release --package rust-analyzer --lib -- integrated_benchmarks::integrated_highlighting_benchmark --exact --nocapture
    Finished release [optimized] target(s) in 0.06s
     Running unittests src/lib.rs (target/release/deps/rust_analyzer-a80ca6bb8f877458)

running 1 test
workspace loading: 339.86ms
initial: 9.28s
change: 10.69µs
cpu profiling is disabled, uncomment `default = [ "cpu_profiler" ]` in Cargo.toml to enable.
  236ms - highlight
      110ms - infer:wait @ per_query_memory_usage
          110ms - infer_query
                0   - crate_def_map:wait (3165 calls)
                4ms - deref_by_trait (967 calls)
               64ms - resolve_obligations_as_possible (15516 calls)
                0   - trait_solve::wait (2068 calls)
       21ms - Semantics::analyze_impl (18 calls)
        0   - SourceBinder::to_module_def (20 calls)
       34ms - classify_name (19 calls)
       18ms - classify_name_ref (308 calls)
        0   - crate_def_map:wait (461 calls)
        3ms - descend_into_macros (628 calls)
        0   - generic_params_query (4 calls)
        0   - impl_data_with_diagnostics_query (1 calls)
       45ms - infer:wait (37 calls)
        0   - resolve_obligations_as_possible (2 calls)
        0   - source_file_to_def (1 calls)
        0   - trait_solve::wait (42 calls)
after change: 238.15ms
test integrated_benchmarks::integrated_highlighting_benchmark ... ok
```

The performance of this test could be further improved but at the cost of making code more complicated, so I wanted to check if such a change is desirable before sending another PR.

`resolve_obligations_as_possible` is actually called a lot of times even when no new information was provided. As I understand, `resolve_obligations_as_possible` could do something useful only if some variables/values were unified since the last check. We can store a boolean variable inside `InferenceTable`, which indicates if `try_unify` was called after last `resolve_obligations_as_possible`. If it wasn't called, we can safely not call `resolve_obligations_as_possible` again.

I tested this change locally, and it reduces the number of `resolve_obligations_as_possible` to several thousand (it is not shown in the profile anymore, so don't know the exact number), and the total time is reduced to ~180ms. Here is a generated profile:
```
$ RUN_SLOW_BENCHES=1 cargo test --release --package rust-analyzer --lib -- integrated_benchmarks::integrated_highlighting_benchmark --exact --nocapture
    Finished release [optimized] target(s) in 0.06s
     Running unittests src/lib.rs (target/release/deps/rust_analyzer-a80ca6bb8f877458)

running 1 test
workspace loading: 349.92ms
initial: 8.56s
change: 11.32µs
cpu profiling is disabled, uncomment `default = [ "cpu_profiler" ]` in Cargo.toml to enable.
  175ms - highlight
       21ms - Semantics::analyze_impl (18 calls)
        0   - SourceBinder::to_module_def (20 calls)
       33ms - classify_name (19 calls)
       17ms - classify_name_ref (308 calls)
        0   - crate_def_map:wait (461 calls)
        3ms - descend_into_macros (628 calls)
        0   - generic_params_query (4 calls)
        0   - impl_data_with_diagnostics_query (1 calls)
       97ms - infer:wait (38 calls)
        0   - resolve_obligations_as_possible (2 calls)
        0   - source_file_to_def (1 calls)
        0   - trait_solve::wait (42 calls)
after change: 177.04ms
test integrated_benchmarks::integrated_highlighting_benchmark ... ok
```
Let me know if adding a new bool field seems like a reasonable tradeoff, so I can send a PR.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-completion autocompletion A-perf performance issues E-has-instructions Issue has some instructions and pointers to code to get started fun A technically challenging issue with high impact good first issue S-actionable Someone could pick this issue up and work on it right now
Projects
None yet
Development

No branches or pull requests

5 participants