Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Environments for code fragments/buffers (i.e. elixirsense) #12645

Closed
josevalim opened this issue Jun 10, 2023 · 69 comments
Closed

Environments for code fragments/buffers (i.e. elixirsense) #12645

josevalim opened this issue Jun 10, 2023 · 69 comments

Comments

@josevalim
Copy link
Member

Hi everyone,

This is as second take on "building environments for code fragments" for IDEs (follow up on #12566). The goal of this issue is to propose a rich data structure for tracking fragment/buffers. Let's start with some background.

Status quo

Currently, Code.Fragment contains functions that analyze where we are in a code snippet:

  • cursor_context tells us where we are in a token

  • container_cursor_to_quoted tells us where we are in the AST (improved in Elixir v1.15 to provide additional context)

So while Elixir can tell that you are inside a alias Enumera call or that you are in the middle of typing Code.Fr, it does not provide any mechanism for telling you that Code.Fragment is a valid autocompletion for Code.Fr.

In order for us to provide such completions, we need to consider three environments:

  1. The fragment/buffer environment. This is a file in your editor, a cell in Livebook, or a prompt in IEx. This file may or may not be valid Elixir.

  2. The eval environment. This comes from previous cells in Livebook and previously executed code in IEx. AFAIK, it does not exist in IDEs.

  3. The global environment. This is all modules, dependencies, etc. Exactly how those modules are compiled and searchable is a separate problem.

The goal of this issue is to build infrastructure to solve the fragment/buffer environment.

Fragment/buffer environment

Take the following code (think of text in your editor):

defmodule Foo do
  import Baz

  def bat do
    var = 123
    {<CURSOR>
  end

  def local_function do
    ...
  end
end

Our goal is to know what is the fragment/buffer environment so we can use it for completion and other features. For example: what has been imported from Baz that is available inside bat? What are the local functions visible from bat? And so on.

After talking with @michalmuskala, we want to approach this problem using the following techniques:

  1. First the whole file is loaded as a fragment/buffer

  2. We will process the fragment using a robust parser that will be capable of ignoring incomplete expressions, such as {in the example above

  3. Once the fragment is processed, we will build a tree of scopes. For example, both bat and local_function, each delimited by their do/end blocks, depend on the scope from defmodule Foo do

  4. As the user changes those files, we will replicate those diffs on top of the fragment. For example, if a line is added, then we add said line to the buffer in a way we can reuse scopes and minimize additional computations

Therefore, the goal is to always be able to quickly answer which functions, variables, imports, aliases, are available at any point in a fragment, ultimately reducing latency in IDE features. I believe this will allow us to replace elixirsense in a way that is robust and performant.

The tasks above are major efforts in themselves, so we will need further issues to break them down, but this issue introduces a general direction.

PS: it is yet unclear if the buffer will track function/macro calls - in order to provide go-to definition and similar - or only the environment. However, note that today it is not trivial to provide mouse-over suggestions with surround_context because it doesn't tell us the arity. Livebook addresses this by explicitly looking after |> operators, but that does not cover all cases.

/cc @scohen @scottming @lukaszsamson @mhanberg

@mhanberg
Copy link
Member

Hi @josevalim! Thanks for including me on this.

For now I don't have too much to add other than observe, but in the coming weeks I will be much more up to speed.

Thanks again!

@lukaszsamson
Copy link
Contributor

The eval environment. This comes from previous cells in Livebook and previously executed code in IEx. AFAIK, it does not exist in IDEs.

It depends on how you classify things. Is app config a part of eval env? Is static analysis? ElixirSense can infer that my_var has property hour in the following snippet

my_var = %DateTime{}
my_var.
              ^

@josevalim
Copy link
Member Author

Anything that you conclude from looking outside of the current buffer (such as app config) is either part of the eval config (this is an environment that directly affects the __ENV__ the buffer is evaluated on) or part of the global config. So knowing a variable set in the buffer has a certain struct is part of the buffer environment.

@scottming
Copy link

scottming commented Jun 12, 2023

Hi, Jose. Thank you for putting more effort into supporting LS.

Let me try to understand your question and goal: so the main goal is:

to propose a rich data structure for tracking fragment/buffers

And first

The goal of this issue is to build infrastructure to solve the fragment/buffer environment.

And after achieving it, it will make it easier for LS authors to implement completion and navigation functions. I have no objections to this overall goal. I believe it will indeed make our life easier.

How does Lexical handle completion?

Then I want to talk about how Lexical handles completion. And what problems we encountered when using Code.Fragment. i think this will be helpful to you when considering use cases while implementing the above goals.

A key feature of Lexical's completion is context awareness: for example, only modules are completed after alias <CURSOR>; only structures are completed after %St<CURSOR>; parameters are reduced after |> fun<CURSOR>, and so on. It can be divided into two stages:

  1. one stage is to determine what kind of context the cursor is in;
  2. the second stage is to provide available completions for that context.

Currently, we use Env.in_context?( method, but it relies on Elixir's private API :elixir_tokenizer.tokenize/4. I know this is not good, and we are trying to avoid it as much as possible. For example, in the Env.in_context?(env, :struct_arguments)' implementation, we used container_cursor_to_quoted + Macro.path instead. However, we cannot completely avoid relying on tokenize/4 for all context implementations, such as function capture examples:

      text = ~q(
      defmodule MyModule do
        &MyModule.fun|
      end
      )

      env = new_env(text)

      quoted =
        env.document
        |> Document.fragment(env.position)
        |> Code.Fragment.container_cursor_to_quoted()

      quoted |> dbg()

The result is

quoted #=> {:ok,
 {:defmodule, [line: 1],
  [{:__aliases__, [line: 1], [:MyModule]}, [do: {:__cursor__, [line: 2], []}]]}}

We can't use that result to determine the cursor in a capture context, though it is. This is the problem we encountered in the first stage.

As for the second stage, I found it difficult to know what was around my cursor just by using the current API. Taking struct_fields as an example: %AliasedModule.Struct{field: |}, it is already in a struct_argument context, then I have to know what AliasedModule.Struct is. To achieve this goal, I have no choice but to traverse through the entire AST to implement an alias_mapping/2.

This may also pose similar problems when implementing navigation. For example, if my cursor is in Parent.MyM|odule.Child, then I need to know that the current cursor position is Parent.MyModule alias; currently, the API provided by Code.Fragment cannot achieve this goal. As for the confusing result provided by cursor_context, Steve has already explained it in detail in another issue.

@josevalim
Copy link
Member Author

It can be divided into two stages:

Exactly. This issue is about the second stage.

We can't use that result to determine the cursor in a capture context, though it is. This is the problem we encountered in the first stage.

Every time you run into a scenario like this, please open up an issue. I will tackle the & one shortly. So please submit bug reports for every case, so you no longer need to rely on tokenizer. :)

To achieve this goal, I have no choice but to traverse through the entire AST to implement an alias_mapping/2.

Correct! That's what we want to solve with this issue in a way that is general and efficient.

This may also pose similar problems when implementing navigation. For example, if my cursor is in Parent.MyM|odule.Child, then I need to know that the current cursor position is Parent.MyModule alias

This should be handled by surround_context + container_cursor_to_quoted (that's how we do it in Livebook).

@josevalim
Copy link
Member Author

The & behaviour is fixed on Elixir main and v1.15 branches.

@scottming
Copy link

The & behaviour is fixed on Elixir main and v1.15 branches.

Cool, thanks; so fast.

So please submit bug reports for every case, so you no longer need to rely on tokenizer. :)

Ok, I'll do that.

This should be handled by surround_context + container_cursor_to_quoted (that's how we do it in Livebook).

Ok, I think that part of the code is in identifier_matcher.ex#L362, I'll explore it when do the navigation task.

@scohen
Copy link
Contributor

scohen commented Jun 12, 2023

This is an interesting approach, @josevalim. I really like the tree approach, and in order to keep things fast, it might help to review how the LSP protocol works, as I would like to avoid any impedance mismatches between the proposal and the LSP spec.

Every time the user types a character, a TextEdit is generated, and sent to the language server. This edit is then applied to the document, at which point things like completion happen. Text edits contain two positions, a start and an end, and positions are before the character they mention (a character of 0 occurs before the start of a line). It's also worth noting that a "character" isn't really a character, but a utf16 code unit offset (dealing with utf16 is very...fun) .

If I'm understanding you correctly, we could couple this new data structure to a document when it's opened by the user, so we'd have both a textual representation (The Document in lexical) and an ast / context representation that we're discussing here. Then, when an edit comes in, it is simultaneously applied to both the textual and AST representations.

Generally, I think this would be fine for us to do; however, edits can affect both large portions of a document (think cut and paste, and search and replace) or very small ones (every character typed or deleted by the user comes to the server as a text edit). I would think that over time, it might become necessary to occasionally collapse the edit history of the documents in order to save memory, as we don't really care much abut history when building language servers.

One of the tougher problems I encountered before I started the lexical project was implementing order aliases. Aliases are a surprisingly complicated problem, and I think they illustrate why this scope approach would be greatly beneficial. If I'm understanding correctly, given the following document:

defmodule Testing do 
  alias Foo.Bar.Baz
  alias Baz.{Top.Bottom,
    Bottom.Right
   }
   alias Right.Left

we'd able to say things like:
aliases_at(doc, {6, 16})

and that would return:

%{'Right' => 'Foo.Bar.Baz.Bottom.Right'
    'Bottom' => 'Foo.Bar.Baz.Bottom',
    'Baz' => 'Foo.Bar.Baz'
}

If so, this would be a great help. We currently have to build this functionality ourselves.

@josevalim
Copy link
Member Author

You got it right. Further context:

  1. We will keep both textual and AST representations (we need both anyway)

  2. We won't keep the edit history ourselves

  3. Help: Can you please link to the TextEdit spec so we use that as a driving force?

  4. Question: column precision for aliases may be very hard. Is line precision enough (i.e. we show aliases defined in previous lines)

@scohen
Copy link
Contributor

scohen commented Jun 12, 2023

Help: Can you please link to the TextEdit spec so we use that as a driving force?

I sure can. Here's the "docs" for text edits. One thing to note (and that I've asked for before) is that text edits and their positions used zero-based lines and "characters", while almost all of elixir uses one-based lines and columns. I would love it if we could tell the compiler to be zero-based. I've relegated most of the conversions to framework code, but it'd be cool to not have to do this at all.

Question: column precision for aliases may be very hard. Is line precision enough (i.e. we show aliases defined in previous lines)

Check my work, but I don't think we can get away solely with lines, as this is valid in elixir

alias Foo.Bar.{Baz, Baz.First, Baz.Second}

the aliases for Baz.First and Baz.Second would definitely benefit from knowing that Baz was aliased. This works pretty darn well in iex though.

@josevalim
Copy link
Member Author

josevalim commented Jun 12, 2023

Check my work, but I don't think we can get away solely with lines, as this is valid in elixir

I am wondering if this is actually an Elixir bug? I am not sure if we should do left-to-right aliasing like that.

EDIT: On the other hand, {alias(:bar, as: Bar), Bar} returns {:bar, :bar}, so it looks correct.

@zachallaun
Copy link
Contributor

zachallaun commented Jun 12, 2023

4. Question: column precision for aliases may be very hard. Is line precision enough (i.e. we show aliases defined in previous lines)

Apologies if I'm missing something, but would it be sufficient to store the line/col start and end on each node of the "tree of scopes" as its being constructed? Then answering questions about scope for an exact line/col can be done by drilling down to the leaf containing that line/col and collecting the path of nodes along the way.

@scohen
Copy link
Contributor

scohen commented Jun 12, 2023

I am wondering if this is actually an Elixir bug? I am not sure if we should do left-to-right aliasing like that.

As far as I can tell, aliases have always supported this (which is just a special case of top-down aliases), but I didn't invent the language, so I don't know what the intention was. 😉 I'm not a fan of this kind of code, but it doesn't seem like a language bug to me.

These are all equivalent:

alias Foo.Bar.{Baz, Baz.First, Baz.Second}
alias Foo.Bar.{
  Baz,
  Baz.First,
  Baz.Second
}
alias Foo.Bar.Baz
alias Baz.First
alias Baz.Second

I'm pretty sure I've seen code like the above in the wild, so it might be too late to take it away.

As an aside, my organize imports code sought to normalize all aliases to eliminate interdependent aliases, and transform all the above cases into:

alias Foo.Bar.Baz
alias Foo.Bar.Baz.First
alias Foo.Bar.Baz.Second

It worked too, but it was extremely complicated.

@lukaszsamson
Copy link
Contributor

@josevalim There is no bug here. @scohen this syntax is a bit simpler than what you described. There is actually no aliasing left to right inside {} and Baz being aliased does not make any difference (at least on 1.14)

iex(1)> alias Foo.Bar.{Baz, Baz.First, Baz.Second}
[Foo.Bar.Baz, Foo.Bar.Baz.First, Foo.Bar.Baz.Second]
iex(2)> __ENV__.aliases
[{Baz, Foo.Bar.Baz}, {First, Foo.Bar.Baz.First}, {Second, Foo.Bar.Baz.Second}]
iex(1)> alias Foo.Bar.{Baz.First, Baz.Second}     
[Foo.Bar.Baz.First, Foo.Bar.Baz.Second]
iex(2)> __ENV__.aliases                      
[{First, Foo.Bar.Baz.First}, {Second, Foo.Bar.Baz.Second}]

But yeah, there are a few other interesting cases with aliases (unaliasing, submodule aliases, __MODULE__ in alias directives, aliasing erlang modules, alias via require, as:)

@scohen
Copy link
Contributor

scohen commented Jun 12, 2023

err, mine (elixir 1.14.3) does differ between the two attempts

iex(1)> alias Foo.Bar.{Baz, Baz.First, Baz.Second}
[Foo.Bar.Baz, Foo.Bar.Baz.First, Foo.Bar.Baz.Second]
iex(2)> __ENV__.aliases
[{Baz, Foo.Bar.Baz}, {First, Foo.Bar.Baz.First}, {Second, Foo.Bar.Baz.Second}]
iex(1)> alias Foo.Bar.{Baz.First, Baz.Second}
[Foo.Bar.Baz.First, Foo.Bar.Baz.Second]
__ENV__.aliases
[{First, Foo.Bar.Baz.First}, {Second, Foo.Bar.Baz.Second}]

(no Baz)

Those look different to me, but that's not the point I'm making here. The thing is that this is allowed, and it's important to know that Baz has been aliased to Foo.Bar.Baz when making completions inside the curly braces, and thus, it's important to know the column (I think).

I do think I see your point though:

iex(3)> alias Foo.Bar.{Baz.Super.Deeply.Nested, Nested.One, Nested.Two}
[Foo.Bar.Baz.Super.Deeply.Nested, Foo.Bar.Nested.One, Foo.Bar.Nested.Two]
iex(4)> __ENV__.aliases
[
  {Nested, Foo.Bar.Baz.Super.Deeply.Nested},
  {One, Foo.Bar.Nested.One},
  {Two, Foo.Bar.Nested.Two}
]

A bit surprising.

@josevalim
Copy link
Member Author

Those look different to me, but that's not the point I'm making here. The thing is that this is allowed, and it's important to know that Baz has been aliased to Foo.Bar.Baz when making completions inside the curly braces, and thus, it's important to know the column (I think).

From looking at the examples, this does not seem the case. They are all based from the root Foo.Bar. If you swap the order around, you get the same result. :) Good.

@scohen
Copy link
Contributor

scohen commented Jun 13, 2023

So, I think lines will be sufficient then

@scohen
Copy link
Contributor

scohen commented Jul 12, 2023

@josevalim I'm starting work on find references, and the first thing I've come across is context-aware knowledge of what exactly is under the cursor.

For example, say I have the following

defmodule ParentModule do 
  defstruct [:key, :value]

  def function(%__MODULE__|{} = parent) do 
    ...
  end
end

(the cursor is |) above
It would be great for me to be able to ask: "hey, what's going on under the cursor" to something and have it tell me "Why that's a struct reference, the module is __MODULE__.

I've tried using Code.Fragment for this purpose, but it returns {:local_or_var, '__MODULE__'}, which omits the important information that this is inside a struct. This is necessary, because the definition of the module is at a different place than the definition of the struct.

I've noticed that both elixir_sense and lexical need and implement this functionality in the same way: they use the tokenizer. I know that you don't like this, since the tokenizer is private, so I'd like to find some way to use a public API in the future.

So a couple of questions:

  1. Do you agree that this is something that core elixir should do?
    a. If not, what approach would be tenable aside from using the tokenizer.
  2. Is this the right place to chat about it, or should I make a new issue?

Thanks!

By the way, I've implemented an initial version of the environment that works with aliases, if you're interested.

@josevalim
Copy link
Member Author

josevalim commented Jul 12, 2023

@scohen I think the ElixirForum is a good place to have these conversations. However, the answer is not going to be different from what I answered in the past: you will always need a combination of cursor_context and container_cursor_to_quoted. I also think elixir_sense is moving away from the tokenizer but doing so requires depending on more recent Elixir versions, so it has been a gradual process.

@scohen
Copy link
Contributor

scohen commented Jul 12, 2023

This is more a question of dueling implemtattions.

@josevalim
Copy link
Member Author

It is up to you really. I won't be able to start on this immediately and, if this feature needs changes to the tokenizer/parser, it will only be available on Elixir v1.16. You can either:

  1. Wait 6 months until this feature is released (you can contribute your existing work if you believe it can speed up)

  2. Work on your version because you can't afford to wait until v1.16

@scohen
Copy link
Contributor

scohen commented Oct 19, 2023

One request that I have is that we make whatever solution more like Macro.Env and less like the map-based interface that iex uses. I find maps to be much too flexible of an API, and it's not clear what keys the maps have, or what values might be associated with those keys.

@mhanberg
Copy link
Member

mhanberg commented Dec 3, 2023

I've begun working on

We will process the fragment using a robust parser that will be capable of ignoring incomplete expressions, such as {in the example above

I will have a repo to share soon, please let me know @josevalim if you have any specific things in mind for it.

Happy holidays everyone!

@josevalim
Copy link
Member Author

The code will be important, of course, but I think the most important will be a test suite with many cases we would like to be automatically fixed. :)

@mhanberg
Copy link
Member

mhanberg commented Dec 3, 2023

The code will be important, of course, but I think the most important will be a test suite with many cases we would like to be automatically fixed. :)

Roger!

@mhanberg
Copy link
Member

I have a fully compliant and error tolerant parser ready (obviously not including any bugs or syntax I haven't encountered yet) here: https://github.com/elixir-tools/spitfire.

It has a test suite full of plenty of test cases, but it also tests against about 30+ popular Elixir projects on GitHub and can parse them identically to Code.string_to_quoted(token_metadata: true, columns: true) as of Elixir commit 52eaf1456182d5d6cce22a4f5c3f6ec9f4dcbfd9, which has the bug fixes that I and José have made since I started working on the parser.

There is still room for more and better error tolerance, but I am now moving onto the next step, which is thinking of datastructure/algorithm for building the environment and thinking of the API for querying the data structure.

The one thing for the future that I'm not sure how to accomplish yet is expanding macro, as they (particularly the use macro) often injects imports and aliases into a module, bringing them into scope.

@josevalim please let me know if you have any thoughts so far or concerns! I've already gone further than I originally planned before checking back in 😅.

@josevalim
Copy link
Member Author

@mhanberg excellent job!

My only concern is that you depend on :elixir_tokenizer, which is a private API. I wonder if it is worth vendorizing it (and :elixir_interpolation) to make it resilient to changes in Elixir's internals.

Btw, have you had the chance of benchmarking the parser? Because a hand-written parser can lead to better error messages and custom fault-tolerance, but I worry it makes maintenance harder. But I wonder if it also leads to better performance.

which is thinking of datastructure/algorithm for building the environment and thinking of the API for querying the data structure.

May I propose something else as the next step?

One of the things I have described above is that we need to track "blocks". For example, if you write:

defmodule Foo do
  def bar do
    ...
  end
end

If you do changes inside bar, you don't need to worry about changes elsewhere. So one of the things we need to solve is: how to receive diffs from editor and then apply it to whatever representation we will have of the source code structure.

Building of the environment and querying of the data is most likely something that needs to exist in Elixir, because expanding the macros and collecting AST info will depend on private APIs, so we need to figure out the appropriate moment and API to hand-off those to Elixir.

@lukaszsamson
Copy link
Contributor

There is still room for more and better error tolerance, but I am now moving onto the next step, which is thinking of datastructure/algorithm for building the environment and thinking of the API for querying the data structure.

elixir_sense has a pretty comprehensive implementation that extracts info from AST though I admit the data model used there is far from optimal. Most of the test suite can be found in https://github.com/elixir-lsp/elixir_sense/blob/master/test/elixir_sense/core/metadata_builder_test.exs

My only concern is that you depend on :elixir_tokenizer, which is a private API. I wonder if it is worth vendorizing it (and :elixir_interpolation) to make it resilient to changes in Elixir's internals.

ElixirLS, Lexical already vendors it as well as Sourceror library. There is a need for tokenizer as a library. It could also improve on error tolerance.

So one of the things we need to solve is: how to receive diffs from editor and then apply it to whatever representation we will have of the source code structure.

LSP editors send content modification messages https://microsoft.github.io/language-server-protocol/specifications/lsp/3.17/specification/#textDocument_didChange, the server applies those messages and constructs the current version of the document. Those messages are on character ranges level so theoretically we could map it to token ranges and then to AST nodes.

The one thing for the future that I'm not sure how to accomplish yet is expanding macro, as they (particularly the use macro) often injects imports and aliases into a module, bringing them into scope.

elixir_sense has some limited support for macro expansion, see
https://github.com/elixir-lsp/elixir_sense/blob/d7b12a11df4bcacfa64afb23b5566e64509b740d/lib/elixir_sense/core/metadata_builder.ex#L1298-L1305

@mhanberg
Copy link
Member

excellent job!

Thanks! 😄

My only concern is that you depend on :elixir_tokenizer, which is a private API. I wonder if it is worth vendorizing it (and :elixir_interpolation) to make it resilient to changes in Elixir's internals.

I had considered it, because there are some changes that could be made to tokenize as it parses, rather than parse it all up front. I punted it because I wasn't sure what parts of this work would get upstreamed into core (the Spitfire parser itself, or one of the other components yet to be built, or none or all of it).

If Spitfire or some section of it was in core, I would assume it would be able to use private API. I am open to anything tho!

Btw, have you had the chance of benchmarking the parser? Because a hand-written parser can lead to better error messages and custom fault-tolerance, but I worry it makes maintenance harder. But I wonder if it also leads to better performance.

No benchmarking yet, I've only observed parsing times for whole projects (includes tokenizing time). For example, it parses all 272 source files in elixir-lang/elixir (134,242 loc) in 265ms and 264 files in absinthe (24,753loc) in 17ms

I'll do some actual benchmarking here soon

May I propose something else as the next step?

One of the things I have described above is that we need to track "blocks". For example, if you write:
...

I was planning on punting this to later because I wasn't exactly sure how to accomplish it 😅.

I'm open to moving to this part tho is that is what your gut is telling you. I had just assumed it was an optimization and opted to getting the "full stack" thing working first.

I still need to do some research into how this is accomplished elsewhere, but the immediate thoughts/concerns I have are:

  • how to handle line/column related metadata? if we incrementally tokenize/parse changed ranges of the document, the metadata will get out of sync. I think this may have impact on getting the environment at the cursor position if the coordinates in the AST are incorrect.

I guess it's just the one concern for now, actually.

Building of the environment and querying of the data is most likely something that needs to exist in Elixir, because expanding the macros and collecting AST info will depend on private APIs, so we need to figure out the appropriate moment and API to hand-off those to Elixir.

Another callback to my previous comment about not knowing where the lines for inclusion into core should be called.


As I was source diving more compiler internals last night, the thought did occur to me that perhaps to accomplish this, we are somewhat building another compiler frontend. Also while reading through LSPs in other languages source code, I saw that rust-analyzer's description is "A Rust compiler front-end for IDEs"

It seems like eventually the steps may be similar to what exists, except it uses the error tolerant parser and after expanding everything it doesn't compile into Erlang and just leave it as a data structure and/or emit tracer events.

I'm curious to your thoughts on that, but we can leave it for another time to not get off topic.


Thank you José!

@josevalim
Copy link
Member Author

I'm open to moving to this part tho is that is what your gut is telling you. I had just assumed it was an optimization and opted to getting the "full stack" thing working first.

Hrm, I think you are right. It may make sense to optimize it only later. Doing a single pass on the code and expanding all macros at once may be fine for the initial version. My only concern is that you will have to reverse engineer all of our changes to Macro.Env and, one of the concerns elixir_sense maintainers reported in the past was keeping up with Elixir changes over time (I am not sure if this concern is still relevant). So, if we could accommodate more in Elixir, it would bring more stability to everyone (as well as avoid relying on private APIs). But I'd say there is still a very open question of where to draw the line.

how to handle line/column related metadata? if we incrementally tokenize/parse changed ranges of the document, the metadata will get out of sync.

@michalmuskala do you have practical suggestions on how to handle this? Do you simply bump everyone's line offset accordingly?

@zachallaun
Copy link
Contributor

@michalmuskala do you have practical suggestions on how to handle this? Do you simply bump everyone's line offset accordingly?

I vaguely recall that rust-analyzer handles this by storing line/column information as relative instead of absolute values and then lazily rebuilding the absolute values as you traverse the structure. So a node contains only its own extents (e.g. 10 cols, 1 line, or for multiline nodes, 5 cols start, 3 lines, 9 cols end). This allows you to replace nodes without concern for updating offsets elsewhere.

@josevalim
Copy link
Member Author

I vaguely recall that rust-analyzer handles this by storing line/column information as relative

Hrm, relative to what? Maybe relative to constructs that start a new scope, such as a new function or a new conditional?

@zachallaun
Copy link
Contributor

zachallaun commented Feb 17, 2024

I vaguely recall that rust-analyzer handles this by storing line/column information as relative

Hrm, relative to what? Maybe relative to constructs that start a new scope, such as a new function or a new conditional?

I'm definitely missing something in my recollection, but I think you could do this by storing, for each node, the position relative to its parent's start line/column as well as its own extent. I'll try to find where I read about this.

Edit: Nope, that wouldn't work because a node changing its extent would affect the relative position of all later siblings, so you have the same issue.

Edit the 2nd: I believe this reference is where the idea came from.

@lukaszsamson
Copy link
Contributor

one of the concerns elixir_sense maintainers reported in the past was keeping up with Elixir changes over time (I am not sure if this concern is still relevant)

It still is but it's not that bad. The tokenizer is not a stable API and AST metadata changes with every release but the high level concepts they represent (modules, protocols, defs, structs, attributes, vars) haven't changed much since elixir ~1.3

As I was source diving more compiler internals last night, the thought did occur to me that perhaps to accomplish this, we are somewhat building another compiler frontend. Also while reading through LSPs in other languages source code, I saw that rust-analyzer's description is "A Rust compiler front-end for IDEs"

Exactly. The tools end up reimplementing parts of the compiler (e.g. alias resolution, variable tracking, function call tracking, type inference) to be able to answer questions about high level concepts represented by code. And rust-analyzer isn't the only example of a compiler frontend serving as a base for LSP servers. MS did something similar with Roslyn for C#.

I vaguely recall that rust-analyzer handles this by storing line/column information as relative instead of absolute values and then lazily rebuilding the absolute values as you traverse the structure

@zachallaun @josevalim I've seen something similar in LSP semantic tokens that editors can use for semantic highlighting. The idea is to store token position relative to the previous token. That way whenever a token is added, moved or removed, only this token or its neighbours are affected. Maybe this could be extended to syntax trees as well. If an edit is made in a block, it should not affect the inside of a next or previous block. We would need to store which tokens (or at least first and last token) generate a particular AST node.

@michalmuskala
Copy link
Member

michalmuskala commented Feb 19, 2024

@michalmuskala do you have practical suggestions on how to handle this? Do you simply bump everyone's line offset accordingly?

As far as I know, most of the interactive-first IDEs have a setup with effectively 2 AST data structures - this was popularised by Roslyn (the C# compiler) and is roughly similar for Rust Analyzer & ELP (through Rowan), and Swift, and perhaps others.

I can talk more about how it looks like in RA & ELP.

First, the AST only stores byte offsets. Storing line + column information is extremely expensive. Early in building ELP we noticed that large amount of memory of the LSP was spent just storing position information. In particular, in Erlang [{line, Int}, {column, Int}] is 80 bytes, if you want to store full span, this becomes 160 bytes (and we really wanted full spans for good errors). In RA & ELP, most of the time position information is stored as 32-bit byte offsets, so 4 bytes for simple position, 8 bytes for full (beginning, end) span.
We convert to line + column information at the edge with a simple side-lookup table that stores offsets of all newlines in the file and simple binary search.

The first AST, and the one that is persisted, is built on "Green Nodes". Those nodes only store their own length in bytes & a list of child nodes. In particular you can cheaply replace a node in the "green tree" and you only need to update the spine of the tree. By storing only the size of the node itself (including the size of children) it naturally becomes "relative" and is not invalidated by unrelated changes.
Furthermore, the nodes don't have "identity", so they get de-duplicated - e.g. every def in the file becomes exactly the same def node referenced from multiple places. This further lowers memory pressure. This is also done for whole trees, e.g. all 1 + 1 pieces will be the same node (and both 1 nodes inside that will be the same node).

Storing the sizes also allows pretty quickly finding a right node given a position - starting from the root, you can do a binary search on each layer to get to the right node in O(log n) time.

However, this is not very practical for actual traversal, since position information is useful to have. To get it, during traversal you build a "red tree" that maintains proper position information from the beginning of the file. This "red node" acts, in a way, similar to a zipper - storing the current computed information + references to current, parent, and sibling green nodes. The red nodes are never stored/persisted, instead you store a (node_type, begin_offset, end_offset) tuple that can be re-formed efficiently into a red node in the next traversal from a green tree. In RA/ELP those red nodes are purely ephemeral, I think Roslyn has some caching mechanism.

@zachallaun
Copy link
Contributor

Thank you for the proper explanation, @michalmuskala! This is indeed what I was remembering above.

Harking back to the original premise of this issue, I think this would fit well into a %Buffer{} of some kind. The buffer would not store AST as we currently know it in Elixir, but the "green nodes" that Michal described. Accessing/traversing the buffer might yield something like a %Buffer.Cursor{} (a zipper) that has a reconstructed node if valid and an error if not...?

@josevalim
Copy link
Member Author

Thank you @michalmuskala! One of the things that are unclear to me is if the structure above is used for all files or only the currently open buffers. I understanding storing line+column (especially as a keyword) can be expensive, but I think that we can treat a buffer and a compiled module differently. Especially because in the buffer case you need to know line by line which imports, aliases, and so on are available to show for completion and you don't need that for compiled modules. So, in your experience, storing line+column is expensive even for buffers?


In any case, it is clear there is plenty of opportunity for experimentation and depending on the release cycle of Elixir will only serve to slow things down. So I think the best way Elixir can support this is by making sure we grow our compiler API surface enough to reduce the use of private APIs.

Any sort of buffer analysis tool in Elixir needs to expand macros but also specially handle some nodes. For example, Kernel.defmodule is a special node, it doesn't make sense to expand it, but rather further analyze it. Kernel.def is another.

Therefore, Elixir definitely needs to provide a lower level API for expanding macros. We do have Macro.expand_once/1, but it does the whole thing, including tracing, error handling, etc, which may not be desired.

Then you also need to be able to handle all special forms. To be honest, that does not sound very complicated. Here is a rudimentary pass for the type system in less than 300LOC. It is basically an AST traversal where you must consider which parts of the special forms are patterns/guards and how each special form modifies the environment.

The environment modification is another part that I worry about. Some of the fields in the environment are very straight-forward to manage, such as module, function, file, etc. But the fields aliases, functions, macro_aliases, macros, requires, and versioned_vars are pretty much private. So, at the very minimum, we need to add APIs for storing new aliases, imports, requires, and variables in the environment that you can use throughout expansion.

Does that sound like a viable first step to everyone? @lukaszsamson, is there anything elixir_sense needs in particular you believe is worth adding to the scope?

@mhanberg
Copy link
Member

I think exposing more compiler apis is a good next step 👍.

@scohen
Copy link
Contributor

scohen commented Feb 21, 2024

@josevalim I think your last post has gotten me thinking. We've spent an inordinate amount of time answering the question of "Given a document, what does the environment look like at a given line and column", and have had to recapitulate a lot of what happens in the compiler and is surfaced via compilation tracers. In addition to what you've highlighted above, imports were also particularly tricky. It would be fantastic to get an API like the following:

@spec env_at(document :: String.t(), line :: pos_integer(), column :: pos_integer()) :: Macro.Env.t()

I also think it might be nice to somehow have this in a standalone library. While it's great to get changes into core elixir, it does mean that we have to do things two ways; one way for versions that don't support the new API and another way for versions that do. Lexical supports elixir > 1.13, and it will be years before it can transition to any new way of doing things post 1.16.

@josevalim
Copy link
Member Author

@scohen, that's pretty much what we have been discussing. :) We want to have a similar API to what you propose for any active buffer (we don't want to do that for all modules because that would be very expensive and, for compiled modules, you only need to know the result (i.e. this function was imported from X but not all possible imports)).

And I agree, this solution needs to be available outside of Elixir, for availability and development speed. I'd recommend you folks to discuss at some point what some shared solution would look like when it comes to global environment and buffer environment, as originally outlined in this issue. For now I will personally only focus on providing the basic environment and macro manipulation functions in Elixir to aid building the buffer environment. For the compiler environment, tracing is pretty much the way to go (we need to add new traces though related to Mix). :) I will submit a proposal for the new functions relatively soon.

@mhanberg
Copy link
Member

Just to clarify for everyone, I am actively working on this issue, and it lives in a 3rd party library called Spitfire.

I am going to continue to harden the parser while José/me/whomever sets/creates a public API for more compiler features in core and when those become available, I will use them in Spitfire to produce more or less the function spec that scohen shared (which I more or less already had living on my computer when I cam back to this issue to share my progress, haha).

In the end, I am very open to look at Spitfire and see if any parts (the parser, the API) can be upstreamed into core, but for now it sounds like keeping it in Spitfire is what we want in order to make faster progress.

@michalmuskala
Copy link
Member

michalmuskala commented Feb 21, 2024

Thank you @michalmuskala! One of the things that are unclear to me is if the structure above is used for all files or only the currently open buffers. I understanding storing line+column (especially as a keyword) can be expensive, but I think that we can treat a buffer and a compiled module differently. Especially because in the buffer case you need to know line by line which imports, aliases, and so on are available to show for completion and you don't need that for compiled modules. So, in your experience, storing line+column is expensive even for buffers?

For RA and ELP there's generally no difference between the current file and other files - everything goes through the same API. In particular they don't rely on regular compiler to do analysis (other than some diagnostics triggered on-save). This generally isn't a big problem apart from macros, and makes the analysis significantly faster (and possible to do as-you-type, rather than on-save). Conceptually, once the LSP state is loaded, the entire state is held in the process and disk or external programs are not consulted for anything.

In Erlang's case regular macros are relatively easy to re-implement on top of our syntax tree. For parse transforms we just don't support them, outside of limited number of standard parse transforms that we effectively re-implement to work on top of our syntax tree.
For eqWAlizer we actually do use the Erlang compiler ASTs to do the analysis (since it was built at a time ELP didn't exist yet). To do that ELP runs a small escript as a sidecar process - there we maintain a full fork of the relevant compiler passes (lexer, pre-processor, parser, and erl_lint) to customise behaviour - in particular maintaining position as byte offsets and smarter include resolution - long-term we indent to modify those forks further to use ELP itself for include resolution and providing original source, rather than using state from disk since we identified this as a source of flakiness (LSP state and on-disk state can diverge especially during rebasing and other many-file operations).

For Rust it's somewhat more complex. For macro_rules they also re-implement them on top of their syntax trees. For procedural macros (arbitrary Rust programs somewhat closer to Elixir macros), they have a very well defined compiler/build system interface and have to be contained in a separate crate as a dependency, so they generally don't need to be dynamically re-compiled during regular operation. That said, they have, in places, limited support and quite a lot of complex integration logic to map from internal syntax trees, to the public interface of the compiler and then back, while trying to preserve position information.

@josevalim
Copy link
Member Author

Perfect, thank you. I think it may be worth keeping them separate in Elixir, due to the amount of work it may be required to store precise imports, aliases, and variables based on macros based on position. I guess we will learn as we move forward.

@scohen
Copy link
Contributor

scohen commented Feb 22, 2024

Just to clarify for everyone, I am actively working on this issue, and it lives in a 3rd party library called Spitfire.

Yes, I understood that, my worry is that spitfire will then depend on what José comes up with, though I believe you're in the same boat as lexical and elixirLS, so upon further reflection, my concern might not be particularly warranted. I was actually wondering if we could have the Elixir changes as a separate library as well. My memory might be a little fuzzy, but I think I remember GenStage had a particularly clever packaging model where it was released with an Experimental prefix that you could alias away via alias Experimental.GenStage with the eventual goal of placing it inside the elixir standard library.

Maybe we can do something similar with the new compiler enhancements? I'd like to ship them sooner rather than later.

@josevalim
Copy link
Member Author

josevalim commented Feb 22, 2024

What @lukaszsamson does in ElixirLS is to have a copy of the functionality inside ElixirLS. For example, if new functions are added to Macro.Env, he keeps a copy as ElixirLS.Macro.Env (for example) and remove it once it is all up to date.

@lukaszsamson
Copy link
Contributor

Does that sound like a viable first step to everyone?

It does for me

Some of the fields in the environment are very straight-forward to manage, such as module, function, file, etc.

It may be straight-forward to manage on the compiler end but mapping a cursor position to an env is not that easy. Due to macros it is a 1 to many relationship e.g. in this code the cursor env is module is both MyProtocol. List and MyProtocol.Map

defimpl MyProtocol, for: [List, Map] do
  # |
end

Or what is the env module and current function for quoted? Looking at the code it's easy to tell the function is go. What imports and locals are available inside go body?

defmodule My do
  defmacro __using__(_opts) do
    quote do
      def go(x) do
        # |
      end
  end
end

is there anything elixir_sense needs in particular you believe is worth adding to the scope?

@josevalim there are other things that are useful. Besides what you mentioned elixir_sense tries to extract from AST more:

  • attributes with their value/type and the location of definition/usages. This is needed to be able to answer what @some.fun(123) resolves to and e.g. navigate to definition or find references
  • variables with their values/types and the location of usages/rebinding. Same as above
  • defined behaviours with their callbacks, optional callbacks
  • private functions in module. Those are not visible in Code.fetch_docs and exports
  • private macros available at location.
  • delegates with their delegated to function. Usable for navigation
  • types and private types available in module scope
  • overridable defs available in module scope.
  • the location of super
  • structs and exceptions with their fields defined in the module with list of enforced fields
  • protocols and their implementations. Env should be able to answer if we are in protocol or implementation body and which protocol defs are available
  • records with their fields. Cant suggest record fields easily
  • locations of function calls. Needed for finding references

A lot of the above are not things that the compiler is aware of. They are built by standard library macros but that macros do not provide necessary introspection features.

Speaking of API, I'd see it similarly to @scohen 's proposal. Id change the return type to a list of possible envs and the first argument to some syntax tree (Macro.t() or something other) so we don't need to reparse the document on every invocation.

@spec env_at(ast :: Macro.t(), line :: pos_integer(), column :: pos_integer()) :: [Macro.Env.t()]

but the column parameter is tricky. Consider this simple example. In which column can we say that variable foo is defined?

%{foo: foo} = some_fun()

@josevalim
Copy link
Member Author

A lot of the above are not things that the compiler is aware of. They are built by standard library macros but that macros do not provide necessary introspection features.

Exactly. None of this is compiler behaviour. The best way is probably for the language server tooling to intercept those macros instead of expanding them, and then add it to its own environment.

@josevalim
Copy link
Member Author

Hi everyone, I have created #13361 for a minimal proposal on the basic building blocks.

Outside of that, my current mindset is the following. This issue argues we need two contexts: buffer context and global context. The global context can be built with tracers and reflection, and there are no plans to make it part of Elixir. We are always happy to add more traces and reflections though.

The buffer context could be made part of Elixir, but right now it should not be, as doing so will only slow down progress. Instead, we are adding functions so you depend less on Elixir internals. If we make this part of Elixir in the future, we would probably make it so it receives Elixir AST as input, so it works both with Elixir's parser and external fault-tolerant parsers as Spitfire. While inclusion in Elixir can be useful, as we would be able to use it in both Livebook and IEx, it is not high priority at the moment, especially because IDEs require much more information.

Thank you for all of the excellent discussions.

Closing in favor of #13361.

@josevalim josevalim closed this as not planned Won't fix, can't repro, duplicate, stale Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

8 participants