Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking issue: debugging improvements for v0.6 #866

Closed
7 of 11 tasks
grjte opened this issue Apr 18, 2023 · 11 comments
Closed
7 of 11 tasks

Tracking issue: debugging improvements for v0.6 #866

grjte opened this issue Apr 18, 2023 · 11 comments
Assignees

Comments

@grjte
Copy link
Contributor

grjte commented Apr 18, 2023

Goal(s)

Our main goal is to provide source mappings (e.g., line numbers) between a given VM operation and Miden assembly source file (and, in the future, higher-language source files too).

Details

We want to add basic source mapping and debugger improvements.

Must have

  1. assembly enhancement good first issue
    vlopes11
  2. assembly processor
    vlopes11
  3. enhancement tools
  4. assembly
    vlopes11
  5. documentation enhancement
    vlopes11

Pending review & update

  1. assembly on hold processor
  2. assembly
    Fumuran

Nice to have

  1. good first issue
@grjte grjte changed the title Tracking issue: debugging improvements for v0.5 Tracking issue: debugging improvements for v0.6 Apr 18, 2023
@grjte grjte added the v0.6 label Apr 18, 2023
@vlopes11
Copy link
Contributor

vlopes11 commented Apr 21, 2023

The goal is to provide a reliable, extensible debug structure that can be used in different platforms.

Long term simplified structure

This is a simplified vision on how the architecture should look in the long run. The components flagged in orange are currently unimplemented.

image

Tokenizer

It will be responsible to extract a string and break it into a tokens stream that may or may not be semantically correct. Tokenizers always strive to be as robust as possible, given the diversity of their inputs. We might have different breaklines style, tabulation, char codeset, etc; this layer is responsible for standardizing such inputs into a common, strict, and well-formed output.

Whenever we can make a tokenizer infallible, we should do so as the parser will be able to skip error checking, mitigating its complexity. MASM is designed for simplicity; we can take advantage of that and extend that simplicity into the tokenizer, making it infallible. This will allow us to easily build plugins in different architectures as it will fall to simply using a handful of string pointers.

One example is treesitter that has its parsers implemented in C: we can simply implement a dylib in Rust, receive a pointer to a string, and populate another pointer with a sequence of strings (tokens). Such simplified structure will allow us to highlight the syntax for malformed programs.

Parser

From the tokens sequence, this component will output either procedures (AST form), or a main program. The procedures should be indexed on a repository (here defined as procedures cache), and they must be visible during runtime. Ideally, they must be inspectable; meaning we should have access to the high-level code (probably MASM; sometimes, IR, Sway, Move, etc) that generated them.

Debug Adapter Protocol

The current standard is to use a debug adapter protocol as this will reduce the programming overhead to support different IDE/interfaces. Having a correct implementation of DAP will imply we have a minimal effort to provide support for applications such as NeoVim, VsCode, Emacs, etc.

MVP

For a minimum viable product, we need to introduce:

  1. A definition for a context-free language atomic unit. It can be as simple as atomic keywords. For an initial version, we can assume any output of the tokenizer as a language atomic unit (i.e. push.0.1, add, etc).
  2. A robust, infallible Tokenizer
  3. A procedure cache Rust API
  4. A Debug Adapter Protocol API for MASM runtime
  5. A frontend plugin for an editor of choice (Vscode, Neovim, etc)

@vlopes11
Copy link
Contributor

For a minimal, working version, we can initially support ProgramAst alone.

The first requirement is to link [Token]s to a [SourceLocation], that is achieved by #861.

The main attribute of [ProgramAst] is body: Vec<Node>. We need a new structure that will contain this sequence of nodes, and associate it, optionally, to their original token locations.

One option is to create [NodeAst] that will contain body: Vec<Node> and locations: Vec<Vec<SourceLocation>>. A 1:1 link is possible because a node is only but a contiguous set of [Instruction]. While this is not optimal in terms of performance, this is a minimal working version that can be easily optimized later on without breaking the general structure or causing major refactor as the 1:1 relationship can be maintained in different ways. The advantage of this approach is we can also extend it to ProcedureAst in the future, as it is also a body: Vec<Node>.

The next step is to handle CodeBlock. As [ProgramAst], it is but a sequence of [Operation]. Therefore, under the assumption that combine_blocks won't effectively change the order of spans (i.e. contiguous [Operation]), we can safely map the root of each of these blocks to the list of source locations that generated it. Even if the blocks themselves are reorganized, the spans are still contiguous instructions that maps 1:many instructions to operations - we just use the same location for instructions that produces multiple operations.

Such map should live in a context that will be available to the CLI, and is optionally inserted. We can benefit from the in_debug_mode flag of Assembler to either feed such context or leave it empty. We can name it [DebugContext].

Given that, every time we append a new [CodeBlock] to [Assembler], we also append its list of [SourceLocation] to [DebugContext], mapped by the root of [CodeBlock].

Therefore, every [Operation] in a Program, given its [CodeBlock], will be fully mapped to its [SourceLocation].

The next step is to yield [DebugContext] from [Assembler] and expose it to processor::execute. We can encapsulate such structure into an [ExecutionContext] that might hold other components such as [AdviceProvider].

Given the map described above Operation |-> SourceLocation, execute_iter will easily keep track of the origin of a program step, allowing the frontend to display such information to the user in a friendly way.

@bobbinth
Copy link
Contributor

One option is to create [NodeAst] that will contain body: Vec<Node> and locations: Vec<Vec<SourceLocation>>

I'm not sure this will work because the Node struct could contain an arbitrarily deep tree. We could, however, modify the Node struct itself in the manner similar to the one you've suggested - and maybe that'll work?

One other thing to think about: how will we serialize/deserialize this extra info in ProgramAst, ModuleAst etc.?

The next step is to handle CodeBlock.

I think this may be the most complex part in the whole thing.

we can safely map the root of each of these blocks to the list of source locations that generated it

This is not entirely correct. A root of a code block is not unique - i.e., many code blocks within a program may end up having the same root. The path from the MAST root to the code block is unique, however, this path won't be available until the whole MAST is built (see #862 (comment)).

I think we need to flash this part out a bit more as it is not clear to me yet how we'll map code blocks to locations and how we'll represent this mapping.

The next step is to yield [DebugContext] from [Assembler] and expose it to processor::execute. We can encapsulate such structure into an [ExecutionContext] that might hold other components such as [AdviceProvider].

I probably wouldn't change processor::execute method - this would be used for non-debug executions. processor::execute_iter however, is the right one to update. I didn't quite understand the idea behind ExecutionContext. Could we just change execute_iter to look something like this?

pub fn execute_iter<A>(
    program: &Program,
    stack_inputs: StackInputs,
    advice_provider: A,
    debug_info: DebugInfo,
) -> VmStateIterator
where
    A: AdviceProvider,

Given the map described above Operation |-> SourceLocation, execute_iter will easily keep track of the origin of a program step, allowing the frontend to display such information to the user in a friendly way.

I'm assuming we'll need to modify VmState struct for this. How are you thinking of changing it?

@vlopes11
Copy link
Contributor

This is not entirely correct. A root of a code block is not unique - i.e., many code blocks within a program may end up having the same root. The path from the MAST root to the code block is unique, however, this path won't be available until the whole MAST is built (see #862 (comment)).

You have a point there. Before we jump into the execute_iter, it's better we have this part figured out.

The MAST path alone might not be enough to entirely solve the equation. Given the following example

export.foo
  add mul
end

export.bar
  add mul
end

begin
  call.foo
  call.bar
end

I think we will end up with the same MAST root & path for both foo and bar, but they should have different source locations (this would extend to all loaded libraries). We can naturally treat this as negligible, zero-impact problem, but a robust solution would also cover that scenario.

I'm thinking of an alternative for this.

@bobbinth
Copy link
Contributor

An even simpler example would be:

being
  push.1
  if.true
    add mull
  else
    add mul
  end
end

Hash of both branches in the if/else block would be the same as they contain the same sequence of instructions.

@vlopes11
Copy link
Contributor

Summary

We have two pending decisions.

First:

  • Append metadata to leaves of MAST, including the [SourceLocation]
  • Map ([ProcedureId] + MAST path) to [SourceLocation]

Second:

  • Pass the path as argument of parse_module
  • Introduce a namespace directive to declare MASL

Details

We need a mechanism to disambiguate duplicated bodies across procedures & blocks. Example:

export.foo add mul end
export.bar add mul end

These two procedures have exactly the same body in different source locations. Therefore, we cannot map the root of the MAST to a set of locations as the absolute module path + procedure name should be part of this lookup. Moreover, This solution must be generic enough to map also blocks under if/else spans; example:

begin
  if.true
    add mul
  else
    add mul
  end
end

This will be parsed to:

[IfElse([add, mul], [add, mul])]

It means we will have the same roots for a and b in IfElse(a, b).

Using the MAST path as index to a source location would solve this particular example, as in:

[ifelse(true), add] |-> (3, 5),
[ifelse(true), mul] |-> (3, 9),
[ifelse(false), add] |-> (5, 5),
[ifelse(false), mul] |-> (5, 9),

However, this wouldn't work for procedures & exec calls. The following example would represent a conflicting location:

export.foo add end
begin add end

As we would have a duplicated index [add], having to pick one of the two - probably the main procedure - but that solution wouldn't be robust enough to describe procedures and their contexts.

As for exec, we would have the following problem:

export.foo add end
begin exec.foo end

The MAST path is [add] inside of begin, but we need to point the location to foo. It means we need to preserve the full path of the procedure during the parse.

We have different approaches on how to attack this problem. The more robust solution - however, the one with the biggest impact - is to annotate every leaf of the MAST with metadata - this metadata will contain its underlying source location. The benefit of this is we will have the ability to annotate instructions in the future with other types of metadata (docs, optimization suggestions, assertions to be performed, etc). The negative point is this would involve a bigger refactor of the code base.

Initially, the metadata would be a simple [SourceLocation]. Whenever we are compiling from MAST to [CodeBlock], we just consume this metadata and create a similar tree-structure for [CodeBlock] leaves.

Another possibility is to create a parallel data structure that will be a tree and map the leaves of the MAST 1:1 with such metadata structure. We extend the advantages of annotating the instructions individually and we greatly reduce the refactor impact, but we generate runtime overhead to compute & store this additional structure. There is another advantage to this approach that is we can attach this structure to generated programs, and expect them to have the debug information injected on the fly.

Finally, we can create a mapping with the absolute path of the module + path of the MAST as index to a source location. @bobbinth suggested we also append the operation index there, but it's not entirely clear to me why that is needed. One instruction generates multiple operations, but all operations generated by one instruction will map to the same source location of the instruction. Further elaboration on that would be helpful.

Exploring further the last option; we can use ProcedureId as prefix of the MAST path, falling back to ProcedureId::main when compiling a main procedure.

However, this introduces a new challenge: to define the module path of a AST module. Currently, the signature to parse modules is the following:

fn parse_module(source: &str) -> Result<ModuleAst, ParsingError>

As we can see, we can't extract the [ProcedureId] from the source contents. We can simply add an argument of the module path:

fn parse_module(path: &str, source: &str) -> Result<ModuleAst, ParsingError>

However, this problem of defining the module path in order to compute a [ProcedureId] is not new. Currently, we are using the disk path, but that might not be ideal as we don't really have the concept of a workspace in MASM. In Rust, that is resolved by searching upwards for a Cargo.toml, define such location as the root of the project, and compute the full module path from that point.

In MASL, we query the user on what is the library bundle prefix, and what is the root path. The implementation is here.

Alternatively, given we don't have an open issue to implement workspace structure, we can introduce a namespace directive on MASL that will define the module path. This would allow the source alone to fully produce a [ProcedureId], solving the issue above.

@bobbinth
Copy link
Contributor

Before we get to MAST, we need to figure out how to track location data in the AST structs (ProgramAst, ModuleAst, and ProcedureAst). My preliminary thinking is that we'll integrate SourceLocation tracking into these structs. Is this how you were thinking about it as well?

One question here is how do we serialize these AST structs when they contain source mappings. Specifically, do they get saves as a part of .masl files? Or do they go into separate files?

This is related to your second question. In this question I'd prefer to go with the first approach - i.e., passing path to parse_module (and other) methods. The way I am thinking about this is that the path that will be passed in is the namespace path (i.e. std::math::u64) rather than filesystem path.

Regarding the first question: I would prefer something more like the second approach as I don't want to complicate the MAST structure more (I'd rather simplify it as much as possible). This would allow to simplify the VM processor as well. Ideally, the MAST structures should not contain any debug or source map related data - and all of this extra data would be provided separately when we call execute_iter() method.

@bobbinth
Copy link
Contributor

Btw, I refactored things a bit in the latest commit. For example, there is no longer parse_module() or parse_program() - instead we have ModuleAst::parse() and ProgramAst::parse().

@bobbinth
Copy link
Contributor

bobbinth commented Apr 26, 2023

One thing to think about: we probably want to instantiate TokenStream like so:

pub fn new(source: &'a str, source_id: SourceId) -> Result<Self, ParsingError>

We could use LibraryPath as source ID, but that might be too heavy. So, we should think about how we can map LibraryPath to some short numeric value (e.g., a u32).

Edit: now that I think about it more, we may actually not need to track source ID at the token level - this really depends on how we add the source location data to ProgramAst, ModuleAst, and ProcedureAst structs. So, I'd say the first thing to figure out is how we want to modify these (and related) structs. One idea I had is to embed source location into Node enum.

For example, we could change Node from it's current form:

pub enum Node {
    Instruction(Instruction),
    IfElse(Vec<Node>, Vec<Node>),
    Repeat(u32, Vec<Node>),
    While(Vec<Node>),
}

To something like:

pub enum Node {
    Instruction(Instruction),
    IfElse(IfElseNode),
    Repeat(RepeatNode),
    While(WhileNode),
}

And then WhileNode, for example, could look like:

pub struct WhileNode {
    body: Vec<Node>,
    locations: Vec<SourceLocation>
}

@vlopes11 vlopes11 self-assigned this Apr 27, 2023
@vlopes11
Copy link
Contributor

vlopes11 commented Apr 27, 2023

Regarding MAST vs ProgramAst, ModuleAst, ProcedureAst; the proposed changes focuses on the latter (not MAST).

One question here is how do we serialize these AST structs when they contain source mappings. Specifically, do they get saves as a part of .masl files? Or do they go into separate files?

I think they should go on different files. If we take as example regular ELF binaries, they might contain debug information, and this might be loaded to memory when executing a binary. For example, in MASL, we will need a strategy to avoid deploying such metadata in production environments to avoid unnecessary storage consumption. If we want to embed this source mapping onto MASL files, we will need to introduce support for such modes - that will be added complexity for this first minimalistic implementation.

This is related to your second question. In this question I'd prefer to go with the first approach - i.e., passing path to parse_module (and other) methods. The way I am thinking about this is that the path that will be passed in is the namespace path (i.e. std::math::u64) rather than filesystem path.

Sounds good! To me, it makes more sense to have a dedicated struct for that instead of a raw string - so we entirely delegate the responsibility of iterating the path of the module to such structure (instead of split :: delimiter, etc). Do you think it is a good moment to introduce such structure? That might save us the need to refactor it later on.

we may actually not need to track source ID at the token level

We have a couple of preliminary questions for that. [SourceId] makes sense only if such ID is indexable somewhere. We might be talking about a [ParserContext] here that will hold a SourceId -> Path relationship.

As I see, the main goal of the [SourceId] is to prevent us from duplicating the concrete path for every node of the tree: we, instead, have something like a u64 that will just point to the full path.

I think it is very positive we introduce such id now, but we will also need this context structure. That wouldn't be necessarily the same as the one we might serialize for MASL, as the latter cannot depend on the host filesystem, and the former could contain just a small set of MASM paths.

Such [ParserContext] can be an implementation detail of [Assembler] and never exposed publicly. It will support the parsing and compilation (i.e. transform [ProgramAst] into [Program]), and could be discarded afterwards.

The [ParserContext] wouldn't support the debug operation either, as we will need a different structure to resolve the locations (for the problem described above of host-independent, serialized debug information).

If we state that a source mapping is always host-dependant, then we can use the same context everywhere - from tokenization to execution. If we opt to go otherwise, these [SourceId] won't be relevant beyond parsing, and we might want to include as metadata the original string of the MASM that generated procedures - probably indexed by their path - deeming [SourceId] useless for that context

@bobbinth
Copy link
Contributor

I think they should go on different files.

Yeah - I like this approach better too. It will make deserialization of ModuleAst and other structs more difficult - but maybe that's OK.

We might be talking about a [ParserContext] here that will hold a SourceId -> Path relationship.

Out current ParserContext is for parsing of individual modules - so, I don't think it could hold such a map. But as I mentioned above, I'm actually not it tracking either source ID or path on per-location basis would be meaningful in the structure we have now (at least during the parsing state). The reason for this is at the library level all modules track their associated paths already (i.e., in the Module struct). So, any time we need to access ModuleAst, we should know the path of that module.

vlopes11 added a commit that referenced this issue May 2, 2023
This commit adds source location to parsed nodes, allowing the
construction of source mapping.

It modifies the [Node] structure based on the discussion that took place
here:

#866 (comment)

The implementation is slightly different than the one discussed as every
instruction individually will contain a [SourceLocation] - it will be
less expensive to store it like that instead of creating sets for each
of the block variants.

The [SourceLocation] itself is a cheap type that will contain integers
only.

This commit doesn't touch the serialization question for ModuleAst with
SourceLocation. These additional bytes might be optional, and including
them might be optional as they will consume additional production
storage when they will provide functionality only for debug/development
environments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Progress
Development

No branches or pull requests

3 participants