Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Miden assembly debugging improvements #580

Closed
Tracked by #866
bobbinth opened this issue Dec 3, 2022 · 16 comments
Closed
Tracked by #866

Miden assembly debugging improvements #580

bobbinth opened this issue Dec 3, 2022 · 16 comments
Assignees
Labels
assembly Related to Miden assembly processor Related to Miden VM processor

Comments

@bobbinth
Copy link
Contributor

bobbinth commented Dec 3, 2022

Currently, debugging Miden assembly is not easy. We do have the ability to step through program execution via the execute_iter() function, but to use this function one has to write Rust code which is not always convenient. So, here are a few thoughts about how we could improve debugging experience (other thoughts are welcome):

  1. Add debug command to the CLI. This command would be similar to run, but it would step through program execution cycle by cycle. We'll need to come up with an interface of how to move forward and backward, and what to print on the screen - but that shouldn't be too complicated. Also, would be cool to come up with a way to graphically map VM operations to assembly instructions.
  2. The debug CLI command would be useful for small programs, but if a program is big, manually stepping through thousands of cycles will probably not be practical. One way to address this is to add debug (or debugger) instruction to the assembly (this would actually be a decorator). Then, when we run a program via miden debug command, the program could run until the next occurrence of debug instruction. From that point on, we'd be able to step through the execution (both forward and backward) in the same way as described in point 1 above.
  3. It would be really nice to integrate the above functionality into Miden playground - something similar to what can be done on evm.codes. But this is probably a pretty involved project.
  4. Lastly, it would be really cool to provide source mappings (e.g., line numbers) between a given VM operation and Miden assembly source file (and, in the future, higher-language source files too). However, this is a bigger issue which will probably require some refactoring of:
    a. The way we retain source info when we go from source code to MAST (I think we should do it via complementary source-map files).
    b. The way we execute programs in debug mode (I think the current way of handling instruction to operation mapping is not the best).

For the last part, I will write up a separate issue.

And again, if anyone has other thoughts/suggestions - would love to hear them.

@bobbinth bobbinth added assembly Related to Miden assembly processor Related to Miden VM processor labels Dec 3, 2022
@skaller
Copy link

skaller commented Dec 27, 2022

The right, if not only, way to do this is with an emulator that evaluates the assembly code, one step at a time. There are two obvious hosts:

  • Web playground
  • Visual Application

A command line debugger could be done too, but these things are way too hard to operate, IMHO.

Writing the emulator is non trivial, it has to faithfully do what the real machine would do. Hopefully there are Rust functions which can evaluate opcodes and get the maths right. Given that, the control structures are easy.

What I would like is

  • a pane with the code on the left and an arrow pointing at the current instruction.
  • two panes with the stack, one before the instruction with
    • the slots that will be destroyed in red
    • the slots that are accessed but not touched in black
  • one for after the instruction with the new values pushed in green

More fanciful, on seeing an if the code pane under the current instruction can split in two, showing both the branches, with a visual indication which branch will be taken. Should handle while as well.

The traditional operations are:

  • step
  • set breakpoint
  • run to breakpoint
  • movie at a given fps (steps per second)

The movie mode could use a control pane where the mouse can be used to speed up or slow down the execution.

Backtracking is essential later: you get an error, you want to backstep to see how you got there.

The opcode evaluation can be left out in the first cut. Instead, you get a button for true and false which pretends the stack has 0 or 1 on it for the purpose of evaluating an if or while. So you can step through the program up to one of these instructions and then chose one branch, see how control flows, and maybe reset and try the other branch.

In addition, a way to view the advice tape and memory is also required for a complete picture. Furthermore, some way of handling calls to library procedure is required, by default these would be treated as a single step operation, unless the user hits a key to recurse into the library code.

The reason an emulator is required should be clear: the real machine doesn't operate on assembly instructions.

@frisitano
Copy link
Contributor

Some additional thoughts on the topic:

  • A typical entry point for accessing debugging / tracing information is via an RPC invocation on a node client. For example in Geth there are a number of RPC methods that allow you to trace a transaction given a particular state / block number. We could consider exposing similar RPC methods on the Miden node client.
  • We could consider adding the ability to toggle the verbosity / data included in the debug trace. These are the options provided by geth:
    enableMemory: true disableStack: false disableStorage: false enableReturnData: true
  • Start thinking about higher level tracers which consume the raw debug trace and extract insights relevant to that particular tracer. See some examples that Geth provides here: https://geth.ethereum.org/docs/developers/evm-tracing/built-in-tracers

@hackaugusto
Copy link
Contributor

A typical entry point for accessing debugging / tracing information is via an RPC invocation on a node client. For example in Geth there are a number of RPC methods that allow you to trace a transaction given a particular state / block number. We could consider exposing similar RPC methods on the Miden node client.

That is a great idea. My two cents: I think we should have endpoints to download the data and run locally the debugging. This way we can improve the debugging tooling without having to update the client RPC, we can support private transactions, and we can't simulate the execution of the transaction on a different block than what happened on-chain.

@bobbinth
Copy link
Contributor Author

The first item from the original post was addressed in #693

@grjte
Copy link
Contributor

grjte commented Feb 27, 2023

@vlopes11 is taking a look at point 2 from the original post

vlopes11 added a commit that referenced this issue Mar 6, 2023
This commit introduces `breakpoint`, a transparent issue that will
cause a debug execution to break when reached.

This instruction will not be serialized into libraries, and will not
have an opcode or be part of the code block tree.

For debug executions, it will produce a `NOOP`, so the internal clock of
the VM will be affected.

The decision of not including this as part of the code block is to avoid
further consuming variants of opcodes as they are already scarce.

related issue: #580
vlopes11 added a commit that referenced this issue Mar 6, 2023
This commit introduces `breakpoint`, a transparent instruction that will
cause a debug execution to break when reached.

This instruction will not be serialized into libraries, and will not
have an opcode or be part of the code block tree.

For debug executions, it will produce a `NOOP`, so the internal clock of
the VM will be affected.

The decision of not including this as part of the code block is to avoid
further consuming variants of opcodes as they are already scarce.

related issue: #580
vlopes11 added a commit that referenced this issue Mar 6, 2023
This commit introduces `breakpoint`, a transparent instruction that will
cause a debug execution to break when reached.

This instruction will not be serialized into libraries, and will not
have an opcode or be part of the code block tree.

For debug executions, it will produce a `NOOP`, so the internal clock of
the VM will be affected.

The decision of not including this as part of the code block is to avoid
further consuming variants of opcodes as they are already scarce.

related issue: #580
vlopes11 added a commit that referenced this issue Mar 7, 2023
This commit introduces `breakpoint`, a transparent instruction that will
cause a debug execution to break when reached.

This instruction will not be serialized into libraries, and will not
have an opcode or be part of the code block tree.

For debug executions, it will produce a `NOOP`, so the internal clock of
the VM will be affected.

The decision of not including this as part of the code block is to avoid
further consuming variants of opcodes as they are already scarce.

related issue: #580
vlopes11 added a commit that referenced this issue Mar 7, 2023
This commit introduces `breakpoint`, a transparent instruction that will
cause a debug execution to break when reached.

This instruction will not be serialized into libraries, and will not
have an opcode or be part of the code block tree.

For debug executions, it will produce a `NOOP`, so the internal clock of
the VM will be affected.

The decision of not including this as part of the code block is to avoid
further consuming variants of opcodes as they are already scarce.

related issue: #580
@bitwalker
Copy link
Contributor

Lastly, it would be really cool to provide source mappings (e.g., line numbers) between a given VM operation and Miden assembly source file (and, in the future, higher-language source files too). However, this is a bigger issue which will probably require some refactoring of:

a. The way we retain source info when we go from source code to MAST (I think we should do it via complementary source-map files).
b. The way we execute programs in debug mode (I think the current way of handling instruction to operation mapping is not the best).

I have some thoughts on this:

  • At the VM level, I'd recommend mapping instruction ranges to locations, each location can be mapped back to the full source span (if available). This provides full fidelity source-level debugging information with the least amount of overhead (as opposed to naively attaching source information to each opcode).
  • As mentioned by @bobbinth, we'll want to generate "split" debug information, i.e. the debug info is stored in a separate file and can be loaded when source-level debugging is desired. This is useful when trying to debug a program in production without having to recompile it. One needs to provide both the split debug info and the sources, but they don't have to be shipped with the program itself. This may very well need to work differently in a blockchain context, but my suspicion is that shipping full source-level debugging information with the MAST is a lot of overhead that would be best avoided. The split debug info file would simply contain a binary-encoded representation of the mappings from instruction ranges to locations as mentioned in my first point. When loaded, you'd want to specify the root path where the sources are located, and all locations would be relative to that root.
  • Traces (printing a stack trace on error) can be constructed by mapping instruction offsets to ranges of instructions generated from a procedure, and having calls store a continuation pointer somewhere (i.e. the instruction pointer to which control will transfer upon return), with that in place you can "unwind" the stack to generate a trace. Presumably the latter already exists in some form. The metadata for what function/procedure is mapped to what instruction range can be stored in the debug info, or generated by the VM when loading a program, whichever is most convenient. This gives you useful stack traces even if you don't have sources available.

On the compiler end, we'll make sure we carry through full source information (when available) to Miden asm, but as far as how to feed that to the VM, that's really the part which is unspecified/up for debate at this point.

vlopes11 added a commit that referenced this issue Mar 8, 2023
This commit introduces `breakpoint`, a transparent instruction that will
cause a debug execution to break when reached.

This instruction will not be serialized into libraries, and will not
have an opcode or be part of the code block tree.

For debug executions, it will produce a `NOOP`, so the internal clock of
the VM will be affected.

The decision of not including this as part of the code block is to avoid
further consuming variants of opcodes as they are already scarce.

related issue: #580
@bobbinth
Copy link
Contributor Author

bobbinth commented Mar 9, 2023

I'd recommend mapping instruction ranges to locations, each location can be mapped back to the full source span (if available).

Are there some commonly used formats for this? Anything you'd recommend?

vlopes11 added a commit that referenced this issue Mar 13, 2023
This commit introduces `breakpoint`, a transparent instruction that will
cause a debug execution to break when reached.

This instruction will not be serialized into libraries, and will not
have an opcode or be part of the code block tree.

For debug executions, it will produce a `NOOP`, so the internal clock of
the VM will be affected.

The decision of not including this as part of the code block is to avoid
further consuming variants of opcodes as they are already scarce.

related issue: #580
@bitwalker
Copy link
Contributor

@bobbinth In my own virtual machine implementations, I've just used a custom encoding best suited to the bytecode format and what the runtime desired in terms of metadata. There are open formats for debug info, e.g. DWARF, but that makes more sense when compiling to native code, rather than for a virtual machine target. DWARF in particular is intended for pairing up with a debugger like lldb, by providing richer information about what types of data live where on the stack/registers, and even things like how to recover values like the frame pointer/return address for use in unwinding. Typically you have to roll your own debugger with a virtual machine though, so I'm pretty firmly in the camp that thinks you are better off rolling your own custom format tailored towards the needs of your VM/runtime/debugger stack.

Here's a sketch (in Rust terms) of one simple, but effective approach you can use for source-level debug info (based on what I've done in the past). It is probably worth noting that the compiler using this approach generates bytecode into a single bytecode module, and does so using a pattern where each function is generated independently using a FunctionBuilder that keeps all the state about a bytecode function isolated until it has been fully built (e.g. mapping blocks to instruction offsets, which are later resolved to relative offsets in the final module). When the function builder is dropped, all of the instructions for that function are written to the backing module, resolving all of the instruction offsets, and of particular interest here - registering offsets with their source locations, de-duplicating runs of instructions with the same location.

It's also worth noting that there is a separate table in the bytecode which maps function symbols to their absolute instruction offset in the module (as well as other metadata about functions, such as the call frame size). In the code below, a "function pointer" is really the absolute offset to the instruction where the definition of a function begins.

/// This is used to represent the fully-qualified name of a function in the source program
pub type Symbol = String;
/// Filenames are reference-counted so they only need be allocated once per-file
pub type File = Rc<str>;
/// Line numbers start at 1
pub type Line = NonZeroU32;
/// Column numbers start at 1
pub type Column = NonZeroU32;
/// This represents a fully-decoded source location
pub struct SourceLocation {
    pub file: File,
    pub line: Line,
    pub column: Column,
}

/// This uniquely identifies a file in the debuginfo table
pub type FileId = u32;
/// This identifies a unique location in the source program (i.e. combination of file/line/column)
pub struct LocationId = u32;
/// This is the type of value referred to by a `LocationId`, and doubles as the encoded form of a `SourceLocation`
pub struct Location {
    pub file: FileId,
    pub line: Line,
    pub column: Column,
}

/// This is the in-memory representation of the debuginfo table, which can be trivially encoded/decoded to binary format.
pub struct DebugInfoTable {
    /// During compilation, this is populated with files as they are seen for the first time,
    /// and the `FileId` corresponds to the index of a `File` in this vector.
    pub files: Vec<File>,
    /// This is used to look up a file in the table before inserting into `files`
    pub files_to_id: BTreeMap<File, FileId>,
    /// During compilation, this is populated with locations as they are seen for the first time,
    /// and the `LocationId` corresponds ot the index of a `Location` in this vector.
    pub locations: Vec<Location>,
    /// This is used to look up a location in the table before inserting into `locations`
    pub locations_to_id: HashMap<Location, LocationId>,
    /// This maps instruction offsets to unique locations.
    /// 
    /// During compilation, runs of instructions with the same `LocationId` are collapsed into the first
    /// offset at which the run starts. The `BTreeMap` API can be used to recover the `LocationId` for
    /// any offset as long as all instructions were mapped to some location (using a sentinel location
    /// which represents absence of sources as a fallback)
    pub offsets: BTreeMap<usize, LocationId>,
}
impl DebugInfoTable {
    pub fn get_or_insert_file<F: AsRef<str>>(&mut self, file: F) -> FileId {
        let file = file.as_ref();
        match self.files_to_id.get(file) {
            None => {
                assert!(self.files.len() < FileId::MAX as usize);
                let id = self.files.len() as FileId;
                let file: Rc<str> = file.into();
                self.files.push(file.clone());
                self.files_to_id.insert(file, id);
                id
            }
            Some(id) => *id,
        }
    }

    pub fn get_or_insert_location(&mut self, location: Location) -> LocationId {
        assert!((location.file as usize) < self.files.len());
        match self.locations_to_id.get(&location) {
            None => {
                assert!(self.locations.len() < LocationId::MAX as usize);
                let id = self.locations.len() as LocationId;
                self.locations.push(location);
                self.locations_to_id.insert(location, id);
                id
            }
            Some(id) => *id,
        }
    }

    /// Specify the location for an offset
    ///
    /// NOTE: This may result in multiple consecutive entries in the map with the same `LocationId`, 
    /// and while that has no effect on correctness, it does incur some overhead for lookups, so it is
    /// expected that when encoding to binary, these "duplicates" are compressed into a single run.
    /// Ideally though, the compiler will avoid registering consecutive offsets for the same location
    /// to avoid causing this "problem" in the first place.
    pub fn register_offset(&mut self, offset: usize, loc: LocationId) {
        self.offsets.insert(offset, loc);
    }

    /// Try to locate a source location for the current instruction pointer, ensuring that the
    /// search stays within the bounds of the current function.
    pub fn offset_to_source_location(&self, fp: usize, ip: usize) -> Option<SourceLocation> {
        let mut iter = self.offsets.range(fp..=ip);
        let (_, id) = iter.next_back()?;
        self.location_to_source_location(*id)
    }

    /// Get a source location corresponding to the first instruction in a function
    #[inline]
    pub fn function_pointer_to_source_location(&self, fp: usize) -> Option<SourceLocation> {
        self.location_to_source_location(self.offsets.get(&fp).copied()?)
    }

    /// Get the "rich" location data for a `LocationId`
    pub fn location_to_source_location(&self, loc: LocationId) -> Option<SourceLocation> {
        let loc = self.locations.get(loc as usize)?;
        let file = self.files[loc.file as usize].clone();
        Some(SourceLocation {
            file,
            line: loc.line,
            column: loc.column,
        })
    }
}

Encoding the above debug info to binary form can be done in a single pass. The layout in binary is something like the following, starting with the header:

| NUM_FILES # A u32 value representing the number of unique file names
| NUM_LOCS # A u32 value representing the number of unique locations
| NUM_OFFS # A u32 value representing the number of unique offset/location id pairs

The above can either directly precede the rest of the data, or be written to the bytecode module header. In my own case, I chose the latter, but it really makes no difference - the important bit is that we know ahead of time how many of each record type is encoded.

This is the encoded layout for the actual data:

| [usize; NUM_FILES] # Vector file name sizes (in bytes), in big-endian encoding
| ... # all of the file name string data concatenated together in the same order as the vector of sizes above
| [Location; NUM_LOCS] # Vector of locations (file id/line/column using big-endian encoding)
| [(offset, LocationId); NUM_OFFS] # Vector of offset/location id pairs (in big-endian encoding)

Encoding is done as follows:

  1. Allocate a scratch buffer, using e.g. Vec, which will contain all of the string data for the filenames
  2. Traverse the vector of files, appending the raw bytes of the filename to the scratch buffer, and writing a u32 value to the underlying binary representing the length of the file.
  3. Once all of the files have been written to the scratch buffer, write that buffer to the underlying binary
  4. For each location in the table, write a sequence of 3 u32 values to the underlying binary, representing the file id (offset into the files vector), line number, and column number respectively.
  5. For each offset/location id pair in the table, write the offset followed by the location id to the underlying binary.

Decoding is more or less the inverse:

  1. Allocate an empty debug info table
  2. Read usize * NUM_FILES bytes from the input as a slice (let's call it size_data)
  3. For 0..NUM_FILES, parse a big-endian usize from size_data (let's call it size), then read size bytes from the input as a str, allocate a Rc<str> from it, and insert it into the debug info table. Once all file names have been read, we will have reified the Vec<File> from the original debug info table. When this step is done, the input is pointing to the first Location
  4. For 0..NUM_LOCS, parse 3 big-endian u32 values, validate that the first value (the file id) is within bounds of the parsed filenames vector. Insert each one into the debug info table.
  5. For 0..NUM_OFFS, parse a usize followed by a u32, in big-endian format. Validate that the latter (the location id) is within the bounds of the parsed locations. Insert each offset/location id pair in the debug info table

There are likely ways to make this even more of a compact encoding, and do things like avoid creating the Rc<str> for each file name by referring to the string data directly in the bytecode module, but those are very secondary concerns, and this representation is simple, uniform, and easy to extend.

@vlopes11
Copy link
Contributor

@bitwalker that looks overall like a good approach. As you suggested, we indeed should stick to the standard and use file:line:column for a debug mapping as it will be friendly to all modern applications, including, but not limited to, debug adapter protocol (DAP).

We currently have assembly libraries bundle that will be pre-compiled, and eventually shipped, to different hosts. This means we will perform a compilation in one place, and debug it elsewhere.

This brings a couple of decisions to the table:

  • Should we use absolute paths, relative paths, or filename identifiers for our filename field?
  • Should we embed the original MASM into the bundles so we can easily source map from a foreign host?
  • Should we support inlining trace? This would require a relatively complex solution to map a compiled operation that was optimized via the IR. It is ofc possible, but will demand non-trivial effort.

vlopes11 added a commit that referenced this issue Mar 13, 2023
This commit introduces `breakpoint`, a transparent instruction that will
cause a debug execution to break when reached.

This instruction will not be serialized into libraries, and will not
have an opcode or be part of the code block tree.

For debug executions, it will produce a `NOOP`, so the internal clock of
the VM will be affected.

The decision of not including this as part of the code block is to avoid
further consuming variants of opcodes as they are already scarce.

related issue: #580
@bitwalker
Copy link
Contributor

We currently have assembly libraries bundle that will be pre-compiled, and eventually shipped, to different hosts. This means we will perform a compilation in one place, and debug it elsewhere.
...
Should we use absolute paths, relative paths, or filename identifiers for our filename field?

I think a good default is to make all source paths relative to the project root from which they were compiled, so all that is needed to show source spans in a debugger/print in a terminal is to point the runtime to a directory which should be treated as the root. This is also nice since you can avoid shipping source with the program, but drop it in later when debugging as-needed.

But it is also fairly common that some sources are located in a different root directory (e.g. standard library sources). If there is only one set of separate sources (e.g. you have the user code and dependencies in one directory, and the standard library sources in another), then you can just handle that as two separate directories in the runtime; but if you can have arbitrarily many source roots (such as one for each dependency/library), that's where it gets messy, since you need a way to uniquely identify roots.

As an example, you might do so at the level of a codegen unit - basically at the level of a Rust crate - since the sources for those units should all come from the same project directory. You could then provide sources on a per-codegen-unit basis using a single directory tree (where at the root of the tree, each child is a directory with the same name as the codegen unit, with the sources stored relative to their origin codegen unit). That's just a sketch of an idea, but I think one way or the other, absolute paths are generally useless in all but development, since they are completely unportable.

Should we embed the original MASM into the bundles so we can easily source map from a foreign host?

Ideally users won't have to write MASM by hand once the compiler is ready, and instead you'd be bundling whatever they actually had a hand in writing - but in any case, if the overhead of bundling in the sources is not an issue, then it's certainly the best option in terms of portability. But it is definitely a heavyweight option, depending on the amount of source code that has to be shipped with the bytecode.

Should we support inlining trace? This would require a relatively complex solution to map a compiled operation that was optimized via the IR. It is ofc possible, but will demand non-trivial effort.

I would, yes, and I don't think it's as complex as you're expecting. Since source locations are threaded through the compiler from end-to-end, and transformations must be written to ensure that source locations are preserved, there isn't any reason why you can't recover the original function symbol to which a specific instruction belongs (from the source code perspective). The inlining transformation shouldn't be rewriting the source locations of the inlined code, so that won't be an issue. A bigger issue is usually compiler-generated code which doesn't correspond to any source code at all, but those sorts of things can be handled on a case-by-case basis to ensure the debugging output is still useful for identifying where the code came from.

The main thing needed for traces is a proper call stack, which we may already have - but in short, the VM needs to recognize when control transfers to a new function, and record that call frame information somewhere (the stack is easiest, but it can be its own parallel structure as well). In the virtual machine that Firefly targets, call instructions are responsible for pushing new call frames when appropriate (i.e. tail calls reuse the callers frame, so no new frame is recorded), and those call frames contain a continuation pointer at the base of the frame (essentially the return address). When capturing a stack trace, the current function symbol is obtained using the instruction pointer, but the rest of the trace is discovered by unwinding the stack frame-by-frame and looking up the symbol associated with the continuation pointer at the base of each frame, until the bottom of the stack is reached. A frame pointer is usually needed to support unwinding, but that's assuming a C-like stack regime - if some other parallel call frame structure is used, it's probably not needed.


The most complicated thing under the debugging umbrella IMO is really interactive debugging, i.e. setting breakpoints, stepping over instructions, in/out of functions, alongside a visual representation of the source code; particularly with multiple threads of execution - and that's really where I hit the limit of my own experience, as I haven't really delved into implementing that kind of tooling before.

vlopes11 added a commit that referenced this issue Mar 14, 2023
This commit introduces `breakpoint`, a transparent instruction that will
cause a debug execution to break when reached.

This instruction will not be serialized into libraries, and will not
have an opcode or be part of the code block tree.

For debug executions, it will produce a `NOOP`, so the internal clock of
the VM will be affected.

The decision of not including this as part of the code block is to avoid
further consuming variants of opcodes as they are already scarce.

related issue: #580
@vlopes11
Copy link
Contributor

The most complicated thing under the debugging umbrella IMO is really interactive debugging, i.e. setting breakpoints, stepping over instructions, in/out of functions, alongside a visual representation of the source code; particularly with multiple threads of execution - and that's really where I hit the limit of my own experience, as I haven't really delved into implementing that kind of tooling before.

I don't think this is a particularly hard problem. We tick per operation; as long as we can map each operation to the frontend (being that a MASM file, a source, or anything that compiled to the operation), it is quite straightforward to implement breakpoints in terms of filename:line, or even conditional breakpoints as nothing really stops us from evaluating expressions on the fly.

To step over, we can have some supporting data that will define the ranges for the procedures of a file. We do have a couple of edge cases such as the difference between inline/exec and call, but then I don't see a big problem there as we would always point to the function definition, regardless of the inline approach we took. It all falls to the source mapping and how robust it is - but once the source mapping is done, the rest becomes trivial.

I did implement a couple of debuggers in the past, and our case is quite simple, so I wouldn't worry about the feasibility of this issue too much.

vlopes11 added a commit that referenced this issue Mar 14, 2023
This commit introduces `breakpoint`, a transparent instruction that will
cause a debug execution to break when reached.

This instruction will not be serialized into libraries, and will not
have an opcode or be part of the code block tree.

For debug executions, it will produce a `NOOP`, so the internal clock of
the VM will be affected.

The decision of not including this as part of the code block is to avoid
further consuming variants of opcodes as they are already scarce.

related issue: #580
@bobbinth
Copy link
Contributor Author

I think there are a few parts to this:

  1. When compiling .masm files, we need to provide source maps from higher-level languages. For this we'll probably need to update the parsing methods (e.g., parse_program() to accept the source maps (in w/e form).
  2. The parsers would need "re-map" the source maps to now refer to AST nodes. These new source mappings could be output as separate files, or could be encoded into .masl files somehow. Separate files would probably be better.
    a. Even if we don't provide higher-level source maps to parsers, we may still want them to output source mappings for Miden assembly source.
  3. When the assembler compiles programs/modules, we would need to provide mappings generated in point 2 above. For this we'll probably need to update the compile() method (or provide another variant).
  4. The assembler would need to then "re-map" this data again to associate it with a specific MAST node/instruction. Identifying the note in the MAST should be pretty straight forward: code_block_hash + op_index should do. So, the mapping file should be able to map something like code_block_hash::op_index to the original source location.
    a. The assembler would need to be able to handle a mix of modules where some modules may have debug info and others would not have it.
  5. We can update VmState struct to include code_block_hash and op_index for every cycle. This will simplify the whole thing as we would no longer need to keep track of AsmOpInfo.
  6. Finally, when we execute a program using execute_iter() function, we'd get back an iterator over VmStates, and using the debug info from step 4 above, we should be able to map any cycle to the corresponding location in the original source file.

I think we could start with step 2 (step 1 can be implemented later). The most complex parts would be steps 2, 3, and 4. Septs 5 and 6 should be pretty ease.

@vlopes11
Copy link
Contributor

I think it would fall into the following diagram

image

All these mappings from the debug info must provide a reliable link from the opcdode/decorator into the ASM, IR, & source file. I think it's feasible, wdyt?

@bitwalker
Copy link
Contributor

bitwalker commented Mar 15, 2023

At each stage where you "lower" to a new representation, it is the responsibility of that lowering translation to carry forward source locations in the higher-level representation to the lower-level representation. The result is that when you reach the lowest level representation, you don't need to map things back to each of the previous levels, instead you have a direct link from individual instructions to the source location from which it is derived. The original sources can be a high-level language, or hand-written Miden ASM, and the end result is the same. The actual debug info can be stored separately, but a reference into that debug info should be stored in the MAST itself (or if not in the MAST, in an adjacent structure which maps a node in the tree to metadata about that node) - something akin to LocationId in my earlier code example.

For this we'll probably need to update the parsing methods (e.g., parse_program() to accept the source maps (in w/e form).

@bobbinth My working assumption at the moment is that the compiler is lowering to MAST directly (via Miden ASM, at least initially; but in either case, not as a string, but in Rust directly). In other words, we can directly attach source location metadata to ops in ASM (and/or MAST), and even use the same representation in both the compiler and VM. If parsing Miden ASM from a file/string, I think we should assume that it is hand-written, and treat it as the original source code - that is certainly true today given that there is no other way to write programs for the Miden VM (that I know of), and will only be reinforced when the compiler is available.

@vlopes11
Copy link
Contributor

Yes, we just need to pay attention for the "rogue" stdlib in the flow. It means that the debug info structure will be per opcode, and not per linked bundle. Everything else should be straightforward.

@grjte
Copy link
Contributor

grjte commented Apr 18, 2023

Superseded by #866

@grjte grjte closed this as completed Apr 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
assembly Related to Miden assembly processor Related to Miden VM processor
Projects
No open projects
Status: In Progress
Development

No branches or pull requests

7 participants