-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Miden assembly debugging improvements #580
Comments
The right, if not only, way to do this is with an emulator that evaluates the assembly code, one step at a time. There are two obvious hosts:
A command line debugger could be done too, but these things are way too hard to operate, IMHO. Writing the emulator is non trivial, it has to faithfully do what the real machine would do. Hopefully there are Rust functions which can evaluate opcodes and get the maths right. Given that, the control structures are easy. What I would like is
More fanciful, on seeing an The traditional operations are:
The movie mode could use a control pane where the mouse can be used to speed up or slow down the execution. Backtracking is essential later: you get an error, you want to backstep to see how you got there. The opcode evaluation can be left out in the first cut. Instead, you get a button for In addition, a way to view the advice tape and memory is also required for a complete picture. Furthermore, some way of handling calls to library procedure is required, by default these would be treated as a single step operation, unless the user hits a key to recurse into the library code. The reason an emulator is required should be clear: the real machine doesn't operate on assembly instructions. |
Some additional thoughts on the topic:
|
That is a great idea. My two cents: I think we should have endpoints to download the data and run locally the debugging. This way we can improve the debugging tooling without having to update the client RPC, we can support private transactions, and we can't simulate the execution of the transaction on a different block than what happened on-chain. |
The first item from the original post was addressed in #693 |
@vlopes11 is taking a look at point 2 from the original post |
This commit introduces `breakpoint`, a transparent issue that will cause a debug execution to break when reached. This instruction will not be serialized into libraries, and will not have an opcode or be part of the code block tree. For debug executions, it will produce a `NOOP`, so the internal clock of the VM will be affected. The decision of not including this as part of the code block is to avoid further consuming variants of opcodes as they are already scarce. related issue: #580
This commit introduces `breakpoint`, a transparent instruction that will cause a debug execution to break when reached. This instruction will not be serialized into libraries, and will not have an opcode or be part of the code block tree. For debug executions, it will produce a `NOOP`, so the internal clock of the VM will be affected. The decision of not including this as part of the code block is to avoid further consuming variants of opcodes as they are already scarce. related issue: #580
This commit introduces `breakpoint`, a transparent instruction that will cause a debug execution to break when reached. This instruction will not be serialized into libraries, and will not have an opcode or be part of the code block tree. For debug executions, it will produce a `NOOP`, so the internal clock of the VM will be affected. The decision of not including this as part of the code block is to avoid further consuming variants of opcodes as they are already scarce. related issue: #580
This commit introduces `breakpoint`, a transparent instruction that will cause a debug execution to break when reached. This instruction will not be serialized into libraries, and will not have an opcode or be part of the code block tree. For debug executions, it will produce a `NOOP`, so the internal clock of the VM will be affected. The decision of not including this as part of the code block is to avoid further consuming variants of opcodes as they are already scarce. related issue: #580
This commit introduces `breakpoint`, a transparent instruction that will cause a debug execution to break when reached. This instruction will not be serialized into libraries, and will not have an opcode or be part of the code block tree. For debug executions, it will produce a `NOOP`, so the internal clock of the VM will be affected. The decision of not including this as part of the code block is to avoid further consuming variants of opcodes as they are already scarce. related issue: #580
I have some thoughts on this:
On the compiler end, we'll make sure we carry through full source information (when available) to Miden asm, but as far as how to feed that to the VM, that's really the part which is unspecified/up for debate at this point. |
This commit introduces `breakpoint`, a transparent instruction that will cause a debug execution to break when reached. This instruction will not be serialized into libraries, and will not have an opcode or be part of the code block tree. For debug executions, it will produce a `NOOP`, so the internal clock of the VM will be affected. The decision of not including this as part of the code block is to avoid further consuming variants of opcodes as they are already scarce. related issue: #580
Are there some commonly used formats for this? Anything you'd recommend? |
This commit introduces `breakpoint`, a transparent instruction that will cause a debug execution to break when reached. This instruction will not be serialized into libraries, and will not have an opcode or be part of the code block tree. For debug executions, it will produce a `NOOP`, so the internal clock of the VM will be affected. The decision of not including this as part of the code block is to avoid further consuming variants of opcodes as they are already scarce. related issue: #580
@bobbinth In my own virtual machine implementations, I've just used a custom encoding best suited to the bytecode format and what the runtime desired in terms of metadata. There are open formats for debug info, e.g. DWARF, but that makes more sense when compiling to native code, rather than for a virtual machine target. DWARF in particular is intended for pairing up with a debugger like Here's a sketch (in Rust terms) of one simple, but effective approach you can use for source-level debug info (based on what I've done in the past). It is probably worth noting that the compiler using this approach generates bytecode into a single bytecode module, and does so using a pattern where each function is generated independently using a It's also worth noting that there is a separate table in the bytecode which maps function symbols to their absolute instruction offset in the module (as well as other metadata about functions, such as the call frame size). In the code below, a "function pointer" is really the absolute offset to the instruction where the definition of a function begins. /// This is used to represent the fully-qualified name of a function in the source program
pub type Symbol = String;
/// Filenames are reference-counted so they only need be allocated once per-file
pub type File = Rc<str>;
/// Line numbers start at 1
pub type Line = NonZeroU32;
/// Column numbers start at 1
pub type Column = NonZeroU32;
/// This represents a fully-decoded source location
pub struct SourceLocation {
pub file: File,
pub line: Line,
pub column: Column,
}
/// This uniquely identifies a file in the debuginfo table
pub type FileId = u32;
/// This identifies a unique location in the source program (i.e. combination of file/line/column)
pub struct LocationId = u32;
/// This is the type of value referred to by a `LocationId`, and doubles as the encoded form of a `SourceLocation`
pub struct Location {
pub file: FileId,
pub line: Line,
pub column: Column,
}
/// This is the in-memory representation of the debuginfo table, which can be trivially encoded/decoded to binary format.
pub struct DebugInfoTable {
/// During compilation, this is populated with files as they are seen for the first time,
/// and the `FileId` corresponds to the index of a `File` in this vector.
pub files: Vec<File>,
/// This is used to look up a file in the table before inserting into `files`
pub files_to_id: BTreeMap<File, FileId>,
/// During compilation, this is populated with locations as they are seen for the first time,
/// and the `LocationId` corresponds ot the index of a `Location` in this vector.
pub locations: Vec<Location>,
/// This is used to look up a location in the table before inserting into `locations`
pub locations_to_id: HashMap<Location, LocationId>,
/// This maps instruction offsets to unique locations.
///
/// During compilation, runs of instructions with the same `LocationId` are collapsed into the first
/// offset at which the run starts. The `BTreeMap` API can be used to recover the `LocationId` for
/// any offset as long as all instructions were mapped to some location (using a sentinel location
/// which represents absence of sources as a fallback)
pub offsets: BTreeMap<usize, LocationId>,
}
impl DebugInfoTable {
pub fn get_or_insert_file<F: AsRef<str>>(&mut self, file: F) -> FileId {
let file = file.as_ref();
match self.files_to_id.get(file) {
None => {
assert!(self.files.len() < FileId::MAX as usize);
let id = self.files.len() as FileId;
let file: Rc<str> = file.into();
self.files.push(file.clone());
self.files_to_id.insert(file, id);
id
}
Some(id) => *id,
}
}
pub fn get_or_insert_location(&mut self, location: Location) -> LocationId {
assert!((location.file as usize) < self.files.len());
match self.locations_to_id.get(&location) {
None => {
assert!(self.locations.len() < LocationId::MAX as usize);
let id = self.locations.len() as LocationId;
self.locations.push(location);
self.locations_to_id.insert(location, id);
id
}
Some(id) => *id,
}
}
/// Specify the location for an offset
///
/// NOTE: This may result in multiple consecutive entries in the map with the same `LocationId`,
/// and while that has no effect on correctness, it does incur some overhead for lookups, so it is
/// expected that when encoding to binary, these "duplicates" are compressed into a single run.
/// Ideally though, the compiler will avoid registering consecutive offsets for the same location
/// to avoid causing this "problem" in the first place.
pub fn register_offset(&mut self, offset: usize, loc: LocationId) {
self.offsets.insert(offset, loc);
}
/// Try to locate a source location for the current instruction pointer, ensuring that the
/// search stays within the bounds of the current function.
pub fn offset_to_source_location(&self, fp: usize, ip: usize) -> Option<SourceLocation> {
let mut iter = self.offsets.range(fp..=ip);
let (_, id) = iter.next_back()?;
self.location_to_source_location(*id)
}
/// Get a source location corresponding to the first instruction in a function
#[inline]
pub fn function_pointer_to_source_location(&self, fp: usize) -> Option<SourceLocation> {
self.location_to_source_location(self.offsets.get(&fp).copied()?)
}
/// Get the "rich" location data for a `LocationId`
pub fn location_to_source_location(&self, loc: LocationId) -> Option<SourceLocation> {
let loc = self.locations.get(loc as usize)?;
let file = self.files[loc.file as usize].clone();
Some(SourceLocation {
file,
line: loc.line,
column: loc.column,
})
}
} Encoding the above debug info to binary form can be done in a single pass. The layout in binary is something like the following, starting with the header:
The above can either directly precede the rest of the data, or be written to the bytecode module header. In my own case, I chose the latter, but it really makes no difference - the important bit is that we know ahead of time how many of each record type is encoded. This is the encoded layout for the actual data:
Encoding is done as follows:
Decoding is more or less the inverse:
There are likely ways to make this even more of a compact encoding, and do things like avoid creating the |
@bitwalker that looks overall like a good approach. As you suggested, we indeed should stick to the standard and use file:line:column for a debug mapping as it will be friendly to all modern applications, including, but not limited to, debug adapter protocol (DAP). We currently have assembly libraries bundle that will be pre-compiled, and eventually shipped, to different hosts. This means we will perform a compilation in one place, and debug it elsewhere. This brings a couple of decisions to the table:
|
This commit introduces `breakpoint`, a transparent instruction that will cause a debug execution to break when reached. This instruction will not be serialized into libraries, and will not have an opcode or be part of the code block tree. For debug executions, it will produce a `NOOP`, so the internal clock of the VM will be affected. The decision of not including this as part of the code block is to avoid further consuming variants of opcodes as they are already scarce. related issue: #580
I think a good default is to make all source paths relative to the project root from which they were compiled, so all that is needed to show source spans in a debugger/print in a terminal is to point the runtime to a directory which should be treated as the root. This is also nice since you can avoid shipping source with the program, but drop it in later when debugging as-needed. But it is also fairly common that some sources are located in a different root directory (e.g. standard library sources). If there is only one set of separate sources (e.g. you have the user code and dependencies in one directory, and the standard library sources in another), then you can just handle that as two separate directories in the runtime; but if you can have arbitrarily many source roots (such as one for each dependency/library), that's where it gets messy, since you need a way to uniquely identify roots. As an example, you might do so at the level of a codegen unit - basically at the level of a Rust crate - since the sources for those units should all come from the same project directory. You could then provide sources on a per-codegen-unit basis using a single directory tree (where at the root of the tree, each child is a directory with the same name as the codegen unit, with the sources stored relative to their origin codegen unit). That's just a sketch of an idea, but I think one way or the other, absolute paths are generally useless in all but development, since they are completely unportable.
Ideally users won't have to write MASM by hand once the compiler is ready, and instead you'd be bundling whatever they actually had a hand in writing - but in any case, if the overhead of bundling in the sources is not an issue, then it's certainly the best option in terms of portability. But it is definitely a heavyweight option, depending on the amount of source code that has to be shipped with the bytecode.
I would, yes, and I don't think it's as complex as you're expecting. Since source locations are threaded through the compiler from end-to-end, and transformations must be written to ensure that source locations are preserved, there isn't any reason why you can't recover the original function symbol to which a specific instruction belongs (from the source code perspective). The inlining transformation shouldn't be rewriting the source locations of the inlined code, so that won't be an issue. A bigger issue is usually compiler-generated code which doesn't correspond to any source code at all, but those sorts of things can be handled on a case-by-case basis to ensure the debugging output is still useful for identifying where the code came from. The main thing needed for traces is a proper call stack, which we may already have - but in short, the VM needs to recognize when control transfers to a new function, and record that call frame information somewhere (the stack is easiest, but it can be its own parallel structure as well). In the virtual machine that Firefly targets, call instructions are responsible for pushing new call frames when appropriate (i.e. tail calls reuse the callers frame, so no new frame is recorded), and those call frames contain a continuation pointer at the base of the frame (essentially the return address). When capturing a stack trace, the current function symbol is obtained using the instruction pointer, but the rest of the trace is discovered by unwinding the stack frame-by-frame and looking up the symbol associated with the continuation pointer at the base of each frame, until the bottom of the stack is reached. A frame pointer is usually needed to support unwinding, but that's assuming a C-like stack regime - if some other parallel call frame structure is used, it's probably not needed. The most complicated thing under the debugging umbrella IMO is really interactive debugging, i.e. setting breakpoints, stepping over instructions, in/out of functions, alongside a visual representation of the source code; particularly with multiple threads of execution - and that's really where I hit the limit of my own experience, as I haven't really delved into implementing that kind of tooling before. |
This commit introduces `breakpoint`, a transparent instruction that will cause a debug execution to break when reached. This instruction will not be serialized into libraries, and will not have an opcode or be part of the code block tree. For debug executions, it will produce a `NOOP`, so the internal clock of the VM will be affected. The decision of not including this as part of the code block is to avoid further consuming variants of opcodes as they are already scarce. related issue: #580
I don't think this is a particularly hard problem. We tick per operation; as long as we can map each operation to the frontend (being that a MASM file, a source, or anything that compiled to the operation), it is quite straightforward to implement breakpoints in terms of filename:line, or even conditional breakpoints as nothing really stops us from evaluating expressions on the fly. To step over, we can have some supporting data that will define the ranges for the procedures of a file. We do have a couple of edge cases such as the difference between inline/exec and call, but then I don't see a big problem there as we would always point to the function definition, regardless of the inline approach we took. It all falls to the source mapping and how robust it is - but once the source mapping is done, the rest becomes trivial. I did implement a couple of debuggers in the past, and our case is quite simple, so I wouldn't worry about the feasibility of this issue too much. |
This commit introduces `breakpoint`, a transparent instruction that will cause a debug execution to break when reached. This instruction will not be serialized into libraries, and will not have an opcode or be part of the code block tree. For debug executions, it will produce a `NOOP`, so the internal clock of the VM will be affected. The decision of not including this as part of the code block is to avoid further consuming variants of opcodes as they are already scarce. related issue: #580
I think there are a few parts to this:
I think we could start with step 2 (step 1 can be implemented later). The most complex parts would be steps 2, 3, and 4. Septs 5 and 6 should be pretty ease. |
At each stage where you "lower" to a new representation, it is the responsibility of that lowering translation to carry forward source locations in the higher-level representation to the lower-level representation. The result is that when you reach the lowest level representation, you don't need to map things back to each of the previous levels, instead you have a direct link from individual instructions to the source location from which it is derived. The original sources can be a high-level language, or hand-written Miden ASM, and the end result is the same. The actual debug info can be stored separately, but a reference into that debug info should be stored in the MAST itself (or if not in the MAST, in an adjacent structure which maps a node in the tree to metadata about that node) - something akin to
@bobbinth My working assumption at the moment is that the compiler is lowering to MAST directly (via Miden ASM, at least initially; but in either case, not as a string, but in Rust directly). In other words, we can directly attach source location metadata to ops in ASM (and/or MAST), and even use the same representation in both the compiler and VM. If parsing Miden ASM from a file/string, I think we should assume that it is hand-written, and treat it as the original source code - that is certainly true today given that there is no other way to write programs for the Miden VM (that I know of), and will only be reinforced when the compiler is available. |
Yes, we just need to pay attention for the "rogue" stdlib in the flow. It means that the debug info structure will be per opcode, and not per linked bundle. Everything else should be straightforward. |
Superseded by #866 |
Currently, debugging Miden assembly is not easy. We do have the ability to step through program execution via the execute_iter() function, but to use this function one has to write Rust code which is not always convenient. So, here are a few thoughts about how we could improve debugging experience (other thoughts are welcome):
debug
command to the CLI. This command would be similar torun
, but it would step through program execution cycle by cycle. We'll need to come up with an interface of how to move forward and backward, and what to print on the screen - but that shouldn't be too complicated. Also, would be cool to come up with a way to graphically map VM operations to assembly instructions.debug
CLI command would be useful for small programs, but if a program is big, manually stepping through thousands of cycles will probably not be practical. One way to address this is to adddebug
(ordebugger
) instruction to the assembly (this would actually be a decorator). Then, when we run a program viamiden debug
command, the program could run until the next occurrence ofdebug
instruction. From that point on, we'd be able to step through the execution (both forward and backward) in the same way as described in point 1 above.a. The way we retain source info when we go from source code to MAST (I think we should do it via complementary source-map files).
b. The way we execute programs in debug mode (I think the current way of handling instruction to operation mapping is not the best).
For the last part, I will write up a separate issue.
And again, if anyone has other thoughts/suggestions - would love to hear them.
The text was updated successfully, but these errors were encountered: