-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking issue: debugging improvements for v0.6 #866
Comments
The goal is to provide a reliable, extensible debug structure that can be used in different platforms. Long term simplified structureThis is a simplified vision on how the architecture should look in the long run. The components flagged in orange are currently unimplemented. TokenizerIt will be responsible to extract a string and break it into a tokens stream that may or may not be semantically correct. Tokenizers always strive to be as robust as possible, given the diversity of their inputs. We might have different breaklines style, tabulation, char codeset, etc; this layer is responsible for standardizing such inputs into a common, strict, and well-formed output. Whenever we can make a tokenizer infallible, we should do so as the parser will be able to skip error checking, mitigating its complexity. MASM is designed for simplicity; we can take advantage of that and extend that simplicity into the tokenizer, making it infallible. This will allow us to easily build plugins in different architectures as it will fall to simply using a handful of string pointers. One example is treesitter that has its parsers implemented in C: we can simply implement a dylib in Rust, receive a pointer to a string, and populate another pointer with a sequence of strings (tokens). Such simplified structure will allow us to highlight the syntax for malformed programs. ParserFrom the tokens sequence, this component will output either procedures (AST form), or a main program. The procedures should be indexed on a repository (here defined as procedures cache), and they must be visible during runtime. Ideally, they must be inspectable; meaning we should have access to the high-level code (probably MASM; sometimes, IR, Sway, Move, etc) that generated them. Debug Adapter ProtocolThe current standard is to use a debug adapter protocol as this will reduce the programming overhead to support different IDE/interfaces. Having a correct implementation of DAP will imply we have a minimal effort to provide support for applications such as NeoVim, VsCode, Emacs, etc. MVPFor a minimum viable product, we need to introduce:
|
For a minimal, working version, we can initially support ProgramAst alone. The first requirement is to link [Token]s to a [SourceLocation], that is achieved by #861. The main attribute of [ProgramAst] is One option is to create [NodeAst] that will contain The next step is to handle CodeBlock. As [ProgramAst], it is but a sequence of [Operation]. Therefore, under the assumption that combine_blocks won't effectively change the order of spans (i.e. contiguous [Operation]), we can safely map the root of each of these blocks to the list of source locations that generated it. Even if the blocks themselves are reorganized, the spans are still contiguous instructions that maps Such map should live in a context that will be available to the CLI, and is optionally inserted. We can benefit from the Given that, every time we append a new [CodeBlock] to [Assembler], we also append its list of [SourceLocation] to [DebugContext], mapped by the root of [CodeBlock]. Therefore, every [Operation] in a Program, given its [CodeBlock], will be fully mapped to its [SourceLocation]. The next step is to yield [DebugContext] from [Assembler] and expose it to processor::execute. We can encapsulate such structure into an [ExecutionContext] that might hold other components such as [AdviceProvider]. Given the map described above |
I'm not sure this will work because the Node struct could contain an arbitrarily deep tree. We could, however, modify the One other thing to think about: how will we serialize/deserialize this extra info in
I think this may be the most complex part in the whole thing.
This is not entirely correct. A root of a code block is not unique - i.e., many code blocks within a program may end up having the same root. The path from the MAST root to the code block is unique, however, this path won't be available until the whole MAST is built (see #862 (comment)). I think we need to flash this part out a bit more as it is not clear to me yet how we'll map code blocks to locations and how we'll represent this mapping.
I probably wouldn't change pub fn execute_iter<A>(
program: &Program,
stack_inputs: StackInputs,
advice_provider: A,
debug_info: DebugInfo,
) -> VmStateIterator
where
A: AdviceProvider,
I'm assuming we'll need to modify |
You have a point there. Before we jump into the The MAST path alone might not be enough to entirely solve the equation. Given the following example
I think we will end up with the same MAST root & path for both foo and bar, but they should have different source locations (this would extend to all loaded libraries). We can naturally treat this as negligible, zero-impact problem, but a robust solution would also cover that scenario. I'm thinking of an alternative for this. |
An even simpler example would be:
Hash of both branches in the |
SummaryWe have two pending decisions. First:
Second:
DetailsWe need a mechanism to disambiguate duplicated bodies across procedures & blocks. Example:
These two procedures have exactly the same body in different source locations. Therefore, we cannot map the root of the MAST to a set of locations as the absolute module path + procedure name should be part of this lookup. Moreover, This solution must be generic enough to map also blocks under if/else spans; example:
This will be parsed to:
It means we will have the same roots for Using the MAST path as index to a source location would solve this particular example, as in:
However, this wouldn't work for procedures & exec calls. The following example would represent a conflicting location:
As we would have a duplicated index As for exec, we would have the following problem:
The MAST path is We have different approaches on how to attack this problem. The more robust solution - however, the one with the biggest impact - is to annotate every leaf of the MAST with metadata - this metadata will contain its underlying source location. The benefit of this is we will have the ability to annotate instructions in the future with other types of metadata (docs, optimization suggestions, assertions to be performed, etc). The negative point is this would involve a bigger refactor of the code base. Initially, the metadata would be a simple [SourceLocation]. Whenever we are compiling from MAST to [CodeBlock], we just consume this metadata and create a similar tree-structure for [CodeBlock] leaves. Another possibility is to create a parallel data structure that will be a tree and map the leaves of the MAST 1:1 with such metadata structure. We extend the advantages of annotating the instructions individually and we greatly reduce the refactor impact, but we generate runtime overhead to compute & store this additional structure. There is another advantage to this approach that is we can attach this structure to generated programs, and expect them to have the debug information injected on the fly. Finally, we can create a mapping with the absolute path of the module + path of the MAST as index to a source location. @bobbinth suggested we also append the operation index there, but it's not entirely clear to me why that is needed. One instruction generates multiple operations, but all operations generated by one instruction will map to the same source location of the instruction. Further elaboration on that would be helpful. Exploring further the last option; we can use ProcedureId as prefix of the MAST path, falling back to ProcedureId::main when compiling a main procedure. However, this introduces a new challenge: to define the module path of a AST module. Currently, the signature to parse modules is the following: fn parse_module(source: &str) -> Result<ModuleAst, ParsingError> As we can see, we can't extract the [ProcedureId] from the source contents. We can simply add an argument of the module path: fn parse_module(path: &str, source: &str) -> Result<ModuleAst, ParsingError> However, this problem of defining the module path in order to compute a [ProcedureId] is not new. Currently, we are using the disk path, but that might not be ideal as we don't really have the concept of a workspace in MASM. In Rust, that is resolved by searching upwards for a In MASL, we query the user on what is the library bundle prefix, and what is the root path. The implementation is here. Alternatively, given we don't have an open issue to implement workspace structure, we can introduce a |
Before we get to MAST, we need to figure out how to track location data in the AST structs ( One question here is how do we serialize these AST structs when they contain source mappings. Specifically, do they get saves as a part of This is related to your second question. In this question I'd prefer to go with the first approach - i.e., passing path to Regarding the first question: I would prefer something more like the second approach as I don't want to complicate the MAST structure more (I'd rather simplify it as much as possible). This would allow to simplify the VM processor as well. Ideally, the MAST structures should not contain any debug or source map related data - and all of this extra data would be provided separately when we call |
Btw, I refactored things a bit in the latest commit. For example, there is no longer |
One thing to think about: we probably want to instantiate pub fn new(source: &'a str, source_id: SourceId) -> Result<Self, ParsingError> We could use Edit: now that I think about it more, we may actually not need to track source ID at the token level - this really depends on how we add the source location data to For example, we could change pub enum Node {
Instruction(Instruction),
IfElse(Vec<Node>, Vec<Node>),
Repeat(u32, Vec<Node>),
While(Vec<Node>),
} To something like: pub enum Node {
Instruction(Instruction),
IfElse(IfElseNode),
Repeat(RepeatNode),
While(WhileNode),
} And then pub struct WhileNode {
body: Vec<Node>,
locations: Vec<SourceLocation>
} |
Regarding MAST vs ProgramAst, ModuleAst, ProcedureAst; the proposed changes focuses on the latter (not MAST).
I think they should go on different files. If we take as example regular ELF binaries, they might contain debug information, and this might be loaded to memory when executing a binary. For example, in MASL, we will need a strategy to avoid deploying such metadata in production environments to avoid unnecessary storage consumption. If we want to embed this source mapping onto MASL files, we will need to introduce support for such modes - that will be added complexity for this first minimalistic implementation.
Sounds good! To me, it makes more sense to have a dedicated struct for that instead of a raw string - so we entirely delegate the responsibility of iterating the path of the module to such structure (instead of split
We have a couple of preliminary questions for that. [SourceId] makes sense only if such ID is indexable somewhere. We might be talking about a [ParserContext] here that will hold a As I see, the main goal of the [SourceId] is to prevent us from duplicating the concrete path for every node of the tree: we, instead, have something like a I think it is very positive we introduce such id now, but we will also need this context structure. That wouldn't be necessarily the same as the one we might serialize for MASL, as the latter cannot depend on the host filesystem, and the former could contain just a small set of MASM paths. Such [ParserContext] can be an implementation detail of [Assembler] and never exposed publicly. It will support the parsing and compilation (i.e. transform [ProgramAst] into [Program]), and could be discarded afterwards. The [ParserContext] wouldn't support the debug operation either, as we will need a different structure to resolve the locations (for the problem described above of host-independent, serialized debug information). If we state that a source mapping is always host-dependant, then we can use the same context everywhere - from tokenization to execution. If we opt to go otherwise, these [SourceId] won't be relevant beyond parsing, and we might want to include as metadata the original string of the MASM that generated procedures - probably indexed by their path - deeming [SourceId] useless for that context |
Yeah - I like this approach better too. It will make deserialization of
Out current |
This commit adds source location to parsed nodes, allowing the construction of source mapping. It modifies the [Node] structure based on the discussion that took place here: #866 (comment) The implementation is slightly different than the one discussed as every instruction individually will contain a [SourceLocation] - it will be less expensive to store it like that instead of creating sets for each of the block variants. The [SourceLocation] itself is a cheap type that will contain integers only. This commit doesn't touch the serialization question for ModuleAst with SourceLocation. These additional bytes might be optional, and including them might be optional as they will consume additional production storage when they will provide functionality only for debug/development environments.
Goal(s)
Our main goal is to provide source mappings (e.g., line numbers) between a given VM operation and Miden assembly source file (and, in the future, higher-language source files too).
Details
We want to add basic source mapping and debugger improvements.
Must have
SourceLocation
for parsing error & debug #857Pending review & update
ImportedProcModuleNotFound
#840Nice to have
The text was updated successfully, but these errors were encountered: