Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AST Node ID #4188

Closed
Boshen opened this issue Jul 11, 2024 · 27 comments
Closed

AST Node ID #4188

Boshen opened this issue Jul 11, 2024 · 27 comments
Assignees
Labels
A-ast Area - AST A-transformer Area - Transformer / Transpiler blocker P-high Priority - High

Comments

@Boshen
Copy link
Member

Boshen commented Jul 11, 2024

Blocking issues:

Related discussions:

Previous attempts:

Research:

  • TODO
  • rustc
  • other compilers
@Boshen Boshen added A-ast Area - AST A-transformer Area - Transformer / Transpiler labels Jul 11, 2024
@Boshen Boshen added this to the Transformer Milestone 2 milestone Jul 11, 2024
@Boshen Boshen self-assigned this Jul 11, 2024
@Boshen Boshen changed the title AST Node ID AST Node ID (Research Phase) Jul 11, 2024
@overlookmotel
Copy link
Contributor

I'm planning to look into this question as part of review of semantic. I've suggested a solution in the meantime on rolldown/rolldown#1578 to unblock them.

@rzvxa

This comment was marked as resolved.

@rzvxa
Copy link
Contributor

rzvxa commented Jul 12, 2024

Another use case for AST node ID. We can remove these methods and free up a few CPU cycles and some memory while building the semantics.

fn record_ast_nodes(&mut self) {
self.ast_nodes_records.push(Vec::new());
}
fn retrieve_recorded_ast_nodes(&mut self) -> Vec<AstNodeId> {
self.ast_nodes_records.pop().expect("there is no ast nodes record to stop.")
}
fn record_ast_node(&mut self) {
if let Some(records) = self.ast_nodes_records.last_mut() {
records.push(self.current_node_id);
}
}

@Boshen
Copy link
Member Author

Boshen commented Aug 5, 2024

Rolldown needs to implement inject plugin https://esbuild.github.io/api/#inject, which requires an identifiable import statement to be inserted into the AST.

@overlookmotel
Copy link
Contributor

overlookmotel commented Aug 5, 2024

I have suggested to them that they could use either:

  1. Index of the statement in program.body.
  2. Memory address of the ImportDeclaration.
fn addr_of_import(decl: &Box<ImportDeclaration>) -> usize {
    &**decl as *const _ as usize
}

Either of the above will work as a unique identifier, and neither depends on span.

I'm not saying that we shouldn't add AstNodeId to all AST types - personally I'm undecided at present. But just that it's not the only way to solve Rolldown's use case.

@Boshen
Copy link
Member Author

Boshen commented Aug 6, 2024

I have suggested to them that they could use either:

  1. Index of the statement in program.body.
  2. Memory address of the ImportDeclaration.
fn addr_of_import(decl: &Box<ImportDeclaration>) -> usize {
    &**decl as *const _ as usize
}

Either of the above will work as a unique identifier, and neither depends on span.

I'm not saying that we shouldn't add AstNodeId to all AST types - personally I'm undecided at present. But just that it's not the only way to solve Rolldown's use case.

I may need to try it out myself at this point. I'll ask for a walk through to better understand the problem today.

@Boshen
Copy link
Member Author

Boshen commented Aug 10, 2024

Let's decide on this one next week. I'm a little bit worn out on discussing about this but constantly coming up with hacks.

@Boshen Boshen added the P-high Priority - High label Aug 11, 2024
@Boshen Boshen changed the title AST Node ID (Research Phase) AST Node ID Aug 12, 2024
@Boshen
Copy link
Member Author

Boshen commented Aug 12, 2024

I talked to @overlookmotel and here's what we agreed on:

To minimize code change and support Rolldown's use case as a start point, we should

@rzvxa It seems like this task is going to be on you again, feel free to ask questions to clear uncertainties.

@overlookmotel
Copy link
Contributor

I'm going to think about implementation over the course of this week and come back with a proposal.

A useful first step would be to gather the use cases we have for Node IDs. Can anyone who is aware of such needs please comment below?

For other discussion, we have a thread on Discord: https://discord.com/channels/1079625926024900739/1272506817238405160

@Boshen
Copy link
Member Author

Boshen commented Aug 12, 2024

AST node ids are super useful for caching data in a multi-pass compiler architecture.

A few points:

Concrete examples:

  • The type synthesizer prototype I wrote 2 years ago requires caching type for expressions
  • Our minifier will create out-of-band-storage for data in a previous ast pass and use it in a later pass

@Dunqing
Copy link
Member

Dunqing commented Aug 12, 2024

Example: #4767

@DonIsaac
Copy link
Contributor

DonIsaac commented Aug 12, 2024

I talked to @overlookmotel and here's what we agreed on:

To minimize code change and support Rolldown's use case as a start point, we should

@rzvxa It seems like this task is going to be on you again, feel free to ask questions to clear uncertainties.

I attempted making AstBuilder stateful while we working on #4367. The idea was to record the number of nodes/symbols/scopes encountered while parsing. It is, in fact, very painful. I saw three possible implementations, and tried the first two of them:

  1. Store an Rc<RefCell<Data>>. This is safe and doesn't require new lifetimes, but it makes it impossible to implement Copy and adds reference counting overhead. This becomes painful very quickly.
  2. Store a reference to &'s RefCell<Data>. This removes Rc overhead, but introduces a new lifetime parameter which makes this an even more painful refactor.
  3. Use ManuallyDrop<NonNull<UnsafeCell<Data>>> This uses unsafe rust, but it 1) doesn't add Rc overhead, 2) does not introduce any new lifetime parameters, and 3) preserves Copy. It should be sound since the parser is run synchronously and code mutating Data does so at the call site and never stores a mutable reference. However, it makes figuring out when to drop Data painful.

@DonIsaac
Copy link
Contributor

I'm going to think about implementation over the course of this week and come back with a proposal.

A useful first step would be to gather the use cases we have for Node IDs. Can anyone who is aware of such needs please comment below?

For other discussion, we have a thread on Discord: https://discord.com/channels/1079625926024900739/1272506817238405160

Lint rules use AstNodeId to

  1. Iterate up a node's parents, then match on kind
  2. Generalize logic that does not require knowledge about what kind of node is being used.

I'm currently in favor of a single AstNodeId approach, but my opinion isn't tightly held. I'm mostly concerned insofar as those two use cases are satisfied.

As a thought, if we want to store statement/expression-specific information, we could discriminate using an enum:

// nodes.rs in oxc_semantic
enum SpecificNodeData<'a> {
  Statement(StatementNodeData<'a>),
  Expression(ExpressionNodeData<'a>)
}

struct AstNodes<'a> {
  // ...
  specific_data: IndexVec<AstNodeId, SpecificNodeData<'a>>
}

@rzvxa
Copy link
Contributor

rzvxa commented Aug 12, 2024

If the goal is to add it to all the types that have an AstKind variant. We can use it to eliminate all the existing Cell<Option<ID_TYPE>> fields in the AST. We can get those via SOA. And there is no longer a need for a dedicated AstNode type. We can get rid of it since now AstKind is all we need.

I propose we do the following:

// since we no longer need the struct that had the same name before
trait AstNode {
    fn ast_node_id(&self) -> AstNodeId;
}

impl From<AstNode> for AstNodeId {
    fn from(node: &AstNodeId) -> AstNodeId {
        node.ast_node_id()
    }
}

impl<'a> AstNode for Statement<'a> {
    fn ast_node_id(&self) -> AstNodeId {
        self.node_id
    }
}

impl<'a> AstNode for AstKind<'a> {
    fn ast_node_id(&self) -> AstNodeId {
        match self {
            // kinds => kind.ast_node_id(),
        }
    }
}

Then we can rewrite all the methods in the AstNodes SOA to be like this:

// from:
    #[inline]
    pub fn kind(&self, ast_node_id: AstNodeId) -> AstKind<'a> {
        self.nodes[ast_node_id].kind
    }

// to:
    #[inline]
    pub fn kind(&self, ast_node_id: impl Into<AstNodeId>) -> AstKind<'a> {
        self.kinds[ast_node_id.into()]
    }

On the subject of specific ast IDs, We can use something like this to provide access to both types.

// we put this in our structs
struct AstKindId<T> {
    id: AstNodeId,
    _marker: PhantomData<T>,
}

impl<T> AstNode for AstKindId<T> {
    fn ast_node_id(&self) -> AstNodeId {
        self.id
    }
}

Then we can use this generic type to do all sorts of type-safe stuff like this:

pub fn reference_id(&self, ident_ref_id: AstKindId<IdentifierReference>) -> ReferenceId
    // ...
}

I'm not saying this is better, Both approaches have their pros and cons, But having a conversion from "more specific -> less specific" is possible, So we aren't losing that much, especially if we use the AstNode trait which abstracts all of these details away.

I'm looking forward to seeing what are your thoughts on it.

@overlookmotel
Copy link
Contributor

Thanks all for input into this.

@rzvxa Will come back on your proposals for APIs. But right now I'm still at the stage of trying to figure out what are all the use cases we want to support. I know you've brought some up previously - one related to CFG, if I remember right, but I think there were more. Could you please remind me?

In particular, are there any things we want to do which are currently impossible, or extremely painful/expensive without Node IDs?

@overlookmotel
Copy link
Contributor

overlookmotel commented Aug 12, 2024

Lint rules use AstNodeId to

  1. Iterate up a node's parents, then match on kind
  2. Generalize logic that does not require knowledge about what kind of node is being used.

@DonIsaac Do I understand right that both these things are already possible with what we have now? Is there anything in linter which is currently blocked but having node IDs stored inline in the AST types would make possible?

@overlookmotel
Copy link
Contributor

overlookmotel commented Aug 12, 2024

Just to throw one of my own ideas into the mix (which may or may not be a good idea):

If each AST type had node ID stored inline, Spans could be moved out of the AST into a separate side array indexed by the ID. It strikes me that Spans are pretty much only used for diagnostics (which is a cold path) and when compiling source maps, but otherwise they're almost never accessed - they just need to be maintained throughout transform process for purpose of source maps. Therefore, they're a good candidate to move out of the AST itself.

This would also allow us to add UTF-16 spans and/or line/column spans without bloating the AST (#959) - they'd also go in side arrays.

This may not be a good idea because it makes the AST less ergonomic to use (especially for those consuming Oxc as a library), but it does also have some positive effects.

@DonIsaac
Copy link
Contributor

Lint rules use AstNodeId to

  1. Iterate up a node's parents, then match on kind
  2. Generalize logic that does not require knowledge about what kind of node is being used.

@DonIsaac Do I understand right that both these things are already possible with what we have now? Is there anything in linter which is currently blocked but having node IDs stored inline in the AST types would make possible?

Correct, these are all currently possible. However, I don't think they'd be possible with an ExpressionId/StatementId as we were talking about on discord.

The main thing that is not possible right now is walking back up the AST in contexts where an AST node or an ID is not available. It's possible to work around this, but it would be quite helpful to have one on hand in each node.

@overlookmotel
Copy link
Contributor

Thanks for coming back Don. Can you give an example of where walking back up the AST from a node (ancestors) is useful, and what is the current workaround?

@DonIsaac
Copy link
Contributor

For examples of iterating up the AST, see no_unused_vars/usage.rs. Since iteration occurs over AstNodes, we have access to an AstNodeId on each loop iteration.

For examples of where upwards traversal could be made cleaner with AstNodeIds in each struct, see no_useless_spread.rs. On my phone so I can't elaborate too much further on how/why, but you should find a redundant match on AstKind between the first check function and the second two.

@rzvxa
Copy link
Contributor

rzvxa commented Aug 13, 2024

In particular, are there any things we want to do which are currently impossible, or extremely painful/expensive without Node IDs?

Here's my take on this, I think the root of our issue is that we can't walk down the tree using AstNodeId. If we get to walk down while keeping access to the AstNodeIds of our descendants we can do anything we want including walking back up. I've tried to work around it in #3800 but that's just too much redundant information to make it fast.
So currently we all use hacky stuff to get over this limitation, I personally had to use CFG to access this information in the no_fallthrough rule which to be honest is painful to look at:

let (tests, exit) = graph
.edges_directed(node.cfg_id(), Direction::Outgoing)
.fold((Vec::new(), None), |(mut conds, exit), it| {
let target = it.target();
if !matches!(it.weight(), EdgeType::Normal) {
(conds, exit)
} else if cfg
.basic_block(target)
.instructions()
.iter()
.any(|it| matches!(it.kind, InstructionKind::Condition))
{
let is_empty = graph
.edges_directed(target, Direction::Outgoing)
.filter(|it| matches!(it.weight(), EdgeType::Jump))
.exactly_one()
.ok()
.and_then(|it| {
cfg.basic_block(it.target())
.instructions()
.first()
.and_then(|it| it.node_id)
.map(|id| ctx.nodes().parent_kind(id))
.and_then(|it| match it {
Some(AstKind::SwitchCase(case)) => Some(case),
_ => None,
})
})
.is_some_and(|it| it.consequent.is_empty() || it.consequent.iter().exactly_one().is_ok_and(|it| matches!(it, Statement::BlockStatement(b) if b.body.is_empty())));
conds.push((target, is_empty));
(conds, exit)
} else {
(conds, Some(target))
}
});

There is also the need for Semantic::record_ast_node in our hot path which we can eliminate if we have access to the ast_node_id.

In my opinion, the most valuable thing is that we can now keep a bunch of secondary data all over the place - where we currently don't have access to AstNode - without paying for it upfront. Even if we have to use a hashmap, Hashing a number is much easier than a whole struct.

@overlookmotel
Copy link
Contributor

The type synthesizer prototype I wrote 2 years ago requires caching type for expressions

@Boshen Can I just clarify: Is that really type for expressions, or is it type of bindings? i.e. could that info be in an array indexed by SymbolId?

@overlookmotel
Copy link
Contributor

overlookmotel commented Aug 13, 2024

I attempted making AstBuilder stateful while we working on #4367. The idea was to record the number of nodes/symbols/scopes encountered while parsing.

I'm a bit confused why this is problematic in parser. All ParserImpl's methods have access to &mut ParserImpl and ParserImpl contains an AstBuilder. So can't methods access a &mut AstBuilder via self.ast without the need for RefCells and suchlike?

@DonIsaac
Copy link
Contributor

I don't think so, but I'm not certain. AstBuilder is Copy, and there are parts of the code that rely on this behavior. Copying a mutable reference would violate rust borrow checker invariants. However, I never tested this; we could definitely try it.

@overlookmotel
Copy link
Contributor

Yeah it was me that made AstBuilder Copy. It was a good idea at the time! But we can switch it back to how it was before and pass &mut AstBuilders around. As you say, a &mut cannot be copied, but it can be re-borrowed.

Dunqing added a commit that referenced this issue Aug 15, 2024
close: #3943

## Further improvements

There is a double visit here. We need to collect all react hooks calling in `Function` and `ArrowFunctionExpression`. If we want to remove this implementation, we may wait for #4188.

https://github.com/oxc-project/oxc/blob/d797e595d286c613848b773c256bd43124ad1981/crates/oxc_transformer/src/react/refresh.rs#L744-L947

## Tests

All tests copy from https://github.com/facebook/react/blob/main/packages/react-refresh/src/__tests__/ReactFresh-test.js

There are still 4 tests that have not been passed

**1. refresh/can-handle-implicit-arrow-returns/input.jsx**

Related to #4767. transform correct, just output doesn't match the expected output

**2. refresh/registers-identifiers-used-in-jsx-at-definition-site/input.jsx**
**3. refresh/registers-identifiers-used-in-react-create-element-at-definition-site/input.jsx**

Blocked by #4746

**4. refresh/supports-typescript-namespace-syntax/input.tsx**

oxc transforms ts to js first, so probably we can ignore this case. If we really want to pass this test, we also need to turn off `TypeScript` plugin.

## What's next?

### Options:

1. Support transform `refresh_reg` and `refresh_sig` options to `MemberExpression`. Currently `import.meta.xxxx` still is an `Identifier`
2. Support `emit_full_signatures` option

### Other
NAPI, testing in `monitor-oxc`, etc..
Dunqing added a commit that referenced this issue Aug 15, 2024
close: #3943

## Further improvements

There is a double visit here. We need to collect all react hooks calling in `Function` and `ArrowFunctionExpression`. If we want to remove this implementation, we may wait for #4188.

https://github.com/oxc-project/oxc/blob/d797e595d286c613848b773c256bd43124ad1981/crates/oxc_transformer/src/react/refresh.rs#L744-L947

## Tests

All tests copy from https://github.com/facebook/react/blob/main/packages/react-refresh/src/__tests__/ReactFresh-test.js

There are still 4 tests that have not been passed

**1. refresh/can-handle-implicit-arrow-returns/input.jsx**

Related to #4767. transform correct, just output doesn't match the expected output

**2. refresh/registers-identifiers-used-in-jsx-at-definition-site/input.jsx**
**3. refresh/registers-identifiers-used-in-react-create-element-at-definition-site/input.jsx**

Blocked by #4746

**4. refresh/supports-typescript-namespace-syntax/input.tsx**

oxc transforms ts to js first, so probably we can ignore this case. If we really want to pass this test, we also need to turn off `TypeScript` plugin.

## What's next?

### Options:

1. Support transform `refresh_reg` and `refresh_sig` options to `MemberExpression`. Currently `import.meta.xxxx` still is an `Identifier`
2. Support `emit_full_signatures` option

### Other
NAPI, testing in `monitor-oxc`, etc..
@Boshen Boshen assigned overlookmotel and unassigned Boshen Aug 23, 2024
@Boshen Boshen added this to the Transformer Milestone 2 milestone Aug 29, 2024
@overlookmotel
Copy link
Contributor

I have finally written up my thoughts: #5688

Side note: #5674 gets rid of Semantic::record_ast_node and demonstrates that we don't need node IDs stored on AST types for this use case. -1% perf regression on semantic benchmarks, but I'm not sure if that's spurious or not - you'd think it should be faster as it's less code. (The unsafe code could also be removed if we do oxc-project/backlog#109).

@Boshen
Copy link
Member Author

Boshen commented Sep 11, 2024

Continue in #5689

@Boshen Boshen closed this as completed Sep 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-ast Area - AST A-transformer Area - Transformer / Transpiler blocker P-high Priority - High
Projects
None yet
Development

No branches or pull requests

5 participants