-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Auto merge of #17037 - davidsemakula:token-set-collisions, r=Veykril
internal: improve `TokenSet` implementation and add reserved keywords The current `TokenSet` type represents "A bit-set of `SyntaxKind`s" as a newtype `u128`. Internally, the flag for each `SyntaxKind` variant in the bit-set is set as the n-th LSB (least significant bit) via a bit-wise left shift operation, where n is the discriminant. Edit: This is problematic because there's currently ~121 token `SyntaxKind`s, so adding new token kinds for missing reserved keywords increases the number of token `SyntaxKind`s above 128, thus making this ["mask"](https://github.com/rust-lang/rust-analyzer/blob/7a8374c162c64c17e865b98aad282d16b16e96d6/crates/parser/src/token_set.rs#L31-L33) operation overflow. ~~This is problematic because there's currently 266 SyntaxKinds, so this ["mask"](https://github.com/rust-lang/rust-analyzer/blob/7a8374c162c64c17e865b98aad282d16b16e96d6/crates/parser/src/token_set.rs#L31-L33) operation silently overflows in release mode.~~ ~~This leads to a single flag/bit in the bit-set being shared by multiple `SyntaxKind`s~~. This PR: - Changes the wrapped type for `TokenSet` from `u128` to `[u64; 3]` ~~`[u*; N]` (currently `[u16; 17]`) where `u*` can be any desirable unsigned integer type and `N` is the minimum array length needed to represent all token `SyntaxKind`s without any collisions~~. - Edit: Add assertion that `TokenSet`s only include token `SyntaxKind`s - Edit: Add ~7 missing [reserved keywords](https://doc.rust-lang.org/stable/reference/keywords.html#reserved-keywords) - ~~Moves the definition of the `TokenSet` type to grammar codegen in xtask, so that `N` is adjusted automatically (depending on the chosen `u*` "base" type) when new `SyntaxKind`s are added~~. - ~~Updates the `token_set_works_for_tokens` unit test to include the `__LAST` `SyntaxKind` as a way of catching overflows in tests.~~ ~~Currently `u16` is arbitrarily chosen as the `u*` "base" type mostly because it strikes a good balance (IMO) between unused bits and readability of the generated `TokenSet` code (especially the [`union` method](https://github.com/rust-lang/rust-analyzer/blob/7a8374c162c64c17e865b98aad282d16b16e96d6/crates/parser/src/token_set.rs#L26-L28)), but I'm open to other suggestions or a better methodology for choosing `u*` type.~~ ~~I considered using a third-party crate for the bit-set, but a direct implementation seems simple enough without adding any new dependencies. I'm not strongly opposed to using a third-party crate though, if that's preferred.~~ ~~Finally, I haven't had the chance to review issues, to figure out if there are any parser issues caused by collisions due the current implementation that may be fixed by this PR - I just stumbled upon the issue while adding "new" keywords to solve #16858~~ Edit: fixes #16858
- Loading branch information
Showing
4 changed files
with
101 additions
and
20 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters