feat(jmespath): add lexer component #2214
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description of your changes
This PR includes the implementation of the
Lexer
component, part of the JMESPath utility.As discussed in the linked issue, the purpose of a lexer is to break down the input JMESPath expression into smaller meaningful units (tokens). These tokens represent the building blocks of the JMESPath language and are defined in the language grammar (#2192).
The
Lexer
's main method (public *tokenize()
is implemented as a generator. This pattern allows the lexer to walk the expression iteratively and yield (aka return) tokens as it goes. While not a direct equivalent, the closest patten to describe this implementation would be a recursive function that maintains an external state (aka a reducer).At each step, the lexer interprets a certain character and based on its type it performs certain actions.
To describe how the lexer works, let's take this expression as an example:
foo.bar
(leading white space is intentional).With the expression above, the lexer will start looking at each character in the order they appear:
position: 0
- Since the first character is a white space, the lexer advances with no further action (source)position: 1
- Next, the lexer encounters a valid character (f
). At this point the lexer needs to understand how long this identifier is and so it will advance the position until a non-identifier character (aka anything that is not a number or letter) is found (source). In this example it will advance to position4
and interpretfoo
as a single token.position: 4
- The next character is adot
(aka.
) which in the context of a lexer is considered a simple token.position: 5
- Next, the lexer encounters another character (b
), so just like one of the previous steps, it advances until a non-identifier character is found. In this case the lexer reaches the end of the expression.This is a relatively simple example, but hopefully it helps clarifying the flow of the processing. For simpler tokens the implementation is inlined in the
public *tokenize()
method, while in other cases where the processing of a token required a more involved logic a dedicated method was created.Related issues, RFCs
Issue number: #2205
Checklist
Breaking change checklist
Is it a breaking change?: NO
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Disclaimer: We value your time and bandwidth. As such, any pull requests created on non-triaged issues might not be successful.