Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce codegen IR #94

Merged
merged 23 commits into from
Jul 14, 2018
Merged

Introduce codegen IR #94

merged 23 commits into from
Jul 14, 2018

Conversation

katef
Copy link
Owner

@katef katef commented Jul 13, 2018

This PR introduces an IR datastructure for libfsm after the DFA and before the code generation output.

The main idea is that the decisions for code generation are done here, rather than when printing code. This leaves the code generation to simply walk the IR, printing what's there, without mixing in logic. There are still a few choices which might be made during the printing of code, but those ought to be cosmetic only.

This IR is intended for producing code only. Some of the output from libfsm expresses state machines verbatim (especially libfsm/print/dot.c which attempts to show an FSM compactly but verbatim). Those still walk the FSM directly.

IR nodes correspond approximately to states in the FSM. This may change over time when various transformations possibly combine IR nodes.

There are several node types, for the various "strategies" of generating code for a particular state. Here's an example showing most of them:

; ./build/bin/re -pl ir -z '[^a-z].' 'm+[0-9a-f]([^a-zA-Z]A|[a-z]B|.B)?' \
    | dot -Tpng -o /tmp/x.png

IR
and its corresponding DFA:
DFA

katef added 22 commits June 23, 2018 08:48
…; no need for the dependency on the state interface here.
Nodes here currently correspond to FSM states, although that need not be true for the future, especially when identifying parallel walks may be an option.

The intention is for decision-making about code generation to be made present explicitly in this IR, such that the code generation outputs things roughly as given. This way, I hope for various optimisations to be shared across multiple output languages, but also for the code output parts to be simpler.
This explicitly states ranges for erroring, as opposed to leaving them to a `default:` clause. Then the `default:` clause can be used for a dominant mode to transition to another state.

The main situation I have in mind is for regexps like `/[^abc]/`, where writing out every matching symbol is a lot more cumbersome than writing out every symbol which doesn't match. Thus we can generate code like:
```
; ./build/bin/re -plc '[^abc][xyz]'

int
fsm_main(int (*fsm_getc)(void *opaque), void *opaque)
{
	int c;

	assert(fsm_getc != NULL);

	enum {
		S0, S1, S2
	} state;

	state = S0;

	while (c = fsm_getc(opaque), c != EOF) {
		switch (state) {
		case S0: /* start */
			switch ((unsigned char) c) {
			case 'a':
			case 'b':
			case 'c': return TOK_UNKNOWN;
			default: state = S1; break;
			}
			break;

		case S1: /* e.g. "d" */
			switch ((unsigned char) c) {
			case 'x':
			case 'y':
			case 'z': state = S2; break;
			default:  return TOK_UNKNOWN;
			}
			break;

		case S2: /* e.g. "dx" */
			return TOK_UNKNOWN;

		default:
			; /* unreached */
		}
	}

	/* end states */
	switch (state) {
	case S2: return 0x1; /* "[^abc][xyz]" */
	default: return EOF; /* unexpected EOF */
	}
}
```
*/

enum ir_strategy {
IR_NONE = 1 << 0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the comment below, is this simultaneously being used as a type tag for the struct ir_state union, but also as a set of allowed strategies in make_ir?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just the former. I was originally thinking of adding a mask of which strategies to allow, but I tried that and didn't like it much, and currently I think it'd make more sense to have options like "always make a table" set in struct fsm_options instead.

src/libfsm/print/ir.h Outdated Show resolved Hide resolved
src/libfsm/print/ir.h Show resolved Hide resolved

#include "lx/ast.h"
#include "lx/print.h"

/* XXX: abstraction */
int
fsm_print_cfrag(FILE *f, const struct fsm *fsm,
fsm_print_cfrag(FILE *f, const struct ir *ir, const struct fsm_options *opt,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once it's working with the IR, what is struct fsm_options *opt still needed for? Could that be stored within the IR instead?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For various rendering options, like "always hex". I could duplicate those to an equivalent struct ir_options containing just the relevant subset perhaps - I actually tried splitting the .c files here such that they don't include any of the FSM structs at all. But I decided there wasn't any benefit, since this is all internal anyway.

@katef katef merged commit dfc5b8a into master Jul 14, 2018
@katef katef deleted the codegen-ir branch July 14, 2018 16:34
@katef katef mentioned this pull request Dec 12, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants