Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Codegen update -- Assemblers, and many new representations #52

Merged
merged 102 commits into from
Jun 15, 2020

Conversation

warpfork
Copy link
Collaborator

Warning: large diff.

This is a big update and rework of codegen.

Codegen was briefly left behind entirely by the winter update to the core interfaces which moved us to "assembler"-oriented operation -- wherein we mostly assign data into already-bulk-allocated memory -- over the previous "builders". As a result, this is very nearly a total reboot. The use of textual templating as the primary mechanism remains the same; and the overall pattern of {node, assembler, reprNode, reprAssembler} remains the same; but most of the actual code has churned.

Some of the highlight:

  • There's now working proof of a type with various representation strategies! Check out structs: they have a representation as a map, and also a representation of "stringjoin", which behaves very differently.
  • There's now an "AdjunctConfig" system, and this supports customization of anything that's goland-and-this-code-generator-specific (as opposed to anything schema-specified).
    • Among the things that can be done with this is overriding the text used in creating symbols in the generated code. This isn't quite wired up all the way to user-configurability, but the infra should all be in place to do so.
    • You can configure whether "MaybeT" types -- generated and used whenever something is either nullable or optional -- use pointers internally, or not. (This might be something an advanced user legitimately wants to tune for performance reasons.)
  • Everything is (of course) now generating NodeAssembler interfaces.
    • These get nested and embedded in each other. So, whole deep trees of data can be created and even the assemblers get amortized into a single allocation.
  • There's test infrastructure that's actually satisfying!
    • You can run 'go test' on the gen package and it will...
      1. exercise and test the generation itself
      2. ensure that the generated code actually compiles
      3. link the compiled objects back into the test process and run functional tests over them!
    • There's some caveats to this: unfortunately, some of the systems I used here don't work on windows, and they also require (blegh!) some cgo, and thus C compilers on your $PATH. I'm not thrilled with this, but it has much better outcomes than the alternatives -- you can see the commit messages in the development process for more detail, including fully worked attempts at alternatives.
  • ... much more.

This still has a decent number of TODOs, FIXMEs, and REVIEWPLZ comments in it, but may nonetheless already be an interesting read.

There's also a lot of code which is not yet even remotely "DRY". I've been intentionally not DRY'ing things out on the first pass: it's been reliably non-obvious so far what parts are actually going to turn out reusable and at what granularity and boundaries, so it seems better (and less frankly brain-melting) to do things with high redundancy now, and factor it back out later once it's clear what works and what doesn't.

Despite the roughness, I might land this back on master soon. It's still not recommended for incautious general use... but it's getting done enough to kick it around if you're feeling experimental, and it's certainly closer to useful than what's on master presently, so it's probably more useful to move forward than to twiddle thumbs.

In research: I found that there are more low-cost ways to switch which
methods are available to call on a value than I thought.  Also, these
techniques work even for methods of the same name.  This is going to
improve some code in NodeAssemblers significantly -- there are several
situations where this will let us reuse existing pieces of memory
instead of needing to allocate new ones; even the basicnode package
now already needs updates to improve this.  It's also going to make
returning representation nodes from our typed nodes *significantly*
easier and and lower in performance costs.  (Before finding methodsets
are in fact so feasible, I was afraid this was going to require our
typed nodes to embed yet another small struct with a pointer back to
themselves so we can have amortized availability of value that contains
the representation's logic for the Node interface... which while it
certainly would've worked, would've definitely made me sigh deeply.)
Quite exciting for several reasons; only wish I'd noticed this earlier.

Also in research: I found a novel way to make it (I believe) impossible
to create zero values of a type, whilst also making a symbol available
for it in other packages, so that we can do type assertions, etc,
with that symbol.  This is neat.  We're gonna use this to make sure
that types in your schema package can never be created without passing
through any validation logic that the user applies.

In codegen: lots of files disappear.  I'm doing a tabula rasa workflow.
(A bunch of the old files stick around in my working directory, and
are being... "inspirational"... but everything is getting whitelisted
before any of it ports over to the new commits.  This is an effective
way to force myself to do things like naming consistency re-checks
across the board.  And there's *very* little that's getting zero change
since the changes to pointer strategy and assembler interface are so
sweeping, so... there's very little reason *not* to tabula rasa.)

Strings are reimplemented already.  *With* representations.

Most of the codegen interfaces stay roughly the same so far.
I've exported more things this time around.

Lots of "mixins" based on lessons learned in the prior joust.
(Also a bunch of those kind-based rejections look *much* nicer now,
since we also made those standard across the other node packages.)

Some parts of the symbol munging still up in the air a bit.
I think I'm going to go for getting all the infrastructure in place
for allowing symbol-rename adjunct configuration this time.
(I doubt I'll wire it all the way up to real usable configuration yet,
but it'll be nice to get as many of the interventions as possible into
topologically the right places to minimize future effort required.)

There's a HACKME_wip.md file which contains some other notes on
priorities/goals/lessoned-learned-now-being-applied in this rewrite
which may contain some information about what's changing at a higher
level than trying to track the diffs.  (But, caveat: I'm not really
writing it for an audience; more my own tracking.  So, it comes with
no guarantee it will make sense or be useful.)
I *think* the notes on Maybes are now oscillating increasingly closely
around a consistent centroid.  But getting here has been a one heck of
a noodle-scratcher.
Compiles and runs; but the results don't.
Some small issues; some big issues.  Just needed a checkpoint.

It's feeling like it's time to deal with breaking down the assembler
generators in the same way we did for all the node components, which
is going to be kind of a big shakeup.
…values.

(That line kind of says a lot about why this project is tricky, doesn't it.)
This cleans up a lot of stuff and reduces the amount of boilerplate
content that's just generating monomorphizing error method stubs.

The nodeGenerator mixins now also use the node mixins.  I don't know
why I failed to do that from the start; I certainly meant to.
It results in shorter generated code, and the compiler turns it into
identical assembly.
…sts AssignNode validation.

The HACKME_builderBehaviors.md file in the repo root says we should be
able to procede from these repeated keys conditions.

The AssignNode systems now use Finish methods more reasonably.
In the absense of actual validation logic, I don't know if this really
structurally mattered, but it feels better now.

There's one further comment in basicnode.Map and basicnode.List's
AssignNode method about semantics when it's used after *some* data has
already been added; this should probably be addressed at some point.
I'm pushing it off the "today" plate though, since it's not a critical
bug that violates immutability or some such.

All three of these issues were noticed while building the related
semantics in codegen for structs.
Lots of tasty bitshuffling in this commit.

Remaining: creating, embedding, and returning child value assemblers.
This might involve a full type per field: doing so would be direct,
have pretty excellent verbosity in debugging symbols,
and pointers back to parent suffer no interface indirection;
overall, it's probably going to have the fastest results.
However, it would also mean: we can't save resident memory size if we
have more than one field in a struct that has the same type;
and we we can't avoid increasing binary size if the same types are used
as fields in several places.
Going to pause to think about this for a while.
…ed using a pointer; attempt the finishCallback system for reusable child assemblers.

This... almost appears to be on a dang nice track.

Except one fairly critical thing.  It doesn't work correctly if your
maybe IS implemented as containing a pointer.

Because that pointer starts life as nil in the parent structure.

And that's hard to fix.

We can't have a pointer to a pointer in the assembler;
that'd cause the type bifructation we've been trying to avoid.
(We can give up and do that of course; it would be correct.
Just... more code and larger final assembly size.
And if we did this, a lot of this diff would be rewritten.)

We could have the parent structure allocate a blank of the field,
and put a pointer to it in the maybe and also in the child assembler.
Except... this is a wasted allocation if it turns out to be null.

Rock and hard place detected in a paired configuration.
Might have to back out of this approach entirely.  Not sure.
Easier than previous commit message seemed to think.

Since we already have finishCallback functions in play in any scene
where maybes can also be in play, we can just make those responsible
for copying out a pointer from the child assembler to the parent value.
That makes the child assembler able to create a new value late in
its lifecycle and still expect it can land in a proper home.

Tests needed badly; this is *oodles* of edge cases.
Unrelated fix needed first.  (Might make that elsewhere and rebase
this to include that, to reduce the carnage ever so slightly.)
Sanity check level: gen result compiles.

Next: time to target tests at all this jazz that actually exercise
and run the generated code.
I'm almost a little staggered.  We've got structures with maybes
both using pointers and not, and everything seems fine.

A few fixes were necessary, but they were mostly "d'oh" grade.
(The finishCallback logic is a bit hairy, but appears correct now.)

More cases for more variations in presence coming up next.
I started doing double-duty in tests to keep the volume down:
the block for nullables tests both 'nullable' and 'nullable optional';
and the optionals similarly.  Will make failures require more careful
reading to discern, but probably a fine trade.
I was particularly worried about trailing optionals.  Happily,
they appear to work fine.

Representation iterators and smooth stopping included.

Starting to wonder how best to organize these tests to continue.
333 lines already?  Questionable scalability.
Still have a lot of uncertainty to resolve around nullable and optional
in general.  But writing that up elsewhere -- trying to treat it from
the spec angle without getting the implementation too mixed up in it.
Some touches to ImplicitValue came along for the ride, but aren't
exercised yet.

Tests are getting unwieldy.  I think something must be done about
this before proceding much further or it's going to start resulting
in increasingly significant velocity loss.
Using golang's plugin feature, we can... well, *do* this.

To date, testing codegen has involved running the "test" in the gen
package to get it to emit code; and then switching to the emitted
package and _manually_ running the tests there.

Now, running `go test` in the gen package is sufficient to do
*everything*: both the generation, and the result compiling,
and we can even write tests against the interfaces and run those,
all in one step.

There's also lots of stuff that becomes possible now that we can easily
generate multiple separate packages with various codegen outputs:

- Overall: everything is granular.  We can test selections of things,
  rather than needing to have everything fall into place at once.
- Generally more organized results.
- We can more easily inspect the size of generated code.
- We can more easily inspect the size of the compiled result of gen!
  (Okay, not really.  I'm seeing a '.so' file that's 4MB is coming out
  from 200sloc of "String".  I don't think that's exactly informative.
  Some constant factor is thoroughly obscuring the data of interest.
  Nice idea in theory though, and maybe we'll get there later.)
- We can diff the generated type for variations in adjunct config!
  (Probably not something that will end up tested, but neat to be able
  to do during dev.)

Doing this with go plugins seemed like the neatest way to do this.
It's certainly not the only way, though.  And in particular, I will
confess that this will probably make developing from a windows host
pretty painful: go plugins aren't supported on windows.  Mind,
this doesn't mean you can't *use* codegen or its results on windows.
It just means the tests won't work.  So, someone doing development
_on the codegen itself_ would have to wait for the CI server to run
the tests on their behalf.  Hopefully this is survivable.

(There's also a fun wee wiggle in that loading a plugin has the
requirement that it be built with the same "runtime".  The definition
of "runtime" here happens to include whether or not things have been
built in "race" mode.  So, the '-race' flag disappears from our CI
config file in this diff; otherwise, CI will get the (rather confusing)
error "plugin was built with a different version of package runtime".
This isn't really worrying to ditch, though.  I'm... not even sure why
the '-race' was in our CI script in the first place.  Must've arrived
via cargo cult; we don't _have_ any concurrency in this library.)

An alternative way to go about all this would be to have the tests for
gen invoke `go test` (rather than `go build` in plugin mode) on each of
the generated packages.  It strikes me as similar but worse.
We still have to invoke the go tools from inside the test;
we'd have *more* work to do to either copy tests into the gen'd package
or else generate calls back to the parent package for test functions
(which still have to be written against interfaces, so that they can
compile even when the gen isn't done, as it indeed isn't when you
freshly check out the repo -- exact same as with the plugin approach);
and in exchange for the extra work, we get markedly worse output
('go test' doesn't nest nicely, afaict), and we can't separate the
compiling of the generated code from the evaluation of tests on it,
and we'd have no possibility of passing information via closures should
we wish to develop table-driven tests where this would be useful.
tl;dr: Highest cost, uglier, and worse.

No matter which way we go about this, there *is* a speed trade-off.
Invoking the compiler per package adds at least a half second of time
for each package, in practice.  Worth it, though.

And on the off chance that this plugin approach does burn later,
and we do want to switch to child process 'go test' invocations...
the good news is: we shouldn't actually have to rewrite very much.
The everything-starts-from-NodeStyle-and-tests-against-Node work is
exactly the same for making the plugin approach work, and will
trivially pivot to working fine in for child 'go test' approaches,
should we find it necessary to do so in the future.  So!  With our
butts covered: a plugin'ing we shall go!

Some of the code here still needs cleanup; this is a proof of concept
checkpointing commit.  (The real thing probably won't have such
function names as "TestFancier".)  But, we do get to see here:
plugins work; more than one in the process works; and they work even
when the same type names are in the generated packages.  All good.
In the parent commit, we tried using plugins.  Turns out there's one
big, big drawback to plugins.  Building one invokes 'gcc'.
Works; as long as you... have gcc, and don't mind that having that as
a dependency that you now have to track and potentially worry about.
I don't consider that minor.

So, okay.  What's using 'go test' child invocations look like?

Well, bad.  It looks... bad.

The output is a mess.  Every child process starts its output as if its
a top level thing.  Why this happens is perfectly understandable and
hard to be mad at, but nonetheless the result is maddeningly illegible.
Cleaning this up might be possible, but will require some additional
effort to be invested.

There's some more irritating problems than that, though.
Notice the entirely new package that's sprouted.

Not being able to use closures in the test assertions is unfortunate.

Having to put the test assertions in *entirely different files* than
their setup... now that's really painful.

(And yeah, sure, we can move the setup out to another package, too.
It's not pleasing, though.  Then what of the test is actually in the
package it's testing?  The entrypoint?  So then, I'd end up editting
the test setup and assertions in one package (but running tests on
that package would actually no-op; syke!) and then have to switch to
the other package to actually run it?  There's no angle this can be
sliced that results in any sort of _nice_.)

Flags like '-v' aren't inherited now, either.  Hooray.
We could try to hack that in, I suppose.  Detect '-test.v' arg.
But it's just another thing that needs patching up to be shipshape.

Speaking of which, '-run' arg isn't even *remotely* manageable
without massive effort.  The thing can contain regexps, for goodness
sake -- how can we possibly semantically modify those for passdown?

I'm not sure where to go with this.  Depending on gcc in tests is
undesirable.  But all these other effects from trying to use 'go test'
this way seem even *more* undesirable.
'go test' just really wasn't designed for this.  There's no first class
API-driven way to interact with it at all.  Even the 'go test -json'
capabilities, despite being built-in, appear to be strap-ons rather
than logically central consistent abstractions.

Where things can be parsed at all (and in general, they can't -- not
without significant potential for ambiguity slipping in from some
angle or another), trying to output them again in a normalized form
is seriously tricky business: a great deal of re-buffering is required.
(This diff doesn't have it: I started off far too optimistic about the
simplicity of the whole thing, and am putting it down rather than
continue.  I get the feeling I could tumble down the rabbit hole quite
some time and still only ever be *approaching* correctness
asymptotically -- never actually *getting* there.)

In addition, all the problems enumerated in the previous commit
('-v' needs to be passed down; '-run' is more or less impossible; and
all the sadness about closures and more helper files split across whole
package boundaries just for maximum distraction) are still here.

Yeah... let's... let's put this one back in the box.
I'm going to merge-ignore this and keep the plugin approach instead.
As much as I *really* don't want gcc as a test dep, all this stuff...
this seems worse.  (And maybe, someday, Go can be improved so plugins
don't have the gcc dep.  I'm going to hope for that.)
…degen-using-plugins

Both commits on the merged branch contain long comments explaining why
that approach was not satisfying.

I'm feeling a bit salty about the whole thing, truth be told.
I really expected there to be APIs somewhere in this vicinity.
Output is fairly fabulous.  Really happy with this.

Diff size is nice.  Took the opportunity to do some cleanups and the
net result is 193 insertions and 351 deletions counted by line.

Most importantly, though: running 'go test .' in the gengo package
actually generates, compiles, and tests **everything**.

The test file inside the "_test" dir that you previously had to jankily
run manually is gone.  Hooray!

And already, we've got three distinct packages being generated and
tested.  (Try 'diff _test/stroct*/tStroct.go'.  It's neat.)

The time overhead is acceptable so far.  The total time is up to 0.9sec
on my current machine.  Not as bad as feared.

Overall: nice.  I think this reduces the test friction to a point that
will significantly enable further progress.
warpfork added 23 commits May 8, 2020 10:34
Wanting this as I work on list representations, and making
target-of-opportunity improvements as I encounter them.
Trying to minimize gsloc here as much as feasible -- a lot of logic
*is* in common with the type-level node.

We'll see how this goes, though.  If we inspect the performance and
the assembly output at the end of this, and it's nuts, we might
reconsider how to approach this substantially.
A lot of the coments in earlier commit messages about choosing a path
and cutting through it roughly and leaving DRY for latter still
applies (see f1eeeaf).

The hardest part of writing this was giving up on having shared code
in the output.  It just can't be done along the current edges without
giving up some performance at runtime, and that's generally not a trade
we want to make.  Maybe there's different cuts for function boundaries
that would do better -- but we'll leave that for later fun work.

I did at least get a bunch of templates to be textually shared.

Comments abound in the diff for future possibilities, as well as what's
not possible to go further on without making undesired trades.
…ives.

Would cause the same memory to get reused inappropriately; need the nil
value there to kick the child assembler to do the new allocation on the
next round.
Having a nil pointer dereference cause a panic *during* error stringing
is really unpleasant.

We shouldn't "ever" have this problem, when the library is done.
But right no we certainly do.
Correct representational creation mode still coming up, as you can
see in the fixme comment.

Tests will come in the next commit along with the creation mode.
…e when they recurse.

This is pretty similar to the extractions that made it fly for lists.

I've added a few more comments to the "genparts*" files, because on
further reflection, a lot of those methods are... less reusable than
the name might seem to claim.  We'll see how those evolve.

Altered TestMapsContainingMaps quite a bit to now also test that
representation mode is tracked correctly across both read and creation
recursions, in a similar way to how TestListsContainingLists does
(e.g., uses some structs towards the leaf nodes, so it can use the
rename directives to create visible differences).
A fair amount of stuff is finished.

A fair amount of stuff isn't.

These are all the type kinds and representation strategies outlined in
https://specs.ipld.io/schemas/representations.html .
Some of them still lived over the "gendemo" package, and those are now
moved over here to the proper gen package.

Linked to more of the other documents from the main HACKME doc.
Less comments emitted in gen result.

Touch up the 'do not edit' comment.
Going to move it over to replace the (currently hand-written) gendemo
package shortly... but spread that over a few commits, in case the
diffs turn out interesting to look at.
Previously, it was manually written prototypes of what gen "would" look like.

Now it's the real deal :3
Includes one code fix, as well: generated maps now return 'nil' for
the value of lookups when returning an error for not-found.
(I waffled on this for a while, because it's arguable that `ipld.Undef`
is a perfectly reasonable sentinel value for this, and it's *arguable*
that any sentinel value is better than nil.  However... no.
A much, much clearer rule is "if a method returns an error, the value
is nil", and that the contrapositive is also true: and that's what
we're now going with and sticking with.)
All "genparts*" files are reusable bits of template;
all other "gen[A-Z](.*)" files are per kind.
I spent some time poking around at other applications that I'd want to
be able to use some of this stuff on, and the lines about "minimalist"
come from that.  I suspect that in ideal outcomes, I'd like to be able
to sometimes be able to generate the type structures and native/typed
APIs for a Schema definition... but perhaps not generate all the stuff
necessary for the interface monomorphization, because perhaps I just
want an enum purely for internal program purposes; surely we should be
able to just do that.  We'll see though.  Maybe that's scope creep
as much as anything else.
... almost.  In attempting to put the sections about absent value
handling into place, I realized there's a decently sizable bug there
already.  I'm going to document this for now, move on, and come back
to it later... so it will continue to take up space in a "wip" doc.
The description of the situation is at least cleaned up now.
@warpfork
Copy link
Collaborator Author

Okay -- I think it's about time to sync this with master and merge it.

It's still far from perfect, but 99 commits of work is far enough divergence. Work will continue on new (hopefully mostly smaller) branches after this.

@warpfork warpfork merged commit 41db59c into master Jun 15, 2020
@aschmahmann aschmahmann mentioned this pull request Feb 18, 2021
73 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants