Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fresh take on codec APIs, and some tokenization utilities. #101

Merged
merged 7 commits into from
Nov 14, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions codec/api.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
package codec

import (
"io"

"github.com/ipld/go-ipld-prime"
)

// Encoder is the essential definition of a function that takes IPLD Data Model data in memory and serializes it.
// IPLD Codecs are written by implementing this function interface (as well as (typically) a matched Decoder).
//
// Encoder functions can be composed into an ipld.LinkSystem to provide
// a "one stop shop" API for handling content addressable storage.
// Encoder functions can also be used directly if you want to handle serial data streams.
//
// Most codec packages will have a ReusableEncoder type
// (which contains any working memory needed by the encoder implementation,
// as well as any configuration options),
// and that type will have an Encode function matching this interface.
//
// By convention, codec packages that have a multicodec contract will also have
// a package-scope exported function called Encode which also matches this interface,
// and is the equivalent of creating a zero-value ReusableEncoder (aka, default config)
// and using its Encode method.
// This package-scope function will typically also internally use a sync.Pool
// to keep some ReusableEncoder values on hand to avoid unnecesary allocations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why use a sync.Pool instead of an Encoder implementation type with an Encode method, turning the type Encoder here into an interface?

That would also make ReusableEncoder non-necessary, as far as I can tell.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea of having a sync.Pool is mostly in service to the goal of convenience for that package-scope function. Some applications have logic where keeping one Encoder around and reusing it is easy; some don't. I want a handy function that will usually DTRT and reuse an Encoder in an alloc-efficient way, without bothering me about it and making me carry something around. (And this was found to be useful before in one of the generations of libraries I consider a learning experience to feed into this one. [‡])

Could we make a package-scope convenience method that's convenient and just ignores the alloc cost of creating a new ReusableEncoder every time? Well, yeah. Don't wanna though :) At least, I think I don't. sync.Pool is supposed to be pretty cheap, is my understanding. I guess this'll deserve re-checking when more of the code that this comment suggests actually gets implemented, though.

Regardless, ReusableEncoder will still need to be a concrete type for a couple of other reasons: namely, because it's often going to export codec-specific configuration fields, and because those things are codec-specific, it's not really possible to totally hide it behind any general interface.

(Suggesting that this concrete type should have "Reusable" right in the type name might be a little distracting? Maybe I should drop that word. It's something I'm concerned about -- ReusableEncoder tends to have at least one userland "stack" in it; I want them to be resettable and reusable in a way that zeros the length but retains the capacity of any such "stack" slice or other working memory buffers -- but maybe that detail should stay in my head or in docs rather than shouted in the type name.)

[‡] - there's probably a half-dozen red herrings in that older library PR too, admittedly. For example, I'm pretty sure the new encoders and decoders I'll be writing to go with this package will have many fewer allocations to initialize than that older library did, which may result in the performance impact of the pooling being less starkly visible. Still, I think overall I'm going to start by assuming the older library had a point.

//
// Note that a ReusableEncoder type that supports configuration options
// does not functionally expose those options when invoked by the multicodec system --
// multicodec indicators do not provide room for extended configuration info.
// Codecs that expose configuration options are doing so for library users to enjoy;
// it does not mean those non-default configurations will necessarly be available
// in all scenarios that use codecs indirectly.
// There is also no standard interface for such configurations: by nature,
// if they exist at all, they vary per codec.
type Encoder func(data ipld.Node, output io.Writer) error

// Decoder is the essential definiton of a function that consumes serial data and unfurls it into IPLD Data Model-compatible in-memory representations.
// IPLD Codecs are written by implementing this function interface (as well as (typically) a matched Encoder).
//
// Decoder is the dual of Encoder.
// Most of the documentation for the Encoder function interface
// also applies wholesale to the Decoder interface.
type Decoder func(into ipld.NodeAssembler, input io.Reader) error

type ErrBudgetExhausted struct{}

func (e ErrBudgetExhausted) Error() string {
return "decoder resource budget exhausted (message too long or too complex)"
}
84 changes: 84 additions & 0 deletions codec/codectools/token.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
package codectools

import (
"fmt"

"github.com/ipld/go-ipld-prime"
)

type Token struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not a fan of the package name, because then the qualified names are a bit weird; codectools.Token sounds to me like it should be codec.Token.

I guess this is less important if the end user shouldn't see codectools, but then I would argue this should be an internal package.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, this is way too much of a mouthful, especially for the Token stuff.

Gonna run with it a while longer, because I've stacked up so more code atop this already, but some rename or reshuffle here would be an improvement.

Kind TokenKind

Length int // Present for MapOpen or ListOpen. May be -1 for "unknown" (e.g. a json tokenizer will yield this).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can someone talk me through the Go memory model decision process around structs like this that are essentially unions. If you make a new Token you usually only want to set Kind and one of the other properties, yet you have a memory layout that includes space for all of these things. An alternative might be to make everything a pointer, but then you allocate pointer space for each of them and you end up using the heap for the values. Is that the essential trade-off here, and where you have such a choice you should just choose this style? I've run into this a number of times and using pointers always feels more correct to me because I'm not putting anything into the fields I don't care about. But I suppose that's not the best way of thinking about it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I think you've got it already.

If Go had unions -- either discriminated or even nasty dirty unsafe undiscriminated -- I'd use 'em in a heartbeat. Unfortunately, it does not.

If one tries to unify the space cost by using any kind of interface{}, you end up forcing allocations to box up the values. (Interfaces have to be two words: one word is a pointer to the type info, and the other word is a pointer to the value. That means you have to get a pointer from somewhere. And at present: yes, for everything: It's a pointer to the value even if the value is a single word and could just as well fit without a pointer indirection. Uuffah. (I've heard talk future versions of Go might improve this situation for word-sized types, but it hasn't happened yet. And wouldn't entirely absolve our issues here either, since byte slices and strings are both multi-word too.))

("Forcing" allocations might be a little strong: if you can get a pointer to data that's already in the heap (or inside some other data that's already on the heap), then you can get that pointer without a new allocation. But this generally requires some careful logic... and while sometimes that careful logic is worth the effort, it turns out to be pretty hard to apply in practice around how these tokens get used, IME.)

If one rewrote this structure with the same fields but all pointers (and not trying to unify stuff into a single interface{} field), the struct would get about three words shorter ([]byte is three words while *[]byte is one; string is two words while *string is one; the rest are all already one word and would still be one word as a pointer; except, okay, Link is also still two words and would remain two words because you wouldn't turn it into a pointer-to-an-interface since interfaces can already contain pointers). So a lot of space would still be reserved. And you'd indeed still have the same issues with getting a pointer tending to provoke an allocation.

You'll see people still use pointers for optional fields in golang fairly often, but usually the reason for this is that they're wanting to be able to use nil as a distinct sentinel value. (And when you see this, implicitly, the author also either didn't know or just didn't care about the allocations provoked. Which -- sometimes, is fair! But in this context, we really do care; this structure is getting used in hot loops in parsers, so allocations are extremely, extremely noticeable.) A lot of other programming languages try to encapsulate this concept with a type called Maybe or Optional or Either or something like that just to make it clear when the goal is to have a sentinel value -- and we've done it ourselves with the optional keyword in IPLD Schemas -- but Golang doesn't have any of these, so, people frequently overload this semantic onto making something into pointer even though it otherwise wouldn't be.

(Sorry for the wall of text; you asked about the one thing that touches all three of my biggest beefs about Golang -- lack of unions, avoidable heap allocations for word-sized residents of interfaces, and the pointer/optional bamboozle -- at once, ahheh.)

Bool bool // Value. Union: only has meaning if Kind is TokenKind_Bool.
Int int64 // Value. Union: only has meaning if Kind is TokenKind_Int.
Float float64 // Value. Union: only has meaning if Kind is TokenKind_Float.
Str string // Value. Union: only has meaning if Kind is TokenKind_String. ('Str' rather than 'String' to avoid collision with method.)
Bytes []byte // Value. Union: only has meaning if Kind is TokenKind_Bytes.
Link ipld.Link // Value. Union: only has meaning if Kind is TokenKind_Link.

Node ipld.Node // Direct pointer to the original data, if this token is used to communicate data during a walk of existing in-memory data. Absent when token is being used during deserialization.

// The following fields all track position and progress:
// (These may be useful to copy into any error messages if errors arise.)
// (Implementations may assume token reuse and treat these as state keeping;
// you may experience position accounting accuracy problems if *not* reusing tokens or if zeroing these fields.)

pth []ipld.PathSegment // Set by token producers (whether marshallers or deserializers) to track logical position.
offset int64 // Set by deserializers (for both textual or binary formats alike) to track progress.
lineOffset int64 // Set by deserializers that work with textual data. May be ignored by binary deserializers.
columnOffset int64 // Set by deserializers that work with textual data. May be ignored by binary deserializers.
}

func (tk Token) String() string {
switch tk.Kind {
case TokenKind_MapOpen:
return fmt.Sprintf("<%c:%d>", tk.Kind, tk.Length)
case TokenKind_MapClose:
return fmt.Sprintf("<%c>", tk.Kind)
case TokenKind_ListOpen:
return fmt.Sprintf("<%c:%d>", tk.Kind, tk.Length)
case TokenKind_ListClose:
return fmt.Sprintf("<%c>", tk.Kind)
case TokenKind_Null:
return fmt.Sprintf("<%c>", tk.Kind)
case TokenKind_Bool:
return fmt.Sprintf("<%c:%v>", tk.Kind, tk.Bool)
case TokenKind_Int:
return fmt.Sprintf("<%c:%v>", tk.Kind, tk.Int)
case TokenKind_Float:
return fmt.Sprintf("<%c:%v>", tk.Kind, tk.Float)
case TokenKind_String:
return fmt.Sprintf("<%c:%q>", tk.Kind, tk.Str)
case TokenKind_Bytes:
return fmt.Sprintf("<%c:%x>", tk.Kind, tk.Bytes)
case TokenKind_Link:
return fmt.Sprintf("<%c:%v>", tk.Kind, tk.Link)
default:
return "<INVALID>"
}
}

type TokenKind uint8

const (
TokenKind_MapOpen TokenKind = '{'
TokenKind_MapClose TokenKind = '}'
TokenKind_ListOpen TokenKind = '['
TokenKind_ListClose TokenKind = ']'
TokenKind_Null TokenKind = '0'
TokenKind_Bool TokenKind = 'b'
TokenKind_Int TokenKind = 'i'
TokenKind_Float TokenKind = 'f'
TokenKind_String TokenKind = 's'
TokenKind_Bytes TokenKind = 'x'
TokenKind_Link TokenKind = '/'
)

type ErrMalformedTokenSequence struct {
Detail string
}

func (e ErrMalformedTokenSequence) Error() string {
return "malformed token sequence: " + e.Detail
}
258 changes: 258 additions & 0 deletions codec/codectools/token_consumers.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
package codectools

import (
"fmt"
"io"

"github.com/ipld/go-ipld-prime"
"github.com/ipld/go-ipld-prime/codec"
)

// TokenAssemble takes an ipld.NodeAssembler and a TokenReader,
// and repeatedly pumps the TokenReader for tokens and feeds their data into the ipld.NodeAssembler
// until it finishes a complete value.
//
// To compare and contrast to other token oriented tools:
// TokenAssemble does the same direction of information transfer as the TokenAssembler gadget does,
// but TokenAssemble moves completely through a value in one step,
// whereas the TokenAssembler accepts tokens pumped into it one step at a time.
//
// TokenAssemble does not enforce the "map keys must be strings" rule which is present in the Data Model;
// it will also happily do even recursive structures in map keys,
// meaning it can be used when handling schema values like maps with complex keys.
func TokenAssemble(na ipld.NodeAssembler, tr TokenReader, budget int) error {
tk, err := tr(&budget)
if err != nil {
return err
}
return tokenAssemble(na, tk, tr, &budget)
}

func tokenAssemble(na ipld.NodeAssembler, tk *Token, tr TokenReader, budget *int) error {
if *budget < 0 {
return codec.ErrBudgetExhausted{}
}
switch tk.Kind {
case TokenKind_MapOpen:
if tk.Length > 0 && *budget < tk.Length*2 { // Pre-check budget: at least two decrements estimated for each entry.
return codec.ErrBudgetExhausted{}
}
ma, err := na.BeginMap(tk.Length)
if err != nil {
return err
}
for {
// Peek one token. We need to see if the map is about to end or not.
tk, err = tr(budget)
if err != nil {
return err
}
// If the map has ended, invoke the finish operation and check for any errors.
if tk.Kind == TokenKind_MapClose {
return ma.Finish()
}
// Recurse to assemble the key.
*budget-- // Decrement budget by at least one for each key. The key content may also cause further decrements.
if err = tokenAssemble(ma.AssembleKey(), tk, tr, budget); err != nil {
return err
}
// Recurse to assemble the value.
// (We don't really care to peek this token, but do so anyway to keep the calling convention regular.)
tk, err = tr(budget)
if err != nil {
return err
}
*budget-- // Decrement budget by at least one for each value. The value content may also cause further decrements.
if err = tokenAssemble(ma.AssembleValue(), tk, tr, budget); err != nil {
return err
}
// Continue around the loop, to encounter either the next entry or the end of the map.
}
case TokenKind_MapClose:
return ErrMalformedTokenSequence{"map close token encountered while not in the middle of a map"}
case TokenKind_ListOpen:
if tk.Length > 0 && *budget < tk.Length { // Pre-check budget: at least one decrement estimated for each entry.
return codec.ErrBudgetExhausted{}
}
la, err := na.BeginList(tk.Length)
if err != nil {
return err
}
for {
// Peek one token. We need to see if the list is about to end or not.
tk, err = tr(budget)
if err != nil {
return err
}
// If the list has ended, invoke the finish operation and check for any errors.
if tk.Kind == TokenKind_ListClose {
return la.Finish()
}
// Recurse to assemble the value.
*budget-- // Decrement budget by at least one for each value. The value content may also cause further decrements.
if err = tokenAssemble(la.AssembleValue(), tk, tr, budget); err != nil {
return err
}
// Continue around the loop, to encounter either the next value or the end of the list.
}
case TokenKind_ListClose:
return ErrMalformedTokenSequence{"list close token encountered while not in the middle of a list"}
case TokenKind_Null:
return na.AssignNull()
case TokenKind_Bool:
*budget--
return na.AssignBool(tk.Bool)
case TokenKind_Int:
*budget--
return na.AssignInt(int(tk.Int))
case TokenKind_Float:
*budget--
return na.AssignFloat(tk.Float)
case TokenKind_String:
*budget -= len(tk.Str)
return na.AssignString(tk.Str)
case TokenKind_Bytes:
*budget -= len(tk.Bytes)
return na.AssignBytes(tk.Bytes)
case TokenKind_Link:
*budget--
return na.AssignLink(tk.Link)
default:
panic(fmt.Errorf("unrecognized token kind (%q?)", tk.Kind))
}
}

// --- the stepwise assembler system (more complicated; has a userland stack) is below -->

type TokenAssembler struct {
// This structure is designed to be embeddable. Use Initialize when doing so.

stk assemblerStack // this is going to end up being a stack you know
budget int64
}

type assemblerStackRow struct {
state uint8 // 0: assign this node; 1: continue list; 2: continue map with key; 3: continue map with value.
na ipld.NodeAssembler // Always present.
la ipld.ListAssembler // At most one of these is present.
ma ipld.MapAssembler // At most one of these is present.
}
type assemblerStack []assemblerStackRow

func (stk assemblerStack) Tip() *assemblerStackRow {
return &stk[len(stk)-1]
}
func (stk *assemblerStack) Push(na ipld.NodeAssembler) {
*stk = append(*stk, assemblerStackRow{na: na})
}
func (stk *assemblerStack) Pop() {
if len(*stk) == 0 {
return
}
*stk = (*stk)[0 : len(*stk)-1]
}

func (ta *TokenAssembler) Initialize(na ipld.NodeAssembler, budget int64) {
if ta.stk == nil {
ta.stk = make(assemblerStack, 0, 10)
} else {
ta.stk = ta.stk[0:0]
}
ta.stk.Push(na)
ta.budget = budget
}

// Process takes a Token pointer as an argument.
// (Notice how this function happens to match the definition of the visitFn that's usable as an argument to TokenWalk.)
// The token argument can be understood to be "borrowed" for the duration of the Process call, but will not be mutated.
// The use of a pointer here is so that a single Token can be reused by multiple calls, avoiding unnecessary allocations.
//
// Note that Process does very little sanity checking of token sequences itself,
// mostly handing information to the NodeAssemblers directly,
// which presumably will reject the data if it is out of line.
// The NodeAssembler this TokenAssembler is wrapping should already be enforcing the relevant logical rules,
// so it is not useful for TokenAssembler.Process to attempt to duplicate those checks;
// TokenAssembler.Process will also return any errors from the NodeAssembler without attempting to enforce a pattern on those errors.
// In particular, TokenAssembler.Process does not check if every MapOpen is paired with a MapClose;
// it does not check if every ListOpen is paired with a ListClose;
// and it does not check if the token stream is continuing after all open recursives have been closed.
// TODO: review this documentation; more of these checks turn out necessary anyway than originally expected.
func (ta *TokenAssembler) Process(tk *Token) (err error) {
if len(ta.stk) == 0 {
return io.EOF
}
tip := ta.stk.Tip()
switch tip.state {
case 0:
switch tk.Kind {
case TokenKind_MapOpen:
tip.ma, err = tip.na.BeginMap(tk.Length)
tip.state = 2
return err
case TokenKind_MapClose:
// Mostly we try to just forward things, but can't not check this one: tip.ma would be nil; there's reasonable target for forwarding.
return ErrMalformedTokenSequence{"map close token encountered while not in the middle of a map"}
case TokenKind_ListOpen:
tip.la, err = tip.na.BeginList(tk.Length)
tip.state = 1
return err
case TokenKind_ListClose:
// Mostly we try to just forward things, but can't not check this one: tip.la would be nil; there's reasonable target for forwarding.
return ErrMalformedTokenSequence{"list close token encountered while not in the middle of a list"}
case TokenKind_Null:
err = tip.na.AssignNull()
ta.stk.Pop()
return err
case TokenKind_Bool:
err = tip.na.AssignBool(tk.Bool)
ta.stk.Pop()
return err
case TokenKind_Int:
err = tip.na.AssignInt(int(tk.Int)) // TODO: upgrade all of ipld to use high precision int consistently
ta.stk.Pop()
return err
case TokenKind_Float:
err = tip.na.AssignFloat(tk.Float)
ta.stk.Pop()
return err
case TokenKind_String:
err = tip.na.AssignString(tk.Str)
ta.stk.Pop()
return err
case TokenKind_Bytes:
err = tip.na.AssignBytes(tk.Bytes)
ta.stk.Pop()
return err
case TokenKind_Link:
err = tip.na.AssignLink(tk.Link)
ta.stk.Pop()
return err
default:
panic(fmt.Errorf("unrecognized token kind (%q?)", tk.Kind))
}
return nil
case 1:
if tk.Kind == TokenKind_ListClose {
err = tip.la.Finish()
ta.stk.Pop()
return err
}
ta.stk.Push(tip.la.AssembleValue())
return ta.Process(tk)
case 2:
if tk.Kind == TokenKind_MapClose {
err = tip.ma.Finish()
ta.stk.Pop()
return err
}
tip.state = 3
ta.stk.Push(tip.ma.AssembleKey())
return ta.Process(tk)
case 3:
tip.state = 2
ta.stk.Push(tip.ma.AssembleValue())
return ta.Process(tk)
default:
panic("unreachable")
}
}
Loading