Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support HAMT #4

Merged
merged 7 commits into from
Apr 4, 2021
Merged

Support HAMT #4

merged 7 commits into from
Apr 4, 2021

Conversation

hannahhoward
Copy link
Collaborator

Goals

Support both regular directories and HAMTS for path lookup

Implementation

  • Implement a decoder / encoder for the UnixFS protobuf encoded in the Data field of DagPB nodes. Bonus: supports UnixFS v1.5 fields!
  • Implement an IPLD prime based UnixFS HAMT Directory ADL (read only)
  • Factor the iterators out of the original node so they are shared with only minor modifications between regular directors and HAMT Directorys
  • Rearrage the project so that we support multiple nodes. This means:
  • upgrading the reify function to decode the UnixFS Protobuf and then construct an appropriate node based on the protobuf
  • setting up a new NodePrototype design -- this is a hack to pipe through the LinkSystem all the way to the point of build. I put in a potential better fix that requires changes to IPLD Prime here: feat(linksystem): add reification to LinkSystem ipld/go-ipld-prime#158
  • otherwise we really should have NodePrototypes at all till we build the write side

add builders and utitlities for working with unixfs data protobufs
Add HAMT Implementation + Shared iterator to use with basic directory
move basic directory to its own package, redo root package based on more flexible refication and
more flexible node prototype
Copy link
Collaborator

@willscott willscott left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥇 🎉 💯

"github.com/ipld/go-ipld-prime/fluent/qp"
)

func BuildUnixFs(fn func(*Builder)) (data.UnixFSData, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment to this toplevel method with pointer to example of using it to build a node

Comment on lines +39 to +64
type UnixFSData struct {
dataType Int
data optional Bytes
filesize optional Int;
blocksizes [Int]

hashType optional Int
fanout optional Int
mode optional Int
mtime optional UnixTime
} representation map
*/

ts.Accumulate(schema.SpawnStruct("UnixFSData",
[]schema.StructField{
schema.SpawnStructField("DataType", "Int", false, false),
schema.SpawnStructField("Data", "Bytes", true, false),
schema.SpawnStructField("FileSize", "Int", true, false),
schema.SpawnStructField("BlockSizes", "BlockSizes", false, false),
schema.SpawnStructField("HashType", "Int", true, false),
schema.SpawnStructField("Fanout", "Int", true, false),
schema.SpawnStructField("Mode", "Int", true, false),
schema.SpawnStructField("Mtime", "UnixTime", true, false),
},
schema.SpawnStructRepresentationMap(nil),
))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note to @warpfork we really should figure out how to generate this intermediate code-gen format from ipld schemas :)

gengo "github.com/ipld/go-ipld-prime/schema/gen/go"
)

func main() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure to add a go:generate line somewhere in this repo so that a go generate ./... at the root re-runs this.

switch fieldNum {
case Data_DataTypeWireNum:
if wireType != protowire.VarintType {
return fmt.Errorf("protobuf: (UnixFSData) invalid wireType, field: DataType, expected %d, got %d", protowire.VarintType, wireType)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

factor out a constant format string from this and below?

"google.golang.org/protobuf/encoding/protowire"
)

func DecodeUnixFSData(src []byte) (UnixFSData, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to follow the Decoder func(NodeAssembler, io.Reader) error interface here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea kinda -- it's never going to be registered as a multicodec, and this provides a bunch of weird type casts and artificial reader constructions all around. This isn't an IPLD format per se. We will always be decoding from within a larger block.

links := n._substrate.FieldLinks()
link := lookup(links, key)
if link == nil {
return nil, schema.ErrNoSuchField{Type: nil /*TODO*/, Field: ipld.PathSegmentOfString(key)}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leaving comment to flag this todo :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this one's actually an upstream design problem. (When I drafted that error type, I thought having codegen spit out values for self-description of the types it had just emitted would be a "just around the corner" feature, and so I'd put a pointer to type info here. That turned out to be misplaced optimism. It's likely this should be replaced by a convention of type name strings instead.)

I'll make an issue upstream to at least track this.


// satisfy schema.TypedNode
func (UnixFSBasicDir) Type() schema.Type {
return nil /*TODO:typelit*/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why isn't this just TypeMap?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that's not really the same thing. There are a ton of restrictions on format. It's actually a struct from the schema systems point of view. Actually, given it doesn't adequately satisfy that function we should remove it.

shardCache := make(map[ipld.Link]*_UnixFSHAMTShard, substrate.FieldLinks().Length())
bf := BitField(data)
return &_UnixFSHAMTShard{
ctx: ctx,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

holding onto context past construction may be a mis-match that confuses things. currently context is passed in during a load and the caller may assume that's tracking the needed loads to the datastore to construct the node, such that once they get their constructed top-level node back they cancel the context.
if later they attempt to do a map iterator over the children in a directory they aren't necessarily going to make the connection that their previous construction-time context is still in play.

On the flip side, the whole point of the hamt is that we can lazy-load, and we do need a context of some sort while doing that, and I don't see any others to pull in that make any more sense.

thoughts @warpfork

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. the fix would be to pass context to all the ipld node methods, which is not a bad idea per say, but a real big change.

I mean the reality is all these loads are happening in LookupByString, which is a bit confusing I think, since who would expect that massive amounts of fetching from datastores and even the network is happening here. But, on the other hand, I don't have anything better right now.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. understood that this is what the interface lets us do for now, but also pointing it out to @warpfork that the interface forces this in it's current design

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would very much prefer that interfaces are changed to allow context. Storing in a struct temporarily is problematic in that once more code gets built on this the embedded context gets harder to remove, and the context's lifetime may be confusing. If this cannot be changed now, please leave a TODO to indicate this is a workaround that needs to be addressed later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting feedback, thanks for this.

  • It is true that currently storing a context is really the only way it's possible to go with this.
  • Agree that storing context means lifetime can be unclear and that seems kind of problematic.
  • Confirm that adding contexts to all methods is a pretty big change.

@mvdan and I actually talked about this (the possibility of adding context parameters to everything) previously too, and at the time put it back down as insufficiently convinced that it provided value for the amount of line noise it adds. (Mostly because the amount of syntactic bulk in these APIs already often feels like it's kind of pushing it.) But it's definitely possible it's still worth reconsidering.

As an interesting comparison, the new golang fs packages also weighed a similar decision -- whether to add contexts to almost all file IO methods -- and they decided there to store context in a filesystem structure as well. I don't know if that's right or wrong either, but it's food for thought.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it helps to think about things, this article describes why it is important to pass Context rather than store it in another type. It also highlights a rare case where storing Context in a struct type may make sense, and how to do so safely:
https://blog.golang.org/context-and-structs

Perhaps @mvdan could offer his take on this. He has commented on this in the past: golang/go#22602 (comment)

hamt/util.go Outdated
}

if !nd.FieldHashType().Exists() || uint64(nd.FieldHashType().Must().Int()) != HashMurmur3 {
return fmt.Errorf("only murmur3 supported as hash function")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have these errors as constants, here & below?

have reifier add path selection to all protobuf nodes, even ones that are not unixfs
respond to PR comments -- make error types, etc
Copy link
Contributor

@gammazero gammazero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly just a few minor style comments. Looks very good.

data/datatypes.go Show resolved Hide resolved
b := &Builder{MapAssembler: ma}
fn(b)
if !b.hasBlockSizes {
qp.MapEntry(ma, "BlockSizes", qp.List(0, func(ipld.ListAssembler) {}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider a const for "BlockSizes" and other map entry keys.

data/errors.go Outdated
if !ok {
actualName = "Unknown Type"
}
return fmt.Sprintf("Incorrect Node Type: (UnixFSData) expected type: %s, actual type: %s", expectedName, actualName)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: lower case error strings

}
if u.FieldDataType().Int() == Data_HAMTShard {
return HAMTShardPerimissionsDefault
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using switch u.FieldDataType().Int() { ... } here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right :)

for {
if len(remaining) == 0 {
break
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider doing this test in the for:

for len(remaining) != 0 {
    ...
}

Same comment for other consumeX functions.

shardCache := make(map[ipld.Link]*_UnixFSHAMTShard, substrate.FieldLinks().Length())
bf := BitField(data)
return &_UnixFSHAMTShard{
ctx: ctx,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would very much prefer that interfaces are changed to allow context. Storing in a struct temporarily is problematic in that once more code gets built on this the embedded context gets harder to remove, and the context's lifetime may be confusing. If this cannot be changed now, please leave a TODO to indicate this is a workaround that needs to be addressed later.

hamt/util.go Outdated
out := int(mkmask(i) & curb)
hb.consumed += i
return out
} else if i < leftb {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: unnecessary else

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point -- copied code but you are totally right.

hamt/util.go Outdated
return nil
}

func Log2Size(nd data.UnixFSData) int {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be exported?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair

reification.go Outdated
}
builder, ok := reifyFuncs[data.FieldDataType().Int()]
if !ok {
return nil, fmt.Errorf("No reification for this UnixFS node type")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "No" --> "no"

utils/utils.go Show resolved Hide resolved
Copy link
Collaborator Author

@hannahhoward hannahhoward left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put in notes for @gammazero comments and otherwise pushed.

@hannahhoward hannahhoward merged commit 5a89cd4 into main Apr 4, 2021
@mvdan mvdan deleted the feat/support-hamt branch January 20, 2022 16:25
hacdias pushed a commit to ipfs/boxo that referenced this pull request Jan 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants