Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trim Link / CID interface and introduce something like LinkView interface #214

Closed
Gozala opened this issue Oct 16, 2022 · 6 comments
Closed
Assignees

Comments

@Gozala
Copy link
Contributor

Gozala commented Oct 16, 2022

Given #213 #212 #208 and #211 I would like to propose trimming down Link interface to bare minimum where it acts like Model and introduce another interface e.g. LinkView (can call whatever) that extends Link with bunch of methods:

export interface Link<
  Data extends unknown = unknown,
  Format extends number = number,
  Alg extends number = number,
  V extends Version = 1
  > extends Phantom<Data> {
  readonly version: V
  readonly code: Format
  readonly multihash: MultihashDigest<Alg>
  readonly bytes: ByteView<Link<Data, Format, Alg, V>>
  readonly ['/']: this['bytes']
}

export interface LinkView<
  Data extends unknown = unknown,
  Format extends number = number,
  Alg extends number = number,
  V extends Version = 1
  > extends Link<Data, Format, Alg> {


  equals: (other: unknown) => other is Link<Data, Format, Alg, Version>
  toString: <Prefix extends string>(base?: MultibaseEncoder<Prefix>) => ToString<Link<Data, Format, Alg, Version>, Prefix>
  toJSON: () => { version: V, code: Format, hash: Uint8Array }
  link: () => Link<Data, Format, Alg, V>

  toV1: () => Link<Data, Format, Alg, 1>
}

Note I have removed following entirely

readonly byteOffset: number
readonly byteLength: number

Idea here would be that CID class will implement both interfaces, but all the stuff on LinkView will be implemented using CID.prototype which should ensure that value equality will hold.

This also would imply that isLink would become something like this:

const isLink = (value:unknown): value is Link => value && value['/'] === value.bytes
@Gozala
Copy link
Contributor Author

Gozala commented Oct 17, 2022

After looking at dag-json spec I'm realizing that / supposed to point to CID string not bytes so maybe this is not ideal. However general point still is:

  1. We need an own (signalling) property with a very low collision probability.
  2. This property should be retained across JS realms (implying it can not be symbol or getter)

So basically cid.wellKnownPropertyName === cid.bytes could check both boxes above and harder it is to type property name less likely it is to collide.

@Gozala
Copy link
Contributor Author

Gozala commented Oct 17, 2022

For what it's worth, I just checked and found out that node Buffer's also turn into plain Uint8Array in new realms (I was hoping we might borrow a trick from node, but it doesn't seem to exist)

@Gozala
Copy link
Contributor Author

Gozala commented Oct 17, 2022

I have been thinking about this more and I think it is worth calling out all the CID consumers to get down to what makes most sense for them. E.g. in absolute terms following representation of a CID does not seem ideal for any of the consumers (I'll go over them later)

interface CID {
  version: 0|1
  code: number
  multihash: MultihashDigest
  bytes: Uint8Array
}

Because:

  1. Because bytes pretty much is encoding of other three fields and so it is redundant.
  2. Above definition makes invalid CID's possible, because bytes MUST correspond to encoding of other two fields.

Consider alternative, which captures all of the information yet make invalid CIDs impossible:

type CID = CIDv0 | CIDv1

interface CIDv0 {
  version: 0
  multihash: MultihashDigest<0x12>
}

interface CIDv1<Code extends number = number, Alg extends number = number> {
  version: 1
  code: Code
  multihash: MultihashDigest<Alg>
}

Consumers

As far as I can tell there are sever types of consumers of CIDs:

  1. Encoders (which is mostly IPLD codecs), I think needs for those are as follows:

    1. Ability to distinguish CID from other data.
    2. Ability to encode CID into bytes (If they encode into string it's still is encoding into bytes and then encoding those into string)

    As far as I can tell (@rvagg can correct me if I'm wrong) encoders do not need anything else and in case they do need to actually look at version or code they probably are fine using static functions for that as they're likely have multiformats in dependency anyway since decoder part needs to produce more of CID / Link view than just a data decoded from it.

  2. Consumers, usually a high level code exposed to end user (or user interface). I think this code needs:

    1. All of the sugar it may want to toString, toV0, toV1, equals etc...
  3. Code incorporating CIDs. This one can fall anywhere between 1st and 2nd. If it's just incorporating data into larger structure it falls more in the Encoder case (which is basically incorporating it into lager byte array), however if it takes alternative code paths depending on CID version or codec it may fall more on the consumer end of the spectrum.

With above consumers in mind I think it might be a good idea to have optimal representation for Encoders and separate optimal representation for consumers. Incorporators would need to weight tradeoffs and choose to act as one or the other which would imply either importing static functions to operate on received CIDs or demand "CID Views" from it's consumers. I think this would also help with making code on the other side of wire bit more explicit, meaning it would have to act as encoder (well decoder actually, but encoder in our taxonomy) and not as consumer.

With that insight it seems to me that for encoders we simply need Uint8Array that we can distinguish from other Uint8Array. As of consumers we need a class that just wraps that Uint8Array and provides everything else as set of getters and methods. I think we could accomplish this via following types:

// Using unboxed Uint8Array would be too error prone since any array starting with 0 or 18 could
// be mistaken for CID. Any other prefix is probably going to be just as error prone. To avoid this
// we simply box bytes as { CID: bytes }
interface CID<C extends number = number, Alg extend number = number, V = 0 | 1 = 0 | 1>
  extends Phantom<{ version: V, code: C, MultihashDigest<Alg> }> {
   // can be any other name as long as collision with arbitrary other objects is highly unlikely
   // given using capital case names in JS is not idiomatic collision here seems unlikely.
   CID: ByteView<CID<C, Alg, V>>
}


interface CIDView<C extends number = number, Alg extend number = number, V = 0 | 1 = 0 | 1>
   extends CID<C, Alg, V> {
   readonly version: V
   readonly code: C
   readonly multihash: MultihashDigest<Alg>
   toBytes(): Uint8Array
   toString(): string
   toV1(): CIDView<C, Alg, 1>
   equals(other:unknown): other is CID<C, Alg, V>
   link(): CIDView<C, Alg, V>
   [Symbol.toStringTag]: 'CID'
}

In terms of implementation CIDView will be provided by the prototype and the CID via the instance (fields), which can look something along these lines:

const digestCache = new WeakMap()

class CID {
   constructor(bytes) {
     this.CID = bytes
   }
   get version() {
     return version(this)
   }
   get code() {
      return code(this)
   }
   get multihash() {
      return multihash(this)
   }
   equals(other) {
      return equals(this, other)
   }
   toBytes() {
     return toBytes(this)
   }
   toString() {
      return toString(this)
   }
}

export const code = ({ CID }) => varint.decode(CID, 1)
export const version = ({ CID }) => CID[0] === 18 ? 0 : 1

const multihash = (cid) => {
   const digest = digestCache.get(cid.CID)
    if (!digest) {
       const offset = varint.encodingLength(version(cid)) + varint.encodingLength(code(cid))
       const digest = Digest.decode(cid.CID.subarray(offset))
       digestCache.set(cid.CID, digest)
       return digest
    }
    return digest
}

export const equals = ({ CID }, other) => uint8arrays.equals(CID, other.CID)

export const decode = (bytes:Uint8Array) => {
   switch (bytes[0]) {
      case 18:
        return decodeV0(bytes)
      case 1:
        return decodeV1(bytes)
      default:
         throw new Error(`Unsupported CID version ${bytes[0]}`)
   }
}
export const { decode as fromBytes } 

const decodeV0 = (bytes) => {
   const byteOfffset = 1
   const digest = Digest.decode(bytes, byteOffset)
   digestCache.set(bytes, digest)
   return new CID(bytes)
}

const decodeV1 = (bytes) => {
   const byteOfffset = 1
   const [code, size] = varint.decode(bytes, 1)
   const digest = Digest.decode(bytes, byteOffset + size)
   digestCache.set(bytes, digest)
   return new CID(bytes)
)

export const toBytes = ({ CID }) => CID
export const toString = (cid) => {
   switch (version(cid)) {
      case 0:
        return base58btc.encode(toBytes(cid)).slice(1)
      case 1:
        return base32.encode(toBytes(cid))
      default:
        throw new Error(`Unsupported CID version ${version(cid)}`)
   }
}

export const parse = (source) => {
   switch (source[0]) {
      case 'Q':
         return decode(base58.decode(`${base58btc.prefix}${source}`))
      case base58btc.prefix: {
         return decode(base58btc.decode(source))
      case base32.prefix:
         return decode(base32.decode(source))
      default:
        throw new Error(`To parse non base32 or base58btc encoded CIDs use CID.fromBytes(baseDecoder.decode("${source}"))`)
    }
}
export { parse as fromString }

@Gozala
Copy link
Contributor Author

Gozala commented Oct 17, 2022

I would like to address following comment from @RangerMauve

I think the intention was to have an ArrayBuffer that could also be a CID itself (kind of like the Cid type in Go which is just a wrapped byte string) but we've not really evolved beyond just having a concrete class—does the new interface (and/or the newly proposed ones) get us any closer? is it worth the bother holding on to these as important properties?

I think it's doable, yet I'm no longer sure it can be practical in JS due to the following:

  1. I think we need a reliable and non error prone way to distinguish CID from some Uint8Array. I doubt we could use first couple of bytes in Uint8Array to do this (let me know if you think otherwise).
  2. Expando properties get discarded across realms so we can't simply add a field to Uint8Array and expect it to stick.
  3. Sub-classing Uint8Array (which we find useful in bunch of other places) also not going to survive thread / realm boundaries so this is not going to work either.
  4. Using { byteOffset: number, byteLength: number, buffer: ArrayBuffer } would work, but turns out it would introduce issues with equality just like CID equality #208 unless they're made non-enumerable, but making them non-enumerable has unfortunate performance implications as per Performance of Object.defineProperties use in the CID class #200

So while I would really love to have a way to not box Uint8Array's I can't seem to find a reliable way to distinguish ones that represent CIDs from ones that do not. If anyone has suggestion on how to accomplish this, I would very much love to do the necessary work to make it happen. Otherwise some form of lightweight boxing is probably best we can do, in which case I agree that byteOffset, byteLength fields provide no real value and we can remove them.

@rvagg
Copy link
Member

rvagg commented Dec 6, 2022

@achingbrain @Gozala do either of you want to move forward with more changes to CID / Link? Can I close all these issues out for now and expect a PR with concrete proposal at some point?

@rvagg rvagg moved this to 🏃‍♀️ In Progress in IPLD team's weekly tracker Dec 6, 2022
@Gozala
Copy link
Contributor Author

Gozala commented Dec 7, 2022

Only changes to CID / Link I'm actively interested would be in support of #227. Closing this issues sound good to me.

@rvagg rvagg closed this as completed Dec 8, 2022
Repository owner moved this from 🏃‍♀️ In Progress to 🎉 Done in IPLD team's weekly tracker Dec 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants