Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repo checkouts #444

Merged
merged 28 commits into from
Jan 5, 2023
Merged

Repo checkouts #444

merged 28 commits into from
Jan 5, 2023

Conversation

dholms
Copy link
Collaborator

@dholms dholms commented Dec 27, 2022

This started out as introducing repo checkouts to the API but spun out into several related things that improve our storage & sync layer:

  • ripped out auth lib (we weren't actually making use of this & it was just adding extra song and dance to all of repo operations & sync verification)
  • hash blocks as they come in from car file to ensure the bytes match the purported CID
  • better error handling
  • split out repo storage class into the naive ReadableBlockstore and the higher level RepoStorage
  • alongside that, introduce ReadableRepo that allows for loading a repo off of the naive blockstore
  • enabled by both of those, we can remove temp from RepoStorage and introduce a new SyncStorage class on top of two readable blockstores. this is flexible such that you can handle sync in memory or with an on-disk cache
  • improve the sync & verification methods to make use of all of this
  • AND FINALLY add in repo checkouts
  • also added getCommitPath as a bonus

Note: this one does not yet include blobs for diffs/checkouts

@dholms dholms changed the base branch from main to repo-storage-migration December 27, 2022 23:49
Base automatically changed from repo-storage-migration to sync-revamp December 29, 2022 21:16
@dholms dholms changed the title Repo checkout - no history Repo checkouts Jan 3, 2023
packages/repo/src/util.ts Outdated Show resolved Hide resolved
if (!earliest && root.prev === null) {
return path.reverse()
} else if (earliest && root.prev.equals(earliest)) {
return path.reverse()
Copy link
Collaborator Author

@dholms dholms Jan 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to get some opinions on how getCommitPath should be ordered.

Instinct says it should be in increasing order, ie: [commitX, commitY, commitZ] (and that's how it is right now). This makes sense because, when processing commits, you want you'd probably want to do them in order

However, I've structured it similarly to array.slice where the first param is inclusive & the second is exclusive
This is because if you ask for a commit path form the current commitZ to commitX, you should already have knowledge of commitX & you don't actually need it, you only need commitZ & commitY.
And in sense, it would suggest that the you'd receive it in decreasing order ([commitZ, commitY])

You can feel this tension on display in pds/sync.test.ts where I ask for earliest: commit[2], latest: commit[15], but then have to compare to commit.slice(3, 16). Feels weird 🤔

You'll also notice that in getCommitPath on both memory-blockstore & sql-repo-storage, we have to reverse the path when returning because the path is by nature constructed from getting a reference latest and then working backwards to the earliest.

Any thoughts?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you got this right as-is, and the mismatch with slice() makes sense to me. This is similar but a little different than slice since we're slicing in "the opposite direction" if you can say that. I think the case that highlights it best is getCommitPath(x, null) vs. slice(x). In the former it's clear we mean to ask for all commits ending at (i.e. before) x; in the latter we want items starting from (i.e. after) x. To hammer this home, we could make earliest optional and default it to null: getCommitPath(x) vs. slice(x).

@@ -200,31 +209,17 @@ export class MST implements DataStore {
// -------------------

// Return the necessary blocks to persist the MST to repo storage
// If the topmost tree only has one entry and it's a subtree, we can eliminate the topmost tree
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my last MST bughunt adventures: we were trimming the top of the tree before saving, but i've moved that to the delete operation itself since it was screwing up our diff bookkeeping (& it's actually just more intuitive to do it on the operation anyway)

@dholms dholms marked this pull request as ready for review January 4, 2023 00:29
didResolver: DidResolver,
): Promise<VerifiedCheckout> => {
const repo = await ReadableRepo.load(storage, root)
const validSig = await didResolver.verifySignature(
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will say, I don't love having to thread the didResolver through all the sync & verify methods. But it's either that or doing the resolution beforehand & threading through the approved signingKey.
It also doesn't take into account the fact that a user's signing key can change yet. (& god forbid their DID changes, we just error out on that for now)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not seeing any clearly better options unless there's a natural default DidResolver, sort of similar to how in node there's a global/default http.Agent. Can also probably be smoothed over for consumers by assigning a DidResolver early on and threading it down here for them in some wrapper/convenience classes, etc.

import { Repo } from '@atproto/repo'
import SqlRepoStorage from '../../../sql-repo-storage'
import AppContext from '../../../context'
import { CID } from 'multiformats/cid'

export default function (server: Server, ctx: AppContext) {
server.com.atproto.sync.getRoot(async ({ params }) => {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another bikeshed, thoughts on switching this to getHead?

I've done this in the storage interfaces to avoid confusion with the similarly named root (which refers to the ipld node that the commit points to).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually i just decided to do it. lmk if you have any beef with it 🐮

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I'm cool with that

) => { success: true; data: T } | { success: false; error: ZodError }
}

export interface Def<T> {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wrapped the defs in this object so RepoStorage can give better errors

Copy link
Collaborator

@devinivy devinivy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dig this very much 💎

sig: Uint8Array,
): Promise<boolean> {
const signingKey = await this.resolveSigningKey(did)
return crypto.verifySignature(signingKey, data, sig)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see verifySignature's first argument is called did— safe to assume signingKey takes the form of a did:key?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup that's right. may be wroth making that explicit in the variable name in verifySignature however signingKey comes from the did spec & can either take the form of a did:key or i believe a jwk (tho at the moment, we require it to be a didKey)

packages/pds/src/api/com/atproto/sync.ts Outdated Show resolved Hide resolved
packages/pds/src/api/com/atproto/sync.ts Outdated Show resolved Hide resolved
packages/pds/src/api/com/atproto/sync.ts Show resolved Hide resolved
packages/repo/src/readable-repo.ts Outdated Show resolved Hide resolved
packages/repo/src/storage/memory-blockstore.ts Outdated Show resolved Hide resolved
packages/repo/src/sync.ts Outdated Show resolved Hide resolved
storage: RepoStorage,
repoCar: Uint8Array,
didResolver: DidResolver,
): Promise<{ root: CID; ops: RecordWriteOp[] }> => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing to do about this here/now I don't think, but just a thought rattling around my brain. If someone is syncing these ops into e.g. their own index, and they want to keep their index consistent, they'd need to process all ops at once transactionally. If the ops were broken-up by commit, then they could gradually process commit-by-commit (or condensed into batches of commits), landing at a consistent state each step along the way. It would be useful to be able to ask something like "give me this history in ops batches of size roughly 100, always containing full commits."

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. My motivation here was actually to simplify processing ops: for instance, if someone creates & deletes a record several times over a commit range, this will collapse it into one operation.

On second look, this should probably be an optional processing step done by the consumer: we return ops per commit & they can process them atomically by commit, or they can collapse them down into one big set of ops

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is actually trickier than i thought it'd be. you lose some info when going from diff -> op. specifically the CID associated with the del. It may make sense to start tracking cids in the operations as well as the records & include the cid in the delete operation as well 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah hmmm, I can see that. Could be nice! If it's too much of a rabbit hole for this PR though, I'm certainly not too hung-up on it right now.

Copy link
Collaborator Author

@dholms dholms Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright I ended up doing it, because it actually does make these functions much nicer & that extra info (relevant cids) is probably useful to some consumers.

Split it out into it's own PR here: #460

packages/repo/src/util.ts Show resolved Hide resolved
didResolver: DidResolver,
): Promise<VerifiedCheckout> => {
const repo = await ReadableRepo.load(storage, root)
const validSig = await didResolver.verifySignature(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not seeing any clearly better options unless there's a natural default DidResolver, sort of similar to how in node there's a global/default http.Agent. Can also probably be smoothed over for consumers by assigning a DidResolver early on and threading it down here for them in some wrapper/convenience classes, etc.

@dholms dholms mentioned this pull request Jan 4, 2023
@dholms dholms merged commit 423b6ab into sync-revamp Jan 5, 2023
@dholms dholms deleted the repo-checkout branch January 5, 2023 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants