Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write state diffs directly to Postgres #30

Merged
merged 19 commits into from
Nov 11, 2020

Conversation

roysc
Copy link

@roysc roysc commented Oct 21, 2020

This provides a new RPC method statediff_WriteStateDiffAt to write diffs directly into PostgreSQL upon client request (per cerc-io/eth-statediff-service#11). To do this, I copied over and refactored some of the tooling from https://github.com/vulcanize/ipld-eth-indexer.

@i-norden How should we merge the code hash & storage output from #27? I suppose it depends on how that data is consumed.

More minor questions:

  • I added three new CLI parameters to pass in the DB connection URI, node ID, and client name. It may be cleaner to pass these as a single comma separated string or even a separate toml file.
  • Do we still want to configure the max open/idle/lifetime DB parameters, or leave these at default for now?
  • "Transformer" is not really appropriate anymore, maybe DBWriter?
  • I'm not sure it will be possible to perform writes concurrently and still wrap them in an atomic transaction - I suspect not, but haven't found a definitive answer.

@roysc roysc changed the title Statediff into postgres Write state diffs directly to Postgres Oct 21, 2020
Copy link
Collaborator

@i-norden i-norden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need a chance to take a closer look at the code but so far this looks great! In regards to those questions:

  • One option is to go ahead and rebase this on top of the latest state of statediff_at_anyblock-1.9.11, which shouldn't be too messy since we touched mostly different sections of code, and then merge this and apply the code=>codehash update in a separate PR.
  • I think the CLI params like that is good. If anything I'd recommend splitting the URI out into separate CLI flags (hostname, port, database, user, password).
  • I think leaving open/idle/lifetime as default is fine for now, and we can add stuff like that in other PRs to keep this one more focused (already a very significant change w/ large diff!)
  • Agree on Transformer name- DBWriter makes more sense. Or what do think about renaming the current Indexer to DBWriter/PGWriter and the Transformer to Indexer?
  • Yeah it's not possible to write to a single sql.Tx from within separate goroutines, unfortunately. So it would require some additional rework to be able to use the concurrent iterator, but I think it's still possible e.g. we could have all the separate goroutines of the concurrent builder/iterator send their processed StateNode objects out a single channel where a lone worker receives and writes them into a single transaction.

@roysc
Copy link
Author

roysc commented Oct 23, 2020

  • Roger on rebasing for code/hash stuff
  • I'll leave the CLI like it is then, but if we change it I think I would vote for using a separate toml file just in the name of cleanliness of config handling (and allowing to reuse config files for the indexer service)
  • Agreed on naming, and in that case I may also want to rename the indexer's methods from index*CID to merge*CID or upsert*CID, to be more direct
  • Yeah, consuming thread + channels should work, and still improve performance. I would also really prefer to merge this and the standalone statediff service into a single codebase once we get to that point.

@AFDudley
Copy link

Yeah, consuming thread + channels should work, and still improve performance. I would also really prefer to merge this and the standalone statediff service into a single codebase once we get to that point.

Sooner rather than later, please.

@i-norden
Copy link
Collaborator

i-norden commented Oct 23, 2020

Will try to finish up my review today @roysc

In regards to the CLI vs .toml config- I don't have a strong preference but for the sake of usability inside of a docker container in staging environments it is often more convenient to be able to pass config params in by CLI flag or env variable than by .toml file.

I might challenge the idea we should merge eth-statediff-service into here, as the standalone service binary is much smaller and quicker to build and it allows us to provide focused top-level README documentation specific to that service. But I think I could be convinced of the advantages of merging in here.

Writes state diffs directly to postgres

Adds CLI flags to configure PG

Refactors builder output with callbacks

Copies refactored postgres handling code from ipld-eth-indexer
@roysc roysc force-pushed the statediff-into-postgres branch from edf2c37 to 51d2054 Compare October 23, 2020 13:48
@roysc roysc force-pushed the statediff-into-postgres branch from d2951c1 to 74a972f Compare October 23, 2020 14:04
@roysc
Copy link
Author

roysc commented Oct 23, 2020

Alright, I incorporated the SQL output tests from ipld-eth-indexer and otherwise this is pretty much covered by the existing test suite since it's behind the original Builder interface.

What remains -

  • handling the code & code hash, we could just copy over how it's handled in the indexer, but if that output is significant enough it can also be done iteratively
  • we will need to trigger these write calls from somewhere - I have some basic code in the ipld-eth-indexer to test it
  • update the metrics output

@AFDudley I can start on concurrency as soon as we merge this.

@roysc roysc force-pushed the statediff-into-postgres branch from ee7d659 to dc18a76 Compare October 24, 2020 14:01
Copy link
Collaborator

@i-norden i-norden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! The main thing is that we need to add a new service loop that automatically writes diffs at the head of chain. Only a couple small comments aside from that.

Eventually we'll want to figure out a better way of sharing packages between this and ipld-eth-indexer (use a 3rd repo for some stuff).

And eventually we should convert these ported tests away from ginkgo/gomega since go-ethereum uses the standard lib test package.

@@ -1633,12 +1645,15 @@ func RegisterGraphQLService(stack *node.Node, endpoint string, cors, vhosts []st
}

// RegisterStateDiffService configures and registers a service to stream state diff data over RPC
func RegisterStateDiffService(stack *node.Node) {
// dbParams are: Postgres connection URI, Node ID, client name
func RegisterStateDiffService(stack *node.Node, dbParams *[3]string) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be better to define a new StateDiffParams struct or pass the three arguments explicitly rather than this dbParams array pointer, especially since it shows up as an argument in more than one function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definitely - this was a bit of a punt, meant to write a TODO

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool cool yeah this doesn't need to be done now


// Env variables
const (
DATABASE_NAME = "DATABASE_NAME"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we would pass individual db connection string components in by CLI flag in a manner that is overridable by these env variables

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this doesn't need to be resolved now

@@ -405,3 +439,49 @@ func (sds *Service) StreamCodeAndCodeHash(blockNumber uint64, outChan chan<- Cod
}
}()
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need another method that we turn on by CLI param, which automatically writes statediffs to Postgres as they stream off the chainEvents channel of the syncing chain. This will look almost exactly like Loop() method on line 154, except instead of sending the diff to subscribers it writes directly to Postgres.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense - would that just be triggered via CLI or also toggleable by an RPC call?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just turn it on by CLI is fine for now

@i-norden
Copy link
Collaborator

i-norden commented Oct 26, 2020

@roysc Btw I don't think we should try to port the concurrency stuff into here at this time, there will be data races with the regular ongoing sync processes plus the statediff service needs to operate through the same StateDB object as the BlockChain to utilize it's in-memory cache. This is another reason to maintain the separate repos IMO.

Rather, I think we should port these changes into eth-statediff-service and in doing so utilize the concurrent node iterator used there.

roysc added 2 commits October 28, 2020 14:38
todo - do something meaningful to test write loop
@roysc roysc force-pushed the statediff-into-postgres branch from 96265c1 to 2b25daf Compare October 28, 2020 06:38
@i-norden
Copy link
Collaborator

i-norden commented Nov 4, 2020

Hey @roysc, tested this locally and it appears to be working!

Using
./build/bin/geth --syncmode=full --gcmode=archive --statediff --statediff.writing --statediff.db=postgres://localhost:5432/vulcanize_testing?sslmode=disable --statediff.dbnodeid=yeee --statediff.dbclientname=yooo

Seeing results in every cid table (including uncles and receipts)

Also working without --gcmode=archive

Although there is this error when shutting it down in this mode: ERROR[11-04|11:06:19.592] Dangling trie nodes after full cleanup , which I think may actually be appropriate in our case since we purposely prevent dereferencing of nodes we have not diffed yet.

Copy link
Collaborator

@i-norden i-norden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting some changes that don't necessarily need to be made in this PR.

node "github.com/ipfs/go-ipld-format"
"github.com/jmoiron/sqlx"
"github.com/multiformats/go-multihash"
"github.com/sirupsen/logrus"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update all the ported packages to use Geth's log package

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, we should update any tests we ported over to use the default test pkg that the rest of geth uses instead of ginko/gomega

traceMsg += fmt.Sprintf("postgres transaction commit duration: %s\r\n", tDiff.String())
}
traceMsg += fmt.Sprintf(" TOTAL PROCESSING DURATION: %s\r\n", time.Now().Sub(start).String())
log.Info(traceMsg)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should change this to the Debug level

@roysc
Copy link
Author

roysc commented Nov 5, 2020

@i-norden Refactored the logging and tests - (btw. noticed we have the gomega module vendored in, guessing we can remove it).

@i-norden
Copy link
Collaborator

i-norden commented Nov 5, 2020

@roysc yes please get rid of the vendor/ dir

Copy link
Collaborator

@i-norden i-norden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I think we should go ahead and merge this so we can cut a release and apply the rebase on top. Running it locally things look good and I have confirmed the codehash=>code is being stored correctly. There is a pretty big lag between the head in the Postgres db and the head of the current block import but I'm not surprised considering how quickly it imports chunks of blocks while catching up to head and the limiting resources of my computer.

@i-norden i-norden merged commit 4f46ea0 into statediff_at_anyblock-1.9.11 Nov 11, 2020
@i-norden i-norden deleted the statediff-into-postgres branch September 10, 2021 19:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants