-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Write state diffs directly to Postgres #30
Write state diffs directly to Postgres #30
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still need a chance to take a closer look at the code but so far this looks great! In regards to those questions:
- One option is to go ahead and rebase this on top of the latest state of statediff_at_anyblock-1.9.11, which shouldn't be too messy since we touched mostly different sections of code, and then merge this and apply the code=>codehash update in a separate PR.
- I think the CLI params like that is good. If anything I'd recommend splitting the URI out into separate CLI flags (hostname, port, database, user, password).
- I think leaving open/idle/lifetime as default is fine for now, and we can add stuff like that in other PRs to keep this one more focused (already a very significant change w/ large diff!)
- Agree on Transformer name- DBWriter makes more sense. Or what do think about renaming the current Indexer to DBWriter/PGWriter and the Transformer to Indexer?
- Yeah it's not possible to write to a single sql.Tx from within separate goroutines, unfortunately. So it would require some additional rework to be able to use the concurrent iterator, but I think it's still possible e.g. we could have all the separate goroutines of the concurrent builder/iterator send their processed StateNode objects out a single channel where a lone worker receives and writes them into a single transaction.
|
Sooner rather than later, please. |
Will try to finish up my review today @roysc In regards to the CLI vs .toml config- I don't have a strong preference but for the sake of usability inside of a docker container in staging environments it is often more convenient to be able to pass config params in by CLI flag or env variable than by .toml file. I might challenge the idea we should merge eth-statediff-service into here, as the standalone service binary is much smaller and quicker to build and it allows us to provide focused top-level README documentation specific to that service. But I think I could be convinced of the advantages of merging in here. |
Writes state diffs directly to postgres Adds CLI flags to configure PG Refactors builder output with callbacks Copies refactored postgres handling code from ipld-eth-indexer
edf2c37
to
51d2054
Compare
less ambiguous
d2951c1
to
74a972f
Compare
Alright, I incorporated the SQL output tests from What remains -
@AFDudley I can start on concurrency as soon as we merge this. |
had to rf some types for this
ee7d659
to
dc18a76
Compare
migrations and metrics...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! The main thing is that we need to add a new service loop that automatically writes diffs at the head of chain. Only a couple small comments aside from that.
Eventually we'll want to figure out a better way of sharing packages between this and ipld-eth-indexer (use a 3rd repo for some stuff).
And eventually we should convert these ported tests away from ginkgo/gomega since go-ethereum uses the standard lib test package.
cmd/utils/flags.go
Outdated
@@ -1633,12 +1645,15 @@ func RegisterGraphQLService(stack *node.Node, endpoint string, cors, vhosts []st | |||
} | |||
|
|||
// RegisterStateDiffService configures and registers a service to stream state diff data over RPC | |||
func RegisterStateDiffService(stack *node.Node) { | |||
// dbParams are: Postgres connection URI, Node ID, client name | |||
func RegisterStateDiffService(stack *node.Node, dbParams *[3]string) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be better to define a new StateDiffParams struct or pass the three arguments explicitly rather than this dbParams array pointer, especially since it shows up as an argument in more than one function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely - this was a bit of a punt, meant to write a TODO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool cool yeah this doesn't need to be done now
|
||
// Env variables | ||
const ( | ||
DATABASE_NAME = "DATABASE_NAME" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally we would pass individual db connection string components in by CLI flag in a manner that is overridable by these env variables
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this doesn't need to be resolved now
@@ -405,3 +439,49 @@ func (sds *Service) StreamCodeAndCodeHash(blockNumber uint64, outChan chan<- Cod | |||
} | |||
}() | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need another method that we turn on by CLI param, which automatically writes statediffs to Postgres as they stream off the chainEvents channel of the syncing chain. This will look almost exactly like Loop()
method on line 154, except instead of sending the diff to subscribers it writes directly to Postgres.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense - would that just be triggered via CLI or also toggleable by an RPC call?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just turn it on by CLI is fine for now
@roysc Btw I don't think we should try to port the concurrency stuff into here at this time, there will be data races with the regular ongoing sync processes plus the statediff service needs to operate through the same StateDB object as the BlockChain to utilize it's in-memory cache. This is another reason to maintain the separate repos IMO. Rather, I think we should port these changes into eth-statediff-service and in doing so utilize the concurrent node iterator used there. |
todo - do something meaningful to test write loop
96265c1
to
2b25daf
Compare
Hey @roysc, tested this locally and it appears to be working! Using Seeing results in every cid table (including uncles and receipts) Also working without Although there is this error when shutting it down in this mode: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting some changes that don't necessarily need to be made in this PR.
node "github.com/ipfs/go-ipld-format" | ||
"github.com/jmoiron/sqlx" | ||
"github.com/multiformats/go-multihash" | ||
"github.com/sirupsen/logrus" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should update all the ported packages to use Geth's log package
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, we should update any tests we ported over to use the default test pkg that the rest of geth uses instead of ginko/gomega
statediff/indexer/indexer.go
Outdated
traceMsg += fmt.Sprintf("postgres transaction commit duration: %s\r\n", tDiff.String()) | ||
} | ||
traceMsg += fmt.Sprintf(" TOTAL PROCESSING DURATION: %s\r\n", time.Now().Sub(start).String()) | ||
log.Info(traceMsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should change this to the Debug level
drop ginkgo/gomega
@i-norden Refactored the logging and tests - (btw. noticed we have the gomega module vendored in, guessing we can remove it). |
@roysc yes please get rid of the vendor/ dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I think we should go ahead and merge this so we can cut a release and apply the rebase on top. Running it locally things look good and I have confirmed the codehash=>code is being stored correctly. There is a pretty big lag between the head in the Postgres db and the head of the current block import but I'm not surprised considering how quickly it imports chunks of blocks while catching up to head and the limiting resources of my computer.
This provides a new RPC method
statediff_WriteStateDiffAt
to write diffs directly into PostgreSQL upon client request (per cerc-io/eth-statediff-service#11). To do this, I copied over and refactored some of the tooling from https://github.com/vulcanize/ipld-eth-indexer.@i-norden How should we merge the code hash & storage output from #27? I suppose it depends on how that data is consumed.
More minor questions: