-
Notifications
You must be signed in to change notification settings - Fork 262
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add influxdb trace collection #970
Changes from 5 commits
a9f8479
40fe826
47818ce
6970772
12a33cf
8e2fbd1
79a2f54
136931a
6758a3e
014b9db
e087e94
c05e07d
f40c5e1
efe4b7a
407fac9
1d083b0
3e4a964
b51a444
b2c4061
8e1de35
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -24,12 +24,17 @@ import ( | |
"github.com/tendermint/tendermint/libs/service" | ||
tmsync "github.com/tendermint/tendermint/libs/sync" | ||
"github.com/tendermint/tendermint/p2p" | ||
"github.com/tendermint/tendermint/pkg/remote" | ||
tmproto "github.com/tendermint/tendermint/proto/tendermint/types" | ||
sm "github.com/tendermint/tendermint/state" | ||
"github.com/tendermint/tendermint/types" | ||
tmtime "github.com/tendermint/tendermint/types/time" | ||
) | ||
|
||
const ( | ||
consensusTable = "consensus" | ||
) | ||
cmwaters marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
// Consensus sentinel errors | ||
var ( | ||
ErrInvalidProposalSignature = errors.New("error invalid proposal signature") | ||
|
@@ -140,6 +145,8 @@ type State struct { | |
|
||
// for reporting metrics | ||
metrics *Metrics | ||
|
||
eventCollector *remote.Client | ||
} | ||
|
||
// StateOption sets an optional parameter on the State. | ||
|
@@ -170,6 +177,7 @@ func NewState( | |
evpool: evpool, | ||
evsw: tmevents.NewEventSwitch(), | ||
metrics: NopMetrics(), | ||
eventCollector: &remote.Client{}, | ||
} | ||
|
||
// set function defaults (may be overwritten before calling Start) | ||
|
@@ -211,6 +219,11 @@ func StateMetrics(metrics *Metrics) StateOption { | |
return func(cs *State) { cs.metrics = metrics } | ||
} | ||
|
||
// StateEventCollector sets the remote event collector. | ||
func StateEventCollector(ec *remote.Client) StateOption { | ||
evan-forbes marked this conversation as resolved.
Show resolved
Hide resolved
|
||
return func(cs *State) { cs.eventCollector = ec } | ||
} | ||
|
||
// String returns a string. | ||
func (cs *State) String() string { | ||
// better not to access shared variables | ||
|
@@ -686,6 +699,12 @@ func (cs *State) newStep() { | |
|
||
cs.nSteps++ | ||
|
||
cs.eventCollector.WritePoint(consensusTable, map[string]interface{}{ | ||
"height": rs.Height, | ||
"round": rs.Round, | ||
"step": rs.Step, | ||
}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead of duplicating ourselves, can we implement the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. would we be able to collect arbitary trace data that way? Like, if I wanted a summary of each message sent and recieved over the wire, would we be able to put that in a gauge or counter? I'm not exactly sure of how we will use this yet. Initially, I assumed that we would create proprietary one-off branches that push very specific trace data to the db for us to examine some scenario or component. However I could see have some default set of data collection that could be turned on simply by configuring a url and token to also be useful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I've since removed the above code, so by default we're not actually sending any data to the db even if we have it configured. However we can add things or refactor in the future! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So from where I stand, the utility in this is for analysing feature developments and potentially for bug detection and introspection. I currently have use cases for the former but I can also imagine the latter being helpful on occasions. I think you're right that for feature developments this would occur on some branch but with any modifications, it's essential that we compare it with the baseline (i.e. the current network version). Thus Secondly, we're probably not going to be looking at one or two data points but look out for changes across several dimensions (which metrics should be tracking) such as the amount of rounds, the block time, bandwidth for certain messages etc... We need to ensure not only that the hypothesis is correct, but also that there aren't any regressions in a different part of the system. I think that often these will be the same data points, thus by just capturing all metrics by default, we save having to rewrite this every time we want to analyse something. As well as feature development, we can and probably should be using this tool to analyse usage patterns. If done right, it may even help us understand what areas to be focusing on/prioritising. (i.e. if bandwidth is still a bottle neck then do we need to start doing some research into a narwhalesque implementation). Lastly, I think if we come across some major problems, tracking all metrics could be helpful in any retrospective work. So my conclusion would be to have the simple option of having all metrics persisted as data points straight out-of-the-box while allowing for developers to inject any custom tracing where they see fit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I agree, but wasn't sure how much data we want to collect atm and felt like that could be pushed to a different PR.
added #978 ! 🙂 For example, I think collecting the consensus info makes sense here, but what about a trace of all of the messages sent over the wire? That seems both very useful in some cases, but also very heavy, so idk. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
also sorry I think I mistunderstood this at first to say that we should refactor the eventCollector to use guages and counters for collection. being able to keep everything in a single db is a good idea 😅 #979 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
We could look at filtering metrics by reactor (since they're already divided that way) similar to the filtered logger Tendermint had in the past |
||
|
||
// newStep is called by updateToState in NewState before the eventBus is set! | ||
if cs.eventBus != nil { | ||
if err := cs.eventBus.PublishEventNewRoundStep(rs); err != nil { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kinda just threw these names together, so I'm definitely open to better names. I'm leaning towards just calling it an influxdb client, because that's basically what it is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
defer to you b/c
EventCollector
orInfluxDBClient
both SGTM.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The naming looks good to me, but it may be beneficial to include a brief comment about the service it provides.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't think this should fall under
Instrumentation
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could fall under instrumentation, but I'm viewing trace level data collection as much more arbitrary and specific atm. Its not really something we use to get a glance at what is happening at the network, its something that we use in a more one-off way to debug really specific issues or measure very specific things. Perhaps in the future we have traces that we look at routinely, but if that's the case, then perhaps we should make that thing a metric instead. We can include in it in that config though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not? :)
Aren't you using it now to work out how often we have multi-round heights in our testnets
Anyway, this strikes me as
Instrumentation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay cool, will add to the instumentation config instead 👍
If I'm interpreting the question correctly, I think this comes down to how much detail is being extracted or summarized. for example, if we were simply wanting to view how many blocks took multiple rounds to come to consensus, then we could do that in a simple light way using the existing metrics to store a summary of the data. That summary is useful to "glance" at. However, if we wanted to examine why those heights are taking multiple rounds, then we would need something significantly less summarized that inherently takes more time to analyze. I feel like I'm not answering the question though, so pls ask slightly differently if that's the case 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, when we store the data in a detailed (as opposed to summarized) way, in order to glance at it we have to write significantly more complex queries, which we don't currently have atm and can be time consuming to write