Skip to content
This repository has been archived by the owner on Apr 29, 2020. It is now read-only.

Mpuncel/rc audit logs #877

Merged
merged 5 commits into from
Jul 10, 2017
Merged

Mpuncel/rc audit logs #877

merged 5 commits into from
Jul 10, 2017

Conversation

mpuncel
Copy link
Collaborator

@mpuncel mpuncel commented Jun 22, 2017

No description provided.

@@ -396,7 +401,8 @@ func (rc *replicationController) ensureConsistency(current types.PodLocations) e
}

rc.logger.WithField("node", pod.Node).WithField("intentManifestSHA", intentSHA).Info("Found inconsistency in scheduled manifest")
if err := rc.schedule(ctx, pod.Node); err != nil {

if err := rc.scheduleNoAudit(ctx, pod.Node); err != nil {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

potentially we could just do audit logging in ensureConsistency() as well. it's going to produce an extra audit log record unnecessarily which is a downside but the upside would mean we don't need this scheduleNoAudit() function and we also don't need the Context() getter on the auditing transaction, and it would prevent someone from doing transaction.Commit(txn.Context() in the wrong place (and thus not creating an audit log record)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we need that Context() getter unless we're going to move node scheduling logic into the auditing transaction as well. i think things are fine the way they are

@mpuncel mpuncel force-pushed the mpuncel/rc-audit-logs branch 3 times, most recently from 3705d4d to 7c75acf Compare June 22, 2017 23:02
PodIDLabel = rcstore.PodIDLabel // TODO: put this in a different place now that multiple packages use it
AvailabilityZoneLabel = types.AvailabilityZoneLabel
ClusterNameLabel = types.ClusterNameLabel
PodIDLabel = types.PodIDLabel // TODO: put this in a different place now that multiple packages use it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the TODO comment because it has been satisfied?


type auditingTransaction struct {
ctx context.Context
nodes map[types.NodeName]struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this a map of types.NodeName to struct{}? I've been using map[types.NodeName]bool to create sets for fast lookup. is this analogous?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could have done bool, but struct{} occupies 0 bytes of memory so it's slightly more "efficient" even though it doesn't really matter here.

source: https://dave.cheney.net/2014/03/25/the-empty-struct

@@ -226,19 +229,19 @@ func (rc *replicationController) addPods(current types.PodLocations) error {

rc.logger.NoFields().Infof("Need to schedule %d nodes out of %s", toSchedule, possible)

ctx, cancelFunc := transaction.New(context.Background())
txn, cancelFunc := rc.newAuditingTransaction(context.Background(), currentNodes)
for i := 0; i < toSchedule; i++ {
// create a new context for every 5 nodes. This is done to make sure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unrelated to the PR here, but would you mind adding a link to this issue in this comment? hashicorp/consul#2921
We'll be able to simplify this code soon.

@rudle
Copy link
Contributor

rudle commented Jul 10, 2017

What kind of metrics should we watch when this rolls out? I think it could cause a write spike, especially for pods like p2-heartbeat

@mpuncel
Copy link
Collaborator Author

mpuncel commented Jul 10, 2017

I don't think it will show up in our metrics, we're not increasing the number of operations performed by the RC farm, we're just adding an extra write in each operation. Bandwidth will increase but it will likely be invisible compared to other sources of bandwidth fluctuation

This allows expanding the labels.ApplicatorWithoutWatches interface
which makes it easier to use in various binaries that switch on using
HTTP applicators vs direct consul access.
An rc retargeting event represents an RC changing the set of nodes it
targets, e.g. due to a replica count change, selector change, or a
change to the labels of a node.

This commit creates the json schema for the details for an event as well
as a constructor for easily creating the json.RawMessage for use in an
audit log record.
RCs have a "pod_labels" field which represents a set of labels that
should be applied to every pod that is scheduled by that RC. Previously
there has been no requirement that that set of labels includes the
availability_zone and cluster_name to identify the pod cluster that the
RC and its pods are members of. pod_id can be inferred from the pod
manifest and is already automatically set.

This commit enforces that the set of labels have availability zone and
cluster name in anticipation of having RCs create audit log records
every time the set of nodes they target changes. This is because audit
log records are linked to pod clusters, and as a result RCs must be as
well.
An audit log record will be created whenever nodes are scheduled or
unscheduled by an RC. This will allow constructing a history of which
nodes were managed by an RC or pod cluster over time.
@mpuncel mpuncel merged commit 36f7849 into square:master Jul 10, 2017
@mpuncel mpuncel deleted the mpuncel/rc-audit-logs branch July 10, 2017 19:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants