Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved Cloudwatch output plugin Connect() #3335

Merged

Conversation

adamchainz
Copy link
Contributor

@adamchainz adamchainz commented Oct 12, 2017

Fixes #3319. Use STS to validate the provided credentials work, rather than ListMetricData which requires a separate permission, and document the required permission.

  • Signed CLA.
  • Associated README.md updated.
  • Has appropriate unit tests.

I'm not sure about unit tests since none exist using mock AWS, but I did some manual testing.

I made a test IAM user with only cloudwatch:PutMetricData permission, policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}

I then used its access key as environment variables locally and ran telegraf in this environment, with this config:

[agent]
  debug = "true"

[[outputs.cloudwatch]]
  region = "eu-west-1"
  namespace = "InfluxData/Telegraf"

[[inputs.cpu]]

I then successfully saw data flushed when running telegraf like so:

$ ./telegraf --config test.conf
2017-10-12T22:23:07Z D! Attempting connection to output: cloudwatch
2017-10-12T22:23:07Z D! Successfully connected to output: cloudwatch
2017-10-12T22:23:07Z I! Starting Telegraf v1.5.0~c74c29b1
2017-10-12T22:23:07Z I! Loaded outputs: cloudwatch
2017-10-12T22:23:07Z I! Loaded inputs: inputs.cpu
2017-10-12T22:23:07Z I! Tags enabled: host=Adams-MacBook-Pro-2.local
2017-10-12T22:23:07Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"Adams-MacBook-Pro-2.local", Flush Interval:10s
2017-10-12T22:23:20Z D! Output [cloudwatch] buffer fullness: 9 / 10000 metrics.
2017-10-12T22:23:21Z D! Output [cloudwatch] wrote batch of 9 metrics in 976.776485ms
^C2017-10-12T22:23:26Z I! Hang on, flushing any cached metrics before shutdown
2017-10-12T22:23:26Z D! Output [cloudwatch] buffer fullness: 0 / 10000 metrics.

And I also checked with bad credentials and saw the error message:

$ AWS_ACCESS_KEY_ID=foobar ./telegraf --config test.conf
2017-10-12T22:24:50Z E! cloudwatch: Cannot use credentials to connect to AWS : InvalidClientTokenId: The security token included in the request is invalid.
    status code: 403, request id: 2824d227-af9c-11e7-aa9b-f5e6ab3e0c4c
2017-10-12T22:24:50Z E! Failed to connect to output cloudwatch, retrying in 15s, error was 'InvalidClientTokenId: The security token included in the request is invalid.
    status code: 403, request id: 2824d227-af9c-11e7-aa9b-f5e6ab3e0c4c'

Fixes influxdata#3316. Use STS to validate the provided credentials work, rather than `ListMetricData` which requires a separate permission, and document the required permission.

- [x] Signed [CLA](https://influxdata.com/community/cla/).
- [x] Associated README.md updated.
- [x] Has appropriate unit tests.

I'm not sure about unit tests since none exist using mock AWS, but I did some manual testing.

I made a test IAM user with only `cloudwatch:PutMetricData` permission, policy:

```
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData"
            ],
            "Resource": [
                "*"
            ]
        }
    ]
}
```

I then used its access key as environment variables locally and ran telegraf in this environment, with this config:

```ini
[agent]
  debug = "true"

[[outputs.cloudwatch]]
  region = "eu-west-1"
  namespace = "InfluxData/Telegraf"

[[inputs.cpu]]
```

I then successfully saw data flushed when running telegraf like so:

```
$ ./telegraf --config test.conf
2017-10-12T22:23:07Z D! Attempting connection to output: cloudwatch
2017-10-12T22:23:07Z D! Successfully connected to output: cloudwatch
2017-10-12T22:23:07Z I! Starting Telegraf v1.5.0~c74c29b1
2017-10-12T22:23:07Z I! Loaded outputs: cloudwatch
2017-10-12T22:23:07Z I! Loaded inputs: inputs.cpu
2017-10-12T22:23:07Z I! Tags enabled: host=Adams-MacBook-Pro-2.local
2017-10-12T22:23:07Z I! Agent Config: Interval:10s, Quiet:false, Hostname:"Adams-MacBook-Pro-2.local", Flush Interval:10s
2017-10-12T22:23:20Z D! Output [cloudwatch] buffer fullness: 9 / 10000 metrics.
2017-10-12T22:23:21Z D! Output [cloudwatch] wrote batch of 9 metrics in 976.776485ms
^C2017-10-12T22:23:26Z I! Hang on, flushing any cached metrics before shutdown
2017-10-12T22:23:26Z D! Output [cloudwatch] buffer fullness: 0 / 10000 metrics.
```

And I also checked with bad credentials and saw the error message:

```
$ AWS_ACCESS_KEY_ID=foobar ./telegraf --config test.conf
2017-10-12T22:24:50Z E! cloudwatch: Cannot use credentials to connect to AWS : InvalidClientTokenId: The security token included in the request is invalid.
    status code: 403, request id: 2824d227-af9c-11e7-aa9b-f5e6ab3e0c4c
2017-10-12T22:24:50Z E! Failed to connect to output cloudwatch, retrying in 15s, error was 'InvalidClientTokenId: The security token included in the request is invalid.
    status code: 403, request id: 2824d227-af9c-11e7-aa9b-f5e6ab3e0c4c'
```
Copy link
Contributor

@danielnelson danielnelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one thing that comes to mind is that previously we knew if we had ListMetricData authorization, but now we only know if we can authenticate. Granted, knowing about this permission was not helpful because it was the wrong one, but I wonder if there is a way to test up front if the user has the correct permissions setup.

Ultimately it will be reported on write so I don't think it is crucial.

@adamchainz
Copy link
Contributor Author

There is no way to test for specific permissions on AWS except to try the action. Some EC2 endpoints have a "dry run" mode for checking but they seem to have ditched this idea.

@danielnelson danielnelson merged commit bf9f94e into influxdata:master Oct 13, 2017
@danielnelson danielnelson added this to the 1.5.0 milestone Oct 13, 2017
@danielnelson
Copy link
Contributor

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants