Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloudwatch_logs: remove sequence tokens from API calls #6741

Open
wants to merge 3 commits into
base: 1.9
Choose a base branch
from

Conversation

matthewfala
Copy link
Contributor

@matthewfala matthewfala commented Jan 26, 2023

Signed-off-by: Matthew Fala falamatt@amazon.com

This is a cleaned up version of the following branch https://github.com/PettitWesley/fluent-bit.git 1_9-sequence-token
This removes the sequence token feature entirely.

Also removes error handling for InvalidSequence and DataAlreadyAccepted exceptions

The sequence token is now ignored in PutLogEvents actions. PutLogEvents actions are always accepted and never return InvalidSequenceTokenException or DataAlreadyAcceptedException even if the sequence token is not valid. You can use parallel PutLogEvents actions on the same log stream.

https://docs.aws.amazon.com/AmazonCloudWatchLogs/latest/APIReference/API_PutLogEvents.html

This PR is in testing.


Enter [N/A] in the box, if an item is not applicable to your change.

Testing
Before we can approve your change; please submit the following in a comment:

  • Example configuration file for the change
  • Debug log output from testing the change
  • Attached Valgrind output that shows no leaks or memory corruption was found

If this is a change to packaging of containers or native binaries then please confirm it works for all targets.

  • Run local packaging test showing all targets (including any new ones) build.
  • Set ok-package-test label to test for all targets (requires maintainer to do).

Documentation

  • Documentation required for this feature

Backporting

  • Backport to latest stable release.

Fluent Bit is licensed under Apache 2.0, by submitting this pull request I understand that this code will be released under the terms of that license.

@matthewfala matthewfala temporarily deployed to pr January 26, 2023 19:05 — with GitHub Actions Inactive
@matthewfala matthewfala temporarily deployed to pr January 26, 2023 19:06 — with GitHub Actions Inactive
@matthewfala matthewfala temporarily deployed to pr January 26, 2023 19:22 — with GitHub Actions Inactive
@PettitWesley
Copy link
Contributor

Adding a note here that we discussed that we will update our cloudwatch_logs load tests to incrementally test higher throughput now that sequence tokens and ingestion limits are removed.

Right now we have 1, 2, 3 MB/s. We can incrementally increase it next release to 2, 4, 6 MB/s and see those results in our pipeline. If they are good, next time we can increment it again.

In theory, cloudwatch_logs after this change should be able to be tested at the same 30 MB/s that we test other plugins.

https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.30.0

Though @matthewfala noted that the cloudwatch_logs validation takes the longest, so we might need to improve that before we can increase the load too much or might have to stick with a lower load because of it.

CloudWatch can export logs to S3, we may want to look into how fast that is. Validating exported logs in S3 might possibly be faster for high throughput tests than validating against CW.

Comment on lines 1458 to 1505
error = flb_aws_error(c->resp.payload, c->resp.payload_size);
if (error != NULL) {
if (strcmp(error, ERR_CODE_INVALID_SEQUENCE_TOKEN) == 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flb_aws_error and flb_aws_print_error are basically the same.

Here were are now only still using flb_aws_error to check if we have an error so we choose to either print it with flb_aws_print_error or print the full API payload to debug if we can't parse the error.

If you made a small change to add a return value to flb_aws_print_error, it could do the check of whether or not here is an error that we can parse. This simplifies the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does make the code simpler. I've updated the code to reflect this suggestion.

FLB_OUTPUT_RETURN(FLB_RETRY);
}

// TODO: this msg is innaccurate if events are skipped
flb_plg_debug(ctx->ins, "Sent %d events to CloudWatch", event_count);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did remove it in the sequence token deprecation branch. That may be what you are thinking.

return -1;
}

struct cw_flush *new_buffer()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you noted yourself to me, need to also remove the singleplex scheduler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The singleplex scheduler has been removed.

Copy link
Contributor Author

@matthewfala matthewfala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

About to force push my revision.

plugins/out_cloudwatch_logs/cloudwatch_api.c Show resolved Hide resolved
Comment on lines 1458 to 1505
error = flb_aws_error(c->resp.payload, c->resp.payload_size);
if (error != NULL) {
if (strcmp(error, ERR_CODE_INVALID_SEQUENCE_TOKEN) == 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does make the code simpler. I've updated the code to reflect this suggestion.

return -1;
}

struct cw_flush *new_buffer()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The singleplex scheduler has been removed.

FLB_OUTPUT_RETURN(FLB_RETRY);
}

// TODO: this msg is innaccurate if events are skipped
flb_plg_debug(ctx->ins, "Sent %d events to CloudWatch", event_count);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You did remove it in the sequence token deprecation branch. That may be what you are thinking.

@@ -605,7 +603,7 @@ static struct flb_config_map config_map[] = {
"errors, which may help improve throughput when there are transient/random "
"networking issues."
},

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove stray whitespace

@matthewfala matthewfala temporarily deployed to pr January 27, 2023 20:44 — with GitHub Actions Inactive
@matthewfala matthewfala temporarily deployed to pr January 27, 2023 20:45 — with GitHub Actions Inactive
@matthewfala matthewfala temporarily deployed to pr January 27, 2023 20:45 — with GitHub Actions Inactive
@matthewfala matthewfala temporarily deployed to pr January 27, 2023 20:59 — with GitHub Actions Inactive
flb_sds_destroy(error);
}
else {
/* some other error occurred; notify user */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove word "other"

* Returns 0 if an error cannot be found in the response,
* and 1 if the error is found and printed.
*/
void flb_aws_print_error(char *response, size_t response_len,
int flb_aws_print_error(char *response, size_t response_len,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why these return values?? Let's use either typical 0 success and -1 failure or FLB_TRUE and FLB_ERROR

/* some other error occurred; notify user */
is_error_parsed = flb_aws_print_error(c->resp.payload, c->resp.payload_size,
"PutRetentionPolicy", ctx->ins);
if (!is_error_parsed) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use fluent bit best practice of comparing directly with FLB_TRUE or FLB_FALSE. I think its more readable too.

Comment on lines 1456 to 1458
if (!is_error_parsed) {
/* error could not be parsed, print raw response to debug */
flb_plg_debug(ctx->ins, "Raw response: %s", c->resp.payload);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we should make this a non-debug log?? I kind of think we should, otherwise user is missing the clear error response.

Of course, this case shouldn't really happen unless we get a truncated response or something such that the AWS API error can't be parsed.

@matthewfala
Copy link
Contributor Author

Signed-off-by: Matthew Fala <falamatt@amazon.com>
Signed-off-by: Matthew Fala <falamatt@amazon.com>
Signed-off-by: Matthew Fala <falamatt@amazon.com>
@matthewfala matthewfala temporarily deployed to pr January 28, 2023 03:23 — with GitHub Actions Inactive
@matthewfala matthewfala temporarily deployed to pr January 28, 2023 03:23 — with GitHub Actions Inactive
@matthewfala matthewfala temporarily deployed to pr January 28, 2023 03:42 — with GitHub Actions Inactive
Copy link
Contributor

@PettitWesley PettitWesley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea on just printing the response in aws_print_error instead of my over complicated idea of the return value for the same!

@github-actions
Copy link
Contributor

github-actions bot commented May 1, 2023

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants