-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
timeout while waiting for state to become 'success' (timeout: 2m0s) #780
Comments
Could be useful to introduce these to the operations as well, for additional control: https://developer.hashicorp.com/terraform/language/resources/syntax#operation-timeouts |
Whats strange with thats issue is that it (in our case) works from our personal computers (100% of cases passes) but fails (>95% cases fails) from github action. And seems that with each fail it have "random" number of failed items, so its maybe related to some PG rate limiting at host level or something?
Overall this issue is annoying as hell. |
Same here. Our developers apply terraform via a github action and we are seeing the same thing. |
We have just ran into this as well with our first GH actions deploy using an scoped OAuth client credential (app) which only this one project is using for one deployment at a time. No issues during development on local machine with multiple deploys and tear downs of the stack but going to staging and prod with this errored. I retried the staging job twice (the second time after waiting while reading issues on Github) and then the prod one went through. Seems I may have got lucky on Github actions with a new runner or exit IP... there may be an undocumented IP address based limit in play? |
To test if its GH (network) issue i have created self hosted runner. I will try to create simple case run yesterday.
Something like that failed on second run.
|
We are also testing moving our terraform actions to self-hosted runners and are monitoring to see if the timeouts go away |
@ingwarsw legend, you just saved me from testing a self-hosted runner. I have even seen an error on this:
which fails after 5mins of spinning. Only seen this on Github actions, local machines work 100% of the time. |
Hey folks! I prepared this repository for trying to replicate the error and after several intends (New commits and Actions re-runs), I can that tell I haven't had success 😅 If I captured correctly what you all being noting, the repository meets following condition for trying the reproduce the error:
On top of that, I added verbose (secured) logging for debugging the error and at the end find out what's going on. As you being pointing out, locally the TF plan/apply works flawlessly, even in TF Cloud runners too (I did the test just in case). Therefore, I would really appreciate if any you could submit a few PRs, so you can help me to replicate this error and find the culprit please 🙏🏽. I'll do my best staying tune and promptly merging your PRs till We reproduce the error and hopefully catch the bug in the logs. Thanks in advance for your help and patience. |
@imjaroiswebdev can you try adding a bunch of user / team lookups? We are suspicious that our pagerduty schedule definitions cause a bunch of cascading requests having to look up each user by email and such |
Here's a sanitized example of how we define teams and schedules. locals {
team = "DevInfra"
members = [
"bogus1@pagerduty.com",
"bogus2@pagerduty.com",
"bogus3@pagerduty.com",
"bogus4@pagerduty.com",
"bogus5@pagerduty.com",
]
start = "2023-11-27T14:30:00-07:00"
manager = "bogus1@pagerduty.com"
}
resource "pagerduty_team" "default" {
name = local.team
}
data "pagerduty_user" "team" {
for_each = toset(local.members)
email = each.key
}
data "pagerduty_user" "manager" {
email = local.manager
}
resource "pagerduty_schedule" "default" {
name = "${local.team} schedule"
time_zone = "America/Los_Angeles"
layer {
name = "${local.team} Ops Leads"
start = local.start
rotation_virtual_start = local.start
rotation_turn_length_seconds = 60 * 60 * 24 * 7
users = [for member in local.members : data.pagerduty_user.team[member].id]
}
teams = [pagerduty_team.default.id]
}
resource "pagerduty_escalation_policy" "default" {
name = "${local.team} Escalation Policy"
teams = [pagerduty_team.default.id]
rule {
escalation_delay_in_minutes = 10
target {
type = "schedule_reference"
id = pagerduty_schedule.default.id
}
}
rule {
escalation_delay_in_minutes = 10
target {
type = "user_reference"
id = data.pagerduty_user.manager.id
}
}
}
edit: PR imjaroiswebdev/pd-tfprovider-issue-780-experiment#1 |
Hey @austinpray-mixpanel thank you very much for your help, however, this configuration wasn't enough to replicate the error look 😩 |
We appreciate the effort @imjaroiswebdev. Are you able to check internally if there is any rate limiting at the host/ip level in addition to the new rate limiting rules published publically last year? That may explain this issue better than a standard reproduction. I have only had 1/5 new deployments fail since I posted, however that job was failing repeatedly on |
I finally was able to reproduce the issue here, I decided to re-run the job until it failed because of this. I believe the last time I didn't try it enough. I just meant to update you all for letting you know I'm researching further into this with other engineering teams to catch the culprit and get back to you with a solution, workaround or something 💪🏽 |
@imjaroiswebdev sorry for the late reply, I run into the issue when running from an Azure Devops MS hosted agent (similar to a gh runner). Issue has not presented itself locally. I see someone else provided code but here's what I'm running: terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
}
pagerduty = {
source = "pagerduty/pagerduty"
}
}
}
resource "pagerduty_service" "tsc_pagerduty_service" {
name = "[TF] ${var.service_name}"
description = "[Managed by Terraform] - ${var.pagerduty_description}"
auto_resolve_timeout = var.pagerduty_auto_resolve_timeout
acknowledgement_timeout = var.pagerduty_acknowledgement_timeout
escalation_policy = var.pagerduty_escalation_policy_id
alert_creation = "create_alerts_and_incidents"
incident_urgency_rule {
type = var.pagerduty_incident_urgency == "high" ? "constant" : "use_support_hours"
urgency = var.pagerduty_incident_urgency == "high" ? "high" : ""
dynamic "during_support_hours" {
for_each = var.pagerduty_incident_urgency == "high" ? [] : [1]
content {
type = "constant"
urgency = "high"
}
}
dynamic "outside_support_hours" {
for_each = var.pagerduty_incident_urgency == "high" ? [] : [1]
content {
type = "constant"
urgency = "low"
}
}
}
dynamic "support_hours" {
for_each = var.pagerduty_incident_urgency == "high" ? [] : [1]
content {
type = "fixed_time_per_day"
time_zone = "America/New_York"
days_of_week = ["1", "2", "3", "4", "5"]
start_time = "09:00:00"
end_time = "17:00:00"
}
}
dynamic "scheduled_actions" {
for_each = var.pagerduty_incident_urgency == "high" ? [] : [1]
content {
type = "urgency_change"
to_urgency = "high"
at {
type = "named_time"
name = "support_hours_start"
}
}
}
}
resource "pagerduty_service_integration" "tsc_pagerduty_azure_service_integration" {
name = "Microsoft Azure"
vendor = var.pagerduty_microsoft_azure_vendor_id
service = pagerduty_service.tsc_pagerduty_service.id
}
resource "pagerduty_slack_connection" "tsc_pagerduty_slack_connection" {
source_id = pagerduty_service.tsc_pagerduty_service.id
source_type = "service_reference"
workspace_id = var.slack_workspace_id
channel_id = var.slack_channel_id
notification_type = "responder"
config {
events = [
"incident.triggered",
"incident.escalated",
"incident.resolved",
"incident.priority_updated",
"incident.responder.added",
"incident.responder.replied",
"incident.status_update_published",
"incident.reopened"
]
priorities = ["*"]
}
}
resource "azurerm_monitor_action_group" "tsc_pagerduty_action_group" {
name = "${trim(var.service_name,":<>+/&%?@")} PagerDuty Action Group"
resource_group_name = var.action_group_resource_group_name
short_name = "PD${var.pagerduty_incident_urgency}${substr(var.service_name, 0, 5)}"
webhook_receiver {
name = "PagerDuty"
service_uri = "https://events.pagerduty.com/integration/${pagerduty_service_integration.tsc_pagerduty_azure_service_integration.integration_key}/enqueue"
use_common_alert_schema = true
}
lifecycle {
ignore_changes = [
tags["Environment"],
tags["CostCenter"],
tags["Product"],
tags["lastModified"],
tags["lastModifiedBy"]
]
}
}
output "pagerduty_service_integration_id" {
value = pagerduty_service_integration.tsc_pagerduty_azure_service_integration.id
}
output "tsc_pagerduty_action_group_id" {
value = azurerm_monitor_action_group.tsc_pagerduty_action_group.id
} I then invoke the module as so: module "tsc_services_action_group_high" {
source = "../modules/pagerduty-action-group"
for_each = toset(distinct(local.tscServices))
service_name = "${each.value} - High"
pagerduty_description = "These are the high urgency alerts for ${each.value}"
pagerduty_escalation_policy_id = local.it_system_engineers_escalation_policy_id
pagerduty_incident_urgency = "high"
action_group_resource_group_name = azurerm_resource_group.tscmonitoring_live.name
}
ATM I am working around then by breaking out my monitoring terraform into separate workspaces and then importing the Azure Actions Groups as data objects in the other workspaces. Using the data objects for the pagerduty services themselves still leads to the same timeouts unfortunately, thankfully my setup we don't need to alter the PD services very often. But it would be extremely useful to be able to do so in order to enable my team to rename things + update dependencies etc. as they wish. |
Thank you so much for all your help, because it was very valuable for figuring this out. This is not an issue affecting GH Runners exclusively as @erose96 detected, the reason of this issue is that TF provider's API client doesn't have a configured timeout for API calls. A patch for solving this will be released on next Monday Jan 22th. Again, thanks for all your support and patience folks. |
@imjaroiswebdev I think we should reopen that issue.. Just tested v3.5.0 and still have same issue.
Second run second set of "random" errors.
|
Can confirm that it looks like this issue is still present (maybe worse, since the timeout is lowered).
Is the issue here the lack of retries? I'm not too bothered about the timeout itself, but it doesn't appear that the provider/http client attempts a retry. It seems like any network issue during a run would be enough to err the terraform run. (it seems that some resources have retries baked-in, and others do not. though none at the network-level) |
Yes @ingwarsw, reopening for further investigation and tests. Please stay tuned, I'll get back to you ASAP with a patch or ETA for it. |
I'm also seeing specific 500 errors on the new version as well for some endpoints, such as:
|
My issue is likely related to https://status.pagerduty.com/incident_details/PTUPX96 |
Hey folks! I encourage you to do the upgrade to v3.5.2, hopefully, the issue should be finally addressed. Again, I want to thank you all for your patience and support providing helpful error outputs for us to better figure out how to solve this. @tgoodsell-tempus the issue you were experience was due to a partial outage with Slack Integration at that moment, however, as far as I know, it should be working as usual again. Feel free to re-open this thread if any form of this error continues to appear. |
@imjaroiswebdev I have run our pipeline 5 times and it didnt failed once.. so we can consider it big success 🥇 |
Great! Thank you so much for the feedback @ingwarsw. Very appreciated 🎉 |
upgrading to 3.6.0 fixed the issue. Thank you |
Working perfectly using v3.7.0, thank you @imjaroiswebdev and everyone else who helped solve this issue! |
#777 attempted to fix this issue but it persists in my environment.
I do not believe this is an issue caused by the rate limit.
Here is the section of the debug log with where the exception in the title occurs:
The 200 that occurs right before this indicates the rate limit is not about to be hit:
The
WaitForState
messages in the logs makes me think it's related to an issue upstream in the terraform-plugin-sdk. A fix was submitted for that issue a few years ago but was never reviewed.See past issues: #765 #760
The text was updated successfully, but these errors were encountered: