Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] AzureRM - Azure Postgres Flexible Server - Virtual Endpoint Attempts to re-create after Failover #27796

Open
1 task done
leonrob opened this issue Oct 28, 2024 · 12 comments

Comments

@leonrob
Copy link

leonrob commented Oct 28, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave comments along the lines of "+1", "me too" or "any updates", they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment and review the contribution guide to help.

Terraform Version

0.13

AzureRM Provider Version

4.7.0

Affected Resource(s)/Data Source(s)

azurerm_postgresql_flexible_server_virtual_endpoint

Terraform Configuration Files

resource "azurerm_postgresql_flexible_server_virtual_endpoint" "testendpoint" {
  name              = "testendpoint1"
  source_server_id  = data.azurerm_postgresql_flexible_server.centralpg.id
  replica_server_id = data.azurerm_postgresql_flexible_server.eastpgreplica.id
  type              = "ReadWrite"

  depends_on = [
    data.azurerm_postgresql_flexible_server.centralpg,
    data.azurerm_postgresql_flexible_server.eastpgreplica
  ]
  
}

Debug Output/Panic Output

Whitespace change before failover shows 0 changes.

After manually promoting the replica server to the primary server in the azure UI, this happens on whitespace change:

Terraform will perform the following actions:

  # azurerm_postgresql_flexible_server_virtual_endpoint.testendpoint will be created
  + resource "azurerm_postgresql_flexible_server_virtual_endpoint" "testendpoint" {
      + id                = (known after apply)
      + name              = "testendpoint1"
      + replica_server_id = "/subscriptions/XX/resourceGroups/eastus2-cloudpipelines-dev-rg/providers/Microsoft.DBforPostgreSQL/flexibleServers/eastus2-replica-test-demo-dev-fpg"
      + source_server_id  = "/subscriptions/XX/resourceGroups/centralus-development-dev-rg/providers/Microsoft.DBforPostgreSQL/flexibleServers/centralus-test-demo-dev-fpg"
      + type              = "ReadWrite"
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Expected Behaviour

It should see there is already an endpoint assigned to both resources that is functional.

Actual Behaviour

No response

Steps to Reproduce

Whitespace change

Important Factoids

No response

References

No response

@leonrob
Copy link
Author

leonrob commented Oct 28, 2024

Also - for what it's worth... I've attempted to add lifecycle prevent destroy and it does not work. The only workaround I found is:

Create a var:

variable "create_virtual_endpoint" {
type = bool
default = false # Change this based on your workspace context
}

Use var as a bool to create or not. But on initial creation it would need set to true. It would need changed to "False" in a separate PR after. I'm trying to reduce the amount of steps required.

resource "azurerm_postgresql_flexible_server_virtual_endpoint" "testendpoint" {
count = var.create_virtual_endpoint ? 1 : 0
name = "testendpoint1"
source_server_id = data.azurerm_postgresql_flexible_server.centralpg.id
replica_server_id = data.azurerm_postgresql_flexible_server.eastpgreplica.id
type = "ReadWrite"

depends_on = [
data.azurerm_postgresql_flexible_server.centralpg,
data.azurerm_postgresql_flexible_server.eastpgreplica
]

lifecycle {
ignore_changes = ["*"]
}
}

@neil-yechenwei
Copy link
Contributor

neil-yechenwei commented Oct 29, 2024

Thanks for raising this issue. The prevent destroy needs to be added from the beginning. Seems I can't reproduce this issue. Could you double check if below reproduce steps are expected?

Reproduce steps:

  1. tf apply with below tf config
  2. Exchange the values for zone and standby_availability_zone
  3. tf apply again

tf config:

provider "azurerm" {
  features {}
}

resource "azurerm_resource_group" "test" {
  name     = "acctestRG-postgresql-test01"
  location = "eastus"
}

resource "azurerm_postgresql_flexible_server" "test" {
  name                          = "acctest-fs-test01"
  resource_group_name           = azurerm_resource_group.test.name
  location                      = azurerm_resource_group.test.location
  version                       = "16"
  public_network_access_enabled = false
  administrator_login           = "adminTerraform"
  administrator_password        = "QAZwsx123"
  zone                          = "1"
  storage_mb                    = 32768
  storage_tier                  = "P30"
  sku_name                      = "GP_Standard_D2ads_v5"

  high_availability {
    mode                      = "ZoneRedundant"
    standby_availability_zone = "2"
  }
}

resource "azurerm_postgresql_flexible_server" "test_replica" {
  name                          = "acctest-ve-replica-test01"
  resource_group_name           = azurerm_postgresql_flexible_server.test.resource_group_name
  location                      = azurerm_postgresql_flexible_server.test.location
  create_mode                   = "Replica"
  source_server_id              = azurerm_postgresql_flexible_server.test.id
  version                       = "16"
  public_network_access_enabled = false
  zone                          = "1"
  storage_mb                    = 32768
  storage_tier                  = "P30"
}

resource "azurerm_postgresql_flexible_server_virtual_endpoint" "test" {
  name              = "acctest-ve-test01"
  source_server_id  = azurerm_postgresql_flexible_server.test.id
  replica_server_id = azurerm_postgresql_flexible_server.test_replica.id
  type              = "ReadWrite"
}

@leonrob
Copy link
Author

leonrob commented Oct 29, 2024

Apologies you're using "ZoneRedundant" for the HA mode

It's actually Replica

@CorrenSoft
Copy link
Contributor

According to the plan that you shared, is just creating a virtual endpoint (does not include the destruction part), which may suggest that the Virtuaal Endpoint was already destroyed on the Failover. Could be that the case?

Is not uncommon that in failover scenarios, due to the changes made in the process, the Terraform code gets outdated. In those situations, you will need to decide between to restore the original configuration once the situation that triggered the failover is no longer valid, or update the code to properly describe the new status.

@leonrob
Copy link
Author

leonrob commented Nov 5, 2024

According to the plan that you shared, is just creating a virtual endpoint (does not include the destruction part), which may suggest that the Virtuaal Endpoint was already destroyed on the Failover. Could be that the case?

Is not uncommon that in failover scenarios, due to the changes made in the process, the Terraform code gets outdated. In those situations, you will need to decide between to restore the original configuration once the situation that triggered the failover is no longer valid, or update the code to properly describe the new status.

Hey CorrenSoft thanks for the reply. It actually does NOT destroy the endpoint. I have ton some extremely extensive tests on this to replicate it. I can replicate this super easily.

If possible, would you be willing to hop on a call with me? No pressure or anything. That way I can show you this. My company is a fortune 500 but we aren't a Terraform enterprise customer. (Although we spend a large amount with Hashi :-D )

Thanks in advance

@CorrenSoft
Copy link
Contributor

Not sure if it would be appropriate since I don't work for Hashicorp :p
Besides, I am not familiar enough (yet) with this resource, just provided my input based on my experience with failover with other resources.

Just to increase the context information, Did you say that the failover did not destroy the endpoint? If so, does the apply step actually create a new one?

@leonrob
Copy link
Author

leonrob commented Nov 5, 2024

Not sure if it would be appropriate since I don't work for Hashicorp :p Besides, I am not familiar enough (yet) with this resource, just provided my input based on my experience with failover with other resources.

Just to increase the context information, Did you say that the failover did not destroy the endpoint? If so, does the apply step actually create a new one?

Oh I apologize, I thought you did! lol.

Yes. The failover did NOT destroy the endpoint. Which is expected.

The database servers should be able to fail between each other without destruction

My concern is that the Terraform doesn't see the virtual endpoint when it goes to check state. Even though it already exists.

It's 100% a bug on Hashi's end. There was another bug related to this that I was able to get someone to fix, but that person no longer works at hashi

@leonrob
Copy link
Author

leonrob commented Nov 14, 2024

Anyone from hashi take a peek at this yet?

@zahi101
Copy link

zahi101 commented Dec 4, 2024

I ran into this problem too, after promoting to the replica server the terraform don't "know" the endpoint and tries to create a new one (ends with error because same name...) after promoting again(to the original one) its worked.

@leonrob
Copy link
Author

leonrob commented Dec 4, 2024

@jackofallops could you take a look?

@leonrob
Copy link
Author

leonrob commented Dec 12, 2024

@stephybun could you take a look?

@leonrob
Copy link
Author

leonrob commented Dec 23, 2024

Unsure if anyone is planning on trying to fix this so i gave it a shot here
#28374

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants