Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r/virtual_machine: Failsafe against disk name mismatch #304

Closed
wants to merge 1 commit into from

Conversation

vancluever
Copy link
Contributor

There are certain situations that can lead to a disk ultimately losing
its name at a specific location without Terraform being able to keep
track of it:

  • Significant alteration of a VM config outside of Terraform: It's
    feasible that someone could go and completely replace a virtual disk in
    one location with a different one.
  • Snapshot heck: Snapshots alter the name of a virtual disk. We walk up
    the child disk path to ultimately try and confirm if we are dealing with
    the right virtual disk, but it's not out of the realm of possibility
    that with enough manipulation, the original name of the disk might be
    lost.
  • Linked clones and storage vMotion: This is currently a real
    possibility and the main reason we are putting in this validation.
    vMotions of linked clones ultimately assign delta disk-like names to
    virtual disks, completely removing the original name from the chain.

In these situations, the existing state of the disk does not facilitate
adding keep_on_remove (unless explicitly there already) as the logic
does not believe the disk to be orphaned (it's in state and was found in
config by key or device address). This can lead to potentially
destructive scenarios.

This update guards against that - if the base name of the disk that was
in state before (if any) and the disk ultimately read out of the
backing do not match, taking snapshots into account, an error is kicked
back during the read process and TF stops in its tracks. Any state that
is pending from DiskRefreshOperation is not committed, and an error
message is supplied prompting the user to manually correct the issues
with the VM before (carefully) proceeding again.

Further commits will close the hole in the final scenario, preventing
storage vMotion on linked clones (ie: the root disk in the snapshot
chain does not match the file name of the disk we are managing). This
should prevent this error from coming up during normal TF operations.


Ref: #295 - this does not completely correct it (that will be coming later) but helps prevent the consequences of it.

There are certain situations that can lead to a disk ultimately losing
its name at a specific location without Terraform being able to keep
track of it:

* Significant alteration of a VM config outside of Terraform: It's
feasible that someone could go and completely replace a virtual disk in
one location with a different one.
* Snapshot heck: Snapshots alter the name of a virtual disk. We walk up
the child disk path to ultimately try and confirm if we are dealing with
the right virtual disk, but it's not out of the realm of possibility
that with enough manipulation, the original name of the disk might be
lost.
* Linked clones and storage vMotion: This is currently a real
possibility and the main reason we are putting in this validation.
vMotions of linked clones ultimately assign delta disk-like names to
virtual disks, completely removing the original name from the chain.

In these situations, the existing state of the disk does not facilitate
adding keep_on_remove (unless explicitly there already) as the logic
does not believe the disk to be orphaned (it's in state and was found in
config by key or device address). This can lead to potentially
destructive scenarios.

This update guards against that - if the base name of the disk that was
in state before (if any) and the disk ultimately read out of the
backing do not match, taking snapshots into account, an error is kicked
back during the read process and TF stops in its tracks. Any state that
is pending from DiskRefreshOperation is not committed, and an error
message is supplied prompting the user to manually correct the issues
with the VM before (carefully) proceeding again.

Further commits will close the hole in the final scenario, preventing
storage vMotion on linked clones (ie: the root disk in the snapshot
chain does not match the file name of the disk we are managing). This
should prevent this error from coming up during normal TF operations.
@vancluever
Copy link
Contributor Author

Does not break existing tests:

=== RUN   TestAccResourceVSphereVirtualMachine
=== RUN   TestAccResourceVSphereVirtualMachine/basic
=== RUN   TestAccResourceVSphereVirtualMachine/ESXi-only_test
=== RUN   TestAccResourceVSphereVirtualMachine/shutdown_OK
=== RUN   TestAccResourceVSphereVirtualMachine/multi-device
=== RUN   TestAccResourceVSphereVirtualMachine/add_devices
=== RUN   TestAccResourceVSphereVirtualMachine/remove_middle_devices
=== RUN   TestAccResourceVSphereVirtualMachine/remove_middle_devices,_change_disk_unit
=== RUN   TestAccResourceVSphereVirtualMachine/high_disk_unit_numbers
=== RUN   TestAccResourceVSphereVirtualMachine/high_disk_units_to_regular_single_controller
=== RUN   TestAccResourceVSphereVirtualMachine/cdrom
=== RUN   TestAccResourceVSphereVirtualMachine/maximum_number_of_nics
=== RUN   TestAccResourceVSphereVirtualMachine/upgrade_cpu_and_ram
=== RUN   TestAccResourceVSphereVirtualMachine/modify_annotation
=== RUN   TestAccResourceVSphereVirtualMachine/grow_disk
=== RUN   TestAccResourceVSphereVirtualMachine/swap_scsi_bus
=== RUN   TestAccResourceVSphereVirtualMachine/extraconfig
=== RUN   TestAccResourceVSphereVirtualMachine/extraconfig_swap_keys
=== RUN   TestAccResourceVSphereVirtualMachine/attach_existing_vmdk
=== RUN   TestAccResourceVSphereVirtualMachine/in_folder
=== RUN   TestAccResourceVSphereVirtualMachine/move_to_folder
=== RUN   TestAccResourceVSphereVirtualMachine/static_mac
=== RUN   TestAccResourceVSphereVirtualMachine/single_tag
=== RUN   TestAccResourceVSphereVirtualMachine/multiple_tags
=== RUN   TestAccResourceVSphereVirtualMachine/switch_tags
=== RUN   TestAccResourceVSphereVirtualMachine/clone_from_template
=== RUN   TestAccResourceVSphereVirtualMachine/clone,_multi-nic_(template_should_have_one)
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_different_timezone
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_bad_timezone
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_bad_eagerly_scrub_with_linked_clone
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_bad_thin_provisioned_with_linked_clone
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_bad_size_with_linked_clone
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_bad_size_without_linked_clone
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_different_hostname
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_extra_disks
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_cdrom
=== RUN   TestAccResourceVSphereVirtualMachine/cpu_hot_add
=== RUN   TestAccResourceVSphereVirtualMachine/memory_hot_add
=== RUN   TestAccResourceVSphereVirtualMachine/dual-stack_ipv4_and_ipv6
=== RUN   TestAccResourceVSphereVirtualMachine/ipv6_only
=== RUN   TestAccResourceVSphereVirtualMachine/windows_template,_customization_events_and_proper_IP
=== RUN   TestAccResourceVSphereVirtualMachine/host_vmotion
=== RUN   TestAccResourceVSphereVirtualMachine/resource_pool_vmotion
=== RUN   TestAccResourceVSphereVirtualMachine/storage_vmotion_-_global_setting
=== RUN   TestAccResourceVSphereVirtualMachine/storage_vmotion_-_single_disk
=== RUN   TestAccResourceVSphereVirtualMachine/storage_vmotion_-_pin_datastore
=== RUN   TestAccResourceVSphereVirtualMachine/import
=== RUN   TestAccResourceVSphereVirtualMachine/import_with_multiple_disks_at_different_SCSI_slots
--- PASS: TestAccResourceVSphereVirtualMachine (8179.98s)
    --- PASS: TestAccResourceVSphereVirtualMachine/basic (163.78s)
    --- SKIP: TestAccResourceVSphereVirtualMachine/ESXi-only_test (0.00s)
    	helper_test.go:85: set VSPHERE_TEST_ESXI to run ESXi-specific acceptance tests
    --- PASS: TestAccResourceVSphereVirtualMachine/shutdown_OK (159.96s)
    --- PASS: TestAccResourceVSphereVirtualMachine/multi-device (170.73s)
    --- PASS: TestAccResourceVSphereVirtualMachine/add_devices (183.09s)
    --- PASS: TestAccResourceVSphereVirtualMachine/remove_middle_devices (197.59s)
    --- PASS: TestAccResourceVSphereVirtualMachine/remove_middle_devices,_change_disk_unit (301.44s)
    --- PASS: TestAccResourceVSphereVirtualMachine/high_disk_unit_numbers (159.84s)
    --- PASS: TestAccResourceVSphereVirtualMachine/high_disk_units_to_regular_single_controller (312.25s)
    --- PASS: TestAccResourceVSphereVirtualMachine/cdrom (19.02s)
    --- PASS: TestAccResourceVSphereVirtualMachine/maximum_number_of_nics (191.32s)
    --- PASS: TestAccResourceVSphereVirtualMachine/upgrade_cpu_and_ram (308.61s)
    --- PASS: TestAccResourceVSphereVirtualMachine/modify_annotation (172.06s)
    --- PASS: TestAccResourceVSphereVirtualMachine/grow_disk (168.13s)
    --- PASS: TestAccResourceVSphereVirtualMachine/swap_scsi_bus (291.03s)
    --- PASS: TestAccResourceVSphereVirtualMachine/extraconfig (159.68s)
    --- PASS: TestAccResourceVSphereVirtualMachine/extraconfig_swap_keys (290.89s)
    --- PASS: TestAccResourceVSphereVirtualMachine/attach_existing_vmdk (161.01s)
    --- PASS: TestAccResourceVSphereVirtualMachine/in_folder (159.45s)
    --- PASS: TestAccResourceVSphereVirtualMachine/move_to_folder (172.37s)
    --- PASS: TestAccResourceVSphereVirtualMachine/static_mac (168.26s)
    --- PASS: TestAccResourceVSphereVirtualMachine/single_tag (170.25s)
    --- PASS: TestAccResourceVSphereVirtualMachine/multiple_tags (161.55s)
    --- PASS: TestAccResourceVSphereVirtualMachine/switch_tags (169.30s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_from_template (129.75s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone,_multi-nic_(template_should_have_one) (129.90s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_different_timezone (130.18s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_bad_timezone (0.01s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_bad_eagerly_scrub_with_linked_clone (4.08s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_bad_thin_provisioned_with_linked_clone (4.06s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_bad_size_with_linked_clone (3.96s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_bad_size_without_linked_clone (4.13s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_different_hostname (124.34s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_extra_disks (293.47s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_cdrom (129.06s)
    --- PASS: TestAccResourceVSphereVirtualMachine/cpu_hot_add (138.61s)
    --- PASS: TestAccResourceVSphereVirtualMachine/memory_hot_add (139.67s)
    --- PASS: TestAccResourceVSphereVirtualMachine/dual-stack_ipv4_and_ipv6 (123.54s)
    --- PASS: TestAccResourceVSphereVirtualMachine/ipv6_only (429.59s)
    --- PASS: TestAccResourceVSphereVirtualMachine/windows_template,_customization_events_and_proper_IP (319.13s)
    --- PASS: TestAccResourceVSphereVirtualMachine/host_vmotion (159.70s)
    --- PASS: TestAccResourceVSphereVirtualMachine/resource_pool_vmotion (129.98s)
    --- PASS: TestAccResourceVSphereVirtualMachine/storage_vmotion_-_global_setting (400.81s)
    --- PASS: TestAccResourceVSphereVirtualMachine/storage_vmotion_-_single_disk (260.93s)
    --- PASS: TestAccResourceVSphereVirtualMachine/storage_vmotion_-_pin_datastore (392.34s)
    --- PASS: TestAccResourceVSphereVirtualMachine/import (160.67s)
    --- PASS: TestAccResourceVSphereVirtualMachine/import_with_multiple_disks_at_different_SCSI_slots (160.46s)

@vancluever vancluever added bug Type: Bug enhancement Type: Enhancement and removed bug Type: Bug enhancement Type: Enhancement labels Dec 9, 2017
@tombuildsstuff
Copy link

This update guards against that - if the base name of the disk that was in state before (if any) and the disk ultimately read out of the backing do not match, taking snapshots into account, an error is kicked back during the read process and TF stops in its tracks. Any state that is pending from DiskRefreshOperation is not committed, and an error message is supplied prompting the user to manually correct the issues with the VM before (carefully) proceeding again.

@vancluever perhaps I'm mis-understanding this - but from what I can tell this will breaks the terraform plan workflow (since it's in the Read) if a VM's drifted dramatically from it's original state? If so - I feel this is the wrong approach here, the Plan should be sufficient and we should highlight these changes to the end user (which they can then veto by not applying / fixing the underlying VM as needed)

Copy link

@tombuildsstuff tombuildsstuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code itself looks logically sound (although, it needs an acceptance test to cover this use-case) - however I'm concerned this is going to break the terraform plan workflow (see the other comment)?

@vancluever
Copy link
Contributor Author

@tombuildsstuff I would normally agree with you, but this is a unique case.

In this case, what is happening is that the name of a disk is drifting sufficiently from the original that Terraform cannot track it anymore (see #295). This is mainly happening in storage vMotion scenarios, but can indeed happen during regular reads if someone replaces the disk at a certain address with another.

We can track the latter in plan, but not the former, which is the more likely scenario at this time. This is not a permanent solution - this is a stop-gap until we can 1) block storage vMotion under certain scenarios and 2) start tracking disks by UUID instead of name, device address, and key (something I want to map out today).

PS: Test will be coming this morning.

@vancluever
Copy link
Contributor Author

After internal discussion we've decided that turning this into a keep_on_remove scenario would be better. This will flag the disk the same way we treat all other orphaned disks.

Closing this in favour of a pending PR which will address this.

@vancluever vancluever closed this Dec 11, 2017
vancluever added a commit that referenced this pull request Dec 13, 2017
Lots of helper code added in this test to help manage datastore paths
and what not - things we are kind of doing in other places manually but
I'm getting tired of writing out. Also, a helper for moving a virtual
disk.

The main test case fails right now which is intentional, it fails
correctly which is the good thing. Again, this is testing for the case
describe in #304 where if a disk with a different name gets put into a
place where TF already is tracking, that disk will get deleted on the
next TF run. We are working to fix it so that it simply removes it.

Ref: #304.
vancluever added a commit that referenced this pull request Dec 13, 2017
Lots of helper code added in this test to help manage datastore paths
and what not - things we are kind of doing in other places manually but
I'm getting tired of writing out. Also, a helper for moving a virtual
disk.

The main test case fails right now which is intentional, it fails
correctly which is the good thing. Again, this is testing for the case
describe in #304 where if a disk with a different name gets put into a
place where TF already is tracking, that disk will get deleted on the
next TF run. We are working to fix it so that it simply removes it.

Ref: #304.
@vancluever vancluever deleted the b-different-disk-fail-safe branch January 18, 2018 05:37
@ghost ghost locked and limited conversation to collaborators Apr 19, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Type: Bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants