r/virtual_machine: Failsafe against disk name mismatch #304

vancluever · 2017-12-09T14:48:47Z

There are certain situations that can lead to a disk ultimately losing
its name at a specific location without Terraform being able to keep
track of it:

Significant alteration of a VM config outside of Terraform: It's
feasible that someone could go and completely replace a virtual disk in
one location with a different one.
Snapshot heck: Snapshots alter the name of a virtual disk. We walk up
the child disk path to ultimately try and confirm if we are dealing with
the right virtual disk, but it's not out of the realm of possibility
that with enough manipulation, the original name of the disk might be
lost.
Linked clones and storage vMotion: This is currently a real
possibility and the main reason we are putting in this validation.
vMotions of linked clones ultimately assign delta disk-like names to
virtual disks, completely removing the original name from the chain.

In these situations, the existing state of the disk does not facilitate
adding keep_on_remove (unless explicitly there already) as the logic
does not believe the disk to be orphaned (it's in state and was found in
config by key or device address). This can lead to potentially
destructive scenarios.

This update guards against that - if the base name of the disk that was
in state before (if any) and the disk ultimately read out of the
backing do not match, taking snapshots into account, an error is kicked
back during the read process and TF stops in its tracks. Any state that
is pending from DiskRefreshOperation is not committed, and an error
message is supplied prompting the user to manually correct the issues
with the VM before (carefully) proceeding again.

Further commits will close the hole in the final scenario, preventing
storage vMotion on linked clones (ie: the root disk in the snapshot
chain does not match the file name of the disk we are managing). This
should prevent this error from coming up during normal TF operations.

Ref: #295 - this does not completely correct it (that will be coming later) but helps prevent the consequences of it.

There are certain situations that can lead to a disk ultimately losing its name at a specific location without Terraform being able to keep track of it: * Significant alteration of a VM config outside of Terraform: It's feasible that someone could go and completely replace a virtual disk in one location with a different one. * Snapshot heck: Snapshots alter the name of a virtual disk. We walk up the child disk path to ultimately try and confirm if we are dealing with the right virtual disk, but it's not out of the realm of possibility that with enough manipulation, the original name of the disk might be lost. * Linked clones and storage vMotion: This is currently a real possibility and the main reason we are putting in this validation. vMotions of linked clones ultimately assign delta disk-like names to virtual disks, completely removing the original name from the chain. In these situations, the existing state of the disk does not facilitate adding keep_on_remove (unless explicitly there already) as the logic does not believe the disk to be orphaned (it's in state and was found in config by key or device address). This can lead to potentially destructive scenarios. This update guards against that - if the base name of the disk that was in state before (if any) and the disk ultimately read out of the backing do not match, taking snapshots into account, an error is kicked back during the read process and TF stops in its tracks. Any state that is pending from DiskRefreshOperation is not committed, and an error message is supplied prompting the user to manually correct the issues with the VM before (carefully) proceeding again. Further commits will close the hole in the final scenario, preventing storage vMotion on linked clones (ie: the root disk in the snapshot chain does not match the file name of the disk we are managing). This should prevent this error from coming up during normal TF operations.

vancluever · 2017-12-09T14:49:35Z

Does not break existing tests:

=== RUN   TestAccResourceVSphereVirtualMachine
=== RUN   TestAccResourceVSphereVirtualMachine/basic
=== RUN   TestAccResourceVSphereVirtualMachine/ESXi-only_test
=== RUN   TestAccResourceVSphereVirtualMachine/shutdown_OK
=== RUN   TestAccResourceVSphereVirtualMachine/multi-device
=== RUN   TestAccResourceVSphereVirtualMachine/add_devices
=== RUN   TestAccResourceVSphereVirtualMachine/remove_middle_devices
=== RUN   TestAccResourceVSphereVirtualMachine/remove_middle_devices,_change_disk_unit
=== RUN   TestAccResourceVSphereVirtualMachine/high_disk_unit_numbers
=== RUN   TestAccResourceVSphereVirtualMachine/high_disk_units_to_regular_single_controller
=== RUN   TestAccResourceVSphereVirtualMachine/cdrom
=== RUN   TestAccResourceVSphereVirtualMachine/maximum_number_of_nics
=== RUN   TestAccResourceVSphereVirtualMachine/upgrade_cpu_and_ram
=== RUN   TestAccResourceVSphereVirtualMachine/modify_annotation
=== RUN   TestAccResourceVSphereVirtualMachine/grow_disk
=== RUN   TestAccResourceVSphereVirtualMachine/swap_scsi_bus
=== RUN   TestAccResourceVSphereVirtualMachine/extraconfig
=== RUN   TestAccResourceVSphereVirtualMachine/extraconfig_swap_keys
=== RUN   TestAccResourceVSphereVirtualMachine/attach_existing_vmdk
=== RUN   TestAccResourceVSphereVirtualMachine/in_folder
=== RUN   TestAccResourceVSphereVirtualMachine/move_to_folder
=== RUN   TestAccResourceVSphereVirtualMachine/static_mac
=== RUN   TestAccResourceVSphereVirtualMachine/single_tag
=== RUN   TestAccResourceVSphereVirtualMachine/multiple_tags
=== RUN   TestAccResourceVSphereVirtualMachine/switch_tags
=== RUN   TestAccResourceVSphereVirtualMachine/clone_from_template
=== RUN   TestAccResourceVSphereVirtualMachine/clone,_multi-nic_(template_should_have_one)
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_different_timezone
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_bad_timezone
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_bad_eagerly_scrub_with_linked_clone
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_bad_thin_provisioned_with_linked_clone
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_bad_size_with_linked_clone
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_bad_size_without_linked_clone
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_different_hostname
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_extra_disks
=== RUN   TestAccResourceVSphereVirtualMachine/clone_with_cdrom
=== RUN   TestAccResourceVSphereVirtualMachine/cpu_hot_add
=== RUN   TestAccResourceVSphereVirtualMachine/memory_hot_add
=== RUN   TestAccResourceVSphereVirtualMachine/dual-stack_ipv4_and_ipv6
=== RUN   TestAccResourceVSphereVirtualMachine/ipv6_only
=== RUN   TestAccResourceVSphereVirtualMachine/windows_template,_customization_events_and_proper_IP
=== RUN   TestAccResourceVSphereVirtualMachine/host_vmotion
=== RUN   TestAccResourceVSphereVirtualMachine/resource_pool_vmotion
=== RUN   TestAccResourceVSphereVirtualMachine/storage_vmotion_-_global_setting
=== RUN   TestAccResourceVSphereVirtualMachine/storage_vmotion_-_single_disk
=== RUN   TestAccResourceVSphereVirtualMachine/storage_vmotion_-_pin_datastore
=== RUN   TestAccResourceVSphereVirtualMachine/import
=== RUN   TestAccResourceVSphereVirtualMachine/import_with_multiple_disks_at_different_SCSI_slots
--- PASS: TestAccResourceVSphereVirtualMachine (8179.98s)
    --- PASS: TestAccResourceVSphereVirtualMachine/basic (163.78s)
    --- SKIP: TestAccResourceVSphereVirtualMachine/ESXi-only_test (0.00s)
    	helper_test.go:85: set VSPHERE_TEST_ESXI to run ESXi-specific acceptance tests
    --- PASS: TestAccResourceVSphereVirtualMachine/shutdown_OK (159.96s)
    --- PASS: TestAccResourceVSphereVirtualMachine/multi-device (170.73s)
    --- PASS: TestAccResourceVSphereVirtualMachine/add_devices (183.09s)
    --- PASS: TestAccResourceVSphereVirtualMachine/remove_middle_devices (197.59s)
    --- PASS: TestAccResourceVSphereVirtualMachine/remove_middle_devices,_change_disk_unit (301.44s)
    --- PASS: TestAccResourceVSphereVirtualMachine/high_disk_unit_numbers (159.84s)
    --- PASS: TestAccResourceVSphereVirtualMachine/high_disk_units_to_regular_single_controller (312.25s)
    --- PASS: TestAccResourceVSphereVirtualMachine/cdrom (19.02s)
    --- PASS: TestAccResourceVSphereVirtualMachine/maximum_number_of_nics (191.32s)
    --- PASS: TestAccResourceVSphereVirtualMachine/upgrade_cpu_and_ram (308.61s)
    --- PASS: TestAccResourceVSphereVirtualMachine/modify_annotation (172.06s)
    --- PASS: TestAccResourceVSphereVirtualMachine/grow_disk (168.13s)
    --- PASS: TestAccResourceVSphereVirtualMachine/swap_scsi_bus (291.03s)
    --- PASS: TestAccResourceVSphereVirtualMachine/extraconfig (159.68s)
    --- PASS: TestAccResourceVSphereVirtualMachine/extraconfig_swap_keys (290.89s)
    --- PASS: TestAccResourceVSphereVirtualMachine/attach_existing_vmdk (161.01s)
    --- PASS: TestAccResourceVSphereVirtualMachine/in_folder (159.45s)
    --- PASS: TestAccResourceVSphereVirtualMachine/move_to_folder (172.37s)
    --- PASS: TestAccResourceVSphereVirtualMachine/static_mac (168.26s)
    --- PASS: TestAccResourceVSphereVirtualMachine/single_tag (170.25s)
    --- PASS: TestAccResourceVSphereVirtualMachine/multiple_tags (161.55s)
    --- PASS: TestAccResourceVSphereVirtualMachine/switch_tags (169.30s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_from_template (129.75s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone,_multi-nic_(template_should_have_one) (129.90s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_different_timezone (130.18s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_bad_timezone (0.01s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_bad_eagerly_scrub_with_linked_clone (4.08s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_bad_thin_provisioned_with_linked_clone (4.06s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_bad_size_with_linked_clone (3.96s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_bad_size_without_linked_clone (4.13s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_different_hostname (124.34s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_extra_disks (293.47s)
    --- PASS: TestAccResourceVSphereVirtualMachine/clone_with_cdrom (129.06s)
    --- PASS: TestAccResourceVSphereVirtualMachine/cpu_hot_add (138.61s)
    --- PASS: TestAccResourceVSphereVirtualMachine/memory_hot_add (139.67s)
    --- PASS: TestAccResourceVSphereVirtualMachine/dual-stack_ipv4_and_ipv6 (123.54s)
    --- PASS: TestAccResourceVSphereVirtualMachine/ipv6_only (429.59s)
    --- PASS: TestAccResourceVSphereVirtualMachine/windows_template,_customization_events_and_proper_IP (319.13s)
    --- PASS: TestAccResourceVSphereVirtualMachine/host_vmotion (159.70s)
    --- PASS: TestAccResourceVSphereVirtualMachine/resource_pool_vmotion (129.98s)
    --- PASS: TestAccResourceVSphereVirtualMachine/storage_vmotion_-_global_setting (400.81s)
    --- PASS: TestAccResourceVSphereVirtualMachine/storage_vmotion_-_single_disk (260.93s)
    --- PASS: TestAccResourceVSphereVirtualMachine/storage_vmotion_-_pin_datastore (392.34s)
    --- PASS: TestAccResourceVSphereVirtualMachine/import (160.67s)
    --- PASS: TestAccResourceVSphereVirtualMachine/import_with_multiple_disks_at_different_SCSI_slots (160.46s)

tombuildsstuff · 2017-12-11T10:31:40Z

This update guards against that - if the base name of the disk that was in state before (if any) and the disk ultimately read out of the backing do not match, taking snapshots into account, an error is kicked back during the read process and TF stops in its tracks. Any state that is pending from DiskRefreshOperation is not committed, and an error message is supplied prompting the user to manually correct the issues with the VM before (carefully) proceeding again.

@vancluever perhaps I'm mis-understanding this - but from what I can tell this will breaks the terraform plan workflow (since it's in the Read) if a VM's drifted dramatically from it's original state? If so - I feel this is the wrong approach here, the Plan should be sufficient and we should highlight these changes to the end user (which they can then veto by not applying / fixing the underlying VM as needed)

tombuildsstuff

The code itself looks logically sound (although, it needs an acceptance test to cover this use-case) - however I'm concerned this is going to break the terraform plan workflow (see the other comment)?

vancluever · 2017-12-11T15:48:55Z

@tombuildsstuff I would normally agree with you, but this is a unique case.

In this case, what is happening is that the name of a disk is drifting sufficiently from the original that Terraform cannot track it anymore (see #295). This is mainly happening in storage vMotion scenarios, but can indeed happen during regular reads if someone replaces the disk at a certain address with another.

We can track the latter in plan, but not the former, which is the more likely scenario at this time. This is not a permanent solution - this is a stop-gap until we can 1) block storage vMotion under certain scenarios and 2) start tracking disks by UUID instead of name, device address, and key (something I want to map out today).

PS: Test will be coming this morning.

vancluever · 2017-12-11T18:48:54Z

After internal discussion we've decided that turning this into a keep_on_remove scenario would be better. This will flag the disk the same way we treat all other orphaned disks.

Closing this in favour of a pending PR which will address this.

Lots of helper code added in this test to help manage datastore paths and what not - things we are kind of doing in other places manually but I'm getting tired of writing out. Also, a helper for moving a virtual disk. The main test case fails right now which is intentional, it fails correctly which is the good thing. Again, this is testing for the case describe in #304 where if a disk with a different name gets put into a place where TF already is tracking, that disk will get deleted on the next TF run. We are working to fix it so that it simply removes it. Ref: #304.

vancluever requested review from tombuildsstuff and mbfrahry December 9, 2017 14:49

vancluever added bug Type: Bug enhancement Type: Enhancement and removed bug Type: Bug enhancement Type: Enhancement labels Dec 9, 2017

tombuildsstuff suggested changes Dec 11, 2017

View reviewed changes

vancluever mentioned this pull request Dec 11, 2017

r/virtual_machine: Block VM renames that result in disk deletion #305

Merged

vancluever closed this Dec 11, 2017

This was referenced Dec 13, 2017

r/virtual_machine: Do not delete disks at same address as existing disk #317

Merged

r/virtual_machine: Current limitations and roadmap for disk device management and vMotion #295

Closed

vancluever deleted the b-different-disk-fail-safe branch January 18, 2018 05:37

ghost locked and limited conversation to collaborators Apr 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r/virtual_machine: Failsafe against disk name mismatch #304

r/virtual_machine: Failsafe against disk name mismatch #304

vancluever commented Dec 9, 2017

vancluever commented Dec 9, 2017

tombuildsstuff commented Dec 11, 2017

tombuildsstuff left a comment

vancluever commented Dec 11, 2017

vancluever commented Dec 11, 2017

r/virtual_machine: Failsafe against disk name mismatch #304

r/virtual_machine: Failsafe against disk name mismatch #304

Conversation

vancluever commented Dec 9, 2017

vancluever commented Dec 9, 2017

tombuildsstuff commented Dec 11, 2017

tombuildsstuff left a comment

Choose a reason for hiding this comment

vancluever commented Dec 11, 2017

vancluever commented Dec 11, 2017