Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create Framework Step that executes Snapshot/Restore upgrade #7

Closed
Tracked by #10
chelma opened this issue Nov 20, 2022 · 5 comments
Closed
Tracked by #10

Create Framework Step that executes Snapshot/Restore upgrade #7

chelma opened this issue Nov 20, 2022 · 5 comments
Assignees

Comments

@chelma
Copy link
Member

chelma commented Nov 20, 2022

Blocked By

Task

Create a validation framework step that executes a snapshot/restore upgrade [1] on the cluster created in a preceding step. You may assume the starting and ending nodes have the same number of nodes. The ending cluster version must be user-configurable via a supplied Dockerfile.

Why snapshot restore? It's widely used, it enables us to have both the starting and ending clusters spun up in parallel at the end for testing, and enable us to develop our final validation testing script such that it can be pointed at any two starting/ending clusters (including "real" customer clusters).

[1] Snapshot/Restore steps - https://github.com/sumobrian/migration-tools/pull/4

Acceptance Criteria

  • Able to invoke the validation framework's CLI to perform a snapshot/restore upgrade using starting cluster created by previous step
  • Step should confirm upgraded cluster is running w/ expected version and node count before ending
  • Step should store information about the cluster in application state (IP/Port for each node, etc) so they can be accessed later
  • Type hints and unit tests for any added code
@chelma
Copy link
Member Author

chelma commented Dec 12, 2022

Goal is to have a PR for this out by 2022-12-16. Think it's doable as long as the PR from the previous step (#40) isn't too contentions.

@chelma
Copy link
Member Author

chelma commented Dec 15, 2022

Slightly ahead of schedule; got it working and ready for PR today.

(.venv) chelma@3c22fba4e266 upgrades % cat ./test_configs/snapshot_restore_es_7_10_2_to_os_1_3_6.json
{
    "clusters_def": {
        "source": {
            "engine_version": "ES_7_10_2",
            "image": "docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2",
            "node_count": 2,
            "additional_node_config": {
                "ES_JAVA_OPTS": "-Xms512m -Xmx512m"
            }
        },
        "target": {
            "engine_version": "OS_1_3_6",
            "image": "opensearchproject/opensearch:1.3.6",
            "node_count": 2,
            "additional_node_config": {
                "plugins.security.disabled": "true",
                "OPENSEARCH_JAVA_OPTS": "-Xms512m -Xmx512m"
            }
        }
    },
    "upgrade_def": {
        "style": "snapshot-restore"
    }
}
(.venv) chelma@3c22fba4e266 upgrades % ./run_utf.py --test_config ./test_configs/snapshot_restore_es_7_10_2_to_os_1_3_6.json
[FrameworkRunner] Running Step: LoadTestConfig
[LoadTestConfig] Loading test config file...
[LoadTestConfig] Loaded test config file successfully
[FrameworkRunner] Step Succeeded: LoadTestConfig
[FrameworkRunner] Running Step: BootstrapDocker
[BootstrapDocker] Checking if Docker is installed and available...
[BootstrapDocker] Docker appears to be installed and available
[BootstrapDocker] Ensuring the Docker image docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 is available either locally or remotely...
[BootstrapDocker] Docker image docker.elastic.co/elasticsearch/elasticsearch-oss:7.10.2 is available
[BootstrapDocker] Ensuring the Docker image opensearchproject/opensearch:1.3.6 is available either locally or remotely...
[BootstrapDocker] Docker image opensearchproject/opensearch:1.3.6 is available
[FrameworkRunner] Step Succeeded: BootstrapDocker
[FrameworkRunner] Running Step: SnapshotRestoreSetup
[SnapshotRestoreSetup] Creating shared Docker volume to share snapshots...
[SnapshotRestoreSetup] Created shared Docker volume cluster-snapshots-volume
[FrameworkRunner] Step Succeeded: SnapshotRestoreSetup
[FrameworkRunner] Running Step: StartSourceCluster
[StartSourceCluster] Creating source cluster...
[StartSourceCluster] Waiting up to 30 sec for cluster to be active...
Node source-cluster-node-1 is now active
Node source-cluster-node-2 is now active
[StartSourceCluster] Cluster source-cluster is active
[FrameworkRunner] Step Succeeded: StartSourceCluster
[FrameworkRunner] Running Step: TestSourceCluster
[TestSourceCluster] Querying cluster status...
[TestSourceCluster] ip            heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
192.168.240.3           31          52  62    1.63    0.60     0.29 dimr      *      source-cluster-node-2
192.168.240.2           30          52  71    1.63    0.60     0.29 dimr      -      source-cluster-node-1
[TestSourceCluster] Uploading sample document...
[TestSourceCluster] Retrieving uploaded document...
[TestSourceCluster] Retrieved uploaded doc sucessfully
[TestSourceCluster] {
    "_id": "1",
    "_index": "noldor",
    "_primary_term": 1,
    "_seq_no": 0,
    "_source": {
        "name": "Finwe"
    },
    "_type": "_doc",
    "_version": 1,
    "found": true
}
[FrameworkRunner] Step Succeeded: TestSourceCluster
[FrameworkRunner] Running Step: CreateSourceSnapshot
[CreateSourceSnapshot] Creating snapshot of source cluster...
[CreateSourceSnapshot] Confirming snapshot of source cluster exists...
[CreateSourceSnapshot] Source snapshot created successfully
[CreateSourceSnapshot] {
    "snapshots": [
        {
            "data_streams": [],
            "duration_in_millis": 200,
            "end_time": "2022-12-15T22:58:02.033Z",
            "end_time_in_millis": 1671145082033,
            "failures": [],
            "include_global_state": true,
            "indices": [
                "noldor"
            ],
            "shards": {
                "failed": 0,
                "successful": 1,
                "total": 1
            },
            "snapshot": "1",
            "start_time": "2022-12-15T22:58:01.833Z",
            "start_time_in_millis": 1671145081833,
            "state": "SUCCESS",
            "uuid": "NnQF30ulRxCo2LQWGPY8BA",
            "version": "7.10.2",
            "version_id": 7100299
        }
    ]
}
[FrameworkRunner] Step Succeeded: CreateSourceSnapshot
[FrameworkRunner] Running Step: StartTargetCluster
[StartTargetCluster] Creating target cluster...
[StartTargetCluster] Waiting up to 30 sec for cluster to be active...
Node target-cluster-node-2 is now active
Node target-cluster-node-1 is now active
[StartTargetCluster] Cluster target-cluster is active
[FrameworkRunner] Step Succeeded: StartTargetCluster
[FrameworkRunner] Running Step: TestTargetCluster
[TestTargetCluster] ip         heap.percent ram.percent cpu load_1m load_5m load_15m node.role master name
172.17.0.3           19          93  65    3.41    1.12     0.47 dimr      *      target-cluster-node-2
172.17.0.2           30          93  70    3.41    1.12     0.47 dimr      -      target-cluster-node-1
[FrameworkRunner] Step Succeeded: TestTargetCluster
[FrameworkRunner] Running Step: RestoreSourceSnapshot
[RestoreSourceSnapshot] Checking if source snapshots are visible on target...
[RestoreSourceSnapshot] Source snapshot visible to target cluster
[RestoreSourceSnapshot] Restoring source snapshot onto target...
[RestoreSourceSnapshot] Waiting a few seconds for the snapshot to be restored...
[RestoreSourceSnapshot] Attempting to retrieve source doc from target cluster...
[RestoreSourceSnapshot] Document retrieved, snapshot restored successfully
[RestoreSourceSnapshot] {
    "_id": "1",
    "_index": "noldor",
    "_primary_term": 1,
    "_seq_no": 0,
    "_source": {
        "name": "Finwe"
    },
    "_type": "_doc",
    "_version": 1,
    "found": true
}
[FrameworkRunner] Step Succeeded: RestoreSourceSnapshot
[FrameworkRunner] Running Step: StopSourceCluster
[StopSourceCluster] Stopping cluster source-cluster...
[StopSourceCluster] Cleaning up underlying resources for cluster source-cluster...
[FrameworkRunner] Step Succeeded: StopSourceCluster
[FrameworkRunner] Running Step: StopTargetCluster
[StopTargetCluster] Stopping cluster target-cluster...
[StopTargetCluster] Cleaning up underlying resources for cluster target-cluster...
[FrameworkRunner] Step Succeeded: StopTargetCluster
[FrameworkRunner] Running Step: SnapshotRestoreTeardown
[SnapshotRestoreTeardown] Removing shared Docker volume cluster-snapshots-volume...
[SnapshotRestoreTeardown] Removed shared Docker volume cluster-snapshots-volume
[FrameworkRunner] Step Succeeded: SnapshotRestoreTeardown
[FrameworkRunner] Ran through all steps successfully
[FrameworkRunner] Saving application state to file...
[FrameworkRunner] Application state saved
[FrameworkRunner] Application state saved to: /tmp/utf/state-file
[FrameworkRunner] Full run details logged to: /tmp/utf/logs/run.log.2022-12-15_16_57_43

@chelma
Copy link
Member Author

chelma commented Dec 15, 2022

PR posted: #47

@chelma
Copy link
Member Author

chelma commented Dec 16, 2022

Demo'd the UTF to @sumobrian @kartg @okhasawn @lewijacn. Overall, folks thought it looked good with a few comments:

  • Kartik - Need to be careful that using a language-specific client SDK for ES/OS to check results of operations doesn't impact the outcome of the test. Per discussion, seems like we should stick to the raw REST interface and make that clear to users.
  • Brian - The engine versions showed up multiple times in the test_config file - in the file name, in the engine_version field, and in the docker image name, maybe we can infer it instead of hardcode in each spot? Per discussion, can't depend on it being in the docker image name, and might not always have it in the filename either, so seems safest to have a separate field.
  • Brian - Would be good to be able to override test configuration in a hierarchy (command-line > test_config > core framework config values). Per discussion, makes sense.

@chelma
Copy link
Member Author

chelma commented Dec 20, 2022

PR merged; task complete. Resolving.

@chelma chelma closed this as completed Dec 20, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants