-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UTF able to snapshot from ES 7.10.2 and restore onto OS 1.3.6 #47
Conversation
Signed-off-by: Chris Helma <chelma+github@amazon.com>
Signed-off-by: Chris Helma <chelma+github@amazon.com>
Signed-off-by: Chris Helma <chelma+github@amazon.com>
* Moved the unit tests to a directory tree that mirrors the lib * Added basic handling for Node state edge cases * Added basic handling for Cluster state edge cases Signed-off-by: Chris Helma <chelma+github@amazon.com>
* Fixed Git hook for unit test invocation * Fixed some bugs in Cluster/Node state management * Added code so that the app state is saved in full Signed-off-by: Chris Helma <chelma+github@amazon.com>
* Kept the _nodes around after cleanup so their state is stored. Previously, the field was cleared and the final state-file had no information about the node configuration in it. Signed-off-by: Chris Helma <chelma+github@amazon.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice addition for snapshot restore, and setting up outline for external tests to be plugged in the test steps. Only minor comments to look at
elif "OS" == raw_engine: | ||
engine = ENGINE_OPENSEARCH | ||
else: | ||
raise CouldNotParseEngineVersionException(version_string) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we could provide a little more details for each case we raise an exception in this file for example, "Provided engine should match ES or OS format". Another alternative may be to simply provide an example in the exception class, "i.e. ES_7_10_2 or OS_1_3_6" , for all cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense, will do
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improved exception message in next revision.
@@ -13,17 +16,34 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we have improved upon this last paragraph some and may need to update this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep - will update. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated in next rev
from upgrade_testing_framework.core.framework_step import FrameworkStep | ||
import upgrade_testing_framework.core.shell_interactions as shell | ||
from upgrade_testing_framework.cluster_management.node import Node | ||
|
||
|
||
class TestSourceCluster(FrameworkStep): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming most of this is placeholder for the actual tests we want to run
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's placeholder
self.fail(f"Unsupported upgrade style - {upgrade_style}") | ||
|
||
self.logger.info("Creating shared Docker volume to share snapshots...") | ||
snapshot_volume = docker_client.create_volume("cluster-snapshots-volume") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a point of preference, seems like we should we be checking for an existing volume or are we relying on the exception of the docker sdk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC, Docker will create as many volumes with the same name as you ask it to and instead rely on their ID meta-data field; it's networks and containers that collide on a name. The Docker SDK uses the ID behind the scenes instead of the name, too, I think, so you wouldn't have collision.
Regardless, this just falls into the same bucket of issues that @mikaylathompson raised in a previous PR about adding a more formal test setup/teardown mechanism to ensure a "clean" testbed. I'll make an issue for that momentarily.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Created an issue to cover this: #51
self.logger.info("Creating snapshot of source cluster...") | ||
port = source_cluster.rest_ports[0] | ||
|
||
register_cmd = f"""curl -XPUT 'localhost:{port}/_snapshot/noldor' -H 'Content-Type: application/json' -d '{{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we should be using a constant here for snapshot repo name and snapshot name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - assuming this is our "real" code. It isn't, however, and will be ripped out/replaced in short order once we write our actual test code. Think of this as prototype/placeholder/demo code that is meant solely to illustrate how the UTF is designed to work.
"location": "{shared_volume.mount_point}" | ||
}} | ||
}}' | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would like to see at least a debug log for the name of the snapshot repo created and snapshot name later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The call_shell_command()
function logs all that. That said - this curl
code is going to be ripped out in the next few days/weeks anyways and replaced with "real" code, so...
} | ||
}, | ||
"upgrade_def": { | ||
"style": "snapshot-restore" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing to worry about now, but it would be interesting in the future if a customer could specify an existing snapshot repo they had and restore that on a target cluster we spin up
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially. Given the UTF is now primarily focused on OS developers rather than including cluster admins, I think the need for something like that maybe lessened, but I'm absolutely open to it based on user feedback.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, would want to hear some user desire for this before pursuing
@@ -32,7 +32,7 @@ Learn more about venv [here](https://docs.python.org/3/library/venv.html). | |||
The unit tests are executed by invoking Pytest: | |||
|
|||
``` | |||
python -m pytest tests/ | |||
python -m pytest unit_tests/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, is this to distinguish unit tests from tests run against the cluster? makes sense
}, | ||
"target": { | ||
"engine_version": "OS_1_3_6", | ||
"image": "opensearchproject/opensearch:1.3.6", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naively (i.e. 4 files into the review), engine_version
and image
seem redundant--why do we need both?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thinking is that a user should be able to roll their own image and use it in the UTF, but the UTF still needs to know the engine version before it spins up any nodes. Consider the use case of a plugin or core developer wanting to run the UTF against a modified version of the codebase by sticking it into a locally build Docker image and using that as the basis of a UTF run.
] | ||
container_config = ContainerConfiguration(self._starting_image, network, port_mappings, self._volumes) | ||
|
||
# Assemble the container configuration based on the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment looks unfinished
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in next rev
@@ -116,7 +132,7 @@ def create_container(self, image: str, container_name: str, network: Network, po | |||
detach=True, | |||
environment=env_variables | |||
) | |||
self.logger.debug("Created container") | |||
self.logger.debug(f"Created container {container_name}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very minor question: you changed the logger.debug messages on lines 84 & 87 to info
level messages -- why not this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Info level logs are visible in the terminal history; debug level logs are only visible in the log file. The fact that we had to pull an image from an external repo seems like something that's more important to surface to that higher level than the fact we created a container, which is a given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I think the fact that we're pushing INFO level logs here indicates that this particular method should be at a higher level of abstraction. Will refactor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in next rev
@@ -131,10 +147,18 @@ def remove_container(self, container: Container): | |||
self.logger.debug(f"Removed container {container.name}") | |||
|
|||
def run(self, container: Container, command: str) -> Tuple[int, str]: | |||
# TODO - Need to handle when container isn't stopped | |||
# TODO - Need to handle when container isn't running |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: I think the original is correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, wait, no -- my comment here is wrong. Leaving this because maybe it's confusing that run
doesn't actually refer to "running" the container, but "running" a command in the container, but I don't know if making the method name more verbose is the solution. 🤷♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy fix; will rename to be clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to run_command
def __init__(self, name: str, container_config: ContainerConfiguration, node_config: NodeConfiguration, | ||
docker_client: DockerFrameworkClient, container: Container = None): | ||
self.logger = logging.getLogger(__name__) | ||
self.name = name | ||
self.rest_port = container_config.rest_port |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rest port = port on which the rest api is available?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep!
def _get_engine_user_for_version(version: ev.EngineVersion) -> str: | ||
if ev.ENGINE_ELASTICSEARCH == version.engine: | ||
return "elasticsearch" | ||
return "elasticsearch" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be return "opensearch"
, yeah?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Surprisingly not; for Elastic 1.x my testing indicated the run-as user in the container was "elasticsearch". Snapshot/restore certainly seems to work with the OpenSearch container's snapshots volume having its linux ownership perms set to the "elasticsearch" user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well that's weird (but as a side note, this explains why the original snapshot walkthrough only needed the owner of the repo to be set once).
why have the if
at all then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I'm betting OpenSearch 2.x+ has a different user, and this is looking ahead to that eventuality.
self.docker_client: DockerFrameworkClient = None | ||
self.source_cluster: Cluster = None | ||
self.target_cluster: Cluster = None | ||
self.test_config: TestConfig = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know we discussed the Type | None
vs Optional[Type]
thing last time--if they're being set to None
, should they all be Optionals
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dunno; does it matter either way?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I under your question. We currently expect these to be defined over the course of a UTF run, so they're not technically "optional" in a grammatical sense. I therefore didn't use Optional
, and instead declared/initialized them as None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a big deal, so not worth changing in this review, but for future related code: my take would be that if there's a chance they could be None (i.e. should callers check whether it's not-None before using it?), they should be Optional. Is that fair/line up with your understanding?
def _run(self): | ||
""" | ||
The contents of this are quick/dirty. Instead of raw curl commands we should probably be using the Python SDK | ||
for Elasticsearch/OpenSearch. At the very least, we should wrap the curl commands in our own pseudo-client. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: want to make/link to a gh issue for this task? I agree that at least a pseudo-client sounds like a necessary approach and I'd like to have it on our radar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already made and in progress - #50
from upgrade_testing_framework.core.framework_step import FrameworkStep | ||
from upgrade_testing_framework.cluster_management.cluster import Cluster, ClusterNotStartedInTimeException | ||
|
||
class StartTargetCluster(FrameworkStep): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussion question:
Ideally do we want StartSouceCluster
and StartTargetCluster
steps to be totally different, or should we just have a StartCluster
step that takes in a config?
Changing this would be out of scope for this review. My first reaction is to object to the code duplication, but when I think about it more, source and target are pretty fundamental concepts for a migration/upgrade, and there's not (at this point) any reason to have more than the two, so maybe it's not a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had the same thought about making a base step for this, having them share library code, make a meta class, etc. I ended up deciding it wasn't worth messing with now as there's only two of them. Once we hit three, or it becomes a hassle to maintain the separate "copies", then we can revisit it. There are bigger fish to fry at the moment.
* opensearch-project#47 Signed-off-by: Chris Helma <chelma+github@amazon.com>
@@ -24,7 +24,7 @@ jobs: | |||
python -m pip install -r requirements.txt coverage pytest-cov | |||
- name: Run Tests with Coverage | |||
run: | | |||
python -m pytest tests/ --cov=upgrade_testing_framework --cov-report=xml --cov-branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch
Description
test_config
wrangling code substantially to handle more complex configurationtest_config
used to perform testing to a new directory in the repoIssues Resolved
Testing
Added unit tests for the new code
Ran the snapshot/restore workflow on my laptop:
State-file:
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.