Skip to content

Conversation

lsierant
Copy link
Contributor

@lsierant lsierant commented Sep 18, 2025

Summary

This PR adds testing for running search with Ops Manager 8.0.14 and with mongod 8.2.

Changes to search/search_enterprise_tls.py:

  • Added ability to run the test with either OM or cloud qa.
  • Added mongod migration steps from 8.0.14 to 8.2.0.
    • This adds automated verification that we can have search running and upgrade from mdb 8.0 (having polyfilled searchCoordinator role) to mdb 8.2 (with the role being a built-in one).

Proof of Work

EVG passing for search_enterprise_tls.py in both cloud-qa and om80 variants: evg link

Also the test is only run in om80 variant and not for earlier OM.

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

@lsierant lsierant self-assigned this Sep 18, 2025
@lsierant lsierant force-pushed the lsierant/multi-cloud-qa branch from 94a69a9 to b053f96 Compare September 18, 2025 21:18
@lsierant lsierant force-pushed the lsierant/multi-cloud-qa branch from b053f96 to 8689e2c Compare September 22, 2025 10:34
@lsierant lsierant force-pushed the lsierant/multi-cloud-qa branch 2 times, most recently from 6efa62e to 9d65420 Compare September 22, 2025 20:02
@lsierant lsierant force-pushed the lsierant/search-8.2 branch 2 times, most recently from d70d9e6 to 4363fa2 Compare September 22, 2025 20:19
@lsierant lsierant force-pushed the lsierant/multi-cloud-qa branch from 4964f58 to aaa6b41 Compare September 22, 2025 20:27
@lsierant lsierant force-pushed the lsierant/search-8.2 branch 2 times, most recently from 6d98958 to 23dc861 Compare September 22, 2025 20:28
@lsierant lsierant force-pushed the lsierant/multi-cloud-qa branch from aaa6b41 to 3aa8a72 Compare September 22, 2025 20:33
@lsierant lsierant force-pushed the lsierant/multi-cloud-qa branch from 3aa8a72 to 8f187b9 Compare September 22, 2025 21:44
@lsierant lsierant marked this pull request as ready for review September 23, 2025 06:19
@lsierant lsierant requested a review from a team as a code owner September 23, 2025 06:19
@lsierant lsierant requested review from m1kola, mircea-cosbuc and fealebenpae and removed request for a team September 23, 2025 06:19
Comment on lines 133 to 136
if ops_manager is not None:
ops_manager.update()
ops_manager.om_status().assert_reaches_phase(Phase.Running, timeout=1200)
ops_manager.appdb_status().assert_reaches_phase(Phase.Running, timeout=600)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would get_ops_manager() ever return None provided this step is decorated with skip_if_cloud_manager? Or maybe I'm not understanding why we're guarding against ops_manager being None before calling .update() but not for the following method calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right, removed the if. The annotation is sufficient.



@mark.e2e_search_enterprise_tls
def test_wait_for_database_resource_ready2(mdb: MongoDB):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we wait again here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it looks it's redundant. Removing it.

from kubetester.mongodb_user import MongoDBUser
from kubetester.omtester import skip_if_cloud_manager
from kubetester.phase import Phase
from mypyc.irbuild.function import check_native_override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem used.

from tests.common.search.search_tester import SearchTester
from tests.conftest import get_default_operator
from tests.conftest import get_default_operator, get_issuer_ca_filepath
from tests.opsmanager.conftest import custom_om_prev_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem used.

Comment on lines 272 to 267
@mark.e2e_search_enterprise_tlssh
def test_search_assert_search_query_2(mdb: MongoDB):
get_user_sample_movies_helper(mdb).assert_search_query(retry_timeout=60)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should move this as the last step in TestUpgradeMongod to make sure it doesn't get moved around.

@lsierant lsierant force-pushed the lsierant/multi-cloud-qa branch from fa19d3d to de8fd69 Compare September 23, 2025 20:30
@lsierant lsierant force-pushed the lsierant/multi-cloud-qa branch from de8fd69 to 8398705 Compare September 25, 2025 07:39
@lsierant lsierant force-pushed the lsierant/search-8.2 branch 2 times, most recently from a7d5aeb to 37d09a7 Compare September 25, 2025 09:55

resource.set_version(get_custom_om_version())
resource.set_appdb_version(get_custom_appdb_version())
resource.allow_mdb_rc_versions()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is 8.2 an RC version? Is it not available without enabling this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is not necessary anymore. Thanks for flagging.

@lsierant lsierant changed the base branch from lsierant/multi-cloud-qa to master September 29, 2025 08:48
Copy link

github-actions bot commented Sep 29, 2025

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.5.0 Release Notes

New Features

  • Improve automation agent certificate rotation: the agent now restarts automatically when its certificate is renewed, ensuring smooth operation without manual intervention and allowing seamless certificate updates without requiring manual Pod restarts.

Bug Fixes

  • MongoDBMultiCluster: fix resource stuck in Pending state if any clusterSpecList item has 0 members. After the fix, a value of 0 members is handled correctly, similarly to how it's done in the MongoDB resource.

@lsierant lsierant added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Sep 29, 2025
@lsierant lsierant changed the title Running search test under OM 8.2 CLOUDP-342858: Running search test under OM 8.2 Sep 29, 2025
Copy link
Contributor

@m1kola m1kola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, but the tests are not easy to follow. Left some suggestions.

@mark.e2e_search_enterprise_tls
def test_wait_for_database_resource_ready(mdb: MongoDB):
mdb.assert_abandons_phase(Phase.Running, timeout=300)
mdb.assert_reaches_phase(Phase.Running, timeout=300)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: doesn't look like a very useful test. We do nothing here apart from the assertion. In my mind the test should perform some change and then assert some condition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is useful, it's checking the state of the world. Not every test must change the world AND observe it as part of the same test case. Especially when you deal with multiple components deployed asynchronously. In this case we create users, search and mongodb which are reconciled by different controllers

Having smaller test cases makes it easier to develop, as you can retry smaller individual functions if needed. Also we should consider all the functions top to bottom as part of one testing steps and story.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of search, after it's deployed, the MongoDB reconciler will pick it up and perform changes to mongod - agents will perform rolling restart of mongods, even without changing to MongoDB CR at all.

Copy link
Contributor

@m1kola m1kola Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is useful, it's checking the state of the world. Not every test must change the world AND observe it as part of the same test case. Especially when you deal with multiple components deployed asynchronously. In this case we create users, search and mongodb which are reconciled by different controllers

Having smaller test cases makes it easier to develop, as you can retry smaller individual functions if needed. Also we should consider all the functions top to bottom as part of one testing steps and story.

This is an end to end test and we shouldn't be concerned about specific controllers or other implementation details. We should focus on how the system behaves as a whole instead.

Perhaps it is easier to develop/debug the tests like this, but tests structured this way leave the reader guessing. Even if you consider all the functions top to bottom it raises a question - why we do this check here? We already check ready state in test_create_database_resource - why check it again? You explained this later, but imagine not having this context.

In case of search, after it's deployed, the MongoDB reconciler will pick it up and perform changes to mongod - agents will perform rolling restart of mongods, even without changing to MongoDB CR at all.

Given that explanation I would argue that in the end to end test we should be:

  1. Deploying everything (the DB resource, user and the search)
  2. Do the ready state waits and other checks after the waiting.

This is probably how the users will be using it (especially with GitOps). And it eliminates the questions about strange checks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your comments made me rethink this and I changed that function to test_wait_for_agents_ready. After search is deployed, mongod parameters are added to all processes and the agents are doing their work for at least few minutes.
Due to how we report (or rather not doing properly) changes to Pending when reconciling, we cannot see the agents are doing anything by checking the spec.status alone.

Therefore I've added in the test mdb.get_om_tester().wait_agents_ready() that checks if all the agents reached the goal state (it's exactly the same logic as the operator is waiting on in the reconcile loop).

I hope now it's more clear what's happening.

Comment on lines +215 to +232
@mark.e2e_search_enterprise_tls
def test_validate_tls_connections(mdb: MongoDB, mdbs: MongoDBSearch, namespace: str):
validate_tls_connections(mdb, mdbs, namespace)


@mark.e2e_search_enterprise_tls
def test_search_restore_sample_database(mdb: MongoDB):
get_admin_sample_movies_helper(mdb).restore_sample_database()


@mark.e2e_search_enterprise_tls
def test_search_create_search_index(mdb: MongoDB):
get_user_sample_movies_helper(mdb).create_search_index()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the correlation between test actions and the assertions here and in some other tests. Perhaps it is too granular split into separate tests?

I think it will be easier to follow if there is a test which performs an action and checks the outcome of the action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is difficult to fix tbh and it would require a more broader alignment how we should write the test...
I don't know how to fix this exact case, but it's just part of the testing story: in order to query the search we need the index in the first place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to insist on changing this. But IMO something like this makes more sense:

def test_can_search(mdb: MongoDB):
    get_admin_sample_movies_helper(mdb).restore_sample_database()
    get_user_sample_movies_helper(mdb).create_search_index()
    get_user_sample_movies_helper(mdb).assert_search_query(retry_timeout=60)

You can immediately see that the user should be able to search after creating a search index.

Comment on lines 238 to 250
custom_roles = mdb.get_automation_config_tester().automation_config.get("roles", [])
assert len(custom_roles) > 0
assert "searchCoordinator" in [role["role"] for role in custom_roles]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not clear to me - why we assert it here?


I had to look at the controller code to understand it better. I see that for DB versions prior to 8.2 we create a polyfill role. Would it then make more sense to assert this after creating the resource with MDB_VERSION_WITHOUT_BUILT_IN_ROLE? E.g. somewhere in test_create_database_resource?

And then in the upgrade test we can test that there is no polyfill role after the upgrade.

This way test actions are bundled together with a relevant assertions which makes it easier to understand. Also I think a comment with an explanation about the polyfill role wouldn't hurt somewhere around role assertions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added some more comments around and reordered the steps a bit - PTAL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at this again I now feel like ideally we should not be checking for implementation details in an E2E test: we should not be concerned about polyfill role. It might be an useful unit/integration test, but on the E2E level we shouldn't wory about it. We care about an answer to the question - "does search work after upgrade?".

Not requesting you to change this (but I'll welcome it) as there is no harm in the additional checks apart from them adding a bit of extra complexity to the test.

Comment on lines 242 to 259
def test_mongod_version(self, mdb: MongoDB):
mdb.tester(ca_path=get_issuer_ca_filepath(), use_ssl=True).assert_version(MDB_VERSION_WITHOUT_BUILT_IN_ROLE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is the value of this test as part of the upgrade test.

Perhaps it is also should be an assertion after initial creation in test_create_database_resource?

And then in the upgrade test we will be:

  • Taking action: upgrading to a new version
  • Making assertions: checking running state, custom roles, version, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this assertion alone is of little value as it stands as we deploy this version at the beginning and the assertion could be there.

OTOH, I've prepared this TestUpgradeMongod to be a potentially reusable bit of upgrade test/steps. If you look at this assertion as part of the steps grouped under a single TestClass, then it's OK to test whether we run the test under specific circumstances.

Let me think about the overall structure, perhaps some comments would be needed here.

except client.ApiException as e:
if e.status == 404:
logger.debug("project config map is not specified, trying my-project as the source")
src_cm = read_configmap(self.namespace, "my-project", api_client=api_client)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole except section is redundant due to default value assigned to src_project_config_map_name

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this part a bit.

@lsierant lsierant force-pushed the lsierant/search-8.2 branch from 2d1d13b to d8329a6 Compare October 3, 2025 06:47
@lsierant lsierant force-pushed the lsierant/search-8.2 branch from d8329a6 to e3fa3e7 Compare October 3, 2025 07:16
@lsierant lsierant requested review from m1kola and anandsyncs October 6, 2025 09:32
Copy link
Contributor

@m1kola m1kola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. I'll be happy if you address them, but I'll be ok with the PR merged as is because I do not consider these comments critical.

Comment on lines +190 to +211
def test_wait_for_mongod_parameters(mdb: MongoDB):
# After search CR is deployed, MongoDB controller will pick it up
# and start adding searchCoordinator role and search-related
# parameters to the automation config.
def check_mongod_parameters():
parameters_are_set = True
pod_parameters = []
for idx in range(mdb.get_members()):
mongod_config = yaml.safe_load(
KubernetesTester.run_command_in_pod_container(
f"{mdb.name}-{idx}", mdb.namespace, ["cat", "/data/automation-mongod.conf"]
)
)
set_parameter = mongod_config.get("setParameter", {})
parameters_are_set = parameters_are_set and (
"mongotHost" in set_parameter and "searchIndexManagementHostAndPort" in set_parameter
)
pod_parameters.append(f"pod {idx} setParameter: {set_parameter}")

return parameters_are_set, f'Not all pods have mongot parameters set:\n{"\n".join(pod_parameters)}'

run_periodically(check_mongod_parameters, timeout=200)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think both this and test_wait_for_agents_ready are testing implementation details. I understand that test_wait_for_agents_ready is necessary (at least for now, while we are not reporting Pending). But do we need to do test_wait_for_mongod_parameters?

I'm trying to look at the flow from the user perspective. For me as a user it is important that I can:

  1. Deploy MongoDB
  2. Deploy MongoDBSearch
  3. Get some indication that everything is ready (we don't have that at the moment unfortunately as you explained in the comment for test_wait_for_agents_ready).
  4. Create search index
  5. And start searching.

Can we just drop test_wait_for_mongod_parameters and verify "I can search" functionality (you already do that later in the test)?

Comment on lines +215 to +232
@mark.e2e_search_enterprise_tls
def test_validate_tls_connections(mdb: MongoDB, mdbs: MongoDBSearch, namespace: str):
validate_tls_connections(mdb, mdbs, namespace)


@mark.e2e_search_enterprise_tls
def test_search_restore_sample_database(mdb: MongoDB):
get_admin_sample_movies_helper(mdb).restore_sample_database()


@mark.e2e_search_enterprise_tls
def test_search_create_search_index(mdb: MongoDB):
get_user_sample_movies_helper(mdb).create_search_index()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not going to insist on changing this. But IMO something like this makes more sense:

def test_can_search(mdb: MongoDB):
    get_admin_sample_movies_helper(mdb).restore_sample_database()
    get_user_sample_movies_helper(mdb).create_search_index()
    get_user_sample_movies_helper(mdb).assert_search_query(retry_timeout=60)

You can immediately see that the user should be able to search after creating a search index.

Comment on lines 238 to 250
custom_roles = mdb.get_automation_config_tester().automation_config.get("roles", [])
assert len(custom_roles) > 0
assert "searchCoordinator" in [role["role"] for role in custom_roles]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at this again I now feel like ideally we should not be checking for implementation details in an E2E test: we should not be concerned about polyfill role. It might be an useful unit/integration test, but on the E2E level we shouldn't wory about it. We care about an answer to the question - "does search work after upgrade?".

Not requesting you to change this (but I'll welcome it) as there is no harm in the additional checks apart from them adding a bit of extra complexity to the test.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
skip-changelog Use this label in Pull Request to not require new changelog entry file
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants