Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] Create a test MySQL server in test and dev namespaces #13030

Merged
merged 16 commits into from
May 12, 2023

Conversation

daniel-goldstein
Copy link
Contributor

@daniel-goldstein daniel-goldstein commented May 10, 2023

This change creates mysql pods for test and dev namespaces, instead of sharing the CloudSQL database server. The areas of change are as follows:

Generation of the namespace's database-server-config

The current approach in main does a little trick. Since the current createDatabase step uses the database-server-config from default to generate admin/user sql configs, the CI pipeline creates a dummy database test-database-instance to create a sql-test-instance-admin-config that inherits the credentials from the production database-server-config, and then copies that within the test namespace to database-server-config.

In this change, since we are creating the server ourselves, we can just replace these with a step that creates a database-server-config from scratch, and then uses that for the DB pod.

Overall making these changes really gave me the heebie jeebies that the test and dev namespaces have all these credentials to the CloudSQL server. I'm glad this gets rid of that.

Accessing the database server

We use the DB pod's service DNS name as the host so inside Kubernetes this Just Works. The one caveat is the CI pipeline in which we run migrations in batch jobs. Those jobs need a way to reach the DB pod. I achieve this with a NodePort and then use the job's K8s credentials to resolve the node and port that the DB is on. The code I've added to do this resolution feels a bit janky, wouldn't mind some feedback on that. In terms of security, if a user job was able to somehow resolve the address of a test db, they would still not have the credentials to access it, and this is currently also the case with the production database. Nevertheless, this does raise an action item that we should only allow traffic to the k8s and DB subnets for network=private jobs, but I think we should make that a separate PR.

Database creation

In order to test this properly in a dev deploy, I needed to make some changes to create_database.py. In main, dev deploys can't create databases. I think they should be able to, and those operations should just be idempotent. When allowing dev deploys to create databases, I hit the ALTER USER lines in create_database.py which lock out any already-deployed application, which feels silly. Instead, I create the mysql user and create the corresponding secret iff the mysql username does not already exist.

create_initial_account.py

When starting off with a fresh database, we encounter a bootstrapping problem where we need to add the developer's user. The current way to add the developer's user is through create_initial_account.py which assumes that the developer's gsa key secret is not already in the namespace, but now it could be, so I delete the key in the namespace if it exists before creating it. This could all change with my other PR to add devs to test namespaces. But this change to allow ephemeral dev databases will make testing that other PR way easier.


with open('/sql-config.cnf', 'wb') as f:
f.write(base64.b64decode(admin_secret['data']['sql-config.cnf']))
f.write(admin_sql_config.to_cnf().encode('utf-8'))

Check failure

Code scanning / CodeQL

Clear-text storage of sensitive information

This expression stores [sensitive data (password)](1) as clear text. This expression stores [sensitive data (certificate)](2) as clear text.
kubectl -n {{ default_ns.name }} get -o json secret {{ test_database_instance.admin_secret_name }} | jq '{apiVersion, kind, type, data, metadata: {name: "database-server-config"}}' | kubectl -n {{ default_ns.name }} apply -f -
set -ex

if ! kubectl get secret -n {{ default_ns.name }} database-server-config;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should delete all sql configs in dev namespaces because they currently point to the production server.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're the only one using them this week so I say blow it all away as soon as this merged.

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some questions, comments. The thing about the cores doesn't need to be resolved in this PR. I had four things to take action on: the change to create_initial_account.py, line 90 of create_database.py, unused arg in the shell script, and the root password.

-keyout server-ca-key.pem -out server-ca.pem

create_key_and_cert server db
create_key_and_cert client client
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second argument is unused, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup this is cruft I forgot to erase

"host": "db.$NAMESPACE.svc.cluster.local",
"port": 3306,
"user": "root",
"password": "pw",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we should still randomly generate the root password even though its a test db.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya good point, this is not hard to do.

ci/ci/build.py Outdated
regions=[REGION],
)

n_cores = 4 if scope == 'deploy' and not is_test_deployment else 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have resource plots for the first big db migration but we have them for b54aafb: https://batch.hail.is/batches/7287213/jobs/100. We stay mostly around 1.3 cores and never exceed 5GiB of RAM.

That job seems to take 40 minutes to construct the chunks which then use 5GiB. Have we considered parallelizing across batch ids? There's certainly a spread in batch size, so there might be stragglers, but that'll use very little memory client-side. Or maybe parallelize by batch and job? More overhead (each job gets a db call), but client could do more parallelism to compensate. Either way, it feels like the client shouldn't need four cores and 16GiB.

It has an impact on CI latency, though maybe overcome by all the other high latency activities (like compiling Hail). In the most recent deploy the ci_utils_image finished at 17:24:32 and the database job didn't start until 17:26:08 which suggests we had to wait for a VM to come online to fit our four cores.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya these are good points. Let's talk about this in a follow-up when Jackie's back as she probably has the most insight on this.

if not admin_exists
else "ALTER USER '{admin_username}'@'%' IDENTIFIED BY '{admin_password}';"
)
async def create_user_idempotent(admin_or_user, mysql_username, mysql_password):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't really idempotent because someone could race between your SELECT and the CREATE USER. I'm not sure that matters though. There's only one person creating users at any time, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I guess what I mean to say is: it is single-writer idempotent but not atomic so with multiple writers its not idempotent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I think I'm abusing idempotent a little bit here. What I really mean is just create_user_if_doesnt_exist.

)
async def create_user_idempotent(admin_or_user, mysql_username, mysql_password):
existing_user = await db.execute_and_fetchone('SELECT user FROM mysql.user WHERE user=%s', (mysql_username,))
if existing_user is not None and existing_user.get('user') == mysql_username:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the second condition is always true if the first condition is true, right? I'm fine making it an assertion below.

Copy link
Contributor Author

@daniel-goldstein daniel-goldstein May 10, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya true. Just going to delete this check and SELECT 1.

@@ -77,6 +80,19 @@ async def aexit(acontext_manager, exc_type=None, exc_val=None, exc_tb=None):
return await acontext_manager.__aexit__(exc_type, exc_val, exc_tb)


async def resolve_test_db_endpoint(sql_config: SQLConfig) -> SQLConfig:
service_name, namespace = sql_config.host[: -len('.svc.cluster.local')].split('.', maxsplit=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we're so close to removesuffix!

db_service = await client.read_namespaced_service(service_name, namespace)
db_pod = await client.read_namespaced_pod(f'{db_service.spec.selector["app"]}-0', namespace)
sql_config_dict = sql_config.to_dict()
sql_config_dict['host'] = db_pod.status.host_ip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels a bit flaky no? Are we not able to access this through the service's DNS name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. The workers. And exposing the service to the workers would require a private load balancer, etc. etc. annoying costly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya exactly. I didn't want to allocate a whole internal load balancer just to run a few couple minute migrations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does feel a bit flaky, which is why I specifically don't do this if you're running in k8s. I am implicitly assuming here that the DB pod is not going to be relocated while the create_database job is running, which I guess would pose a problem anyway. This seemed the most reasonable, albeit a little janky, approach in terms of cost and complexity given that these are just test databases.

@@ -19,6 +19,14 @@ async def copy_identity_from_default(hail_credentials_secret_name: str) -> str:

secret = await k8s_client.read_namespaced_secret(hail_credentials_secret_name, 'default')

try:
await k8s_client.delete_namespaced_secret(hail_credentials_secret_name, NAMESPACE)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's up with this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I was doing this in my namespace that already had a dgoldste-gsa-key secret but not a user in the database, the create_namespaced_secret failed. I needed a way to force create a secret even if one already existed and this is how we do it in auth/driver.py

Copy link
Contributor

@danking danking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such a good change!

@danking
Copy link
Contributor

danking commented May 11, 2023

Let's get this in so we can try scaling up the service backend tests.

@danking
Copy link
Contributor

danking commented May 11, 2023

ci test hung

@danking danking merged commit ae7cdee into hail-is:main May 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants