[ci] Create a test MySQL server in test and dev namespaces #13030

daniel-goldstein · 2023-05-10T14:56:46Z

This change creates mysql pods for test and dev namespaces, instead of sharing the CloudSQL database server. The areas of change are as follows:

Generation of the namespace's database-server-config

The current approach in main does a little trick. Since the current createDatabase step uses the database-server-config from default to generate admin/user sql configs, the CI pipeline creates a dummy database test-database-instance to create a sql-test-instance-admin-config that inherits the credentials from the production database-server-config, and then copies that within the test namespace to database-server-config.

In this change, since we are creating the server ourselves, we can just replace these with a step that creates a database-server-config from scratch, and then uses that for the DB pod.

Overall making these changes really gave me the heebie jeebies that the test and dev namespaces have all these credentials to the CloudSQL server. I'm glad this gets rid of that.

Accessing the database server

We use the DB pod's service DNS name as the host so inside Kubernetes this Just Works. The one caveat is the CI pipeline in which we run migrations in batch jobs. Those jobs need a way to reach the DB pod. I achieve this with a NodePort and then use the job's K8s credentials to resolve the node and port that the DB is on. The code I've added to do this resolution feels a bit janky, wouldn't mind some feedback on that. In terms of security, if a user job was able to somehow resolve the address of a test db, they would still not have the credentials to access it, and this is currently also the case with the production database. Nevertheless, this does raise an action item that we should only allow traffic to the k8s and DB subnets for network=private jobs, but I think we should make that a separate PR.

Database creation

In order to test this properly in a dev deploy, I needed to make some changes to create_database.py. In main, dev deploys can't create databases. I think they should be able to, and those operations should just be idempotent. When allowing dev deploys to create databases, I hit the ALTER USER lines in create_database.py which lock out any already-deployed application, which feels silly. Instead, I create the mysql user and create the corresponding secret iff the mysql username does not already exist.

create_initial_account.py

When starting off with a fresh database, we encounter a bootstrapping problem where we need to add the developer's user. The current way to add the developer's user is through create_initial_account.py which assumes that the developer's gsa key secret is not already in the namespace, but now it could be, so I delete the key in the namespace if it exists before creating it. This could all change with my other PR to add devs to test namespaces. But this change to allow ephemeral dev databases will make testing that other PR way easier.

ci/create_database.py


    with open('/sql-config.cnf', 'wb') as f:
-        f.write(base64.b64decode(admin_secret['data']['sql-config.cnf']))
+        f.write(admin_sql_config.to_cnf().encode('utf-8'))


daniel-goldstein · 2023-05-10T15:58:24Z

build.yaml

-     kubectl -n {{ default_ns.name }} get -o json secret {{ test_database_instance.admin_secret_name }} | jq '{apiVersion, kind, type, data, metadata: {name: "database-server-config"}}' | kubectl -n {{ default_ns.name }} apply -f -
+      set -ex
+
+      if ! kubectl get secret -n {{ default_ns.name }} database-server-config;


We should delete all sql configs in dev namespaces because they currently point to the production server.

I think you're the only one using them this week so I say blow it all away as soon as this merged.

danking

Some questions, comments. The thing about the cores doesn't need to be resolved in this PR. I had four things to take action on: the change to create_initial_account.py, line 90 of create_database.py, unused arg in the shell script, and the root password.

danking · 2023-05-10T17:19:33Z

tls/create_test_db_config.sh

+    -keyout server-ca-key.pem -out server-ca.pem
+
+create_key_and_cert server db
+create_key_and_cert client client


the second argument is unused, right?

Yup this is cruft I forgot to erase

danking · 2023-05-10T17:22:02Z

tls/create_test_db_config.sh

+    "host": "db.$NAMESPACE.svc.cluster.local",
+    "port": 3306,
+    "user": "root",
+    "password": "pw",


I feel like we should still randomly generate the root password even though its a test db.

Ya good point, this is not hard to do.

danking · 2023-05-10T17:51:07Z

ci/ci/build.py

+                regions=[REGION],
+            )
+
+        n_cores = 4 if scope == 'deploy' and not is_test_deployment else 1


We don't have resource plots for the first big db migration but we have them for b54aafb: https://batch.hail.is/batches/7287213/jobs/100. We stay mostly around 1.3 cores and never exceed 5GiB of RAM.

That job seems to take 40 minutes to construct the chunks which then use 5GiB. Have we considered parallelizing across batch ids? There's certainly a spread in batch size, so there might be stragglers, but that'll use very little memory client-side. Or maybe parallelize by batch and job? More overhead (each job gets a db call), but client could do more parallelism to compensate. Either way, it feels like the client shouldn't need four cores and 16GiB.

It has an impact on CI latency, though maybe overcome by all the other high latency activities (like compiling Hail). In the most recent deploy the ci_utils_image finished at 17:24:32 and the database job didn't start until 17:26:08 which suggests we had to wait for a VM to come online to fit our four cores.

DB charts at that time: https://console.cloud.google.com/monitoring/dashboards/resourceList/cloudsql_database;startTime=2023-04-19T01:00:00.000Z;endTime=2023-04-20T01:00:00.000Z?project=hail-vdc

Ya these are good points. Let's talk about this in a follow-up when Jackie's back as she probably has the most insight on this.

danking · 2023-05-10T18:01:16Z

ci/create_database.py

-        if not admin_exists
-        else "ALTER USER '{admin_username}'@'%' IDENTIFIED BY '{admin_password}';"
-    )
+    async def create_user_idempotent(admin_or_user, mysql_username, mysql_password):


This isn't really idempotent because someone could race between your SELECT and the CREATE USER. I'm not sure that matters though. There's only one person creating users at any time, right?

Well, I guess what I mean to say is: it is single-writer idempotent but not atomic so with multiple writers its not idempotent.

Good point, I think I'm abusing idempotent a little bit here. What I really mean is just create_user_if_doesnt_exist.

danking · 2023-05-10T18:01:52Z

ci/create_database.py

-    )
+    async def create_user_idempotent(admin_or_user, mysql_username, mysql_password):
+        existing_user = await db.execute_and_fetchone('SELECT user FROM mysql.user WHERE user=%s', (mysql_username,))
+        if existing_user is not None and existing_user.get('user') == mysql_username:


the second condition is always true if the first condition is true, right? I'm fine making it an assertion below.

Ya true. Just going to delete this check and SELECT 1.

danking · 2023-05-10T18:05:12Z

gear/gear/database.py

@@ -77,6 +80,19 @@ async def aexit(acontext_manager, exc_type=None, exc_val=None, exc_tb=None):
    return await acontext_manager.__aexit__(exc_type, exc_val, exc_tb)


+async def resolve_test_db_endpoint(sql_config: SQLConfig) -> SQLConfig:
+    service_name, namespace = sql_config.host[: -len('.svc.cluster.local')].split('.', maxsplit=1)


we're so close to removesuffix!

danking · 2023-05-10T18:06:37Z

gear/gear/database.py

+        db_service = await client.read_namespaced_service(service_name, namespace)
+        db_pod = await client.read_namespaced_pod(f'{db_service.spec.selector["app"]}-0', namespace)
+        sql_config_dict = sql_config.to_dict()
+        sql_config_dict['host'] = db_pod.status.host_ip


this feels a bit flaky no? Are we not able to access this through the service's DNS name?

Ah. The workers. And exposing the service to the workers would require a private load balancer, etc. etc. annoying costly.

ya exactly. I didn't want to allocate a whole internal load balancer just to run a few couple minute migrations.

It does feel a bit flaky, which is why I specifically don't do this if you're running in k8s. I am implicitly assuming here that the DB pod is not going to be relocated while the create_database job is running, which I guess would pose a problem anyway. This seemed the most reasonable, albeit a little janky, approach in terms of cost and complexity given that these are just test databases.

danking · 2023-05-10T18:07:31Z

ci/create_initial_account.py

@@ -19,6 +19,14 @@ async def copy_identity_from_default(hail_credentials_secret_name: str) -> str:

    secret = await k8s_client.read_namespaced_secret(hail_credentials_secret_name, 'default')

+    try:
+        await k8s_client.delete_namespaced_secret(hail_credentials_secret_name, NAMESPACE)


what's up with this?

Because I was doing this in my namespace that already had a dgoldste-gsa-key secret but not a user in the database, the create_namespaced_secret failed. I needed a way to force create a secret even if one already existed and this is how we do it in auth/driver.py

addressed

danking

Such a good change!

danking · 2023-05-11T20:07:49Z

Let's get this in so we can try scaling up the service backend tests.

danking · 2023-05-11T23:55:21Z

ci test hung

daniel-goldstein closed this May 10, 2023

daniel-goldstein reopened this May 10, 2023

github-advanced-security bot found potential problems May 10, 2023

View reviewed changes

daniel-goldstein commented May 10, 2023

View reviewed changes

daniel-goldstein assigned danking May 10, 2023

daniel-goldstein added the stacked PR label May 10, 2023

danking previously requested changes May 10, 2023

View reviewed changes

danking approved these changes May 10, 2023

View reviewed changes

daniel-goldstein added 3 commits May 11, 2023 09:47

[ci] Create a test MySQL server in test and dev namespaces

863e6c3

address comments

79191be

fix

dfca8e1

daniel-goldstein force-pushed the test-dbs branch from c7a1003 to dfca8e1 Compare May 11, 2023 14:17

daniel-goldstein added 2 commits May 11, 2023 10:54

fixes

5365176

oh that is needed for test deploy I guess

8a320b2

daniel-goldstein removed the stacked PR label May 11, 2023

daniel-goldstein added 3 commits May 11, 2023 14:07

cant_create_database doesnt mean anything anymore

00f4cc4

fix cant_create_database

2771ca4

fix

96c07b1

danking added the prio:high label May 11, 2023

fix

f933811

daniel-goldstein added 7 commits May 11, 2023 20:00

fix

8a068e7

try to fix again

5777056

fix

6a2a9d2

fix

7d10e36

cmon

708ece4

lint

1124881

fix test step

edebc62

danking merged commit ae7cdee into hail-is:main May 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci] Create a test MySQL server in test and dev namespaces #13030

[ci] Create a test MySQL server in test and dev namespaces #13030

daniel-goldstein commented May 10, 2023 •

edited

Loading

daniel-goldstein May 10, 2023

danking May 10, 2023

danking left a comment •

edited

Loading

danking May 10, 2023

daniel-goldstein May 10, 2023

danking May 10, 2023

daniel-goldstein May 10, 2023

danking May 10, 2023

danking May 10, 2023

daniel-goldstein May 10, 2023

danking May 10, 2023

danking May 10, 2023

daniel-goldstein May 10, 2023

danking May 10, 2023

daniel-goldstein May 10, 2023 •

edited

Loading

danking May 10, 2023

danking May 10, 2023

danking May 10, 2023

daniel-goldstein May 10, 2023

daniel-goldstein May 10, 2023

danking May 10, 2023

daniel-goldstein May 10, 2023

danking left a comment

danking commented May 11, 2023

danking commented May 11, 2023

[ci] Create a test MySQL server in test and dev namespaces #13030

[ci] Create a test MySQL server in test and dev namespaces #13030

Conversation

daniel-goldstein commented May 10, 2023 • edited Loading

Generation of the namespace's database-server-config

Accessing the database server

Database creation

create_initial_account.py

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danking left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

daniel-goldstein May 10, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

danking left a comment

Choose a reason for hiding this comment

danking commented May 11, 2023

danking commented May 11, 2023

daniel-goldstein commented May 10, 2023 •

edited

Loading

danking left a comment •

edited

Loading

daniel-goldstein May 10, 2023 •

edited

Loading