-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci] Create a test MySQL server in test and dev namespaces #13030
Conversation
|
||
with open('/sql-config.cnf', 'wb') as f: | ||
f.write(base64.b64decode(admin_secret['data']['sql-config.cnf'])) | ||
f.write(admin_sql_config.to_cnf().encode('utf-8')) |
Check failure
Code scanning / CodeQL
Clear-text storage of sensitive information
kubectl -n {{ default_ns.name }} get -o json secret {{ test_database_instance.admin_secret_name }} | jq '{apiVersion, kind, type, data, metadata: {name: "database-server-config"}}' | kubectl -n {{ default_ns.name }} apply -f - | ||
set -ex | ||
|
||
if ! kubectl get secret -n {{ default_ns.name }} database-server-config; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should delete all sql configs in dev namespaces because they currently point to the production server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're the only one using them this week so I say blow it all away as soon as this merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some questions, comments. The thing about the cores doesn't need to be resolved in this PR. I had four things to take action on: the change to create_initial_account.py, line 90 of create_database.py, unused arg in the shell script, and the root password.
tls/create_test_db_config.sh
Outdated
-keyout server-ca-key.pem -out server-ca.pem | ||
|
||
create_key_and_cert server db | ||
create_key_and_cert client client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the second argument is unused, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup this is cruft I forgot to erase
tls/create_test_db_config.sh
Outdated
"host": "db.$NAMESPACE.svc.cluster.local", | ||
"port": 3306, | ||
"user": "root", | ||
"password": "pw", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we should still randomly generate the root password even though its a test db.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya good point, this is not hard to do.
ci/ci/build.py
Outdated
regions=[REGION], | ||
) | ||
|
||
n_cores = 4 if scope == 'deploy' and not is_test_deployment else 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have resource plots for the first big db migration but we have them for b54aafb: https://batch.hail.is/batches/7287213/jobs/100. We stay mostly around 1.3 cores and never exceed 5GiB of RAM.
That job seems to take 40 minutes to construct the chunks which then use 5GiB. Have we considered parallelizing across batch ids? There's certainly a spread in batch size, so there might be stragglers, but that'll use very little memory client-side. Or maybe parallelize by batch and job? More overhead (each job gets a db call), but client could do more parallelism to compensate. Either way, it feels like the client shouldn't need four cores and 16GiB.
It has an impact on CI latency, though maybe overcome by all the other high latency activities (like compiling Hail). In the most recent deploy the ci_utils_image finished at 17:24:32 and the database job didn't start until 17:26:08 which suggests we had to wait for a VM to come online to fit our four cores.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya these are good points. Let's talk about this in a follow-up when Jackie's back as she probably has the most insight on this.
ci/create_database.py
Outdated
if not admin_exists | ||
else "ALTER USER '{admin_username}'@'%' IDENTIFIED BY '{admin_password}';" | ||
) | ||
async def create_user_idempotent(admin_or_user, mysql_username, mysql_password): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't really idempotent because someone could race between your SELECT and the CREATE USER. I'm not sure that matters though. There's only one person creating users at any time, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I guess what I mean to say is: it is single-writer idempotent but not atomic so with multiple writers its not idempotent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, I think I'm abusing idempotent a little bit here. What I really mean is just create_user_if_doesnt_exist
.
ci/create_database.py
Outdated
) | ||
async def create_user_idempotent(admin_or_user, mysql_username, mysql_password): | ||
existing_user = await db.execute_and_fetchone('SELECT user FROM mysql.user WHERE user=%s', (mysql_username,)) | ||
if existing_user is not None and existing_user.get('user') == mysql_username: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the second condition is always true if the first condition is true, right? I'm fine making it an assertion below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya true. Just going to delete this check and SELECT 1
.
@@ -77,6 +80,19 @@ async def aexit(acontext_manager, exc_type=None, exc_val=None, exc_tb=None): | |||
return await acontext_manager.__aexit__(exc_type, exc_val, exc_tb) | |||
|
|||
|
|||
async def resolve_test_db_endpoint(sql_config: SQLConfig) -> SQLConfig: | |||
service_name, namespace = sql_config.host[: -len('.svc.cluster.local')].split('.', maxsplit=1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're so close to removesuffix
!
db_service = await client.read_namespaced_service(service_name, namespace) | ||
db_pod = await client.read_namespaced_pod(f'{db_service.spec.selector["app"]}-0', namespace) | ||
sql_config_dict = sql_config.to_dict() | ||
sql_config_dict['host'] = db_pod.status.host_ip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels a bit flaky no? Are we not able to access this through the service's DNS name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. The workers. And exposing the service to the workers would require a private load balancer, etc. etc. annoying costly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya exactly. I didn't want to allocate a whole internal load balancer just to run a few couple minute migrations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does feel a bit flaky, which is why I specifically don't do this if you're running in k8s. I am implicitly assuming here that the DB pod is not going to be relocated while the create_database job is running, which I guess would pose a problem anyway. This seemed the most reasonable, albeit a little janky, approach in terms of cost and complexity given that these are just test databases.
@@ -19,6 +19,14 @@ async def copy_identity_from_default(hail_credentials_secret_name: str) -> str: | |||
|
|||
secret = await k8s_client.read_namespaced_secret(hail_credentials_secret_name, 'default') | |||
|
|||
try: | |||
await k8s_client.delete_namespaced_secret(hail_credentials_secret_name, NAMESPACE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's up with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I was doing this in my namespace that already had a dgoldste-gsa-key
secret but not a user in the database, the create_namespaced_secret
failed. I needed a way to force create a secret even if one already existed and this is how we do it in auth/driver.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such a good change!
c7a1003
to
dfca8e1
Compare
Let's get this in so we can try scaling up the service backend tests. |
ci test hung |
This change creates mysql pods for test and dev namespaces, instead of sharing the CloudSQL database server. The areas of change are as follows:
Generation of the namespace's database-server-config
The current approach in main does a little trick. Since the current
createDatabase
step uses thedatabase-server-config
from default to generate admin/user sql configs, the CI pipeline creates a dummy databasetest-database-instance
to create asql-test-instance-admin-config
that inherits the credentials from the productiondatabase-server-config
, and then copies that within the test namespace todatabase-server-config
.In this change, since we are creating the server ourselves, we can just replace these with a step that creates a
database-server-config
from scratch, and then uses that for the DB pod.Overall making these changes really gave me the heebie jeebies that the test and dev namespaces have all these credentials to the CloudSQL server. I'm glad this gets rid of that.
Accessing the database server
We use the DB pod's service DNS name as the
host
so inside Kubernetes this Just Works. The one caveat is the CI pipeline in which we run migrations in batch jobs. Those jobs need a way to reach the DB pod. I achieve this with a NodePort and then use the job's K8s credentials to resolve the node and port that the DB is on. The code I've added to do this resolution feels a bit janky, wouldn't mind some feedback on that. In terms of security, if a user job was able to somehow resolve the address of a test db, they would still not have the credentials to access it, and this is currently also the case with the production database. Nevertheless, this does raise an action item that we should only allow traffic to the k8s and DB subnets fornetwork=private
jobs, but I think we should make that a separate PR.Database creation
In order to test this properly in a dev deploy, I needed to make some changes to
create_database.py
. In main, dev deploys can't create databases. I think they should be able to, and those operations should just be idempotent. When allowing dev deploys to create databases, I hit theALTER USER
lines increate_database.py
which lock out any already-deployed application, which feels silly. Instead, I create the mysql user and create the corresponding secret iff the mysql username does not already exist.create_initial_account.py
When starting off with a fresh database, we encounter a bootstrapping problem where we need to add the developer's user. The current way to add the developer's user is through
create_initial_account.py
which assumes that the developer's gsa key secret is not already in the namespace, but now it could be, so I delete the key in the namespace if it exists before creating it. This could all change with my other PR to add devs to test namespaces. But this change to allow ephemeral dev databases will make testing that other PR way easier.