-
Notifications
You must be signed in to change notification settings - Fork 84
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Casefold when processing email addresses #374
Conversation
Signed-off-by: H.Shay <shaysquared@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem here is that if anybody has already registered with Richard@Example.com
, then that's what's going to be stored in the database, so if we instead look up richard@example.com
, we're not going to find it.
I think we either need to migrate the existing database entries to a casefolded format, or we need to do two lookups. Did anyone more familiar than me with Sydent (@babolivier? @reivilibre) had any thoughts on this?
I think Brendan outlined a migration step here: https://github.com/matrix-org/matrix-doc/blob/babolivier/msc_email_case/proposals/2265-email-lowercase.md#conflicts-with-existing-associations, but that seems to be something that would be implemented by people actually running/maintaining a Sydent instance? |
For context, I was thinking of that migration step as something that'd need to be implemented in Sydent. It probably can just be a script, though we'd need to figure out how to make sure people running their own Sydent know they need to run it - maybe we can prevent Sydent from starting if we find not-casefolded addresses in the DB (with an error to tell them to run the script to fix their setup)? |
@babolivier: I am trying to think through the migration script portion of this, and I'm wondering if this makes sense to you (and forgive me if this is super obvious, I haven't done much work with databases yet):
My assumption is that we would have to repeat this process for the global_threepid_associations table as well- I wasn't totally sure about that. Does this approach make sense overall? Thanks in advance for your input! |
@H-Shay Gah I'm sorry I've completely missed your comment. I think your solution is heading in the right direction, but it looks to me like "check to see if there are any other mxids associated with the original, non-casefolded email address" cannot work. There's a unique index on If we need to do this, I think the easiest solution is, after
Which might have been what you suggested, sorry if I haven't parsed it correctly. Also this is just a suggestion, feel free to adapt it as you want. Bonus points for queuing up the DB operations and running them all at once with In Python, my proposal would look something like this: def update_assocs(conn: sqlite3.Connection):
# Do everything in a transaction: that way if a deletion or update fails, or the
# script is stopped halfway through, we don't end up in a weird state.
cur = conn.cursor()
res = cur.execute(
"""
SELECT address, mxid FROM local_threepid_associations
WHERE medium = 'email'
ORDER BY ts DESC
"""
)
assocs: Dict[str, List[Tuple[str, str]]] = {}
# Iterate over the raw database results to build our data set.
# In this case, fetchall returns a list of tuples with the address as first value and
# the mxid as second value.
for row in res.fetchall():
cf_address = row[0].casefold()
if cf_address not in assocs:
assocs[cf_address] = []
assocs[cf_address].append((row[0], row[1]))
# The UPDATE statement we'll give to executemany will need three arguments:
# * the casefolded address (SET address = ...)
# * the original address and the original MXID (WHERE address = ... AND mxid = ...)
update_args: List[Tuple[str, str, str]] = []
# The DELETE stetement we'll give to executemany will need one argument, which is the
# original address (WHERE address = ...)
delete_args: List[str] = []
# Now we've got our data set, let's use it to determine what changes we need to make
# to the database.
for cf_address, assoc_tuples in assocs.items():
# We know there's at least one tuple. If there's more than one, we've ordered the
# results by most recent association first, so we can just use that one.
update_args.append((cf_address, assoc_tuples[0][0], assoc_tuples[0][1]))
if len(assoc_tuples) > 1:
# List of the MXIDs we need to tell the email address's owner they've been
# disassociated from the address.
mxids: List[str] = []
# Iterate over all associations except for the first one, since we've already
# processed it.
for assoc_tuple in assoc_tuples[1:]:
delete_args.append(assoc_tuple[0])
mxids.append(assoc_tuple[1])
# Note: we could scope the mxids list outside of the for loop and process them
# only after we've finished iterating.
send_mail(mxids)
# Delete the extra rows.
if len(delete_args) > 0:
cur.executemany(
"DELETE FROM local_threepid_associations WHERE address = ?",
delete_args,
)
# Update the remaining rows with the casefolded versions of the addresses.
if len(update_args) > 0:
cur.executemany(
"UPDATE local_threepid_associations SET address = ? WHERE address = ? AND mxid = ?",
update_args,
)
# We've finished updating the database, committing the transaction.
conn.commit() (again, this is just a suggestion, feel free to adapt it however you want; if anything it'll also require more logging since there's none at all in this snippet)
Yes I'm afraid we'll have to. I think the best approach would be to get every row that's got a sydent/sydent/http/servlets/lookupservlet.py Lines 60 to 86 in 6626623
Hope this was helpful (and sorry for the wall of text), feel free to ping me if you've got any question 🙂 |
As pointed out in DM, we also need to update the |
So I added the functions to update the DB, but I wasn't sure where to put the functions, so I just popped them in a file in the DB folder for now, which I am sure is wrong. Obviously they will have to be modified slightly depending on where the file ends up living, but the core logic will stay the same and that's what I am looking for feedback on. I also wrote some tests to try and convince myself that the scrips were working, I pushed them just in case people wanted to look but I assume it might be better to drop them before actually merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coming along!
Sorry for the wall of comments, but hopefully you should be able to get something out of them. I tried to explain what I'm thinking. Ask about anything you like (here or Matrix is fine)!
And by all means, please do include the tests! We will bite your hands off for tests!
N.B. I haven't checked that you're casefolding in all the needed places, but I'll go through and double check that in the next review.
I also should have a closer look at the tests.
sydent/db/migration.py
Outdated
res = cur.execute( | ||
"SELECT address, mxid FROM local_threepid_associations WHERE medium = 'email'" | ||
"ORDER BY ts DESC" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know how many rows we're expecting? If this were e.g. Synapse, we'd split this up into batches somehow to avoid a really large query.
I don't know how big our Sydent deployments are, but to be honest I'm not sure it's vector.im that's the scary one -- perhaps some of the government deployments have everyone in the identity lookup, which I would guess could be several hundred thousand users? (@babolivier do you know?)
If you end up making this run in batches, you might want to save the position of this process to the database so that if Sydent is restarted, it can resume where it left off.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this I don't know-I do know that the deletion step is supposed to be rare, but as for how large a DB this code will run on, I have no idea-this might be a questions for Brendan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some more comments :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hopefully we're coming close now. I think you saw the discussion about how this is going to work in #synapse-dev:
We would like to make this migration executable as a script.
We don't need to check for casefoldable addresses & warn -- we'll assume that the sysadmin will check the upgrade notes for now.
The current structure (where a lot of reading happens before a small amount of writing all happens at the end) is good for minimising the time that this will hold an EXCLUSIVE lock on the SQLite3 database, so it should be fine to run on a live system (though the SELECT will block other writers, so it might interrupt the ability to create new bindings for a short time, but I think we'll be OK as the database isn't very large).
It would be nice to have two options for this script:
- dry-run: don't actually make any changes or send any e-mails
- no-emails: don't send e-mails
Sorry for the brain-dump... but does it make sense? :)
Thanks!
sydent/db/migration.py
Outdated
sendEmail( | ||
self.sydent, | ||
templateFile, | ||
address, | ||
{"mxid": "mxid", "subject_header_value": "MatrixID Update"}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at this again, I think one thing we could do would be to not send an e-mail if all the bindings are for the same Matrix ID.
(For bob@example.org → @bob:example.org
and BOB@example.org → @bob:example.org
, it won't make any difference if you remove one of the two.)
Just thinking that this will reduce the number of e-mails sent and will reduce the number of people who might get confused by this.
scripts/casefold_db_no_email.py
Outdated
@@ -0,0 +1,199 @@ | |||
#!/usr/bin/env python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto for this file; make it a flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, the structure is good
whilst you're here, do you want to add scripts/
as one of the directories to be checked in .github/workflows/pipeline.yml
? 😇
(If the other scripts are making the CI fail, then just add scripts/casefold_db.py
instead, since that's not your fault and we shouldn't be fixing them in this PR :))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking great; let's get another set of eyes from Patrick (edit: or Brendan; I think Patrick was looking at the other PR, that's me being confused) today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for working on this! And sorry I didn't have a look sooner :/
It broadly looks good, there are just a few issues to fix before I'm 100% happy with it.
Hello, | ||
|
||
We’ve recently improved how people discover your Matrix account. | ||
In the past, identity services did not take capitalization into account when creating and storing Matrix IDs. We’ve now updated this behavior so anyone can find you, no matter how your email is capitalized. As part of this recent update, the duplicate Matrix ID %(mxid)s is no longer associated with this e-mail address. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think some parts of this paragraph are confusing, e.g.:
identity services did not take capitalization into account when creating and storing Matrix IDs
Isn't the issue that they did take capitalisation (btw I think case might be a better word?) into account and we don't want them to?
the duplicate Matrix ID %(mxid)s is no longer associated with this e-mail address
The Matrix ID might still be associated with the email address, just not the email address with this case.
Generally, I think this should be looked at by our wordsmiths before we can use it on vector.im and matrix.org, so I don't think fixing it should block this PR as long as we remember to ping the right people internally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there!
First crack at resolving #341. I tried to make sure emails were being casefolded in the proper places (i.e. if an address was passed through a few functions the email address was casefolded in a function that made the most sense logically) and I tried to only casefold where necessary.