Casefold when processing email addresses #374

H-Shay · 2021-07-07T18:49:28Z

First crack at resolving #341. I tried to make sure emails were being casefolded in the proper places (i.e. if an address was passed through a few functions the email address was casefolded in a function that made the most sense logically) and I tried to only casefold where necessary.

Signed-off-by: H.Shay <shaysquared@gmail.com>

richvdh

I think the problem here is that if anybody has already registered with Richard@Example.com, then that's what's going to be stored in the database, so if we instead look up richard@example.com, we're not going to find it.

I think we either need to migrate the existing database entries to a casefolded format, or we need to do two lookups. Did anyone more familiar than me with Sydent (@babolivier? @reivilibre) had any thoughts on this?

H-Shay · 2021-07-12T18:24:47Z

I think Brendan outlined a migration step here: https://github.com/matrix-org/matrix-doc/blob/babolivier/msc_email_case/proposals/2265-email-lowercase.md#conflicts-with-existing-associations, but that seems to be something that would be implemented by people actually running/maintaining a Sydent instance?

babolivier · 2021-07-14T16:53:34Z

For context, I was thinking of that migration step as something that'd need to be implemented in Sydent. It probably can just be a script, though we'd need to figure out how to make sure people running their own Sydent know they need to run it - maybe we can prevent Sydent from starting if we find not-casefolded addresses in the DB (with an error to tell them to run the script to fix their setup)?

H-Shay · 2021-07-15T21:17:53Z

@babolivier: I am trying to think through the migration script portion of this, and I'm wondering if this makes sense to you (and forgive me if this is super obvious, I haven't done much work with databases yet):
Would it be fair to say that the script might look something like this:

from table local_threepid_associations, select all associations where the medium is 'email'
iterate through the selected associations, casefolding the email and storing the new casefolded email with it's associations
check to see if there are any other mxids associated with the original, non-casefolded email address
a. if there are, implement the steps you listed in the casefolding proposal:
1. list all MXIDs associated with a variant of the email address, and the timestamp of that association
2. delete all associations except for the most recent one [0]
3. inform the user of the deletion by sending them an email notice to the email address

My assumption is that we would have to repeat this process for the global_threepid_associations table as well- I wasn't totally sure about that. Does this approach make sense overall? Thanks in advance for your input!

babolivier · 2021-07-21T10:43:22Z

@H-Shay Gah I'm sorry I've completely missed your comment.

I think your solution is heading in the right direction, but it looks to me like "check to see if there are any other mxids associated with the original, non-casefolded email address" cannot work. There's a unique index on (medium, address) so only one association can exist for a given case of the same email address, and sqlite doesn't have a casefold function (so we can't do WHERE casefold(address) = '...'). This means that the only way to do this is to store every variation of a same email address we've encountered while iterating on the database results.

If we need to do this, I think the easiest solution is, after SELECTing all associations in the local_threepid_associations with the email medium, just iterating through them and storing them in a Dict[str, List[Tuple[str, str]]]. In this dict, the key is the casefolded address and the list contains associated tuple of MXIDs with the associated email address with its original case. Then you can just iterate over that dict and:

update the row for the first result
if there's any other result, delete them from the database, send an email with the MXIDs

Which might have been what you suggested, sorry if I haven't parsed it correctly. Also this is just a suggestion, feel free to adapt it as you want.

Bonus points for queuing up the DB operations and running them all at once with executemany instead of running them one by one, as a perf improvement.

In Python, my proposal would look something like this:

def update_assocs(conn: sqlite3.Connection):
    # Do everything in a transaction: that way if a deletion or update fails, or the
    # script is stopped halfway through, we don't end up in a weird state.
    cur = conn.cursor()

    res = cur.execute(
        """
        SELECT address, mxid FROM local_threepid_associations
        WHERE medium = 'email'
        ORDER BY ts DESC
        """
    )

    assocs: Dict[str, List[Tuple[str, str]]] = {}

    # Iterate over the raw database results to build our data set.
    # In this case, fetchall returns a list of tuples with the address as first value and
    # the mxid as second value.
    for row in res.fetchall():
        cf_address = row[0].casefold()

        if cf_address not in assocs:
            assocs[cf_address] = []

        assocs[cf_address].append((row[0], row[1]))

    # The UPDATE statement we'll give to executemany will need three arguments:
    #   * the casefolded address (SET address = ...)
    #   * the original address and the original MXID (WHERE address = ... AND mxid = ...)
    update_args: List[Tuple[str, str, str]] = []
    # The DELETE stetement we'll give to executemany will need one argument, which is the
    # original address (WHERE address = ...)
    delete_args: List[str] = []

    # Now we've got our data set, let's use it to determine what changes we need to make
    # to the database.
    for cf_address, assoc_tuples in assocs.items():
        # We know there's at least one tuple. If there's more than one, we've ordered the
        # results by most recent association first, so we can just use that one.
        update_args.append((cf_address, assoc_tuples[0][0], assoc_tuples[0][1]))

        if len(assoc_tuples) > 1:
            # List of the MXIDs we need to tell the email address's owner they've been
            # disassociated from the address.
            mxids: List[str] = []

            # Iterate over all associations except for the first one, since we've already
            # processed it.
            for assoc_tuple in assoc_tuples[1:]:
                delete_args.append(assoc_tuple[0])
                mxids.append(assoc_tuple[1])

            # Note: we could scope the mxids list outside of the for loop and process them
            # only after we've finished iterating.
            send_mail(mxids)

    # Delete the extra rows.
    if len(delete_args) > 0:
        cur.executemany(
            "DELETE FROM local_threepid_associations WHERE address = ?",
            delete_args,
        )

    # Update the remaining rows with the casefolded versions of the addresses.
    if len(update_args) > 0:
        cur.executemany(
            "UPDATE local_threepid_associations SET address = ? WHERE address = ? AND mxid = ?",
            update_args,
        )

    # We've finished updating the database, committing the transaction.
    conn.commit()

(again, this is just a suggestion, feel free to adapt it however you want; if anything it'll also require more logging since there's none at all in this snippet)

My assumption is that we would have to repeat this process for the global_threepid_associations table as well- I wasn't totally sure about that

Yes I'm afraid we'll have to. I think the best approach would be to get every row that's got a originServer value that matches our server name, then do basically the same operation as with local_threepid_associations; except we'll also need to update the sgAssoc with the new data and recalculate its signature. I think a good place in the code to look at to figure out how to do this would be here:

sydent/sydent/http/servlets/lookupservlet.py

Lines 60 to 86 in 6626623

    
           sgassoc = globalAssocStore.signedAssociationStringForThreepid(medium, address) 
        
           if not sgassoc: 
        
               return {} 
        
           sgassoc = json_decoder.decode(sgassoc) 
        
           if self.sydent.server_name not in sgassoc["signatures"]: 
        
               # We have not yet worked out what the proper trust model should be. 
        
               # 
        
               # Maybe clients implicitly trust a server they talk to (and so we 
        
               # should sign every assoc we return as ourselves, so they can 
        
               # verify this). 
        
               # 
        
               # Maybe clients really want to know what server did the original 
        
               # verification, and want to only know exactly who signed the assoc. 
        
               # 
        
               # Until we work out what we should do, sign all assocs we return as 
        
               # ourself. This is vaguely ok because there actually is only one 
        
               # identity server, but it happens to have two names (matrix.org and 
        
               # vector.im), and so we're not really lying too much. 
        
               # 
        
               # We do this when we return assocs, not when we receive them over 
        
               # replication, so that we can undo this decision in the future if 
        
               # we wish, without having destroyed the raw underlying data. 
        
               sgassoc = signedjson.sign.sign_json( 
        
                   sgassoc, self.sydent.server_name, self.sydent.keyring.ed25519 
        
               )

Hope this was helpful (and sorry for the wall of text), feel free to ping me if you've got any question 🙂

babolivier · 2021-07-22T10:10:41Z

As pointed out in DM, we also need to update the lookup_hashes, which I've missed.

H-Shay · 2021-07-29T22:18:41Z

So I added the functions to update the DB, but I wasn't sure where to put the functions, so I just popped them in a file in the DB folder for now, which I am sure is wrong. Obviously they will have to be modified slightly depending on where the file ends up living, but the core logic will stay the same and that's what I am looking for feedback on. I also wrote some tests to try and convince myself that the scrips were working, I pushed them just in case people wanted to look but I assume it might be better to drop them before actually merging?

reivilibre

This is coming along!

Sorry for the wall of comments, but hopefully you should be able to get something out of them. I tried to explain what I'm thinking. Ask about anything you like (here or Matrix is fine)!

And by all means, please do include the tests! We will bite your hands off for tests!

N.B. I haven't checked that you're casefolding in all the needed places, but I'll go through and double check that in the next review.
I also should have a closer look at the tests.

res/matrix-org/migration_template.eml

sydent/db/migration.py

reivilibre · 2021-07-30T08:57:44Z

sydent/db/migration.py

+    res = cur.execute(
+        "SELECT address, mxid FROM local_threepid_associations WHERE medium = 'email'"
+        "ORDER BY ts DESC"
+    )


Do we know how many rows we're expecting? If this were e.g. Synapse, we'd split this up into batches somehow to avoid a really large query.

I don't know how big our Sydent deployments are, but to be honest I'm not sure it's vector.im that's the scary one -- perhaps some of the government deployments have everyone in the identity lookup, which I would guess could be several hundred thousand users? (@babolivier do you know?)

If you end up making this run in batches, you might want to save the position of this process to the database so that if Sydent is restarted, it can resume where it left off.

So this I don't know-I do know that the deletion step is supposed to be rare, but as for how large a DB this code will run on, I have no idea-this might be a questions for Brendan.

sydent/db/migration.py

tests/test_migration.py

reivilibre

some more comments :)

res/matrix-org/migration_template.eml

sydent/threepid/bind.py

sydent/db/migration.py

tests/test_migration.py

reivilibre

Hopefully we're coming close now. I think you saw the discussion about how this is going to work in #synapse-dev:

We would like to make this migration executable as a script.
We don't need to check for casefoldable addresses & warn -- we'll assume that the sysadmin will check the upgrade notes for now.

The current structure (where a lot of reading happens before a small amount of writing all happens at the end) is good for minimising the time that this will hold an EXCLUSIVE lock on the SQLite3 database, so it should be fine to run on a live system (though the SELECT will block other writers, so it might interrupt the ability to create new bindings for a short time, but I think we'll be OK as the database isn't very large).

It would be nice to have two options for this script:

dry-run: don't actually make any changes or send any e-mails
no-emails: don't send e-mails

Sorry for the brain-dump... but does it make sense? :)

Thanks!

changelog.d/374.misc

sydent/db/invite_tokens.py

reivilibre · 2021-08-10T16:59:16Z

sydent/db/migration.py

+        sendEmail(
+            self.sydent,
+            templateFile,
+            address,
+            {"mxid": "mxid", "subject_header_value": "MatrixID Update"},
+        )


Looking at this again, I think one thing we could do would be to not send an e-mail if all the bindings are for the same Matrix ID.
(For bob@example.org → @bob:example.org and BOB@example.org → @bob:example.org, it won't make any difference if you remove one of the two.)
Just thinking that this will reduce the number of e-mails sent and will reduce the number of people who might get confused by this.

scripts/casefold_db.py

scripts/casefold_db_dry_run.py

reivilibre · 2021-08-13T14:55:15Z

scripts/casefold_db_no_email.py

@@ -0,0 +1,199 @@
+#!/usr/bin/env python


ditto for this file; make it a flag

reivilibre

thank you, the structure is good

whilst you're here, do you want to add scripts/ as one of the directories to be checked in .github/workflows/pipeline.yml? 😇
(If the other scripts are making the CI fail, then just add scripts/casefold_db.py instead, since that's not your fault and we shouldn't be fixing them in this PR :))

scripts/casefold_db.py

tests/test_casefold_migration.py

reivilibre

looking great; let's get another set of eyes from Patrick (edit: or Brendan; I think Patrick was looking at the other PR, that's me being confused) today?

scripts/casefold_db.py

tests/test_casefold_migration.py

babolivier

Thanks a lot for working on this! And sorry I didn't have a look sooner :/
It broadly looks good, there are just a few issues to fix before I'm 100% happy with it.

babolivier · 2021-08-23T16:36:08Z

res/matrix-org/migration_template.eml

+Hello,
+
+We’ve recently improved how people discover your Matrix account.
+In the past, identity services did not take capitalization into account when creating and storing Matrix IDs. We’ve now updated this behavior so anyone can find you, no matter how your email is capitalized. As part of this recent update, the duplicate Matrix ID %(mxid)s is no longer associated with this e-mail address.


I think some parts of this paragraph are confusing, e.g.:

identity services did not take capitalization into account when creating and storing Matrix IDs

Isn't the issue that they did take capitalisation (btw I think case might be a better word?) into account and we don't want them to?

the duplicate Matrix ID %(mxid)s is no longer associated with this e-mail address

The Matrix ID might still be associated with the email address, just not the email address with this case.

Generally, I think this should be looked at by our wordsmiths before we can use it on vector.im and matrix.org, so I don't think fixing it should block this PR as long as we remember to ping the right people internally.

tests/test_casefold_migration.py

scripts/casefold_db.py

babolivier

Almost there!

scripts/casefold_db.py

tests/test_casefold_migration.py

…is back

H-Shay added 3 commits July 7, 2021 10:59

add casefolding to bind and associated functions

087bd4a

add casefolding to necessary functions + lints

0988bdd

Signed-off-by: H.Shay <shaysquared@gmail.com>

add changelog

d80191a

callahad requested a review from a team July 8, 2021 21:49

richvdh reviewed Jul 12, 2021

View reviewed changes

richvdh requested review from babolivier and reivilibre July 13, 2021 16:50

H-Shay added 6 commits July 27, 2021 13:33

fix broken terms test

3227de7

draft migration script

5950186

draft test for migration script

80a33f3

add send email and email test

406c9ac

function for updating global table added

d20c2bc

lints

4449f00

reivilibre requested changes Jul 30, 2021

View reviewed changes

H-Shay added 3 commits August 2, 2021 12:11

requested changes

9b363c5

lints

a99a18e

fix lint fail

568b603

H-Shay requested a review from reivilibre August 2, 2021 21:03

reivilibre requested changes Aug 4, 2021

View reviewed changes

H-Shay added 4 commits August 4, 2021 13:10

update tests + misc requested changes

775f0ec

extract casefold logic to function and add to affected files

7ae6d36

update email template and associated test

5d3ecc8

lints

76ad036

H-Shay requested a review from reivilibre August 5, 2021 22:35

reivilibre requested changes Aug 10, 2021

View reviewed changes

H-Shay added 5 commits August 10, 2021 16:19

requested changes

e23232a

rename file, rework to be script, rework tests to test script

3871005

update tests and dry run/no email versions

1f8ebfe

update tests + scripts

69f654d

lints

4b83b25

H-Shay requested a review from reivilibre August 13, 2021 14:33

reivilibre requested changes Aug 13, 2021

View reviewed changes

H-Shay added 4 commits August 16, 2021 13:46

add ability to call script from command line and update tests

ac08265

remove deleted files

91d98d3

refine commandline invocation, add print statement

ec585a3

lints

ef40611

H-Shay requested a review from reivilibre August 16, 2021 21:54

reivilibre requested changes Aug 17, 2021

View reviewed changes

requested changes

e774fa1

H-Shay requested a review from reivilibre August 18, 2021 20:26

reivilibre requested changes Aug 19, 2021

View reviewed changes

scripts/casefold_db.py Outdated Show resolved Hide resolved

tests/test_casefold_migration.py Outdated Show resolved Hide resolved

requested changes + lints

83560a4

babolivier suggested changes Aug 23, 2021

View reviewed changes

H-Shay added 2 commits August 23, 2021 13:13

requested changes

eb5ad93

lints

f1b4f9e

H-Shay requested a review from babolivier August 23, 2021 20:21

babolivier suggested changes Aug 24, 2021

View reviewed changes

scripts/casefold_db.py Outdated Show resolved Hide resolved

tests/test_casefold_migration.py Outdated Show resolved Hide resolved

Actually let me just do that myself - doesn't have to wait till Shay …

969dda4

…is back

babolivier approved these changes Aug 24, 2021

View reviewed changes

H-Shay added 2 commits August 24, 2021 08:38

Merge branch 'matrix-org:main' into casefold

b6b95c6

lints

96eee48

reivilibre merged commit bc72eb5 into matrix-org:main Aug 24, 2021

callahad mentioned this pull request Aug 24, 2021

Support MSC2265: Casefold email addresses automatically #341

Closed

anoadragon453 mentioned this pull request Sep 13, 2021

Fix KeyError on unbinding addresses #397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Casefold when processing email addresses #374

Casefold when processing email addresses #374

H-Shay commented Jul 7, 2021

richvdh left a comment

H-Shay commented Jul 12, 2021

babolivier commented Jul 14, 2021 •

edited

Loading

H-Shay commented Jul 15, 2021

babolivier commented Jul 21, 2021

babolivier commented Jul 22, 2021

H-Shay commented Jul 29, 2021

reivilibre left a comment

reivilibre Jul 30, 2021

H-Shay Aug 2, 2021

reivilibre left a comment

reivilibre left a comment •

edited

Loading

reivilibre Aug 10, 2021

reivilibre Aug 13, 2021

reivilibre left a comment

reivilibre left a comment •

edited

Loading

babolivier left a comment

babolivier Aug 23, 2021 •

edited

Loading

babolivier left a comment

Casefold when processing email addresses #374

Casefold when processing email addresses #374

Conversation

H-Shay commented Jul 7, 2021

richvdh left a comment

Choose a reason for hiding this comment

H-Shay commented Jul 12, 2021

babolivier commented Jul 14, 2021 • edited Loading

H-Shay commented Jul 15, 2021

babolivier commented Jul 21, 2021

babolivier commented Jul 22, 2021

H-Shay commented Jul 29, 2021

reivilibre left a comment

Choose a reason for hiding this comment

reivilibre Jul 30, 2021

Choose a reason for hiding this comment

H-Shay Aug 2, 2021

Choose a reason for hiding this comment

reivilibre left a comment

Choose a reason for hiding this comment

reivilibre left a comment • edited Loading

Choose a reason for hiding this comment

reivilibre Aug 10, 2021

Choose a reason for hiding this comment

reivilibre Aug 13, 2021

Choose a reason for hiding this comment

reivilibre left a comment

Choose a reason for hiding this comment

reivilibre left a comment • edited Loading

Choose a reason for hiding this comment

babolivier left a comment

Choose a reason for hiding this comment

babolivier Aug 23, 2021 • edited Loading

Choose a reason for hiding this comment

babolivier left a comment

Choose a reason for hiding this comment

babolivier commented Jul 14, 2021 •

edited

Loading

reivilibre left a comment •

edited

Loading

reivilibre left a comment •

edited

Loading

babolivier Aug 23, 2021 •

edited

Loading