Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-enable 2FA metrics, speed up queries by denormalizing counts #11701

Closed
di opened this issue Jun 27, 2022 · 11 comments · Fixed by #11724
Closed

Re-enable 2FA metrics, speed up queries by denormalizing counts #11701

di opened this issue Jun 27, 2022 · 11 comments · Fixed by #11724

Comments

@di
Copy link
Member

di commented Jun 27, 2022

In #11699 we disabled the 2FA metrics temporarily due to DB performance issues.

We should improve the count performance with something similar to #745 and re-enable the metrics.

@di
Copy link
Member Author

di commented Jun 28, 2022

After chatting w/ @dstufft we came up with this plan:

After looking at the query that was poorly performing, it appears that the cause is not the poor performance of count but a nested select statement being run for every user:

SELECT 
  count(*) AS count_1 
FROM 
  (
    SELECT 
      users.id AS users_id, 
      users.sitemap_bucket AS users_sitemap_bucket, 
      users.username AS users_username, 
      users.name AS users_name, 
      users.password AS users_password, 
      users.password_date AS users_password_date, 
      users.is_active AS users_is_active, 
      users.is_frozen AS users_is_frozen, 
      users.is_superuser AS users_is_superuser, 
      users.is_moderator AS users_is_moderator, 
      users.is_psf_staff AS users_is_psf_staff, 
      users.prohibit_password_reset AS users_prohibit_password_reset, 
      users.date_joined AS users_date_joined, 
      users.last_login AS users_last_login, 
      users.disabled_for AS users_disabled_for, 
      users.totp_secret AS users_totp_secret, 
      users.last_totp_value AS users_last_totp_value 
    FROM 
      users 
    WHERE 
      (
        SELECT 
          true AS anon_2 
        FROM 
          user_security_keys 
        WHERE 
          user_security_keys.user_id = users.id 
          OR users.id = users.id 
          AND users.totp_secret IS NOT NULL 
        LIMIT 
          1
      )
  ) AS anon_1

This is due to the expression for the has_two_factor hybrid property here:

@has_two_factor.expression # type: ignore
def has_two_factor(self):
return (
select([True])
.where(
or_(
WebAuthn.user_id == self.id,
and_(User.id == self.id, User.totp_secret.is_not(None)),
)
)
.limit(1)
)

So instead of attempting to add triggers and events for counting each of these attributes in the database, we'll instead simplify this expression by denormalizing a User.has_webauthn column, and using that instead.

The steps towards this are roughly:

Optionally, if we find that the task is still running too long, we can make changes to add a readonly setting to tasks similar to what we have for views: https://github.com/pypi/warehouse/blob/main/warehouse/db.py#L190-L194

@dstufft
Copy link
Member

dstufft commented Jun 28, 2022

Create new column for User.has_webauthn, make it non-nullable, default false

Double check that we're on PG 11+, prior to PG 11 adding a column with a default wasn't safe, but in PG 11 it became safe.

To make sure this doesn't maintain an exclusive lock on the db, commit and open transactions in upgrade method, similar to

Just make sure that you chunk the data backfill, and that your update query ignores things that have already been migrated. That may be easier to do with 3 migrations:

  1. Add the column, nullable no default + triggers.
  2. Update batches, with a WHERE has_webauthn IS NULL limitation.
  3. Set the column to not nullable, default of false.

I also just looked, our env.py for alembic contains these lines:

context.configure(
            connection=connection,
            target_metadata=db.metadata,
            compare_server_default=True,
        )
with context.begin_transaction():
            context.run_migrations()

In the run_migrations_online() method. It appears by default alembic will open a single transaction and run all of the migrations inside of that. I think we want to change that to:

context.configure(
            connection=connection,
            target_metadata=db.metadata,
            compare_server_default=True,
            transaction_per_migration=True,
        )
with context.begin_transaction():
            context.run_migrations()

This will make the transaction opened by context.begin_transaction a logical transaction (rather than a database transaction), and internally inside of context.run_migrations() before executing an upgrade() method, alembic will open up a real database transaction.

I also see that there's actually a mode to let you turn on PG's auto commit inside of a migration:

def upgrade():
    # Every statement inside of the autocommit block, will be automatically committed.
    with op.get_context().autocommit_block():
        # We need to use get_bind() here because op.execute doesn't return a result.
        conn = op.get_bind()

        # Update users in chunks, until no more users have a null has_webauthn
        while True:
            r = conn.execute("UPDATE ... WHERE id IN (SELECT id FROM users WHERE has_webauthn IS NULL LIMIT 1000)")
            if r.rowcount == 0:
                break

Update task/expression to consider User.has_webauthn instead

You can refactor the User.has_two_factor to be something like:

@hybrid_property
    def has_two_factor(self):
        return self.totp_secret is not None or self.has_webauthn

 @has_two_factor.expression 
 def has_two_factor(self): 
    return or_(self.totp_secret_is_not(None), self.has_webauthn)

Then just continue to use User.has_two_factor, since it'll just be using two columns on the user table, it won't involve any gnarly sub selects, and should be fast. Might be able to speed it up with an index, not sure.

@dstufft
Copy link
Member

dstufft commented Jun 28, 2022

Also to be clear, I didn't actually EXPLAIN that query to figure out what the problem was, I'm just eyeballing it and guessing. It might be useful to do EXPLAIN or EXPLAIN ANAYLIZE on it and make sure that my assumption is correct... but I'm pretty sure it is. I can't imagine that subquery performing well.

@miketheman
Copy link
Member

The local dev database obtained with make initdb shows a total user count of 40,239. Is this demonstrative of the same user count in production, or are there more?

Again, in development: The query performs two sequential scans, one on users and one on user_security_keys (0 in dev) - so if either of those tables are largish, there may be a good partial index opportunity on users.totp_secret.

@dstufft
Copy link
Member

dstufft commented Jun 29, 2022

warehouse=> SELECT COUNT(*) FROM users;
 count
--------
 604282
(1 row)

warehouse=> SELECT COUNT(*) FROM user_security_keys;
 count
-------
  4210
(1 row)
 Aggregate  (cost=6127375.34..6127375.35 rows=1 width=8) (actual time=346089.288..346089.292 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..6126620.24 rows=302040 width=0) (actual time=1.329..346068.003 rows=27481 loops=1)
         Filter: (SubPlan 1)
         Rows Removed by Filter: 576801
         SubPlan 1
           ->  Limit  (cost=0.00..10.11 rows=1 width=1) (actual time=0.571..0.571 rows=0 loops=604282)
                 ->  Seq Scan on user_security_keys  (cost=0.00..222.34 rows=22 width=1) (actual time=0.569..0.569 rows=
0 loops=604282)
                       Filter: ((user_id = users.id) OR ((users.id = users.id) AND (users.totp_secret IS NOT NULL)))
                       Rows Removed by Filter: 4024
 Planning Time: 0.121 ms
 Execution Time: 346089.324 ms
(11 rows)

In this case, that one query was taking almost 6 minutes to run, there were other queries, and the task was running every 5 minutes.

@di
Copy link
Member Author

di commented Jun 29, 2022

There are currently 604,282 users in production (these stats are on the front page of pypi.org) and 4210 rows in the user_security_keys table, so the doubly nested SELECT is very likely the cause of performance issues here. But adding an index on totp_secret (and has_webauthn) probably wouldn't hurt here I think.

@dstufft
Copy link
Member

dstufft commented Jun 29, 2022

Just because I nerd sniped myself into seeing if I could optimize that query without denormalization:

EXPLAIN ANALYZE SELECT COUNT(*)
FROM users
LEFT JOIN LATERAL (
    SELECT user_security_keys.id
    FROM user_security_keys
    WHERE user_security_keys.user_id = users.id
    FETCH FIRST 1 ROW ONLY
) webauthn ON true
WHERE users.totp_secret IS NOT NULL
    OR webauthn.id IS NOT NULL;


 Aggregate  (cost=5048886.60..5048886.61 rows=1 width=8) (actual time=1862.242..1862.246 rows=1 loops=1)
   ->  Nested Loop Left Join  (cost=0.28..5047376.40 rows=604080 width=0) (actual time=0.045..1852.948 rows=27481 loops=1)
         Filter: ((users.totp_secret IS NOT NULL) OR (user_security_keys.id IS NOT NULL))
         Rows Removed by Filter: 576817
         ->  Seq Scan on users  (cost=0.00..21430.80 rows=604080 width=37) (actual time=0.011..295.354 rows=604298 loops=1)
         ->  Limit  (cost=0.28..8.30 rows=1 width=16) (actual time=0.002..0.002 rows=0 loops=604298)
               ->  Index Scan using ix_user_security_keys_user_id on user_security_keys  (cost=0.28..8.30 rows=1 width=16) (actual time=0.001..0.001 rows=0 loops=604298)
                     Index Cond: (user_id = users.id)
 Planning Time: 0.182 ms
 Execution Time: 1862.356 ms

346089ms down to 1862ms with the same results, no denormalization, no schema changes at all.

@miketheman
Copy link
Member

miketheman commented Jun 29, 2022

Nice! From this analyze I would hazard that the added index on users.totp_secret would reduce that sequential scan even further. interestingly webauthn.id makes no appearance. whoops, that's a derived column

@dstufft
Copy link
Member

dstufft commented Jun 29, 2022

I don't think it would, because the filter on users.totp_secret happens outside of the SeqScan, and logically you have to join all the rows to find out if there is webauthn attached to the user or not.

I managed to get it down to 326ms though:

EXPLAIN ANALYZE SELECT
    (SELECT COUNT(*)
        FROM users
        WHERE users.totp_secret IS NOT NULL) +
    (SELECT COUNT(*)
        FROM (
            SELECT DISTINCT users.id
            FROM users
            JOIN user_security_keys
                ON user_security_keys.user_id = users.id
            WHERE users.totp_secret IS NULL) j) AS total;


 Result  (cost=42238.53..42238.54 rows=1 width=8) (actual time=326.468..326.594 rows=1 loops=1)
   InitPlan 1 (returns $1)
     ->  Finalize Aggregate  (cost=18932.91..18932.92 rows=1 width=8) (actual time=71.467..71.534 rows=1 loops=1)
           ->  Gather  (cost=18932.69..18932.90 rows=2 width=8) (actual time=70.987..71.522 rows=3 loops=1)
                 Workers Planned: 2
                 Workers Launched: 2
                 ->  Partial Aggregate  (cost=17932.69..17932.70 rows=1 width=8) (actual time=68.291..68.293 rows=1 loops=3)
                       ->  Parallel Seq Scan on users  (cost=0.00..17907.00 rows=10278 width=0) (actual time=0.011..64.812 rows=8634 loops=3)
                             Filter: (totp_secret IS NOT NULL)
                             Rows Removed by Filter: 192801
   InitPlan 2 (returns $3)
     ->  Aggregate  (cost=23305.60..23305.61 rows=1 width=8) (actual time=254.993..255.049 rows=1 loops=1)
           ->  HashAggregate  (cost=23214.45..23254.96 rows=4051 width=16) (actual time=252.920..254.354 rows=1580 loops=1)
                 Group Key: users_1.id
                 ->  Gather  (cost=1254.02..23204.33 rows=4051 width=16) (actual time=5.238..250.978 rows=2196 loops=1)
                       Workers Planned: 2
                       Workers Launched: 2
                       ->  Hash Join  (cost=254.02..21799.23 rows=1688 width=16) (actual time=4.958..246.682 rows=732 loops=3)
                             Hash Cond: (users_1.id = user_security_keys.user_id)
                             ->  Parallel Seq Scan on users users_1  (cost=0.00..17907.00 rows=241422 width=16) (actual time=0.008..149.148 rows=192801 loops=3)
                                   Filter: (totp_secret IS NULL)
                                   Rows Removed by Filter: 8634
                             ->  Hash  (cost=201.23..201.23 rows=4223 width=16) (actual time=4.525..4.526 rows=4210 loops=3)
                                   Buckets: 8192  Batches: 1  Memory Usage: 262kB
                                   ->  Seq Scan on user_security_keys  (cost=0.00..201.23 rows=4223 width=16) (actual time=0.011..2.233 rows=4210 loops=3)
 Planning Time: 0.291 ms
 Execution Time: 326.685 ms
(27 rows)

@dstufft
Copy link
Member

dstufft commented Jun 29, 2022

With that last query, you can possibly speed it up with an index on totp_secret, since the filter is happening inside of the Parallel Seq Scan, that would probably turn that into an Index Scan.

However, maintaining indexes isn't free, and either 1862ms or 326ms in a background task every 5 minutes is more than fast enough.

@di
Copy link
Member Author

di commented Jun 29, 2022

We ended up not doing most of this, and just optimizing the queries with joins instead.

@di di reopened this Jun 29, 2022
@di di closed this as completed Jun 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants