Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

services/horizon: fix long-running payment migration #5059

Closed
mollykarcher opened this issue Sep 19, 2023 · 2 comments · Fixed by #5056
Closed

services/horizon: fix long-running payment migration #5059

mollykarcher opened this issue Sep 19, 2023 · 2 comments · Fixed by #5056
Assignees
Labels

Comments

@mollykarcher
Copy link
Contributor

Running this migration takes exceedingly long on a full history db (>5 hours) in the version we released for testnet (2.27.0). We need to resolve this before pubnet release of protocol 20.

@tamirms
Copy link
Contributor

tamirms commented Sep 21, 2023

The reason why the migration takes so long is because it is adding an index on the history_operations table: history_operations_on_is_payment . But, It turns out that the history_operations_on_is_payment index is actually never used:

horizon=> \d+ history_operations
                                                  Table "public.history_operations"
        Column        |         Type          | Collation | Nullable |        Default        | Storage  | Stats target | Description 
----------------------+-----------------------+-----------+----------+-----------------------+----------+--------------+-------------
 id                   | bigint                |           | not null |                       | plain    |              | 
 transaction_id       | bigint                |           | not null |                       | plain    |              | 
 application_order    | integer               |           | not null |                       | plain    |              | 
 type                 | integer               |           | not null |                       | plain    |              | 
 details              | jsonb                 |           |          |                       | extended |              | 
 source_account       | character varying(64) |           | not null | ''::character varying | extended |              | 
 source_account_muxed | character varying(69) |           |          |                       | extended |              | 
 is_payment           | boolean               |           |          |                       | plain    |              | 
Indexes:
    "index_history_operations_on_id" UNIQUE, btree (id)
    "index_history_operations_on_is_payment" btree (is_payment)
    "index_history_operations_on_transaction_id" btree (transaction_id)
    "index_history_operations_on_type" btree (type)
Check constraints:
    "valid_application_order" CHECK (application_order >= 0) NOT VALID
Access method: heap


EXPLAIN ANALYZE SELECT
   hop.id, hop.transaction_id, hop.application_order, hop.type, hop.details, hop.source_account,
   hop.source_account_muxed, COALESCE(hop.is_payment, false) as is_payment, ht.transaction_hash,
   ht.tx_result, COALESCE(ht.successful, true) as transaction_successful
FROM history_operations hop
LEFT JOIN history_transactions ht ON ht.id = hop.transaction_id
WHERE (hop.type IN (0,1,2,13,8) OR hop.is_payment = true)
   AND hop.id > 0 AND (ht.successful = true OR ht.successful IS NULL)
ORDER BY hop.id asc 
LIMIT 10

                                                                                       QUERY PLAN                                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.29..48.52 rows=10 width=840) (actual time=0.749..19.029 rows=10 loops=1)
   ->  Nested Loop Left Join  (cost=1.29..22700733724.05 rows=4807019988 width=840) (actual time=0.749..19.027 rows=10 loops=1)
         Filter: (ht.successful OR (ht.successful IS NULL))
         Rows Removed by Filter: 3
         ->  Index Scan using index_history_operations_on_id on history_operations hop  (cost=0.71..1169458652.26 rows=8274175421 width=620) (actual time=0.741..15.294 rows=13 loops=1)
               Index Cond: (id > 0)
               Filter: ((type = ANY ('{0,1,2,13,8}'::integer[])) OR is_payment)
               Rows Removed by Filter: 17
         ->  Index Scan using hs_transaction_by_id on history_transactions ht  (cost=0.58..2.59 rows=1 width=228) (actual time=0.286..0.286 rows=1 loops=13)
               Index Cond: (id = hop.transaction_id)
 Planning Time: 62.566 ms
 Execution Time: 19.079 ms
(12 rows)

The reason why the index is never used is because query to select payments contains an ORDER BY hop.id asc clause but the history_operations_on_is_payment index does not contain the id column. Moreover, there is a pre-existing index history_operations_on_type which exists to support the WHERE hop.type IN (0,1,2,13,8) clause of the payments query. But, that index is also never used for the same reason (it does not contain the id column and the query sorts by id).

After running select schemaname || '.' || indexrelname as index, idx_scan from pg_stat_user_indexes; against the production read replica db, I was able to confirm that the history_operations_on_type index has never ever been scanned. We have been maintaining the history_operations_on_type index for years and it turns out that it's actually completely useless!

The reason why the payments queries have been working for us is because the query scans through the rows of the history_operations table in sorted order of id. The history_operations_on_id index allows us to traverse the rows in sorted order efficiently. Also, all payments queries have a cursor so we never have to do a full table scan and instead we are searching for rows from a specific starting point.

Finally the most important reason for why this query is still efficient is because it turns out the distribution of operations happens to be dense with payments. The query scans through history_operations row by row until it finds enough payment rows to fill the page limit (at most 200). It is theoretically possible that we could have many ledgers without a single payment operation but in practice that seems to never be the case. However, there are queries on the payments for account endpoint endpoint which timeout because some accounts have many transactions containing only a few payments (see #1808 for more details).

In conclusion we should remove the history_operations_on_is_payment index and also the pre-existing history_operations_on_type index. If the distribution of operation types in the Stellar network changes then the payments endpoint is vulnerable to timing out. In order to address that concern we would need to have an index on both the payment and id columns, or alternatively, an index on both the type and id columns. However, adding that index would take a really long time on a full history db.

@tamirms
Copy link
Contributor

tamirms commented Oct 3, 2023

I'm closing this issue because we have removed the index which was causing the long running migration. I created a separate issue ( #5072 ) to track the concern about how the distribution of stellar operations affects the payments endpoint .

@tamirms tamirms closed this as completed Oct 3, 2023
@github-project-automation github-project-automation bot moved this from In Progress to Done in Platform Scrum Oct 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants