services/horizon: fix long-running payment migration #5059

mollykarcher · 2023-09-19T15:15:21Z

Running this migration takes exceedingly long on a full history db (>5 hours) in the version we released for testnet (2.27.0). We need to resolve this before pubnet release of protocol 20.

tamirms · 2023-09-21T09:29:07Z

The reason why the migration takes so long is because it is adding an index on the history_operations table: history_operations_on_is_payment . But, It turns out that the history_operations_on_is_payment index is actually never used:

horizon=> \d+ history_operations
                                                  Table "public.history_operations"
        Column        |         Type          | Collation | Nullable |        Default        | Storage  | Stats target | Description 
----------------------+-----------------------+-----------+----------+-----------------------+----------+--------------+-------------
 id                   | bigint                |           | not null |                       | plain    |              | 
 transaction_id       | bigint                |           | not null |                       | plain    |              | 
 application_order    | integer               |           | not null |                       | plain    |              | 
 type                 | integer               |           | not null |                       | plain    |              | 
 details              | jsonb                 |           |          |                       | extended |              | 
 source_account       | character varying(64) |           | not null | ''::character varying | extended |              | 
 source_account_muxed | character varying(69) |           |          |                       | extended |              | 
 is_payment           | boolean               |           |          |                       | plain    |              | 
Indexes:
    "index_history_operations_on_id" UNIQUE, btree (id)
    "index_history_operations_on_is_payment" btree (is_payment)
    "index_history_operations_on_transaction_id" btree (transaction_id)
    "index_history_operations_on_type" btree (type)
Check constraints:
    "valid_application_order" CHECK (application_order >= 0) NOT VALID
Access method: heap


EXPLAIN ANALYZE SELECT
   hop.id, hop.transaction_id, hop.application_order, hop.type, hop.details, hop.source_account,
   hop.source_account_muxed, COALESCE(hop.is_payment, false) as is_payment, ht.transaction_hash,
   ht.tx_result, COALESCE(ht.successful, true) as transaction_successful
FROM history_operations hop
LEFT JOIN history_transactions ht ON ht.id = hop.transaction_id
WHERE (hop.type IN (0,1,2,13,8) OR hop.is_payment = true)
   AND hop.id > 0 AND (ht.successful = true OR ht.successful IS NULL)
ORDER BY hop.id asc 
LIMIT 10

                                                                                       QUERY PLAN                                                                                        
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1.29..48.52 rows=10 width=840) (actual time=0.749..19.029 rows=10 loops=1)
   ->  Nested Loop Left Join  (cost=1.29..22700733724.05 rows=4807019988 width=840) (actual time=0.749..19.027 rows=10 loops=1)
         Filter: (ht.successful OR (ht.successful IS NULL))
         Rows Removed by Filter: 3
         ->  Index Scan using index_history_operations_on_id on history_operations hop  (cost=0.71..1169458652.26 rows=8274175421 width=620) (actual time=0.741..15.294 rows=13 loops=1)
               Index Cond: (id > 0)
               Filter: ((type = ANY ('{0,1,2,13,8}'::integer[])) OR is_payment)
               Rows Removed by Filter: 17
         ->  Index Scan using hs_transaction_by_id on history_transactions ht  (cost=0.58..2.59 rows=1 width=228) (actual time=0.286..0.286 rows=1 loops=13)
               Index Cond: (id = hop.transaction_id)
 Planning Time: 62.566 ms
 Execution Time: 19.079 ms
(12 rows)

The reason why the index is never used is because query to select payments contains an ORDER BY hop.id asc clause but the history_operations_on_is_payment index does not contain the id column. Moreover, there is a pre-existing index history_operations_on_type which exists to support the WHERE hop.type IN (0,1,2,13,8) clause of the payments query. But, that index is also never used for the same reason (it does not contain the id column and the query sorts by id).

After running select schemaname || '.' || indexrelname as index, idx_scan from pg_stat_user_indexes; against the production read replica db, I was able to confirm that the history_operations_on_type index has never ever been scanned. We have been maintaining the history_operations_on_type index for years and it turns out that it's actually completely useless!

The reason why the payments queries have been working for us is because the query scans through the rows of the history_operations table in sorted order of id. The history_operations_on_id index allows us to traverse the rows in sorted order efficiently. Also, all payments queries have a cursor so we never have to do a full table scan and instead we are searching for rows from a specific starting point.

Finally the most important reason for why this query is still efficient is because it turns out the distribution of operations happens to be dense with payments. The query scans through history_operations row by row until it finds enough payment rows to fill the page limit (at most 200). It is theoretically possible that we could have many ledgers without a single payment operation but in practice that seems to never be the case. However, there are queries on the payments for account endpoint endpoint which timeout because some accounts have many transactions containing only a few payments (see #1808 for more details).

In conclusion we should remove the history_operations_on_is_payment index and also the pre-existing history_operations_on_type index. If the distribution of operation types in the Stellar network changes then the payments endpoint is vulnerable to timing out. In order to address that concern we would need to have an index on both the payment and id columns, or alternatively, an index on both the type and id columns. However, adding that index would take a really long time on a full history db.

tamirms · 2023-10-03T16:25:06Z

I'm closing this issue because we have removed the index which was causing the long running migration. I created a separate issue ( #5072 ) to track the concern about how the distribution of stellar operations affects the payments endpoint .

mollykarcher added the bug label Sep 19, 2023

mollykarcher assigned tamirms Sep 19, 2023

mollykarcher added this to Platform Scrum Sep 19, 2023

mollykarcher added this to the Soroban Pubnet Release milestone Sep 19, 2023

github-project-automation bot moved this to Backlog in Platform Scrum Sep 19, 2023

mollykarcher moved this from Backlog to In Progress in Platform Scrum Sep 19, 2023

mollykarcher linked a pull request Sep 19, 2023 that will close this issue

horizon: Improve performance of migrations/64_add_payment_flag_history_ops.sql #5056

Merged

7 tasks

This was referenced Sep 21, 2023

Remove unused indexes from horizon schema #5061

Closed

horizon: Improve performance of migrations/64_add_payment_flag_history_ops.sql #5056

Merged

mollykarcher modified the milestones: Soroban Pubnet Release, Soroban Stable P20 Release Sep 27, 2023

tamirms mentioned this issue Oct 3, 2023

Horizon payments query can degrade depending on distribution of operations #5072

Open

tamirms closed this as completed Oct 3, 2023

github-project-automation bot moved this from In Progress to Done in Platform Scrum Oct 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

services/horizon: fix long-running payment migration #5059

services/horizon: fix long-running payment migration #5059

mollykarcher commented Sep 19, 2023

tamirms commented Sep 21, 2023

tamirms commented Oct 3, 2023

services/horizon: fix long-running payment migration #5059

services/horizon: fix long-running payment migration #5059

Comments

mollykarcher commented Sep 19, 2023

tamirms commented Sep 21, 2023

tamirms commented Oct 3, 2023