Large task_processor_task table #4271

mvrhov · 2024-07-02T07:43:18Z

How are you running Flagsmith

Self Hosted with Docker
Self Hosted with Kubernetes
SaaS at flagsmith.com
Some other way (add details in description below)

Describe the bug

Our task_processor_task table has ~56 million records. It seems that the task processor is not keeping up and the query SELECT * FROM get_tasks_to_process(10) takes ~70s to complete. This seems to fill up the connections to pgbouncer and then up to the PostgreSQL and thus flagsmith itself is unable to perform things efficiently.

If it helps we run clients in local eval mode and use segments to be able to turn some features on. (As this is the only option available in local evaluation mode)

Steps To Reproduce

why do I have to fill this if I have no information

Expected behavior

task processor is able to clear the table and not block other things.

Screenshots

No response

The text was updated successfully, but these errors were encountered:

mvrhov · 2024-07-02T07:44:04Z

For now I have disabled a task processor so things can operate "normally"

gagantrivedi · 2024-07-02T07:55:17Z

For now I have disabled a task processor so things can operate "normally"

Can you confirm the value of the following env vars:

ENABLE_CLEAN_UP_OLD_TASKS
TASK_DELETE_RETENTION_DAYS

gagantrivedi · 2024-07-02T07:58:10Z

The second step would be to check if the clean_up_old_tasks task is locked or not. You should be able to do that either by accessing the database directly or through Django admin.

mvrhov · 2024-07-02T07:59:39Z

I haven't set those. So they must be at their default value. They are not set in the default docker-compose.yaml either.

I cannot confirm if t this was happening before. But I noticed this behavior today immediately after I upgraded from 1.109 to 1.126 <- not true looking at the PostgreSQL logs this has been going for a while.

mvrhov · 2024-07-02T08:25:00Z

Two recurring tasks are locked. tasks.clean_up_old_tasks and tasks.populate_bucket
Can I remove all tasks from task_processor_task that have completed column set to true?

SELECT completed, COUNT(*) FROM task_processor_task GROUP BY completed;
 completed |  count   
-----------+----------
 f         |  7072002
 t         | 48600095

gagantrivedi · 2024-07-02T08:32:43Z

Can I remove all tasks from task_processor_task that have completed column set to true?

Yeah, you can do that, but be aware of the performance impact of deleting such a large amount of data.

For the future, you should unlock tasks.clean_up_old_tasks to make sure the task processor cleans up old tasks on its own.

mvrhov · 2024-07-02T09:26:27Z

I've cleaned up things, but the remaining 7mio rows still pose a problem. SELECT times for get_tasks_to_process have not got down and the processor keeps printing task_processor.thread_monitoring WARNING Writing unhealthy threads: ['Thread-1', 'Thread-2', 'Thread-3', 'Thread-4', 'Thread-5'] to the log. Is there anything else that can be configured for task processor to not bail out or whatever it is doing and not processing the rows. (except from setting --graceperiodms to 90000)

gagantrivedi · 2024-07-02T09:31:41Z

I've cleaned up things, but the remaining 7mio rows still pose a problem. SELECT times for get_tasks_to_process have not got down and the processor keeps printing task_processor.thread_monitoring WARNING Writing unhealthy threads: ['Thread-1', 'Thread-2', 'Thread-3', 'Thread-4', 'Thread-5'] to the log. Is there anything else that can be configured for task processor to not bail out or whatever it is doing and not processing the rows.

Think you'd have to run analyze on that table to make sure database is aware of the changes

mvrhov · 2024-07-02T11:35:09Z

This didn't help.

The problem is in ORDER BY priority ASC. If I remove priority from the order by from the procedure the query is fast.

Rewriting the get_tasks_to_process to use WITH things seem a bit faster.

CREATE OR REPLACE FUNCTION get_tasks_to_process(num_tasks integer)
RETURNS SETOF task_processor_task AS $$
BEGIN
    RETURN QUERY
    -- Select the tasks that needs to be processed
    WITH row_to_return AS (
        SELECT *
        FROM task_processor_task
        WHERE num_failures < 3 AND scheduled_for < NOW() AND completed = FALSE AND is_locked = FALSE
        ORDER BY priority ASC, scheduled_for ASC, created_at ASC
        LIMIT num_tasks
        -- Select for update to ensure that no other workers can select these tasks while in this transaction block
        FOR UPDATE SKIP LOCKED
    )
    -- Lock every selected task(by updating `is_locked` to true)
    UPDATE task_processor_task
    -- Lock this row by setting is_locked True, so that no other workers can select these tasks after this
    -- transaction is complete (but the tasks are still being executed by the current worker)
    SET is_locked = TRUE
    FROM row_to_return
    WHERE task_processor_task.id = row_to_return.id
    RETURNING task_processor_task.*;
END;
$$ LANGUAGE plpgsql

Adding priority into the index doesn't convince pg to use it. CREATE INDEX CONCURRENTLY "incomplete_tasks_idx" ON "task_processor_task" ("priority", "num_failures", "scheduled_for") WHERE NOT "completed"; even though it uses the index when I remove priority from ORDER BY

mvrhov · 2024-07-03T05:03:02Z

This index works...
CREATE INDEX CONCURRENTLY "incomplete_tasks_idx" ON "task_processor_task" ("priority", "scheduled_for", "created_at") WHERE NOT "completed";

EXPLAIN SELECT *
        FROM task_processor_task
        WHERE num_failures < 3 AND scheduled_for < NOW() AND completed = FALSE AND is_locked = FALSE
        ORDER BY priority ASC, scheduled_for ASC, created_at ASC LIMIT 10;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..1.50 rows=10 width=150)
   ->  Index Scan using incomplete_tasks_idx on task_processor_task  (cost=0.43..720112.75 rows=6775044 width=150)
         Index Cond: ((scheduled_for < now()) AND (num_failures < 3))
         Filter: (NOT is_locked)

default index does the following plan.

EXPLAIN SELECT *
        FROM task_processor_task
        WHERE num_failures < 3 AND scheduled_for < NOW() AND completed = FALSE AND is_locked = FALSE
        ORDER BY priority ASC, scheduled_for ASC, created_at ASC LIMIT 10;
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=293373.36..293374.52 rows=10 width=150)
   ->  Gather Merge  (cost=293373.36..954084.15 rows=5662840 width=150)
         Workers Planned: 2
         ->  Sort  (cost=292373.33..299451.88 rows=2831420 width=150)
               Sort Key: priority, scheduled_for, created_at
               ->  Parallel Seq Scan on task_processor_task  (cost=0.00..231187.36 rows=2831420 width=150)
                     Filter: ((NOT completed) AND (NOT is_locked) AND (num_failures < 3) AND (scheduled_for < now()))
 JIT:
   Functions: 3
   Options: Inlining false, Optimization false, Expressions true, Deforming true

gagantrivedi · 2024-08-15T04:15:25Z

This index works... CREATE INDEX CONCURRENTLY "incomplete_tasks_idx" ON "task_processor_task" ("priority", "scheduled_for", "created_at") WHERE NOT "completed";

EXPLAIN SELECT *
        FROM task_processor_task
        WHERE num_failures < 3 AND scheduled_for < NOW() AND completed = FALSE AND is_locked = FALSE
        ORDER BY priority ASC, scheduled_for ASC, created_at ASC LIMIT 10;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..1.50 rows=10 width=150)
   ->  Index Scan using incomplete_tasks_idx on task_processor_task  (cost=0.43..720112.75 rows=6775044 width=150)
         Index Cond: ((scheduled_for < now()) AND (num_failures < 3))
         Filter: (NOT is_locked)

default index does the following plan.

EXPLAIN SELECT *
        FROM task_processor_task
        WHERE num_failures < 3 AND scheduled_for < NOW() AND completed = FALSE AND is_locked = FALSE
        ORDER BY priority ASC, scheduled_for ASC, created_at ASC LIMIT 10;
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=293373.36..293374.52 rows=10 width=150)
   ->  Gather Merge  (cost=293373.36..954084.15 rows=5662840 width=150)
         Workers Planned: 2
         ->  Sort  (cost=292373.33..299451.88 rows=2831420 width=150)
               Sort Key: priority, scheduled_for, created_at
               ->  Parallel Seq Scan on task_processor_task  (cost=0.00..231187.36 rows=2831420 width=150)
                     Filter: ((NOT completed) AND (NOT is_locked) AND (num_failures < 3) AND (scheduled_for < now()))
 JIT:
   Functions: 3
   Options: Inlining false, Optimization false, Expressions true, Deforming true

What is the value of random_page_cost for your database? It usually avoids during a sequential scan if the value is high

gagantrivedi · 2024-08-15T04:18:11Z

There are a lot of factors that decide the query plan generate by postgres, e.g: our database generates the following plan with the default index(with random_page_cost 1.1):

-----------------------------------------------------------------------------------------------------------------
 Sort  (cost=34728.50..34990.19 rows=104676 width=258)
   Sort Key: priority, scheduled_for, created_at
   ->  Index Scan using incomplete_tasks_idx on task_processor_task  (cost=0.25..18454.79 rows=104676 width=258)
         Index Cond: (scheduled_for < now())
         Filter: (NOT is_locked)

I am going to close this issue now because we can't really change the index

mvrhov added the bug Something isn't working label Jul 2, 2024

gagantrivedi self-assigned this Aug 14, 2024

gagantrivedi closed this as completed Aug 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large task_processor_task table #4271

Large task_processor_task table #4271

mvrhov commented Jul 2, 2024

mvrhov commented Jul 2, 2024

gagantrivedi commented Jul 2, 2024

gagantrivedi commented Jul 2, 2024

mvrhov commented Jul 2, 2024 •

edited

Loading

mvrhov commented Jul 2, 2024

gagantrivedi commented Jul 2, 2024

mvrhov commented Jul 2, 2024 •

edited

Loading

gagantrivedi commented Jul 2, 2024

mvrhov commented Jul 2, 2024 •

edited

Loading

mvrhov commented Jul 3, 2024

gagantrivedi commented Aug 15, 2024

gagantrivedi commented Aug 15, 2024 •

edited

Loading

Large task_processor_task table #4271

Large task_processor_task table #4271

Comments

mvrhov commented Jul 2, 2024

How are you running Flagsmith

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots

mvrhov commented Jul 2, 2024

gagantrivedi commented Jul 2, 2024

gagantrivedi commented Jul 2, 2024

mvrhov commented Jul 2, 2024 • edited Loading

mvrhov commented Jul 2, 2024

gagantrivedi commented Jul 2, 2024

mvrhov commented Jul 2, 2024 • edited Loading

gagantrivedi commented Jul 2, 2024

mvrhov commented Jul 2, 2024 • edited Loading

mvrhov commented Jul 3, 2024

gagantrivedi commented Aug 15, 2024

gagantrivedi commented Aug 15, 2024 • edited Loading

mvrhov commented Jul 2, 2024 •

edited

Loading

mvrhov commented Jul 2, 2024 •

edited

Loading

mvrhov commented Jul 2, 2024 •

edited

Loading

gagantrivedi commented Aug 15, 2024 •

edited

Loading