Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large task_processor_task table #4271

Closed
1 of 4 tasks
mvrhov opened this issue Jul 2, 2024 · 12 comments
Closed
1 of 4 tasks

Large task_processor_task table #4271

mvrhov opened this issue Jul 2, 2024 · 12 comments
Assignees
Labels
bug Something isn't working

Comments

@mvrhov
Copy link

mvrhov commented Jul 2, 2024

How are you running Flagsmith

  • Self Hosted with Docker
  • Self Hosted with Kubernetes
  • SaaS at flagsmith.com
  • Some other way (add details in description below)

Describe the bug

Our task_processor_task table has ~56 million records. It seems that the task processor is not keeping up and the query SELECT * FROM get_tasks_to_process(10) takes ~70s to complete. This seems to fill up the connections to pgbouncer and then up to the PostgreSQL and thus flagsmith itself is unable to perform things efficiently.

If it helps we run clients in local eval mode and use segments to be able to turn some features on. (As this is the only option available in local evaluation mode)

Steps To Reproduce

  1. why do I have to fill this if I have no information

Expected behavior

task processor is able to clear the table and not block other things.

Screenshots

No response

@mvrhov mvrhov added the bug Something isn't working label Jul 2, 2024
@mvrhov
Copy link
Author

mvrhov commented Jul 2, 2024

For now I have disabled a task processor so things can operate "normally"

@gagantrivedi
Copy link
Member

For now I have disabled a task processor so things can operate "normally"

Can you confirm the value of the following env vars:

ENABLE_CLEAN_UP_OLD_TASKS
TASK_DELETE_RETENTION_DAYS

@gagantrivedi
Copy link
Member

The second step would be to check if the clean_up_old_tasks task is locked or not. You should be able to do that either by accessing the database directly or through Django admin.

@mvrhov
Copy link
Author

mvrhov commented Jul 2, 2024

I haven't set those. So they must be at their default value. They are not set in the default docker-compose.yaml either.

I cannot confirm if t this was happening before. But I noticed this behavior today immediately after I upgraded from 1.109 to 1.126 <- not true looking at the PostgreSQL logs this has been going for a while.

@mvrhov
Copy link
Author

mvrhov commented Jul 2, 2024

Two recurring tasks are locked. tasks.clean_up_old_tasks and tasks.populate_bucket
Can I remove all tasks from task_processor_task that have completed column set to true?

SELECT completed, COUNT(*) FROM task_processor_task GROUP BY completed;
 completed |  count   
-----------+----------
 f         |  7072002
 t         | 48600095

@gagantrivedi
Copy link
Member

Can I remove all tasks from task_processor_task that have completed column set to true?

Yeah, you can do that, but be aware of the performance impact of deleting such a large amount of data.

For the future, you should unlock tasks.clean_up_old_tasks to make sure the task processor cleans up old tasks on its own.

@mvrhov
Copy link
Author

mvrhov commented Jul 2, 2024

I've cleaned up things, but the remaining 7mio rows still pose a problem. SELECT times for get_tasks_to_process have not got down and the processor keeps printing task_processor.thread_monitoring WARNING Writing unhealthy threads: ['Thread-1', 'Thread-2', 'Thread-3', 'Thread-4', 'Thread-5'] to the log. Is there anything else that can be configured for task processor to not bail out or whatever it is doing and not processing the rows. (except from setting --graceperiodms to 90000)

@gagantrivedi
Copy link
Member

I've cleaned up things, but the remaining 7mio rows still pose a problem. SELECT times for get_tasks_to_process have not got down and the processor keeps printing task_processor.thread_monitoring WARNING Writing unhealthy threads: ['Thread-1', 'Thread-2', 'Thread-3', 'Thread-4', 'Thread-5'] to the log. Is there anything else that can be configured for task processor to not bail out or whatever it is doing and not processing the rows.

Think you'd have to run analyze on that table to make sure database is aware of the changes

@mvrhov
Copy link
Author

mvrhov commented Jul 2, 2024

This didn't help.

The problem is in ORDER BY priority ASC. If I remove priority from the order by from the procedure the query is fast.

Rewriting the get_tasks_to_process to use WITH things seem a bit faster.

CREATE OR REPLACE FUNCTION get_tasks_to_process(num_tasks integer)
RETURNS SETOF task_processor_task AS $$
BEGIN
    RETURN QUERY
    -- Select the tasks that needs to be processed
    WITH row_to_return AS (
        SELECT *
        FROM task_processor_task
        WHERE num_failures < 3 AND scheduled_for < NOW() AND completed = FALSE AND is_locked = FALSE
        ORDER BY priority ASC, scheduled_for ASC, created_at ASC
        LIMIT num_tasks
        -- Select for update to ensure that no other workers can select these tasks while in this transaction block
        FOR UPDATE SKIP LOCKED
    )
    -- Lock every selected task(by updating `is_locked` to true)
    UPDATE task_processor_task
    -- Lock this row by setting is_locked True, so that no other workers can select these tasks after this
    -- transaction is complete (but the tasks are still being executed by the current worker)
    SET is_locked = TRUE
    FROM row_to_return
    WHERE task_processor_task.id = row_to_return.id
    RETURNING task_processor_task.*;
END;
$$ LANGUAGE plpgsql

Adding priority into the index doesn't convince pg to use it. CREATE INDEX CONCURRENTLY "incomplete_tasks_idx" ON "task_processor_task" ("priority", "num_failures", "scheduled_for") WHERE NOT "completed"; even though it uses the index when I remove priority from ORDER BY

@mvrhov
Copy link
Author

mvrhov commented Jul 3, 2024

This index works...
CREATE INDEX CONCURRENTLY "incomplete_tasks_idx" ON "task_processor_task" ("priority", "scheduled_for", "created_at") WHERE NOT "completed";

EXPLAIN SELECT *
        FROM task_processor_task
        WHERE num_failures < 3 AND scheduled_for < NOW() AND completed = FALSE AND is_locked = FALSE
        ORDER BY priority ASC, scheduled_for ASC, created_at ASC LIMIT 10;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..1.50 rows=10 width=150)
   ->  Index Scan using incomplete_tasks_idx on task_processor_task  (cost=0.43..720112.75 rows=6775044 width=150)
         Index Cond: ((scheduled_for < now()) AND (num_failures < 3))
         Filter: (NOT is_locked)

default index does the following plan.

EXPLAIN SELECT *
        FROM task_processor_task
        WHERE num_failures < 3 AND scheduled_for < NOW() AND completed = FALSE AND is_locked = FALSE
        ORDER BY priority ASC, scheduled_for ASC, created_at ASC LIMIT 10;
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=293373.36..293374.52 rows=10 width=150)
   ->  Gather Merge  (cost=293373.36..954084.15 rows=5662840 width=150)
         Workers Planned: 2
         ->  Sort  (cost=292373.33..299451.88 rows=2831420 width=150)
               Sort Key: priority, scheduled_for, created_at
               ->  Parallel Seq Scan on task_processor_task  (cost=0.00..231187.36 rows=2831420 width=150)
                     Filter: ((NOT completed) AND (NOT is_locked) AND (num_failures < 3) AND (scheduled_for < now()))
 JIT:
   Functions: 3
   Options: Inlining false, Optimization false, Expressions true, Deforming true

@gagantrivedi gagantrivedi self-assigned this Aug 14, 2024
@gagantrivedi
Copy link
Member

This index works... CREATE INDEX CONCURRENTLY "incomplete_tasks_idx" ON "task_processor_task" ("priority", "scheduled_for", "created_at") WHERE NOT "completed";

EXPLAIN SELECT *
        FROM task_processor_task
        WHERE num_failures < 3 AND scheduled_for < NOW() AND completed = FALSE AND is_locked = FALSE
        ORDER BY priority ASC, scheduled_for ASC, created_at ASC LIMIT 10;
                                                    QUERY PLAN                                                     
-------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.43..1.50 rows=10 width=150)
   ->  Index Scan using incomplete_tasks_idx on task_processor_task  (cost=0.43..720112.75 rows=6775044 width=150)
         Index Cond: ((scheduled_for < now()) AND (num_failures < 3))
         Filter: (NOT is_locked)

default index does the following plan.

EXPLAIN SELECT *
        FROM task_processor_task
        WHERE num_failures < 3 AND scheduled_for < NOW() AND completed = FALSE AND is_locked = FALSE
        ORDER BY priority ASC, scheduled_for ASC, created_at ASC LIMIT 10;
                                                      QUERY PLAN                                                      
----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=293373.36..293374.52 rows=10 width=150)
   ->  Gather Merge  (cost=293373.36..954084.15 rows=5662840 width=150)
         Workers Planned: 2
         ->  Sort  (cost=292373.33..299451.88 rows=2831420 width=150)
               Sort Key: priority, scheduled_for, created_at
               ->  Parallel Seq Scan on task_processor_task  (cost=0.00..231187.36 rows=2831420 width=150)
                     Filter: ((NOT completed) AND (NOT is_locked) AND (num_failures < 3) AND (scheduled_for < now()))
 JIT:
   Functions: 3
   Options: Inlining false, Optimization false, Expressions true, Deforming true

What is the value of random_page_cost for your database? It usually avoids during a sequential scan if the value is high

@gagantrivedi
Copy link
Member

gagantrivedi commented Aug 15, 2024

There are a lot of factors that decide the query plan generate by postgres, e.g: our database generates the following plan with the default index(with random_page_cost 1.1):

-----------------------------------------------------------------------------------------------------------------
 Sort  (cost=34728.50..34990.19 rows=104676 width=258)
   Sort Key: priority, scheduled_for, created_at
   ->  Index Scan using incomplete_tasks_idx on task_processor_task  (cost=0.25..18454.79 rows=104676 width=258)
         Index Cond: (scheduled_for < now())
         Filter: (NOT is_locked)

I am going to close this issue now because we can't really change the index

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants