[leo_backend_db] The number mq-stats displays can be different from the number leo_backend_db actually stores #731

mocchira · 2017-05-10T05:30:33Z

Found through the investigation on #725.
This inconsistency can happen in case leo_storage restarted by heart because https://github.com/leo-project/leo_backend_db/blob/1.2.12/src/leo_backend_db_server.erl#L484-L489 may not be called in an abnormal shutdown.

Solution

To make its number precise, we need to count items stored in leo_backend_db at the init callback.

vstax · 2017-05-24T20:10:43Z

This fix works for me; after starting node which had wrong number in queue normally the numbers were still wrong, so I killed beam.smp process with SIGTERM and upon next restart the number was fixed (went from 14880 to 0). The node took 5-7 seconds more to start (before its status switched to "running" and there was high CPU usage during startup and some memory usage (+200 MB to typical usage after startup), though; I don't know if it's because of message counting or something else when starting after being killed this way.

mocchira · 2017-05-25T01:12:40Z

@vstax thanks for confirming.

The node took 5-7 seconds more to start (before its status switched to "running" and there was high CPU usage during startup and some memory usage (+200 MB to typical usage after startup), though; I don't know if it's because of message counting or something else when starting after being killed this way.

Probably yes though I'm a little bit worried about because the number of records was 0 right? (as you mentioned went from 14880 to 0).
However there are some cases counting the number could take much resources when leveldb used by leo_mq were very fragmented (not compacted). I think you may hit this case.

Just in case, we will do various benchmarks that cover not only massive active records but less records with fragmented (not compacted) in leo_mq.
links: leo-project/leo_backend_db#10 (comment)

vstax · 2017-05-25T15:02:54Z

@mocchira I think it's 0 because when restarting node again and again on older version (1.3.4), I got more and more messages from queue processed following each restart (experiment that was described at #725 (comment)). For storage_2, eventually restarts stopped doing anything - the number of messages was 14880 at that point, and it didn't try to do anything upon restart. So I assume this number is fake and it's actually 0.

And after doing SIGTERM and another restart on latest dev version, as soon as node finished starting up and shown up as "running" on manager, the number of messages was 0.

There is one thing that bothers me, though:

[root@leo-s2 work]# du -sm leo_storage/work/queue/*
1	leo_storage/work/queue/1
1	leo_storage/work/queue/2
1	leo_storage/work/queue/3
47	leo_storage/work/queue/4
1	leo_storage/work/queue/5
1	leo_storage/work/queue/6
1	leo_storage/work/queue/7
1	leo_storage/work/queue/8
1	leo_storage/work/queue/membership

[root@leo-s2 work]# du -sm leo_storage/work/queue/4/*
1	leo_storage/work/queue/4/message0
24	leo_storage/work/queue/4/message0_63653883922
1	leo_storage/work/queue/4/message1
21	leo_storage/work/queue/4/message1_63653883922
1	leo_storage/work/queue/4/message2
1	leo_storage/work/queue/4/message2_63653883922
1	leo_storage/work/queue/4/message3
1	leo_storage/work/queue/4/message3_63653883922
1	leo_storage/work/queue/4/message4
1	leo_storage/work/queue/4/message4_63653883922
1	leo_storage/work/queue/4/message4_leo_async_deletion_queue_message_4.state
1	leo_storage/work/queue/4/message5
1	leo_storage/work/queue/4/message5_63653883922
1	leo_storage/work/queue/4/message5_leo_async_deletion_queue_message_5.state
1	leo_storage/work/queue/4/message6
1	leo_storage/work/queue/4/message6_63653883922
1	leo_storage/work/queue/4/message6_leo_async_deletion_queue_message_6.state
1	leo_storage/work/queue/4/message7
1	leo_storage/work/queue/4/message7_63653883922
1	leo_storage/work/queue/4/message7_leo_async_deletion_queue_message_7.state

I have changed mq.num_of_mq_procs to 4 before starting this node in preparation for stress load experiments as per #734, it was 8 before. And there are .state files only in 4 subdirectories now.. (it's the same for each queue on each node). Could it be that I just did something wrong and one shouldn't change this value on existing system?

mocchira · 2017-05-26T00:48:09Z

@vstax

And after doing SIGTERM and another restart on latest dev version, as soon as node finished starting up and shown up as "running" on manager, the number of messages was 0.

Got it.

I have changed mq.num_of_mq_procs to 4 before starting this node in preparation for stress load experiments as per #734, it was 8 before. And there are .state files only in 4 subdirectories now.. (it's the same for each queue on each node). Could it be that I just did something wrong and one shouldn't change this value on existing system?

It's expected.
The new code on the develop branch does output the state file only when leo_mq succeeded in finishing properly (leo_storage stop) and read that into memory when restarting then remove that in order to prevent the state file including a stale value from being read at the next restart.
That said, the state file only exists while leo_storage is stopping now so leo_mq_[0-3] instances don't have each state file in your case.

mocchira added Bug Priority-MIDDLE v1.4 _leo_backend_db labels May 10, 2017

mocchira added this to the 1.4.0 milestone May 10, 2017

mocchira assigned yosukehara and mocchira May 10, 2017

mocchira mentioned this issue May 10, 2017

Deleting bucket eventually fails and makes delete queues stuck #725

Open

mocchira mentioned this issue May 19, 2017

Count the number by iterating eleveldb if no state file exist leo-project/leo_backend_db#10

Merged

mocchira modified the milestones: v1.3.5, 1.4.0 May 19, 2017

mocchira closed this as completed Jun 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[leo_backend_db] The number mq-stats displays can be different from the number leo_backend_db actually stores #731

[leo_backend_db] The number mq-stats displays can be different from the number leo_backend_db actually stores #731

mocchira commented May 10, 2017

vstax commented May 24, 2017

mocchira commented May 25, 2017 •

edited

Loading

vstax commented May 25, 2017

mocchira commented May 26, 2017

[leo_backend_db] The number mq-stats displays can be different from the number leo_backend_db actually stores #731

[leo_backend_db] The number mq-stats displays can be different from the number leo_backend_db actually stores #731

Comments

mocchira commented May 10, 2017

Solution

vstax commented May 24, 2017

mocchira commented May 25, 2017 • edited Loading

vstax commented May 25, 2017

mocchira commented May 26, 2017

mocchira commented May 25, 2017 •

edited

Loading