Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[leo_backend_db] The number mq-stats displays can be different from the number leo_backend_db actually stores #731

Closed
mocchira opened this issue May 10, 2017 · 4 comments

Comments

@mocchira
Copy link
Member

Found through the investigation on #725.
This inconsistency can happen in case leo_storage restarted by heart because https://github.com/leo-project/leo_backend_db/blob/1.2.12/src/leo_backend_db_server.erl#L484-L489 may not be called in an abnormal shutdown.

Solution

To make its number precise, we need to count items stored in leo_backend_db at the init callback.

@vstax
Copy link
Contributor

vstax commented May 24, 2017

This fix works for me; after starting node which had wrong number in queue normally the numbers were still wrong, so I killed beam.smp process with SIGTERM and upon next restart the number was fixed (went from 14880 to 0). The node took 5-7 seconds more to start (before its status switched to "running" and there was high CPU usage during startup and some memory usage (+200 MB to typical usage after startup), though; I don't know if it's because of message counting or something else when starting after being killed this way.

@mocchira
Copy link
Member Author

mocchira commented May 25, 2017

@vstax thanks for confirming.

The node took 5-7 seconds more to start (before its status switched to "running" and there was high CPU usage during startup and some memory usage (+200 MB to typical usage after startup), though; I don't know if it's because of message counting or something else when starting after being killed this way.

Probably yes though I'm a little bit worried about because the number of records was 0 right? (as you mentioned went from 14880 to 0).
However there are some cases counting the number could take much resources when leveldb used by leo_mq were very fragmented (not compacted). I think you may hit this case.

Just in case, we will do various benchmarks that cover not only massive active records but less records with fragmented (not compacted) in leo_mq.
links: leo-project/leo_backend_db#10 (comment)

@vstax
Copy link
Contributor

vstax commented May 25, 2017

@mocchira I think it's 0 because when restarting node again and again on older version (1.3.4), I got more and more messages from queue processed following each restart (experiment that was described at #725 (comment)). For storage_2, eventually restarts stopped doing anything - the number of messages was 14880 at that point, and it didn't try to do anything upon restart. So I assume this number is fake and it's actually 0.

And after doing SIGTERM and another restart on latest dev version, as soon as node finished starting up and shown up as "running" on manager, the number of messages was 0.

There is one thing that bothers me, though:

[root@leo-s2 work]# du -sm leo_storage/work/queue/*
1	leo_storage/work/queue/1
1	leo_storage/work/queue/2
1	leo_storage/work/queue/3
47	leo_storage/work/queue/4
1	leo_storage/work/queue/5
1	leo_storage/work/queue/6
1	leo_storage/work/queue/7
1	leo_storage/work/queue/8
1	leo_storage/work/queue/membership

[root@leo-s2 work]# du -sm leo_storage/work/queue/4/*
1	leo_storage/work/queue/4/message0
24	leo_storage/work/queue/4/message0_63653883922
1	leo_storage/work/queue/4/message1
21	leo_storage/work/queue/4/message1_63653883922
1	leo_storage/work/queue/4/message2
1	leo_storage/work/queue/4/message2_63653883922
1	leo_storage/work/queue/4/message3
1	leo_storage/work/queue/4/message3_63653883922
1	leo_storage/work/queue/4/message4
1	leo_storage/work/queue/4/message4_63653883922
1	leo_storage/work/queue/4/message4_leo_async_deletion_queue_message_4.state
1	leo_storage/work/queue/4/message5
1	leo_storage/work/queue/4/message5_63653883922
1	leo_storage/work/queue/4/message5_leo_async_deletion_queue_message_5.state
1	leo_storage/work/queue/4/message6
1	leo_storage/work/queue/4/message6_63653883922
1	leo_storage/work/queue/4/message6_leo_async_deletion_queue_message_6.state
1	leo_storage/work/queue/4/message7
1	leo_storage/work/queue/4/message7_63653883922
1	leo_storage/work/queue/4/message7_leo_async_deletion_queue_message_7.state

I have changed mq.num_of_mq_procs to 4 before starting this node in preparation for stress load experiments as per #734, it was 8 before. And there are .state files only in 4 subdirectories now.. (it's the same for each queue on each node). Could it be that I just did something wrong and one shouldn't change this value on existing system?

@mocchira
Copy link
Member Author

@vstax

And after doing SIGTERM and another restart on latest dev version, as soon as node finished starting up and shown up as "running" on manager, the number of messages was 0.

Got it.

I have changed mq.num_of_mq_procs to 4 before starting this node in preparation for stress load experiments as per #734, it was 8 before. And there are .state files only in 4 subdirectories now.. (it's the same for each queue on each node). Could it be that I just did something wrong and one shouldn't change this value on existing system?

It's expected.
The new code on the develop branch does output the state file only when leo_mq succeeded in finishing properly (leo_storage stop) and read that into memory when restarting then remove that in order to prevent the state file including a stale value from being read at the next restart.
That said, the state file only exists while leo_storage is stopping now so leo_mq_[0-3] instances don't have each state file in your case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants