Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Great amount of failed put due container lock? #1108

Closed
vlakas opened this issue Aug 9, 2018 · 29 comments
Closed

Great amount of failed put due container lock? #1108

vlakas opened this issue Aug 9, 2018 · 29 comments

Comments

@vlakas
Copy link

vlakas commented Aug 9, 2018

Hello.

Summary

I'm trying to investigate issue with great amount of failed PUTs. I have cluster with 18 nodes (27 in the future). LeoFS version - 1.4.2

Gateway logs:

[W]	gateway_0@asp.pleiad.uaprom	2018-08-09 16:47:16.32338 +0300	1533822436	leo_gateway_rpc_handler:handle_error/5	298	[{node,'storage_0@boa.pleiad.uaprom'},{mod,leo_storage_handler_object},{method,get},{cause,"Recover failure"}]

On storage nodes I see multiple log records:

[E]	storage_0@chuckwalla.pleiad.uaprom	2018-08-07 17:42:12.196108 +0300	1533652932	leo_storage_handler_object:put/2	389	[{from,gateway},{method,put},{key,<<"uaprom-image/612807456_w200_h200_cid2413932_pid341885190-cb582430.jpg">>},{req_id,59131282},{cause,unavailable}]

I think that my problem is here https://github.com/leo-project/leo_object_storage/blob/v1/src/leo_object_storage_server.erl#L431

Those this problem is due container locks? I am not sure because I am not so good in Erlang.

State of cluster

It seems to me that there is no any issues with ring inconsistency:

 [System Confiuration]
-----------------------------------+----------
 Item                              | Value    
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.4.2
                        cluster Id | pleiad_1
                             DC Id | pleiad
                    Total replicas | 3
          number of successes of R | 1
          number of successes of W | 1
          number of successes of D | 1
 number of rack-awareness replicas | 1
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash | a9bd4eb1
                previous ring-hash | ed8c3e62
-----------------------------------+----------

 [State of Node(s)]
-------+------------------------------------------+--------------+-------------------+----------------+----------------+----------------------------
 type  |                   node                   |    state     |      rack id      |  current ring  |   prev ring    |          updated at         
-------+------------------------------------------+--------------+-------------------+----------------+----------------+----------------------------
  S    | storage_0@adder.pleiad.uaprom            | running      | bemobile          | a9bd4eb1       | ed8c3e62       | 2018-07-31 17:53:12 +0300
  S    | storage_0@anole.pleiad.uaprom            | running      | newtelco          | a9bd4eb1       | ed8c3e62       | 2018-08-08 00:30:29 +0300
  S    | storage_0@asp.pleiad.uaprom              | running      | bemobile          | a9bd4eb1       | ed8c3e62       | 2018-08-03 19:15:13 +0300
  S    | storage_0@boa.pleiad.uaprom              | running      | newtelco          | a9bd4eb1       | ed8c3e62       | 2018-08-03 19:01:34 +0300
  S    | storage_0@boomslang.pleiad.uaprom        | running      | gigacenter        | a9bd4eb1       | ed8c3e62       | 2018-08-08 00:35:00 +0300
  S    | storage_0@cantil.pleiad.uaprom           | running      | gigacenter        | a9bd4eb1       | ed8c3e62       | 2018-08-03 19:24:04 +0300
  S    | storage_0@cascavel.pleiad.uaprom         | running      | newtelco          | a9bd4eb1       | ed8c3e62       | 2018-08-08 00:30:16 +0300
  S    | storage_0@chameleon.pleiad.uaprom        | running      | gigacenter        | a9bd4eb1       | ed8c3e62       | 2018-08-08 00:34:53 +0300
  S    | storage_0@chuckwalla.pleiad.uaprom       | running      | gigacenter        | a9bd4eb1       | ed8c3e62       | 2018-08-08 00:34:54 +0300
  S    | storage_0@mamba.pleiad.uaprom            | running      | gigacenter        | a9bd4eb1       | ed8c3e62       | 2018-08-03 19:14:46 +0300
  S    | storage_0@mamushi.pleiad.uaprom          | running      | bemobile          | a9bd4eb1       | ed8c3e62       | 2018-07-30 16:00:59 +0300
  S    | storage_0@mussurana.pleiad.uaprom        | running      | bemobile          | a9bd4eb1       | ed8c3e62       | 2018-08-03 19:37:48 +0300
  S    | storage_0@rattlesnake.pleiad.uaprom      | running      | newtelco          | a9bd4eb1       | ed8c3e62       | 2018-08-08 00:30:25 +0300
  S    | storage_0@sonoran.pleiad.uaprom          | running      | bemobile          | a9bd4eb1       | ed8c3e62       | 2018-08-08 00:30:16 +0300
  S    | storage_0@taipan.pleiad.uaprom           | running      | bemobile          | a9bd4eb1       | ed8c3e62       | 2018-07-30 15:02:12 +0300
  S    | storage_0@urutu.pleiad.uaprom            | running      | newtelco          | a9bd4eb1       | ed8c3e62       | 2018-08-08 00:30:16 +0300
  S    | storage_0@viper.pleiad.uaprom            | running      | newtelco          | a9bd4eb1       | ed8c3e62       | 2018-08-08 00:34:53 +0300
  S    | storage_0@wutu.pleiad.uaprom             | running      | newtelco          | a9bd4eb1       | ed8c3e62       | 2018-08-08 00:34:54 +0300
  G    | gateway_0@adder.pleiad.uaprom            | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:43:07 +0300
  G    | gateway_0@anaconda.pleiad.uaprom         | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:37:45 +0300
  G    | gateway_0@anole.pleiad.uaprom            | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:40:02 +0300
  G    | gateway_0@asp.pleiad.uaprom              | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-03 19:14:20 +0300
  G    | gateway_0@boa.pleiad.uaprom              | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:40:45 +0300
  G    | gateway_0@boomslang.pleiad.uaprom        | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:39:02 +0300
  G    | gateway_0@cantil.pleiad.uaprom           | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:38:43 +0300
  G    | gateway_0@cascabel.pleiad.uaprom         | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:37:27 +0300
  G    | gateway_0@cascavel.pleiad.uaprom         | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:41:01 +0300
  G    | gateway_0@chameleon.pleiad.uaprom        | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:38:04 +0300
  G    | gateway_0@chuckwalla.pleiad.uaprom       | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:38:24 +0300
  G    | gateway_0@mamba.pleiad.uaprom            | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:39:27 +0300
  G    | gateway_0@mamushi.pleiad.uaprom          | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:43:44 +0300
  G    | gateway_0@mussurana.pleiad.uaprom        | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:42:29 +0300
  G    | gateway_0@raddysnake.pleiad.uaprom       | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:42:00 +0300
  G    | gateway_0@rattler.pleiad.uaprom          | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:42:18 +0300
  G    | gateway_0@rattlesnake.pleiad.uaprom      | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:41:23 +0300
  G    | gateway_0@sonoran.pleiad.uaprom          | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:43:26 +0300
  G    | gateway_0@taipan.pleiad.uaprom           | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:42:49 +0300
  G    | gateway_0@urutu.pleiad.uaprom            | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:39:43 +0300
  G    | gateway_0@viper.pleiad.uaprom            | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:41:42 +0300
  G    | gateway_0@wutu.pleiad.uaprom             | running      |                   | a9bd4eb1       | ed8c3e62       | 2018-08-09 16:40:23 +0300
-------+------------------------------------------+--------------+-------------------+----------------+----------------+----------------------------

But

  1. I have couple millions (6-11 millions) of messages in queue leo_per_object_queue after cluster rebalancing - 4 nodes has been added to cluster (it seems to me ok)
  2. I have couple of messages in delete queues. And it seems to me not good because of queue stuck.
    I've made an attempt to clear delete queue with no success. It is still not empty.
    According to haproxy logs (balancer behind leo gateways) ther is no DELETE requests at all.
 leofs-adm mq-stats storage_0@asp.pleiad.uaprom
              id                |       state       | number of msgs | batch of msgs  |    interval    |                                 description                            
--------------------------------+-------------------+----------------|----------------|----------------|-------------------------------------------------------------------------
 leo_async_deletion_queue       |      idling       | 0              | 1600           | 500            | requests of removing objects asynchronously                             
 leo_comp_meta_with_dc_queue    |      idling       | 0              | 1600           | 500            | requests of comparing metadata w/remote-node                            
 leo_delete_dir_queue_1         |      running      | 1              | 1600           | 500            | requests of removing buckets #1                                         
 leo_delete_dir_queue_2         |      running      | 1              | 1600           | 500            | requests of removing buckets #2                                         
 leo_delete_dir_queue_3         |      running      | 1              | 1600           | 500            | requests of removing buckets #3                                         
 leo_delete_dir_queue_4         |      running      | 1              | 1600           | 500            | requests of removing buckets #4                                         
 leo_delete_dir_queue_5         |      idling       | 0              | 1600           | 500            | requests of removing buckets #5                                         
 leo_delete_dir_queue_6         |      idling       | 0              | 1600           | 500            | requests of removing buckets #6                                         
 leo_delete_dir_queue_7         |      idling       | 0              | 1600           | 500            | requests of removing buckets #7                                         
 leo_delete_dir_queue_8         |      idling       | 0              | 1600           | 500            | requests of removing buckets #8                                         
 leo_per_object_queue           |      running      | 11778849       | 1600           | 500            | requests of fixing inconsistency of objects                             
 leo_rebalance_queue            |      idling       | 0              | 1600           | 500            | requests of relocating objects                                          
 leo_recovery_node_queue        |      idling       | 0              | 1600           | 500            | requests of recovering objects of the node (incl. recover-consistency)  
 leo_req_delete_dir_queue       |      running      | 1              | 1600           | 500            | requests of removing directories                                        
 leo_sync_by_vnode_id_queue     |      idling       | 0              | 1600           | 500            | requests of synchronizing objects by vnode-id                           
 leo_sync_obj_with_dc_queue     |      idling       | 0              | 1600           | 500            | requests of synchronizing objects w/remote-node

I think that this may affect PUT/DELETE requests.

How can I investigate this problem further?

@vlakas
Copy link
Author

vlakas commented Aug 9, 2018

Sorry for mistake. I can purge delete queues. But I still cannot understand of unavailable cause.

@vlakas
Copy link
Author

vlakas commented Aug 9, 2018

From the client side I see the following:

 $ s3cmd -c ~/s3_loss/vlakas put put_issue.md s3://test/
upload: 'put_issue.md' -> 's3://test/put_issue.md'  [1 of 1]
 3728 of 3728   100% in    0s   373.67 kB/s  done
WARNING: Upload failed: /put_issue.md (503 (ServiceUnavailable): Please reduce your request rate.)
WARNING: Retrying on lower speed (throttle=0.01)
WARNING: Waiting 3 sec...
upload: 'put_issue.md' -> 's3://test/put_issue.md'  [1 of 1]
 3728 of 3728   100% in    0s   298.19 kB/s  done
WARNING: Upload failed: /put_issue.md (503 (ServiceUnavailable): Please reduce your request rate.)
WARNING: Retrying on lower speed (throttle=0.05)
WARNING: Waiting 6 sec...
upload: 'put_issue.md' -> 's3://test/put_issue.md'  [1 of 1]
 3728 of 3728   100% in    0s    66.28 kB/s  done
WARNING: Upload failed: /put_issue.md (503 (ServiceUnavailable): Please reduce your request rate.)
WARNING: Retrying on lower speed (throttle=0.25)
WARNING: Waiting 9 sec...
^CSee ya!

@mocchira
Copy link
Member

@vlakas

I think that my problem is here https://github.com/leo-project/leo_object_storage/blob/v1/src/leo_object_storage_server.erl#L431

The unavailable error caused by "a container being locked" can happen only while the compaction is ongoing so if no compaction is ongoing then there is another reason for unavailable errors.

I've summarized when an unavailable error can happen at the below comment.

I think the most probable one you've hit is the first one "Any watchdog got triggered". In order to confirm whether my guess is correct, Can you share leo_storage.conf used by your cluster? if some watchdog is enabled then disabling it may solve the problem.

The second probable one is the last one "No available nodes found during the read-repair process". This can happen in case leo_storage nodes are under heavy load. Can you check how much system resources (CPU, Memory, Disk/Netwrok I/O) have been consumed on leo_storage nodes? if it's too high then suspending the queue by "leofs-adm mq-suspend ${storage_node} leo_per_object_queue" for a while may solve(ease) the problem.

@vlakas
Copy link
Author

vlakas commented Aug 10, 2018

@mocchira Thank you for your answer.

It seems I did not provide enough info here. Sorry.

  1. All watchdogs are turned off in leo_storage.conf
  2. I did try to suspend leo_per_object_queue (and then all MQ queues) on all storage nodes earlier with no effect.

But I will try suspend queues once more and report here.

My leo_storage.conf is here:

leo_storage.conf.txt

@mocchira
Copy link
Member

@vlakas Thanks for the info.

Your leo_storage.conf looks good (no watchdog enabled) so my second guess

The second probable one is the last one "No available nodes found during the read-repair process". This can happen in case leo_storage nodes are under heavy load. Can you check how much system resources (CPU, Memory, Disk/Netwrok I/O) have been consumed on leo_storage nodes? if it's too high then suspending the queue by "leofs-adm mq-suspend ${storage_node} leo_per_object_queue" for a while may solve(ease) the problem.

might be the reason why unavailable errors have happened.

But I will try suspend queues once more and report here.

I see.
If that doesn't work for you then we have to look into further with more system metrics and Erlang/LeoFS diagnostic tools so let me know. I'll provide the info related to those tools.

I have couple millions (6-11 millions) of messages in queue leo_per_object_queue after cluster rebalancing - 4 nodes has been added to cluster (it seems to me ok)

Although it's possible to add 4 nodes and rebalance at once, as it causes the whole cluster to be under very high load, I'd recommend you to attach a node and rebalance one by one to keep the cluster relatively low load during the rebalance.
We are going to write this topic on our official documentation later (filed at #1109

@vlakas
Copy link
Author

vlakas commented Aug 13, 2018

Although it's possible to add 4 nodes and rebalance at once, as it causes the whole cluster to be under very high load, I'd recommend you to attach a node and rebalance one by one to keep the cluster relatively low load during the rebalance.

@mocchira Thank you for the rebalancing tip. It makes sense.

But I will try suspend queues once more and report here.

I see.
If that doesn't work for you then we have to look into further with more system metrics and Erlang/LeoFS diagnostic tools so let me know. I'll provide the info related to those tools.

Unfortunately suspending MQ queues makes no difference. There are no any compaction tasks running.

I'm now investigating bottleneck according to system metrics.

Erlang/LeoFS diagnostic tools so let me know. I'll provide the info related to those tools.

It would be helpful.

@mocchira
Copy link
Member

Unfortunately suspending MQ queues makes no difference. There are no any compaction tasks running.

I see. Just in case, Please tell us the output of mq-stats on every storage node.

I'm now investigating bottleneck according to system metrics.

OK. Please tell us any system metrics if you find something weird.

Erlang/LeoFS diagnostic tools so let me know. I'll provide the info related to those tools.
It would be helpful.

Please try the below tool out on your storage nodes (especially the ones under higher load)

This would enables us to find how many times each module:function is called in a certain period (CPU bound processes can be found) and also find how many queues (remained tasks) each erlang process has (I/O bound processes can be found).

The below link is an leofs_doctor example another LeoFS user provide us to look into a problem. I hope you find it helpful to run leofs_doctor on your environment.

Let me know if you have any questions/problems.

@vlakas
Copy link
Author

vlakas commented Aug 14, 2018

I see. Just in case, Please tell us the output of mq-stats on every storage node.

mq-stats-active.txt
mq-stats-full.txt

mq-stats-active.txt shows only active queues. And full output of mq-stats command on all storage nodes just in case.

OK. Please tell us any system metrics if you find something weird.

I am afraid that main bottleneck may be disk subsystem. On each storage node I have 3x4TB HDD (mdadm, stripe) and 1x1TB SSD as cacheing device (bcache in writeback cache mode). But in most cases disk usage is no more than 70%.

But I have 2 nodes (of total 18) that are really slow with 100% disk utilization (because of small SSD cache disk - 0.5TB). But suspending them or shutting down leofs sotrage does not make any difference.

Formerly we had riak with bitcask (without active-entropy) on the same nodes. And it woks almost perfect (bitcask stores key names in RAM). Currently no more than 30% of 48GB RAM is consumed on storage nodes; average - 15-20. Maybe I need some tuning of leveldb backend? Does it make sens?

My current storage config assumes that 50% of RAM may be used by leofs (other - filesystem cache):

backend_db.eleveldb.write_buf_size = 268435456
obj_containers.path = [/data/leofs/storage/avs]
obj_containers.num_of_containers = [128]       

Please try the below tool out on your storage nodes (especially the ones under higher load)

@mocchira Thank you very much. I'll give it a try.

@vlakas
Copy link
Author

vlakas commented Aug 14, 2018

The below link is an leofs_doctor example another LeoFS user provide us to look into a problem. I hope you find it helpful to run leofs_doctor on your environment.

But I have 2 nodes (of total 18) that are really slow with 100% disk utilization (because of small SSD cache disk - 0.5TB). But suspending them or shutting down leofs sotrage does not make any difference.

Here is the result of leofs_doctor for 2 loaded nodes (high disk usage):
leofs_docktor_mamba.txt
leofs_docktor_viper.txt

Here is the result for nodes with non-empty leo_per_object_queue queue:
leofs_docktor_asp.txt
leofs_docktor_wutu.txt

Here is the result for normal nodes (no system metrics anomalies detected; empty leo_per_object_queue queue):
leofs_docktor_adder.txt
leofs_docktor_mamushi.txt

@mocchira
Copy link
Member

@vlakas

Formerly we had riak with bitcask (without active-entropy) on the same nodes. And it woks almost perfect (bitcask stores key names in RAM). Currently no more than 30% of 48GB RAM is consumed on storage nodes; average - 15-20. Maybe I need some tuning of leveldb backend? Does it make sens?

Maybe, however your leveldb setting described below looks good to me. so I think some other tuning might be needed.

My current storage config assumes that 50% of RAM may be used by leofs (other - filesystem cache):

Suggestions based on the results of diagnostic tools and system metrics.

Here is the result of leofs_doctor for 2 loaded nodes (high disk usage):

From leofs_docktor_mamba.txt, I can see logging related processes are bursting probably due to massive amount of PUT errors you've faced now and this can result in high disk utilization on that node. So the below procedure might ease the problem.

# do the below procedure on all storage nodes
## login remote console on leo_storage
${LEOFS_ROOT}/leo_storage/bin/leo_storage remote_console

## do the below procedure in remote_console
## Limit the number of messages per second allowed from error_logger
Handlers = gen_event:which_handlers(lager_event),
lists:foreach(fun(Handler) ->
    lager:set_loghwm(Handler, 50)
end, Handlers).
application:set_env(lager, killer_hwm, 100).

This configuration changes can drop log messages when the rate writing log messages exceeds a certain threshold so it could avoid high disk utilization on storage nodes.

From other leofs_docktor_.txt, I noticed leo_object_storage_read_ processes run frequently compared to the normal LeoFS cluster so many GET/HEAD operations might cause frequent disk seeks and result in high disk utilization on storage nodes. also I noticed your cluster have so many leo_gateway nodes so the cache hit rate on leo_gateway would probably low (as cache will be scattered across multiple leo_gateway nodes). As a result, many RPC calls to leo_storage nodes from leo_gateway can happen. therefore I'd like to suggest

which will result in reducing RPC calls to leo_storage nodes and also reduce disk utilization on leo_storage nodes.

I hope this will work for you.

@vlakas
Copy link
Author

vlakas commented Aug 16, 2018

@mocchira I appreciate your help.

Good point about leo_gateways. I will try to reduce amount of leo_gateways (with cache tuning) and will try some balancing tricks (i.e. some sort of ip hash balancig or so).

Currently "small" cacheing disks are replaced it may help a bit.

@vlakas
Copy link
Author

vlakas commented Aug 20, 2018

  1. Done with leveldb logs rate limit on each storage node
# do the below procedure on all storage nodes
## login remote console on leo_storage
${LEOFS_ROOT}/leo_storage/bin/leo_storage remote_console

## do the below procedure in remote_console
## Limit the number of messages per second allowed from error_logger
Handlers = gen_event:which_handlers(lager_event),
lists:foreach(fun(Handler) ->
    lager:set_loghwm(Handler, 50)
end, Handlers).
application:set_env(lager, killer_hwm, 100).
  1. 2 servers has been upgraded with new SSD 1TB Disks. So load on each node is almost the same

  2. Amount of leo_gateways has been reduced to 4 nodes with consistent uri balancing on haproxy (here I assume that requests on specific uri goes to the same leo_gateway in most cases).

  3. Tuning leo_gateway cacheing:

cache.http_cache = true
cache.cache_workers = 128
cache.cache_ram_capacity = 21474836480 (20 GB)
cache.cache_disc_capacity = 1099511627776 (1 Tb)

## When the length of the object exceeds this value, store the object on disk
## cache.cache_disc_threshold_len = 1048576

And it seems to me that nothing happens at all unfortunately (I mean errors rate).

Can you tell me where exactly container locks take place: locally on storage nodes or on leo_manager nodes (globally)?

@vlakas
Copy link
Author

vlakas commented Aug 20, 2018

One more question. There is many reqests of non-existent objects (1/4) in leofs (all not_founds we try to write to leofs from backup amazon S3 bucket). Is it an issue for eleveldb?

Is is safe to purge leo_rebalance_queue queue on storage node under recover-consistency? Or is there another way to cancel recover consistency process?

@mocchira
Copy link
Member

@vlakas looks good to me on what you did > #1108 (comment) other than

cache.http_cache = true

Please try to set this field to "false". This may reduce the number of read operations on leo_storage(s).

Can you tell me where exactly container locks take place: locally on storage nodes or on leo_manager nodes (globally)?

As I said at the previous comment, "container locks" is not the root cause. (This can happen only in case compaction(s) running) For more details, please read the comment here: #1108 (comment)

Given that many read operations(GET/HEAD) running on leo_storage(s) according to the result of leofs_doctor, I think "No available nodes found during the read-repair process" can be the root cause of unavailable errors. So could you run leofs_doctor again and share the result on github? and also I want to know the current disk utilization by iostat or something else which give us the detailed info about the load of disk. also the full error log on one of leo_storage(s) would be helpful to dive into further.

One more question. There is many reqests of non-existent objects (1/4) in leofs (all not_founds we try to write to leofs from backup amazon S3 bucket). Is it an issue for eleveldb?

Yes it can as it may cause additional disk seeks at multiple leveldb files on leo_storage(s) (the number of files to be sought depends on the level)

Is is safe to purge leo_rebalance_queue queue on storage node under recover-consistency?

As long as there are no broken nodes and every object stored in LeoFS has at least one replica, it's safe. however if the node having the replica gets broken without any redundancy then the data loss could happen so I'd recommend you issue recover-consistency on all leo_storage nodes one by one ASAP.

Or is there another way to cancel recover consistency process?

Nothing.

@mocchira
Copy link
Member

mocchira commented Aug 21, 2018

@vlakas In order to replicate your situation on our testing environment, it would be great if you tell us

  • Oprations Per Second (ops) for each method
    • ex) PUT: 100, GET: 1,000, DELETE: 10, HEAD: 2,000
  • Histgram of the object size
    • ex) 1-64KB: 10%, 64KB-512KB: 20%, 512KB-2MB: 55%, 2MB-10MB: 10%, 10MB-50MB: 5%
  • System Architecture
    • ex) haproxy version x.x.x with uri balancing enabled -> leo_gateway x 4 -> leo_storage x 18
  • S3 Clients your are using
    • ex) aws-sdk-go version x.x.x from applications and s3cmd x.x.x for operational purposes
  • Disk/Network Bandwidth
    • ex) HDD: 128MB, SSD: 512MB, Network: 1Gbps

Once we get those info, we will try to replicate the problem on as much similar env to yours as possible.

@mocchira
Copy link
Member

@vlakas I've just remembered another user faced the very similar problem you're facing now and their problem has been solved by lowering obj_containers.num_of_containers (In their case, they changed from 384 to 64). so it might work for you (for example changing it from 128 to 64 or 32).

Please bear in mind that changing obj_containers.num_of_containers on leo_storage means all objects stored in that node will get lost so you need to issue recover-node after changing the configuration and restarting the node.

@vlakas
Copy link
Author

vlakas commented Aug 22, 2018

@mocchira Thank you for detailed explanation

As I said at the previous comment, "container locks" is not the root cause. (This can happen only in case compaction(s) running) For more details, please read the comment here: #1108 (comment)

Got it.

cache.http_cache = true

Please try to set this field to "false". This may reduce the number of read operations on leo_storage(s).

I forget to say that I have dedicated leo_gateway servers currently. So it will not help a lot.

Regarding to replicating the situation. I really do not want to bother you for now. First of all it seems that I have servers performance issue (in particular disk subsystem). We decided to add more nodes to the cluster (7 or 8 nodes).

It may take a week because of great amount of small objects.

Then I will post here results.

@vlakas
Copy link
Author

vlakas commented Aug 22, 2018

@vlakas I've just remembered another user faced the very similar problem you're facing now and their problem has been solved by lowering obj_containers.num_of_containers (In their case, they changed from 384 to 64). so it might work for you (for example changing it from 128 to 64 or 32).

@mocchira Thank you for valuable info. Unfortunately procedure of changing containers count will be very slow (because of great amount of stored objects). But I think that I have to give it a try. If I change containers number on couple of nodes can we undarstand if it would be positive changes? Or we need to change it on every storage node in the cluster?

Also do I correctly understand procedure of changing containers number:

  1. stop storage node
  2. remove containers data
  3. change containers number
  4. start storage node + recover-consistency

@vlakas
Copy link
Author

vlakas commented Aug 22, 2018

Numbers on requests and system metrics I will post here a little later.

@mocchira
Copy link
Member

@vlakas

If I change containers number on couple of nodes can we undarstand if it would be positive changes? Or we need to change it on every storage node in the cluster?

Yes you can.
however keep in mind that there is one drawback while "obj_containers.num_of_containers" is not consistent across the cluster. You cannot issue leofs-adm recover-disk during "obj_containers.num_of_containers" is not consistent. (You can issue other recover related commands like reocver-(file|node|ring|cluster|consistency))

Also do I correctly understand procedure of changing containers number:

stop storage node
remove containers data
change containers number
start storage node + recover-consistency

After starting the storage node, issue recover-node against the node with data removed instead of recover-consistency.

@vlakas
Copy link
Author

vlakas commented Sep 3, 2018

@mocchira Here are some fresh results of leofs_doctor. Currently I have 24 nodes (recently it was only 18). with the same hardware configuration.

Some notes:

  1. Nodes with reduced containers count (from 64 to 32):
    • anaconda
    • rattler
    • serpent
    • boiga
  2. Nodes with high count of leo_per_object_queue:
    • adder
    • boa
    • boomslang
    • wutu

storage_0@adder.pleiad.uaprom.txt
storage_0@anaconda.pleiad.uaprom.txt
storage_0@anole.pleiad.uaprom.txt
storage_0@asp.pleiad.uaprom.txt
storage_0@boa.pleiad.uaprom.txt
storage_0@boiga.pleiad.uaprom.txt
storage_0@boomslang.pleiad.uaprom.txt
storage_0@cantil.pleiad.uaprom.txt
storage_0@cascabel.pleiad.uaprom.txt
storage_0@cascavel.pleiad.uaprom.txt
storage_0@chameleon.pleiad.uaprom.txt
storage_0@chuckwalla.pleiad.uaprom.txt
storage_0@mamba.pleiad.uaprom.txt
storage_0@mamushi.pleiad.uaprom.txt
storage_0@mussurana.pleiad.uaprom.txt
storage_0@raddysnake.pleiad.uaprom.txt
storage_0@rattler.pleiad.uaprom.txt
storage_0@rattlesnake.pleiad.uaprom.txt
storage_0@serpent.pleiad.uaprom.txt
storage_0@sonoran.pleiad.uaprom.txt
storage_0@taipan.pleiad.uaprom.txt
storage_0@urutu.pleiad.uaprom.txt
storage_0@viper.pleiad.uaprom.txt
storage_0@wutu.pleiad.uaprom.txt

It seems to me that nodes with 32 containers works more smoothly (according to IOwait parameter - 2-5%). But I am not sure that containers num is the cause of such effect. It would be good if you'd look at leofs_doctor results for nodes with 32 containers num (see the list above).

@vlakas
Copy link
Author

vlakas commented Sep 3, 2018

leofs-adm status

 [System Confiuration]
-----------------------------------+----------
 Item                              | Value    
-----------------------------------+----------
 Basic/Consistency level
-----------------------------------+----------
                    system version | 1.4.1
                        cluster Id | pleiad_1
                             DC Id | pleiad
                    Total replicas | 3
          number of successes of R | 1
          number of successes of W | 2
          number of successes of D | 1
 number of rack-awareness replicas | 1
                         ring size | 2^128
-----------------------------------+----------
 Multi DC replication settings
-----------------------------------+----------
 [mdcr] max number of joinable DCs | 2
 [mdcr] total replicas per a DC    | 1
 [mdcr] number of successes of R   | 1
 [mdcr] number of successes of W   | 1
 [mdcr] number of successes of D   | 1
-----------------------------------+----------
 Manager RING hash
-----------------------------------+----------
                 current ring-hash | 8e2b866c
                previous ring-hash | 7cfdbdfc
-----------------------------------+----------

 [State of Node(s)]
-------+------------------------------------------+--------------+----------------+----------------+----------------------------
 type  |                   node                   |    state     |  current ring  |   prev ring    |          updated at         
-------+------------------------------------------+--------------+----------------+----------------+----------------------------
  S    | storage_0@adder.pleiad.uaprom            | running      | 8e2b866c       | 7cfdbdfc       | 2018-09-03 14:55:02 +0300
  S    | storage_0@anaconda.pleiad.uaprom         | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-28 18:36:45 +0300
  S    | storage_0@anole.pleiad.uaprom            | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-09 17:59:05 +0300
  S    | storage_0@asp.pleiad.uaprom              | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-16 13:47:17 +0300
  S    | storage_0@boa.pleiad.uaprom              | running      | 8e2b866c       | 7cfdbdfc       | 2018-09-03 16:15:56 +0300
  S    | storage_0@boiga.pleiad.uaprom            | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-24 00:30:16 +0300
  S    | storage_0@boomslang.pleiad.uaprom        | running      | 8e2b866c       | 7cfdbdfc       | 2018-09-03 14:59:50 +0300
  S    | storage_0@cantil.pleiad.uaprom           | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-18 15:16:37 +0300
  S    | storage_0@cascabel.pleiad.uaprom         | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-20 20:03:43 +0300
  S    | storage_0@cascavel.pleiad.uaprom         | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-09 18:03:06 +0300
  S    | storage_0@chameleon.pleiad.uaprom        | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-09 18:08:21 +0300
  S    | storage_0@chuckwalla.pleiad.uaprom       | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-09 18:12:20 +0300
  S    | storage_0@mamba.pleiad.uaprom            | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-16 13:44:20 +0300
  S    | storage_0@mamushi.pleiad.uaprom          | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-09 18:15:17 +0300
  S    | storage_0@mussurana.pleiad.uaprom        | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-03 19:37:48 +0300
  S    | storage_0@raddysnake.pleiad.uaprom       | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-20 20:03:43 +0300
  S    | storage_0@rattler.pleiad.uaprom          | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-24 00:30:16 +0300
  S    | storage_0@rattlesnake.pleiad.uaprom      | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-09 19:02:51 +0300
  S    | storage_0@serpent.pleiad.uaprom          | running      | 8e2b866c       | 7cfdbdfc       | 2018-09-03 16:20:00 +0300
  S    | storage_0@sonoran.pleiad.uaprom          | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-09 19:27:09 +0300
  S    | storage_0@taipan.pleiad.uaprom           | running      | 8e2b866c       | 7cfdbdfc       | 2018-07-30 15:02:12 +0300
  S    | storage_0@urutu.pleiad.uaprom            | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-08 00:30:16 +0300
  S    | storage_0@viper.pleiad.uaprom            | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-16 14:08:31 +0300
  S    | storage_0@wutu.pleiad.uaprom             | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-20 18:32:38 +0300
  G    | gateway_0@anaconda.pleiad.uaprom         | running      | 8e2b866c       | 7cfdbdfc       | 2018-09-03 19:06:32 +0300
  G    | gateway_0@boiga.pleiad.uaprom            | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-20 15:29:07 +0300
  G    | gateway_0@rattler.pleiad.uaprom          | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-20 15:29:21 +0300
  G    | gateway_0@serpent.pleiad.uaprom          | running      | 8e2b866c       | 7cfdbdfc       | 2018-08-28 18:44:31 +0300

@mocchira
Copy link
Member

mocchira commented Sep 4, 2018

@vlakas Thanks for the info.

The result of leofs_doctor(s) looks nothing bad to me.

But I am not sure that containers num is the cause of such effect. It would be good if you'd look at leofs_doctor results for nodes with 32 containers num (see the list above).

In order to check if num_of_containers is the cause, it would be more better to check the iostat between a storage node with 32 containers and another one with 64 containers. if the disk util is lower on the node with 32 containers then it works.

@vlakas
Copy link
Author

vlakas commented Sep 11, 2018

@mocchira I have a good news. LeoFS works as expected. The reason of fails is overtuned config of leo_storage.

I've reverted to default config with minor modifications and reassembled leofs cluster.

Later I will post diff of configs (currently I do not understand the root cause of my problem).

@mocchira
Copy link
Member

I have a good news. LeoFS works as expected. The reason of fails is overtuned config of leo_storage.
I've reverted to default config with minor modifications and reassembled leofs cluster.

Great to hear that.

Later I will post diff of configs (currently I do not understand the root cause of my problem).

Yes please, that will help a lot to find out the root cause for me.

@vlakas
Copy link
Author

vlakas commented Sep 14, 2018

leo_storage.conf-new.txt
leo_storage.conf-old.txt

@mocchira New (actual working config) and old (non-working config) configs are attached here.

Also containers_num currently is 64 on each storage node. Rack awareness has been turned off (each node has the same rack_id).

@mocchira
Copy link
Member

@vlakas Thanks for the great feedback.

> ## e.g. Case of plural pathes
> ## obj_containers.path = [/var/leofs/avs/1, /var/leofs/avs/2]
> ## obj_containers.num_of_containers = [32, 64]
>
41c45
< num_of_vnodes = 168
---
> #num_of_vnodes = 168
61c65
< max_num_of_procs = 3000
---
> #max_num_of_procs = 10000
66c70
< ## num_of_obj_storage_read_procs = 3
---
> #num_of_obj_storage_read_procs = 100
79c83
< watchdog.rex.is_enabled = true
---
> watchdog.rex.is_enabled = false
186c190
< compaction.limit_num_of_compaction_procs = 4
---
> compaction.limit_num_of_compaction_procs = 1
218c222
< mq.num_of_batch_process_max = 3000
---
> mq.num_of_batch_process_max = 10000
227c231
< mq.interval_between_batch_procs_max = 3000
---
> mq.interval_between_batch_procs_max = 1000
237,238c241
< backend_db.eleveldb.write_buf_size = 62914560
< #backend_db.eleveldb.write_buf_size = 268435456
---
> backend_db.eleveldb.write_buf_size = 268435456
241c244
< ## backend_db.eleveldb.max_open_files = 1000
---
> backend_db.eleveldb.max_open_files = 10000
333c336
< ## rpc.server.acceptors = 128
---
> rpc.server.acceptors = 5186
339c342
< ## rpc.server.listen_timeout = 30000
---
> #rpc.server.listen_timeout = 30000
342c345
< ## rpc.client.connection_pool_size = 8
---
> #rpc.client.connection_pool_size = 192
345c348
< ## rpc.client.connection_buffer_size = 8
---
> #rpc.client.connection_buffer_size = 192
442c445
< ## snmp_conf = ./snmp/snmpa_storage_0/leo_storage_snmp
---
> ## snmp_conf = ./snmp/snmpa_storage_0/leofs_storage_snmp

The difference which may affect the performance and error rate is

  • mq.num_of_batch_process_max
  • mq.interval_between_batch_procs_max

Since those two settings control how fast the background processing (recover/rebalance) is proceeded (In other words, how much system resources the background processing consumes), the old configuration caused the background processing to consume much more resource than the frontend(handling PUT/GET/DELETE/HEAD) and resulted in the problem you faced (many put failures).

I'm going to add this information to FAQ section in our official document for those who may face the same problem in future.

Thanks again.

@mocchira
Copy link
Member

@vlakas if there is no problem remained then please close the issue.

@vlakas
Copy link
Author

vlakas commented Sep 18, 2018

@mocchira Thank you very much for amazing support. This issue is completely solved.

@vlakas vlakas closed this as completed Sep 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants