Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue on deleting ES Snapshot with Azure plugin #25424

Closed
etiennecarriere opened this issue Jun 27, 2017 · 11 comments
Closed

Issue on deleting ES Snapshot with Azure plugin #25424

etiennecarriere opened this issue Jun 27, 2017 · 11 comments
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@etiennecarriere
Copy link
Contributor

Hi,

I have issues to delete snapshot on a repository managed with Azure plugin :

  • Our configuration is the following :
    = ElasticSearch version : 5.4.1 (via https://artifacts.elastic.co/packages/5.x/yum)
    = Plugins installed : x-pack, repository-azure
    = JVM version : Openjdk 1.8.0.121 (RPM : java-1.8.0-openjdk-1.8.0.121-0.b13.el7_3.x86_64)
    = OS Version : Centos 7.3 / Linux xxx 3.10.0-514.10.2.el7.x86_64 Query DSL: Terms Filter #1 SMP Fri Mar 3 00:04:05 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
    = Point to note : We use the Azure plugin but we are not in the Azure infrastructure

Description of the problem including expected versus actual behavior:

Expected behaviour :

  • Possibility to delete more than one snapshot from Azure repository
    Actual Behaviour:
  • The DELETE Request never ends up
  • When you ask the status of the Snapshot, you take a 404 snapshot_missing_exception
  • When you try to delete an other snapshot, you get a 503 concurrent_snapshot_execution_exception

Steps to reproduce:

I always achieve to reproduce it on my test environnement (happens first on production) :

  1. curl -v -X DELETE http://X.Y.Z.T:9200/_snapshot/testbackups-azure/snapshot_XXX . The request hangs up
  2. In the same time, on the second terminal , curl -v http://X.Y.Z.T:9200/_snapshot/testbackups-azure/snapshot_XXX. You get
    {"error":{"root_cause":[{"type":"snapshot_missing_exception","reason":"[testbackups-azure:snapshot_XXX] is missing"}],"type":"snapshot_missing_exception","reason":"[testbackups-azure:snapshot_XXX] is missing"},"status":404}
  3. On the second terminal, ask for curl -v -X DELETE http://X.Y.Z.T:9200/_snapshot/testbackups-azure/snapshot_YYY (a different Snapshot this time).
    {"error":{"root_cause":[{"type":"concurrent_snapshot_execution_exception","reason":"[testbackups-azure:snapshot_YYY/ZZZ] cannot delete - another snapshot is currently being deleted"}],"type":"concurrent_snapshot_execution_exception","reason":"[testbackups-azure:snapshot_YYY/ZZZ] cannot delete - another snapshot is currently being deleted"},"status":503}

When I do a rolling restart of the cluster ES, it come back to normal and the first snapshot (XXX in our case) is no more present and it is possible to delete an snapshot.

@tlrx
Copy link
Member

tlrx commented Jun 27, 2017

You cannot delete a snapshot if one is being executed or deleted. In your case, the first operation 1) hangs and it forbids any deletion like 3) to happen.

Before deleting a snapshot, the master node retrieves the list of snapshots from the remote repository (and it can takes time... in the meanwhile if your execute some get snapshot like 2) it will answer that it doesn't know the snapshot) and after that it update the cluster state to inform that a deletion is going to be executed. When the cluster state is updated, it deletes the snapshot on the repository.

It would be interesting to know why the first operation takes time. Maybe you have a lot of snapshots on the Azure repository?

@tlrx tlrx added :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs feedback_needed labels Jun 27, 2017
@abeyad
Copy link

abeyad commented Jun 27, 2017

The key will be finding out the cause of the first operation taking a while. Given the above described scenario, it looks like all of the snapshot files are deleted appropriately from Azure, but the snapshot deletion in progress is not removed from the cluster state, which is the last step in the process of deleting a snapshot. We will likely need to see your logs from the master node to help diagnose the problem. It seems like the master node is hanging on something.

@etiennecarriere
Copy link
Contributor Author

etiennecarriere commented Jun 27, 2017

Hi,

Thank you for your help. Here are the new elements :
@tlrx, today I waited a long time and after a lot of time the process of the remove of snapshot finished :

  • 7 minutes on the test cluster (1800 snapshot and a volumetry of the cluster estimated at 600 Gbytes)
  • on production, it doesn't finished after a timeout of 3 hours (aroud 50 snapshots for a volumetry of the cluster estimated to 6 Tbytes)
    @abeyad , I see no logs on any of our nodes on (we have log at info level).

Is there some request (HTTP or transport API) that can give me the process of delete of the snapshot . I see nothing on pending_tasks nor snapshot API endpoints ? or some log I could activate ?

@abeyad
Copy link

abeyad commented Jun 27, 2017

No, deletions should be quick, so we don't provide a status endpoint for those. Is there nothing in the logs whatsoever? Can you increase the logging for snapshotting? You can use:

PUT /_cluster/settings
{
    "transient" : {
        "logger.org.elasticsearch.snapshots" : "DEBUG",
        "logger.org.elasticsearch.cluster.service": "DEBUG"
    }
}

@tlrx
Copy link
Member

tlrx commented Jun 27, 2017

I suspect the large number of snapshots being an issue here, but I have no clue so I 👍 @abeyad suggestion to grab more information by logging at the debug level.

@etiennecarriere
Copy link
Contributor Author

I put the log in debug and it was acknowledged correctly

{
  "acknowledged": true,
  "persistent": {},
  "transient": {
    "logger": {
      "org": {
        "elasticsearch": {
          "snapshots": "DEBUG",
          "cluster": {
            "service": "DEBUG"
          }
        }
      }
    }
  }
}

I retried to remove a snapshot on my test platform :
11:18:56 change of the transient parameters
curator :
2017-06-28 11:20:34,648 INFO Preparing Action ID: 1, "delete_snapshots"
2017-06-28 11:20:34,657 INFO Trying Action ID: 1, "delete_snapshots": Delete snapshots from the selected repository older than 2 days
2017-06-28 11:24:05,369 INFO Deleting selected snapshots
2017-06-28 11:27:21,182 INFO Deleting snapshot snapshot_2017_06_10_02_56_11...
...
Take 4 minutes to have the list of the snapshots and never acknowledge the delete .

es master node :
[2017-06-28T11:18:56,066][DEBUG][o.e.c.s.ClusterService ] [es3-main] set local cluster state to version 197349
[2017-06-28T11:18:56,070][DEBUG][o.e.c.s.ClusterService ] [es3-main] processing [cluster_update_settings]: took [35ms] done applying updated cluster_state (version: 197349, uuid: lZr4L7UrQKyKJe42pAlnlg)
[2017-06-28T11:18:56,070][DEBUG][o.e.c.s.ClusterService ] [es3-main] processing [reroute_after_cluster_update_settings]: execute
[2017-06-28T11:18:56,076][DEBUG][o.e.c.s.ClusterService ] [es3-main] processing [reroute_after_cluster_update_settings]: took [5ms] no change in cluster_state
[2017-06-28T11:27:22,379][DEBUG][o.e.c.s.ClusterService ] [es3-main] processing [delete snapshot]: execute
[2017-06-28T11:27:22,379][DEBUG][o.e.c.s.ClusterService ] [es3-main] cluster state updated, version [197350], source [delete snapshot]
[2017-06-28T11:27:22,380][DEBUG][o.e.c.s.ClusterService ] [es3-main] publishing cluster state version [197350]
[2017-06-28T11:27:22,392][DEBUG][o.e.c.s.ClusterService ] [es3-main] applying cluster state version 197350
[2017-06-28T11:27:22,392][DEBUG][o.e.c.s.ClusterService ] [es3-main] set local cluster state to version 197350
[2017-06-28T11:27:22,409][DEBUG][o.e.c.s.ClusterService ] [es3-main] processing [delete snapshot]: took [30ms] done applying updated cluster_state (version: 197350, uuid: XXXXXXXXXXXXXXXXXXXXX)

It confirm that the delete part is fast but then it hangs up.
Unfortunately, I have no logs on Snapshot classes.

Regards,

Etienne

@etiennecarriere
Copy link
Contributor Author

Complement on the logfile with the end of the remove snapshot :
[2017-06-28T11:27:22,379][DEBUG][o.e.c.s.ClusterService ] [es3-main] processing [delete snapshot]: execute
[2017-06-28T11:27:22,379][DEBUG][o.e.c.s.ClusterService ] [es3-main] cluster state updated, version [197350], source [delete snapshot]
[2017-06-28T11:27:22,380][DEBUG][o.e.c.s.ClusterService ] [es3-main] publishing cluster state version [197350]
[2017-06-28T11:27:22,392][DEBUG][o.e.c.s.ClusterService ] [es3-main] applying cluster state version 197350
[2017-06-28T11:27:22,392][DEBUG][o.e.c.s.ClusterService ] [es3-main] set local cluster state to version 197350
[2017-06-28T11:27:22,409][DEBUG][o.e.c.s.ClusterService ] [es3-main] processing [delete snapshot]: took [30ms] done applying updated cluster_state (version: 197350, uuid: XXXXXXXXXXXXXXXXXXXXX)
[2017-06-28T12:35:10,224][DEBUG][o.e.c.s.ClusterService ] [es3-main] processing [remove snapshot deletion metadata]: execute
[2017-06-28T12:35:10,225][DEBUG][o.e.c.s.ClusterService ] [es3-main] cluster state updated, version [197351], source [remove snapshot deletion metadata]
[2017-06-28T12:35:10,225][DEBUG][o.e.c.s.ClusterService ] [es3-main] publishing cluster state version [197351]
[2017-06-28T12:35:10,236][DEBUG][o.e.c.s.ClusterService ] [es3-main] applying cluster state version 197351
[2017-06-28T12:35:10,236][DEBUG][o.e.c.s.ClusterService ] [es3-main] set local cluster state to version 197351
[2017-06-28T12:35:10,242][DEBUG][o.e.c.s.ClusterService ] [es3-main] processing [remove snapshot deletion metadata]: took [17ms] done applying updated cluster_state (version: 197351, uuid: XXXXXXXXXXXXXXXXXXXXX)

@abeyad
Copy link

abeyad commented Jun 28, 2017

This is problematic:

[2017-06-28T11:27:22,409][DEBUG][o.e.c.s.ClusterService ] [es3-main] processing [delete snapshot]: took [30ms] done applying updated cluster_state (version: 197350, uuid: XXXXXXXXXXXXXXXXXXXXX)
[2017-06-28T12:35:10,224][DEBUG][o.e.c.s.ClusterService ] [es3-main] processing [remove snapshot deletion metadata]: execute

The deletion started at 11:27 and we only begin to remove the snapshot deletion from the cluster state at 12:35, a full hour after the deletion started. It seems to me the Azure access is very slow.

Apologies, but can you kindly re-run the test with DEBUG logging enabled as follows:

PUT /_cluster/settings
{
    "transient" : {
        "logger.org.elasticsearch.snapshots" : "DEBUG",
        "logger.org.elasticsearch.cluster.service": "DEBUG",
        "logger.org.elasticsearch.repositories": "DEBUG",
        "logger.org.elasticsearch.cloud.azure": "DEBUG",
        "logger.com.microsoft.azure.storage": "DEBUG"
    }
}

and send the logs, as complete as possible?

@etiennecarriere
Copy link
Contributor Author

Hi,

In the same time, we upgraded to 5.4.3 our testing database so the new test was with 5.4.3 . The snapshot log is on gist : https://gist.github.com/etiennecarriere/cc21d7e079fb24d8b7d0c65449065f65
I remove all unique ID and replace them with XXX, YYY, ZZZ, TTT .
It seems to prove it is the listing of the azure container that is very slow (50 seconds at each time)

Regards,

Etienne

@etiennecarriere
Copy link
Contributor Author

etiennecarriere commented Jul 1, 2017

Hi,

I confirm what seems to be a very ineffective implementation of the list in the azure plugin. On a bad Internet access (6 Mbits download/ 1 Mbits upload / 50 ms latency to Azure DC), I come from 1162 seconds to remove one snapshot to 185 seconds with my patch to remove a snapshot from a repository with 45 snapshots of 90 shards each.

I see with my employer Monday how I can pull request it.

Regards,

Etienne

@etiennecarriere
Copy link
Contributor Author

@abeyad , I permit to come back to you to see if you had time to have an opinion on the pull request I proposed

rjernst pushed a commit that referenced this issue Sep 11, 2017
…pshot) (#25710)

This commit reworks the azure listing of snapshot files to do a single listing, instead of once per blob.

closes #25424
rjernst pushed a commit that referenced this issue Sep 11, 2017
…pshot) (#25710)

This commit reworks the azure listing of snapshot files to do a single listing, instead of once per blob.

closes #25424
rjernst pushed a commit that referenced this issue Sep 11, 2017
…pshot) (#25710)

This commit reworks the azure listing of snapshot files to do a single listing, instead of once per blob.

closes #25424
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

4 participants