Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Etcd ignores error code and exits prematurely on Watch #26

Closed
abronan opened this issue Jul 17, 2015 · 8 comments
Closed

Etcd ignores error code and exits prematurely on Watch #26

abronan opened this issue Jul 17, 2015 · 8 comments
Labels

Comments

@abronan
Copy link
Contributor

abronan commented Jul 17, 2015

(From @springi99)

Hi,

I am newbie here, i don't know whether is this the right forum for my problem...
I have problem with using etcd as kv store:

docker -D -d --kv-store=etcd:10.0.0.105:4001 --label=com.docker.network.driver.overlay.bind_interface=eth1

failed because it cannot find keys in etcd. I tried to find a solution for this:

in vendor/src/github.com/docker/libkv/store/etcd/etcd.go in Get() you should add error code 100 as well:

-                       if etcdError.ErrorCode == 102 || etcdError.ErrorCode == 104 {
+                       if etcdError.ErrorCode == 102 || etcdError.ErrorCode == 104 || etcdError.ErrorCode == 100 {

furthermore in WatchTree you should not return back when List() returns with error:

        current, err := s.List(directory)
        if err != nil {
 -               return nil, err
 +               log.Info("watchtree error: ", directory)
        }

with these modification i can start docker with etcd. I am able to create overlay network, publish and attach service to a running container. The problem comes when i try to unpublish the service. in etcd the endpoint top directory is not deleted:

                        {
                            "createdIndex": 130,
                            "dir": true,
                            "key": "/docker/libnetwork/endpoint",
                            "modifiedIndex": 130,
                            "nodes": [
                                {
                                    "createdIndex": 130,
                                    "dir": true,
                                    "key": "/docker/libnetwork/endpoint/be2f0baf7de54e0499ee061c862bb3c405defa2d39e6fd8abfa8cbebbcbe451d",
                                    "modifiedIndex": 130
                                },
                                {
                                    "createdIndex": 144,
                                    "dir": true,
                                    "key": "/docker/libnetwork/endpoint/7e784942ec0c4430425bd6a68423c42d61db7eff249bced95f2a519cfa000dd0",
                                    "modifiedIndex": 144
                                }
                            ]
                        },

and it seems veth pairs are not deleted, too.

Actually i did not try whether traffic goes through or not. Did you meet with this problem? Thanks your help in advance,

robert.

@abronan
Copy link
Contributor Author

abronan commented Jul 17, 2015

(From @springi99)

Until the moving is done related to endpoint top directory (or KeyPrefix):

--- a/vendor/src/github.com/docker/libnetwork/store.go
+++ b/vendor/src/github.com/docker/libnetwork/store.go
@@ -87,7 +87,8 @@ func (c *controller) deleteNetworkFromStore(n *network) error {
        if err := cs.DeleteObjectAtomic(n); err != nil {
                return err
        }
-
+       tmpEp := endpoint{network: n}
+       cs.KVStore().DeleteTree(datastore.Key(tmpEp.KeyPrefix()...))
        return nil
 }

solves the problem. Sorry about veth pairs they are not belonging to docker.

@abronan
Copy link
Contributor Author

abronan commented Jul 17, 2015

@springi99 I think the default behavior of List exiting right away is the right one, this avoids launching a goroutine inappropriately because the failure on List might be related to a network connectivity issue or a wrong key.

It is to the users of the lib (in this case libnetwork) to deal with the Watch/WatchTree failure and catch the appropriate error as well as retry on failure if the Watch should be restarted.

The missing catch on error 100 still holds though so if you are interested in submitting a patch I'll be happy to review an merge :)

Same goes for the changes you made on libnetwork, it will definitely help on this side as well :)

@springi99
Copy link
Contributor

Hi,

The problem with the WatchTree is the following: networks for defaults (zero, host, bridge) are also creating this watch and it fails with etcd and not with consul. if you compare the consul.go and etcd.go you can see that in case of failure etcd returns with an error and consule not. This is why i thought it is enough to remove that return. For other networks (overlay and custom/remote e.g. openvswitch from shettyg/ovn-docker) creates the right tree structure. for remote networking some features are still missing, but you can substitute them with hand made scripts.

Related to commiting: do you have any process description for that? Unfortunately i am a newbie and I don't know how can I do it officially.

I have other questions related to networking, if you can provide me resources to learn it would be nice:

  • the difference between endpoint and service: i am a little bit confused as there are rest apis for both of them but on cli only services can be attached. Currently it seems to me that they are equivalaent but maybe they will mean different things in the future. E.g. on cli i can use only 1 publish and it is splitted up by the code and attach "endpoint" according to the result. If there are more publishes in the cli only the last one is taken. There were some words about "service container" but i did not catch the point. I think endpoint is somehow lighter by concept, mean multiple interfaces to the containers, but there is no cli for that.

Thanks your help!

@abronan
Copy link
Contributor Author

abronan commented Jul 20, 2015

@springi99 Oh I see, you are right! I think the Get and List calls of Watch/WatchTree are just misplaced for etcd and should be placed inside the goroutine so that the return actually closes the watch channel (like it's done with Consul). Just tested on the Leader Election recovery of Swarm and it works just fine, so I'd expect it to work as well with libnetwork. I will submit the PR for this as we also need it for docker swarm.

On the whole process of contributing I think that this contains all you need to get started:
https://docs.docker.com/project/make-a-contribution/

If you need some help on getting started with your first PR, do not hesitate to ping us on our IRC channel: #docker-swarm

Regarding your question on libnetwork, I think you can ask this on the docker/libnetwork repository, opening a new Issue. They will provide you with a more in-depth answer that I can provide you and this will also help other users with the same question. Don't want to make a mistake and mislead you on the notion of endpoint and service as things are moving fast on this side :)

Thanks for reporting the issue in the first place though, helps a lot to have users testing things around not only on this repository but on the interaction with other projects! This is a contribution on its own.

springi99 added a commit to springi99/libkv that referenced this issue Jul 27, 2015
Signed-off-by: Robert Springer <springi99@gmail.com>
@springi99
Copy link
Contributor

Hi,

sorry for the delay... I tried to follow the process, the result can be found here:

https://github.com/springi99/libkv/tree/etcd-fixes

if you think it is ok then tell me what to do. Thanks, robert.

@abronan
Copy link
Contributor Author

abronan commented Jul 29, 2015

Hi @springi99, sorry for the delay. If you go to your fork (https://github.com/springi99/libkv/) you should see a button "Create pull request" that appears on top. Click on that, and then click "Create pull request" on that new page. Don't worry putting a description there, it's fine for this PR. Let me know if you have any trouble. I can still create it for you and you'll get credit for your work anyway. Thanks!

@springi99
Copy link
Contributor

Hi,

I created it, please check it. Thanks in advance!

abronan added a commit that referenced this issue Aug 3, 2015
@abronan
Copy link
Contributor Author

abronan commented Aug 17, 2015

Closed by #35

@abronan abronan closed this as completed Aug 17, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants