Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vanflow: Duplicate message delivery per distinct router path #1365

Closed
c-kruse opened this issue Jan 4, 2024 · 10 comments · Fixed by #1685
Closed

Vanflow: Duplicate message delivery per distinct router path #1365

c-kruse opened this issue Jan 4, 2024 · 10 comments · Fixed by #1685
Assignees
Labels
with-skupper Involves skupper-router usage through skupper
Milestone

Comments

@c-kruse
Copy link

c-kruse commented Jan 4, 2024

As discussed in the Skupper Team g chat, I am observing repeated delivery of vanflow messages. This was uncovered by an unrelated skupper collector bug. Thanks for following along @ted-ross! I suspect this issue is in the router, but I didn't know where to start trying to reproduce it without the control plane.

To reproduce:

  • Set up Skupper version 1.5.1 running in three namespaces a, b and c and link all three.
skupper token create --namespace a --uses 2 token-a.yaml

skupper link create --namespace b token-a.yaml
skupper token create --namespace b token-b.yaml

skupper link create --namespace c token-a.yaml
skupper link create --namespace c token-b.yaml
  • Deploy this rough draft vanflow logging utility to each namespace to tail the logs. source.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vanflow-logging
  labels:
    app: vanflow-logging
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vanflow-logging
  template:
    metadata:
      labels:
        app: vanflow-logging
    spec:
      containers:
      - image: quay.io/ckruse/vanflow-tool:1.0-alpha3
        imagePullPolicy: Always
        name: main
        securityContext:
          runAsNonRoot: true
        volumeMounts:
        - mountPath: /etc/messaging/
          name: skupper-local-client
        args:
        - '-messaging-config=/etc/messaging/connect.json'
        - '-enable-flow-records=false'
        - 'log'
      volumes:
      - name: skupper-local-client
        secret:
          defaultMode: 420
          secretName: skupper-local-client
  • Tail the logs of the vanflow logging utilities as you interact with the cluster. Messages originating in the A namespace are logged once from the A namespace but twice from the B and C namespaces.
@jiridanek jiridanek added the with-skupper Involves skupper-router usage through skupper label Jan 4, 2024
@jiridanek
Copy link
Contributor

This is probably a workaround for this?

@c-kruse
Copy link
Author

c-kruse commented Jan 5, 2024

This is probably a workaround for this?

* [Fixes phantom process records observed in the console UI skupper#1341](https://github.com/skupperproject/skupper/pull/1341)

They are related @jiridanek - but I'd hesitate to call either a workaround. The flow collector should be idempotent (fix in skupper/pull/1341), and from what I understand vanflow messages should be delivered only once to each consumer with some reliability regardless of van topology (the topic brought up in this issue.)

@Karen-Schoener
Copy link
Collaborator

Hi, I'm interested in investigating this ticket. If there's any concern / issue with me investigating this ticket, please do not hesitate to let me know. Thanks, Karen

@c-kruse
Copy link
Author

c-kruse commented Nov 21, 2024

Howdy everyone. Back at my old stuff again here, this time with a cleaner reproduction.

To trigger this behavior we need a few things:

  • A network topology with multiple potential routes between a router (a vanflow event source) and the the router that a vanflow event collector is attached to. That is at minimum a fully connected 3 router network.
  • A vanflow event collector listening on all of the routers. (like the skupper control plane does, with config-sync.)

Example docker-compose router network: https://github.com/c-kruse/collector-benchmark/tree/skupper-router-1365

  1. Stand up the podman-compose example project.
  2. Start the container.py script listening to router-01: ./container.py -a 127.0.0.1:15671
  3. Run the container.py script with a payload against any of the routers: ./container.py -a 127.0.0.1:15673 -p hello
  4. Observe the message is received exactly once. Yay.
  5. Start the container.py script on the other two routers as well:
  • ./container.py -a 127.0.0.1:15672
  • ./container.py -a 127.0.0.1:15673
  1. Run the script with a payload on one of the routers ./container.py -a 127.0.0.1:15673 -p greetings
  2. Observe "greetings" once from the router used to send the message, and twice on the other two routers.

EDIT for posterity: I've mostly been focused on vanflow since that is the context I discovered this issue, but it appears to be a general multicast routing issue. Reproduction above circumvents vanflow entirely and uses plain multicast routing.

@ganeshmurthy
Copy link
Contributor

Hi, I'm interested in investigating this ticket

Hi @Karen-Schoener do you still interested in investigating this ticket ?

@Karen-Schoener
Copy link
Collaborator

Karen-Schoener commented Dec 3, 2024

@ted-ross @kgiusti @ganeshmurthy, I wanted to check in and ask for advice on which skupper-router test I should update to verify this fix...

Today, locally, I updated system_tests_multicast.py to create a mesh of 3 routers.
Running without the fix, I see that the receivers that are attached to edge-router-A receive the expected number of messages.
I see the receivers that are attached to edge-router-B and edge-router-C receive twice the expected number of messages.

Right now, my patched test sleeps for 1 second to wait for all multicast packets to be received.
I remember hearing the suggestion: instead of sleeping, the test could send 1 final multicast message before verifying the received count. I planned to work on this next.

Do you prefer that I add a multicast test to system_tests_router_mesh.py?
The alternative is that I could add a mesh/3-router test to system_tests_multicast.py.

@ganeshmurthy
Copy link
Contributor

instead of sleeping, the test could send 1 final multicast message before verifying the received count

The sender could send, say, 10 messages with a string in body like this

# i is the loop variable
msg = Message(body="Message" + i)
event.sender.send(msg)

On the receiving end, add this message body to a python dictionary (one dict per receiver). Check in the dict to see if a body string is already present and if it is, you have received a duplicate, the test fails. Let the test timeout (timeout defaults to 60 seconds) and when the on_timeout function is called, check to see if all the dicts have 10 keys.

The alternative is that I could add a mesh/3-router test to system_tests_multicast.py

My preference would be to add the test to system_tests_multicast.py

@ted-ross
Copy link
Member

ted-ross commented Dec 4, 2024

I agree on putting the test in the system_tests_multicast group.

The key is getting the test to reliably fail with the fix not in place. Waiting 60 seconds is just going to make the test really long. Using a marker-message at the end may not work because the receivers might receive the marker before receiving any duplicates. A short delay (1-5 seconds) is probably not a bad solution.

@c-kruse
Copy link
Author

c-kruse commented Dec 4, 2024

Having very little specific technical knowledge in the area: Can we do managment commands to poll and wait until the senders or receivers have no more blocked/unsettled/undelivered messages?

@ganeshmurthy
Copy link
Contributor

ganeshmurthy commented Dec 4, 2024

Can we do managment commands to poll

Say a receiver connected to Router A received 10 non-duplicate messages first and the 10 more duplicate messages are slightly delayed and have not entered the Router A yet, the test might falsely assume that everything is ok since the blocked/unsettled/undelivered counts will be zero at that instant (before the duplicates arrive) ? The problem here is that we will not know when all the messages and duplicates will arrive. In some slow CI environments we have seen delayed arrival of messages. That is why I thought that the default TIMEOUT (60) seconds is enough time for the test to run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
with-skupper Involves skupper-router usage through skupper
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants