Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toxiproxy hangs #159

Closed
archelan opened this issue Jan 12, 2017 · 14 comments
Closed

Toxiproxy hangs #159

archelan opened this issue Jan 12, 2017 · 14 comments
Assignees

Comments

@archelan
Copy link

archelan commented Jan 12, 2017

Scenario is:

  • Add proxy
    toxiproxy-cli create proxy_20000_to_20001 --listen localhost:20000 --upstream localhost:20001

  • Add latency toxic (also reproducing with others like bandwidth)
    toxiproxy-cli toxic add proxy_20000_to_20001 --upstream --toxicName LatencyIssue -t latency -a latency=500 -a jitter=0

  • Start sending some data through, i.e.

while true; do; netcat -l -p 20001; done
for i in {1..100000}; do sleep 0.1; echo "$i" | tee /dev/tty ; done | netcat localhost 20000
  • Add timeout toxic (data flow stops as expected)
    toxiproxy-cli toxic add proxy_20000_to_20001 --upstream --toxicName TimeoutIssue -t timeout -a timeout=0

  • Remove first toxic
    toxiproxy-cli toxic delete proxy_20000_to_20001 --toxicName LatencyIssue

  • As a result last command just hangs. If you try to run other commands like 'toxiproxy-cli list' they will hang as well. I'm using latest 2.1.0 release of toxiproxy and latest 1.19.1 release of cli.

@sirupsen
Copy link
Contributor

sirupsen commented Feb 2, 2017

Ugh, I don't have time to personally debug this currently. Do you have time to dig into this? I'm considering rolling back the 2.1.0 release for the time being.

@xthexder
Copy link
Contributor

xthexder commented Feb 2, 2017 via email

@sirupsen
Copy link
Contributor

sirupsen commented Feb 2, 2017

Amazing @xthexder! 🎉 😍

I've given you admin access to the repo. 👍

@sirupsen
Copy link
Contributor

sirupsen commented Feb 5, 2017

@xthexder do you need anything else from me?

@xthexder
Copy link
Contributor

Sorry for the delay on this, I've been basically offline for the last 2 weeks due to school events and assignments.

I had a deeper look into this, and while I have a somewhat temporary fix for this, it's still not a perfect one.
The timeout toxic goes against the architectural design assumption that a toxic will always be able to write output.
Since the timeout toxic will never read any data, the latency toxic is stuck trying to write, and hanging.

The easy fix for this is to set a buffer size to the timeout toxic, but this only delays the problem, since if enough data is passed through, it will fill up.
I started trying a few things, but the problem is still going to be there in some form unless there's a huge code rewrite (which I'd rather not do again).

I don't see the above scenario being particularly useful to anyone, since the timeout toxic is still active, and the hang can be avoided by ordering operations differently.
It may be possible to just disallow this scenario or adjust the timeout toxic functionality so it isn't an issue.

In-flight data is really hard to deal with...

@xthexder
Copy link
Contributor

The behaviour that's an issue right now is that if the timeout toxic is removed by the user, all the backed up data will still be sent.
If we modified the timeout toxic to drop data instead of allowing a timeout to be "aborted", then everything will work fine internally.
The only issue with this is that the user removing the timeout will corrupt streams, so we would need to close connections early rather than continuing them.

This change would likely fit well with the timeout suggestion in #149

Thoughts?

@xthexder
Copy link
Contributor

(another quick thought)
API operations should maybe have a timeout and recovery of some sort so that other API calls can still be made afterwards.
While we want to stop the API from ever hanging, I'm sure there'll be more edge cases in the future. Error codes are much nicer to look at than nothing.

@sirupsen
Copy link
Contributor

sirupsen commented Feb 11, 2017

Sorry for the delay on this, I've been basically offline for the last 2 weeks due to school events and assignments.

All good Jacob! Hope it all went well. Congratulations on your ring 💍

You're right, I agree, the timeout toxic is indeed special. I think:

If we modified the timeout toxic to drop data instead of allowing a timeout to be "aborted", then everything will work fine internally.

Makes sense. Using the timeout toxic with another toxic doesn't have obvious behaviour. If we decide and document this is the behavior, I think it's appropriate that it drops the data on the floor. In other cases, people should refer to the latency toxic with a high timeout.

This change would likely fit well with the timeout suggestion in #149

Goes hand in hand with that. It makes the timeout toxic stand out and have very interesting behaviour IMO.

API operations should maybe have a timeout and recovery of some sort so that other API calls can still be made afterwards.

I'm not sure I completely understand this, can you explain how the timeout toxic caused the API to hang?

While we want to stop the API from ever hanging, I'm sure there'll be more edge cases in the future. Error codes are much nicer to look at than nothing.

Definitely agree with this.

@sirupsen
Copy link
Contributor

Hey @xthexder, any update on this? Anything I can do?

@jpittis
Copy link
Contributor

jpittis commented Mar 19, 2017

Three options for fixing this bug are:

  1. Have the timeout toxic drop input data on the ground. On deletion, connections need to be closed to avoid stream corruption.
  2. Have the timeout toxic infinitely buffer input data. On deletion, the data needs to be forwarded.
  3. Create an intermediate infinite buffer between toxics which will guarantee that a toxic will never block when forwarding data to the next toxic.

1 seems more logical to me but it would break backwards compatibility.

I believe 1 and 2 both require a "cleanup" phase when the toxic is deleted. We could maybe add a user specifiable toxic.Cleanup(stub) to the toxic interface. This would also break backwards compatibility with user created toxics because they would need to implement Cleanup before upgrading.

I can't see a way to fix this properly without a major version release. The change doesn't seem too hard but assuming we're using semantic versioning properly, this is a 3.* fix because it's not backward compatible.

3 seems like an overhaul of the toxic flow. I don't know what performance issues a bunch more intermediate buffers would cause. Rather than adding complex buffering, I think it makes sense to force all toxics to handle data on stub.Input which means choosing 1 or 2.

TLDR What do I propose?

Release option 1 with Toxiproxy 3.* if we're following semantic versioning or sooner if not. I'm not getting the feeling that this fix is urgent.

@sirupsen
Copy link
Contributor

I think we should go with (1), this is more of a bug, than a feature I think we should be backwards compatible with. I don't think it's too obvious that the data would be returned after the timeout, I think we should ignore that.

I'm not getting the feeling that this fix is urgent.

It actually kind of is, we're still recommending 2.0.0 because of this bug (but maybe it's present in 2.0.0 as well and I can remove this?) #162.

In short, I think we should go with (1) and bump to 2.1.1.

@jpittis
Copy link
Contributor

jpittis commented Mar 19, 2017

I was able to reproduce this bug with version 2.0.0 on sha 711c9e4. I'll open a PR reverting that fix and work on a fix with solution (1).

@sirupsen
Copy link
Contributor

Awesome, good work @jpittis 🎉

@jpittis
Copy link
Contributor

jpittis commented Apr 4, 2017

This was fixed with #169 and #168 and will be released in toxiproxy version 2.1.1 shortly.

Here's a local screen shot of me recreating the test demonstrated above on the current master branch. Notice that deleting the latency toxic does not hang.

screen shot 2017-04-04 at 7 44 48 pm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants