Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discard ping replies if rtt larger than timeout (-t) #32

Closed
zalexua opened this issue Nov 11, 2012 · 33 comments
Closed

Discard ping replies if rtt larger than timeout (-t) #32

zalexua opened this issue Nov 11, 2012 · 33 comments

Comments

@zalexua
Copy link

zalexua commented Nov 11, 2012

If I got it correctly - in "none default" mode ("batch mode" next) fping will receive and will process ECHO replies for previous requests as success till it executes last request disregarding that replies have been received much later that specified in -t option.

For example ping to a site with long distance.

# time ./fping -c1 -t50 zabbix.jp

zabbix.jp : xmt/rcv/%loss = 1/0/100%

real    0m0.078s
user    0m0.000s
sys 0m0.000s

# time ./fping -c2 -t50 zabbix.jp
zabbix.jp : [0], 84 bytes, 175 ms (175 avg, 0% loss)

zabbix.jp : xmt/rcv/%loss = 2/1/50%, min/avg/max = 175/175/175

real    0m1.079s
user    0m0.000s
sys 0m0.000s


# time ./fping -c5 -t50 zabbix.jp
zabbix.jp : [0], 84 bytes, 174 ms (174 avg, 0% loss)
zabbix.jp : [1], 84 bytes, 174 ms (174 avg, 0% loss)
zabbix.jp : [2], 84 bytes, 174 ms (174 avg, 0% loss)
zabbix.jp : [3], 84 bytes, 174 ms (174 avg, 0% loss)

zabbix.jp : xmt/rcv/%loss = 5/4/20%, min/avg/max = 174/174/174

real    0m4.082s
user    0m0.000s
sys 0m0.000s

As we see above one last response always lost because fping finished its work before last response came.

And with none default -p option, so two last responses are lost while first response has been catched:

# time ./fping -c3 -t50 -p50 zabbix.jp
zabbix.jp : [0], 84 bytes, 174 ms (174 avg, 66% loss)

zabbix.jp : xmt/rcv/%loss = 3/1/66%, min/avg/max = 174/174/174

real    0m0.178s
user    0m0.000s
sys 0m0.000s


I'd consider this as a fping's bug because it's nasty inconsistency which depends on number of requests sent and it's absolutely not clear.

I believe that timeouts in none default "batch" mode should work as expected disregarding on number of requests.
So in all my examples I should get 100% lost responses.

@zalexua
Copy link
Author

zalexua commented Jan 16, 2017

I've tested the 3.16-rc1
I cannot confirm that it has changed described behavior in any way.

# time ./fping -c3 -t50 -p100 zabbix.jp
zabbix.jp : [0], 84 bytes, 238 ms (238 avg, 66% loss)

zabbix.jp : xmt/rcv/%loss = 3/1/66%, min/avg/max = 238/238/238

real    0m0.659s
user    0m0.000s
sys     0m0.000s

# ./fping -v
./fping: Version 3.16-rc1
./fping: comments to david@schweikert.ch

Don't want to be annoying, but this "issue" in NOT fixed.

@schweikert
Copy link
Owner

So, you mean that responses received after the timeout specified in -t should be discarded?

@schweikert schweikert reopened this Jan 16, 2017
@schweikert
Copy link
Owner

I can understand the argument, but I wonder if the current behavior is possibly also preferred by others and possibly relied upon.

@zalexua
Copy link
Author

zalexua commented Jan 16, 2017

Yes, it's an arguable thing - should be discarded or not.
My own opinion - the responses should be discarded.
I.e. options -t (timeout) and -c (count of packets) should not "depend each other" regarding received result.

I belong to a Zabbix team (NMS application), where fping (with "not default" mode, i.e. with -C option) is used to perform icmp ping checks.
Most of fping's options may be controlled in Zabbix, so for example someone wants to change parameters to use not default number of packets (-c) while leave the same default timeout (-t) - and in such case final result may get different.
https://www.zabbix.com/documentation/3.2/manual/config/items/itemtypes/simple_checks#icmp_pings

As for "backward compatibility" - I'm not sure how to resolve that, but current behavior is illogical for me, so I consider it as a bug, which is "to be fixed" instead of keep things "as is" because someone maybe relies it.

@schweikert schweikert changed the title relation between -t and -p options in "none default" mode. -t prossesed not as expected Discard ping replies if rtt larger than timeout (-t) Feb 14, 2017
@schweikert
Copy link
Owner

Here is an alternative proposal:

  • forbid usage of -t in conjunction with -c/-C/-l
  • set "timeout" automatically to the the value of -p (period), so also for the last ping we wait for the correct amount of time.

This is probably easier to understand for the users, instead of having a separate timeout option.

What do you think?

@richlv
Copy link

richlv commented Feb 14, 2017

@schweikert, thanks for considering this, @zalexua - thanks for reporting :)
after thinking this over, the overriding of the timeout by period seems rather unintuitive to me.
it is also inconsistent - normally -t sets the timeout, then suddenly it's -p...
it seems to me that obeying -t always normally and not tying it to -c in slightly obscure ways would still be the best.

also, if i would like to do 5 pings with 5 seconds between them, and one second timeout, the -p proposal would make that impossible (without a bit messy wrapper script).

@zalexua
Copy link
Author

zalexua commented Feb 14, 2017

@schweikert after thinking and discussing your suggestion with one more Zabbix folk here, we do not consider it as an optimal solution. Saying more specifically - it would partially break fping usage in Zabbix.
As @richlv said above - we also will lose flexibility, which currently is available for end users.

Today you have updated issue title and l want to say that I like it - it cleanly reflects what should be changed. I suppose it should not be any hard technically.

@schweikert
Copy link
Owner

schweikert commented Feb 14, 2017

The problem with that is that in loop/count mode, the default timeout of 500ms is really low. In non-loop/count mode you have retries and the backoff factor, but not in loop/count mode. So this would mean starting to discard perfectly fine ping replies, and it would surprise fping users.

In which way it would break Zabbix?

Here is another proposal:

  • Implement the discarding of ping replies that come after the timeout specified as -t
  • Change the default timeout from 1 second to the value of -p, if -t is not specified (which would mostly match the current behavior of fping, except for the last ping).

@zalexua
Copy link
Author

zalexua commented Feb 14, 2017

You said that default value for -t is 1 second, but "man fping" says it's 500 ms. I see the same in sources (#define DEFAULT_TIMEOUT 500). So I'll use 500 ms further, fix me please if I'm wrong.

Thinking on - what people use fping with -C option ... I don't think these people want to measure ICMP pings which are longer than 500 ms, by default. If they use count of packets (-C) more than 1 (which is logical), so they should understand that there is some timeout and it has some affect. If I would expect replies with more than 500 ms, I'd need to explicitly allow that by big value for -t option, but not juggling by value for -C option.
Replies which are longer than timeout -t (default one or customized) are not fine, IMO :)

I said it partially would break fping usage in Zabbix. In Zabbix it's possible to specify custom values for -t and -p options separately. If they would be "merged" on fping side, then current zabbix functionality for icmp ping checks would be partially broken.
If you will forbid usage of -t in conjunction with -c/-C/-l - it will break fping usage completely for ~60-90% Zabbix users, because I'd say that the ~60-90% users do customize the "timeout" in icmp checks.

Your latest proposal sounds good to me. Let me collect more opinions here, will get back to you soon.

@richlv
Copy link

richlv commented Feb 14, 2017

the proposal in #32 (comment) seems reasonable to me

@schweikert
Copy link
Owner

schweikert commented Feb 14, 2017

You are right: it's 500ms by default, not 1000ms. I have corrected it in my comment.

Also, I agree that "forbidding" the usage of -t together with -c/-C/-l would be bad and indeed break many installations, so that's certainly a no go. My main concern is not to break existing installations (and I argue that people might not expect pings > 500ms to be discarded when in loop mode).

@dimir
Copy link

dimir commented Feb 15, 2017

The decision to drop packet if rtt is over timeout specified with -t is right IMO. However, I don't see an idea to connect -t and -p options in any way as good one.

@zalexua
Copy link
Author

zalexua commented Feb 15, 2017

@dimir, thanks for your feedback. Note that -t and -p supposed to be "connected" only if -t is not customized. It sounds not so bad.
And it still will be much better than current behavior.
We still need to find out some compromise to have a chance for requested change.

@schweikert, I think I've collected all opinions I wanted to get. Hope to see some further movements :)

@dimir
Copy link

dimir commented Feb 15, 2017

Sure, I understand that this is only when -t is not specified. And I agree, the proposed solution is not so bad, but not perfect either, IMO. Also I agree that it will be better than current behavior.

@schweikert
Copy link
Owner

If you want to test, I have released fping-4.0-rc2 with these changes:
http://fping.org/dist/beta/fping-4.0-rc2.tar.gz

Testing would be much appreciated!

@zalexua
Copy link
Author

zalexua commented Feb 16, 2017

I'll do it, definitely.

@zalexua
Copy link
Author

zalexua commented Feb 16, 2017

Similarly to "man", maybe in --help it worth to add ", up to 2000 ms" after "where it is the -p period" ?
To be like this: "where it is the -p period, up to 2000 ms"

Duplicated word "for for" in ChangeLog.

schweikert added a commit that referenced this issue Feb 17, 2017
@zalexua
Copy link
Author

zalexua commented Feb 17, 2017

"for for" is still in the Changelog

schweikert added a commit that referenced this issue Feb 17, 2017
@schweikert
Copy link
Owner

@zalexua: indeed, sorry. fixed now

@zalexua
Copy link
Author

zalexua commented Feb 17, 2017

I extensively tested the changes yesterday and today. Yesterday, on beginning (before I ran "ping zabbix.jp"), I had a weird result of loss replies, which I could not explain. After a few unexpected ping loss, I ran wireshak and I saw replies (I guess, unfortunately I didn't pay attention on packets itself, but only for timing) , i.e. with -c2 I had 4 packets appeared in wireshark and their timing was fine!
Raw results are here, just in case on http://pastebin.com/sji0vBNK
Interval between ran fping/ping commands was 10-30 seconds.

After "ping zabbix.jp" I also ran tcpdump on shell, but all further attempts were successful and/or correct.
Today's tests - also everything was correct.
Not sure what it was yesterday on beginning, whom I have to blame - my eyes or my mind ... so feel free to ignore my concerns.
TESTED !

@schweikert
Copy link
Owner

Thank you!

@tohojo
Copy link
Contributor

tohojo commented Apr 9, 2018

So just chiming in with the inevitable "you broke my use case" comment: I am using fping in Flent to measure bufferbloat, among other things. In this use case, the RTT can vary wildly within a single test, and we want to capture this. So discarding late pings is absolutely the wrong thing to do.

If I'm reading the code right, I can get the old behaviour by just passing a really large value to -t, right? Will that affect anything on old versions (when running in -c mode)?

@tohojo
Copy link
Contributor

tohojo commented Apr 9, 2018

Actually, playing with v4.0 a bit more, I'd argue that the current handling is wrong. If you're going to honour -t in counter mode (which is a good idea!), it should only apply as a timeout after the last packet has been sent. I.e., if the test runs long enough for earlier packets to be received, don't ignore them but report the result. That would make the new behaviour useful in my case as well :)

@schweikert
Copy link
Owner

The new behavior was implemented to make the results more consistent and predictable. In particular, it was considered problematic that if you sent 3 pings in count mode, you would possibly get a loss only on the third ping, because the timeout would only affect the last one.

The recommended way to measure is to set a long-enough period, so that you don't run into this problem.

@tohojo
Copy link
Contributor

tohojo commented Apr 9, 2018

Yeah, I did read the discussion; I'm just disagreeing that this is more "consistent and predictable" :)

If you are ignoring late replies you are basically reporting a host as down even though it isn't. And for my measurements I specifically want a high sampling interval, but still be able to tell when latency increases above that interval (which it quite frequently does).

If you're only running a few pings, I can see the point of the new behaviour, but if you're running a long test (I usually run with -p 200 -c 300 for a full minute test), it doesn't make sense...

On the other hand, having the timeout parameter relate only to the end of the test, would make it possible to enforce a maximum duration for the whole test (which I currently have to do by means of a watchdog timer). Alternative, a time-based parameter instead of -c (i.e., "keep sending pings at this interval for x seconds") would be useful...

@heistp
Copy link

heistp commented Apr 9, 2018

I second @tohojo 's comment. The current behavior of using -t for a per-packet timeout isn't desirable in most cases. RTT can jump suddenly for many reasons, and that information shouldn't be discarded by default. Perhaps there is a use case for a per-packet timeout, but it could be a separate flag.

Useful timeout values depend on the environment under test. I think a generally effective default for the final timeout is to wait after the last request has been sent for some multiple of the max RTT (say 3x max RTT), falling back to a fixed timeout if there were no replies, and possibly placing a limit on the timeout if the max RTT is something unreasonably high.

@schweikert
Copy link
Owner

@tohojo: you are free to disagree of course... For me having a different measurement for the last ping was a convincing enough argument to change that behavior. Note that you can still set a longer -t value, if you want to. Just note that it might not produce the results that you expect.

@tohojo
Copy link
Contributor

tohojo commented Apr 9, 2018

Right, I can see how having the last few measurements disappear when latency increases can be odd. But the right way to fix that is doing something like what @peteheist is suggesting, and adjusting the wait interval to the actual observed RTT, instead of just arbitrarily discarding measurement points as you are currently doing.

Latency variations are quite common in the internet, actually, and can be an order of magnitude above the base RTT.

But yeah, since this behaviour is in a release version of fping, I am going to have to work around it anyway...

@richlv
Copy link

richlv commented Apr 9, 2018

It seems to me that adjusting timeout based on observed RTT will be extremely confusing. It could be some additional feature (market it as "adaptive AI"...), but the normal flags should be easily understandable and testable by average user.

Sorry if I missed this in the comments, but what is the issue with specifying as large timeout as you want to handle?

@zalexua
Copy link
Author

zalexua commented Apr 9, 2018

I want to support @schweikert 's words that "The new behavior was implemented to make the results more consistent and predictable". That's the key point.
Previous logic was very unclear.

@heistp
Copy link

heistp commented Apr 9, 2018

The argument for using a factor of max RTT is that you minimize both the total test time and the likelihood that you discard the last packet erroneously, which I think was the original problem reported. I think most users would be happy with this behavior without having to understand it.

The issue with specifying a large timeout is that in case the last reply from the server is lost, there could be an unnecessarily long wait time. You might choose 15 seconds as a time that you think you'll "never" see, but why wait 15 seconds at the end of the test when the maximum round-trip time you saw was 250 milliseconds?

But I think the choice of wait strategy at the end of the test is less consequential than the choice to time out individual round-trips. Knowing when and how often there are high round-trip times is often exactly what you're interested in, as it characterizes any bufferbloat, WiFi interference, hardware problems, etc. If those results are clipped, you can't differentiate between high RTT and packet loss.

Lastly, I empathize. Anyone who thinks that writing/maintaining a ping-like utility is "easy" either hasn't done it or hasn't been dragged through various corner cases. So, thanks for fping! :)

@bkuker
Copy link

bkuker commented Aug 20, 2019

I hate to ask for clarification on a closed issue, but does this change mean that if my period is 20ms, my rtt is 40ms, and my count is 100, fping will return 100% packet loss, rather than permitting multiple requests to be in flight simultaneously?
Shouldn't the sequence number solve this problem?

@zalexua
Copy link
Author

zalexua commented Aug 20, 2019

You are right, in your case 100% packets will be considered as without reply - packet lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants