Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process_info/2 returns empty messages list #7413

Closed
door opened this issue Jun 18, 2023 · 13 comments · Fixed by #7595
Closed

process_info/2 returns empty messages list #7413

door opened this issue Jun 18, 2023 · 13 comments · Fixed by #7595
Assignees
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM

Comments

@door
Copy link

door commented Jun 18, 2023

Describe the bug
process_info(self(),messages) returns empty list in some cases.

To Reproduce

-module(check_messages).
-export([check_messages/0]).

check_messages() ->

  Self = self(),
  spawn_link(fun() ->
    Self ! foo,
    Self ! bar,
    receive _ -> ok end
  end),

  %% Messages sent from the process itself are visible in mailbox by process_info call
  %% self() ! foo, self() ! bar,
  %% receive makes messages visible too
  %% receive xyz -> ok after 1000 -> ok end,

  timer:sleep(1000),

  {messages, Messages} = process_info(self(), messages),
  io:format("Messages: ~p~n", [Messages]),
  [_|_] = Messages, % crash
  ok.

Expected behavior
process_info(self(),messages) always returns all messages contained in its mailbox.

Affected versions
26.0.1
25.3.2 (partially)

Additional context
I checked it on linux, from shell:

rm -f check_messages.beam && erlc check_messages.erl && erl -noinput -run check_messages check_messages -run erlang halt

and from erlang repl:
1> check_messages:check_messages().

Version 26.0.1 fails in both cases, 25.3.2 gives error in erlang repl but not in shell, and 24.3.4.6 has no error in both cases.

@door door added the bug Issue is reported as a bug label Jun 18, 2023
@IngelaAndin IngelaAndin added the team:VM Assigned to OTP team VM label Jun 19, 2023
@jhogberg
Copy link
Contributor

jhogberg commented Jun 19, 2023

Thanks for your report! It does return all the messages in the mailbox, it just so happens to be empty after a second even though you have sent messages to said process.

As per the documentation, emphasis mine:

The amount of time that passes between the time a signal is sent and the arrival of the signal at the destination is unspecified but positive. If the receiver has terminated, the signal does not arrive, but it can trigger another signal. For example, a link signal sent to a non-existing process triggers an exit signal, which is sent back to where the link signal originated from. When communicating over the distribution, signals can be lost if the distribution channel goes down.

The only signal ordering guarantee given is the following: if an entity sends multiple signals to the same destination entity, the order is preserved; that is, if A sends a signal S1 to B, and later sends signal S2 to B, S1 is guaranteed not to arrive after S2. Note that S1 may, or may not have been lost.

Despite occurring a second later, the process_info/2 signal happens to arrive before the messages sent from the other process you spawned. When you send messages to yourself instead, those messages are ordered before the process_info/2 signal due to the above ordering guarantee, and result in process_info/2 returning those messages.

Note that the latter applies to the other process as well: if it calls process_info/2 after sending the messages, those messages are guaranteed to arrive before the process_info/2 signal.

@jhogberg jhogberg added not a bug Issue is determined as not a bug by OTP and removed bug Issue is reported as a bug labels Jun 19, 2023
@jhogberg jhogberg self-assigned this Jun 19, 2023
@door
Copy link
Author

door commented Jun 19, 2023

Thanks for your response!

But receive operator works pretty usual despite this documented uncertainty, there is no any meaning delay before it gets messages.

And there is no mention about the difference in behavior between receive and process_info/2 (at least I did not find it in the docs). Maybe it worth be noted in the documentation, that process_info/2 acts differently from receive?

@jhogberg
Copy link
Contributor

There is no difference in behavior, you can get the same result with an ordinary receive as well. What you're missing is that process_info/2 sends a signal and waits for one in return. As the signals between Self and Self have no defined order relative to the signals between the spawned process and Self, the former can arrive before the latter regardless of when the signals were sent.

-module(check_messages).
-export([check_messages/0]).

check_messages() ->

  Self = self(),
  spawn_link(fun() ->
    Self ! foo,
    Self ! bar,
    receive _ -> ok end
  end),

  timer:sleep(1000),

  Self ! hello,
  receive Msg -> foo = Msg end, %% Can also crash because 'hello' arrives first

  ok.

@door
Copy link
Author

door commented Jun 19, 2023

Thank you for your answer!

I tried to call process_info/2 in loop and compare it with receive.
It look like that process_info to get the messages requires something to complete.

-module(check_messages).

-export([check_messages/1]).
-export([main/1]).

check_messages(Delay) ->
  Self = self(),
  spawn_link(fun() ->
    Self ! foo,
    receive _ -> ok end
  end),

  T0 = erlang:system_time(millisecond),
  fun Loop(Iter) ->
    case messages('process_info') of
    %% case messages('receive') of % this works quickly
      [] ->
        timer:sleep(Delay),
        Loop(Iter+1);
      [_|_] ->
        io:format("time: ~bms iter: ~b~n", [erlang:system_time(millisecond) - T0, Iter])
    end
  end(0),

  ok.

%% this clause gives 1 iteration
messages('receive') ->
  receive
    M ->
      [M|messages('receive')]
  after 0 ->
      []
  end;

%% this clause gives 1198 iterations
messages(process_info) ->
  {messages, Messages} = process_info(self(), messages),
  Messages.


main([Delay]) ->
  check_messages(list_to_integer(Delay)).

No matter how the delay is long, it needs about 1198 times to call process_info/2 to get message.
While receive .. after 0 always return message on second call.

% erl -noinput -run check_messages main 1 -run erlang halt
time: 2399ms iter: 1198

% erl -noinput -run check_messages main 2 -run erlang halt
time: 3602ms iter: 1201

% erl -noinput -run check_messages main 5 -run erlang halt
time: 7190ms iter: 1198

% erl -noinput -run check_messages main 10 -run erlang halt
time: 13177ms iter: 1198

% erl -noinput -run check_messages main 100 -run erlang halt
time: 120999ms iter: 1198

@stolen
Copy link
Contributor

stolen commented Jun 22, 2023

I researched this a bit on a basis of @door example: http://tryerl.seriyps.ru/#id=dab4.

-module(main).
-export([main/0]).
main() ->
  Self = self(),
  
  %% WORKAROUND 1: off-heap MQD fixes getting own messages
  %process_flag(message_queue_data, off_heap),
  %% UN-WORKAROUND 1: io:format removes the positive effect of the off-heap MQD
  %io:format("xx\n"),
  %% UN-WORKAROUND 2: any non-zero sleep breaks things again as well
  %receive after 1 -> broken end,
  %timer:sleep(1),
  
  %% External printer to avoid receiving messages
  P = spawn(fun Pr() -> receive {Tag, Term} -> io:format("~w ~p~n", [Tag, Term]), Pr() end end),
  %% External process which sends messages to this process
  spawn(fun() -> Self ! foo, Self ! bar, receive _ -> ok end end),
  
  %% External process calling process_info on this one
  spawn(fun() -> timer:sleep(1100), P ! {r, process_info(Self, [messages])} end),
  
  %% After 1 second messages are definitely in the queue
  timer:sleep(1000),
  
  %% WORKAROUND 2: Searching in the message queue fixes later process_info
  %receive {{{nonsense}}} -> ok after 0 -> ok end,
  
  %% WORKAROUND 3: GC fixes later process_info
  %garbage_collect(),
  
  %% process_info on self consistently sees an empty mailbox
  P ! {l1, process_info(Self, [ message_queue_len, messages])},
  erlang:yield(),
  receive after 0 -> ok end,
  timer:sleep(30),
  P ! {l2, process_info(Self, [message_queue_len, messages])},
  
  %% After an other process gets the process_info of this one, process_info on self works too
  timer:sleep(200),
  P ! {l3, process_info(Self, [message_queue_len, messages])},
  
  timer:sleep(400),
  ok.

This example shows that process_info called by another process is more reliable than one called by self:

l1 [{message_queue_len,0},{messages,[]}]
l2 [{message_queue_len,0},{messages,[]}]
r [{messages,[foo,bar]}]
l3 [{message_queue_len,2},{messages,[foo,bar]}]

It may be related to mechanism of moving messages from signal queue to message queue.
This behaviour is very surprising for me and I think it is a bug.

This may affect code which uses process_info for overload protection:

@jhogberg can you please evaluate again a severity of this issue?

@stolen
Copy link
Contributor

stolen commented Jun 26, 2023

I found that logger's overload protection uses process_info(self(), message_queue_len) too and crafted an example of erroneous behaviour.

-module(olptest).

-export([check/0]).

% Callbacks
-export([init/1, handle_info/2, handle_load/2]).

init(#{latency := _, printer := _} = Opts) ->
  {ok, Opts}.

handle_info(Msg, #{latency := Lat, printer := Printer} = State) ->
  Printer ! {info, Msg},
  timer:sleep(Lat),
  {load, State}.

handle_load(Msg, #{printer := Printer} = State) ->
  Printer ! {load, Msg},
  State.



check() ->
  % Print via external process to avoid calling receive on synchronous calls
  Printer = spawn_link(fun P() -> receive Msg -> io:format("handle ~0p~n", [Msg]) end, P() end),
  % When the message queue is larger than 5, OLP must raise an error
  {ok, Pid, _} = logger_olp:start_link(test_olp, ?MODULE, #{latency => 200, printer => Printer}, #{overload_kill_qlen => 5, overload_kill_enable => true}),

  % To avoid races, first pack is sent while the OLP is suspended.
  % This is not necessary on my computer, but the number of messages accepted at first, may vary
  erlang:suspend_process(Pid),
  [Pid ! {msg, N} || N <- lists:seq(1, 5)],
  io:format("Resuming OLP with 5 messages in queue~n"),
  erlang:resume_process(Pid),

  % When OLP is already processing the first message, add 10 more messages
  timer:sleep(100),
  [Pid ! {msg, N} || N <- lists:seq(6, 15)],
  io:format("Added 10 more messages~n"),

  % And when it is processing the second message, add another 10
  timer:sleep(100),
  [Pid ! {msg, N} || N <- lists:seq(16, 25)],
  io:format("Added 10 messages again~n"),

  {ok, Pid, Printer}.

Here, I configure logger_olp to crash if message_queue_len is larger than 5.
By the time logger_olp:check_load/1 is called the first time, logger already has 14 unprocessed messages.
By the second call, there are 23 messages.
Despite that first five load checks (at the first pack of messages) pass.

Erlang/OTP 26 [erts-14.0.1] [source-0a08ad475a] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:1] [jit]

Eshell V14.0.1 (press Ctrl+G to abort, type help(). for help)
1> olptest:check().
Resuming OLP with 5 messages in queue
handle {info,{msg,1}}
Added 10 more messages
handle {load,{msg,1}}
handle {info,{msg,2}}
Added 10 messages again
{ok,<0.88.0>,<0.87.0>}
handle {load,{msg,2}}
handle {info,{msg,3}}
handle {load,{msg,3}}
handle {info,{msg,4}}
handle {load,{msg,4}}
handle {info,{msg,5}}
handle {load,{msg,5}}
handle {info,{msg,6}}
** exception error: {shutdown,{overloaded,19,10960}}

@jhogberg
Copy link
Contributor

This behaviour is very surprising for me and I think it is a bug.

It's not a bug. process_info/2 is documented as following the signaling order guarantees as described in the reference manual, and all of the examples in this thread follow that perfectly.

@rickard-green is working on a signal queue monitoring feature that will be better suited for checking overload than polling with process_info/2. We plan to have it in OTP 27.

@rickard-green
Copy link
Contributor

This is not a bug as @jhogberg stated. I can understand that it is somewhat surprising, though. However, the examples above are good examples of what not to do. That is, not execute receive expressions in a process expected to receive messages. A process that is sent lots of messages should typically frequently execute receive expressions. Outstanding messages will then eventually enter the message queue, and will then be part of the result of process_info(_, message_queue_len).

However, we will have a look at a solution that will cut the very long times for outstanding messages that can appear when no receive expressions are executed. Unfortunately this will not be for free. There will be a performance penalty for this, but hopefully we can make it small. Note that you should still not expect a sent message to appear in the receivers message queue instantaneously. Such a solution would be a very effective performance killer and wont be introduced. It will always take time for a message to travel from the sender to the receiver also after this change and there will still be no upper limit on that time. A solution like this should be possible to introduce in OTP 26.1.

@stolen
Copy link
Contributor

stolen commented Jun 27, 2023

Thank you for the explanation!

I do understand the performance penalty of always giving the exact message queue information.
And this multi-buffering for me seems like a very good optimisation.
The problem for me now is a lack of simple introspection mechanism.

As just a user I expect sent message to be observable. Now it just "disappears" (if I use documented API).
As a more advanced user, I now know about the signal queue and inner/middle/outer message queues, but I could not find any way to observe/inspect that.

As for process_info.

  1. When user explicitly asks for messages, is it possible to handle queued message signals (e.g. move them into the message queue) ? This obviously would have a performance penalty (and should have a warning in the documentation), but would allow a simple introspection. Anyway, getting messages without receive (as I saw on github) is used only for better error reports, when performance does not matter any more.
  2. When returning message_queue_len, is it possible to return inner_queue_len + middle_queue_len + outer_queue_len instead? This may require locking, but maybe overhead would be small, and the user will not have to deal with implementation details.

@jhogberg
Copy link
Contributor

When user explicitly asks for messages, is it possible to handle queued message signals (e.g. move them into the message queue) ? This obviously would have a performance penalty (and should have a warning in the documentation), but would allow a simple introspection. Anyway, getting messages without receive (as I saw on github) is used only for better error reports, when performance does not matter any more.

I don't think this would work well. Simplifying things a bit, we'd either have to break the signal order to do this (non-starter since this would affect the target process), or stop at the first non-message signal we encounter (making more or less it useless for processes whose clients tend to use monitors with their requests, e.g. gen_server ...).

When returning message_queue_len, is it possible to return inner_queue_len + middle_queue_len + outer_queue_len instead? This may require locking, but maybe overhead would be small, and the user will not have to deal with implementation details.

The reason we haven't done this is because it would be very expensive, particularly for processes that receive messages from many different processes which tend to be those you want overload protection for to begin with. You will have to take 1-64 locks (depending on the parallel signal optimization) to move signals from the outer to the middle queue, then linearly walk through the signals in the middle queue to figure out how many there are, and then do the same for the message queue ("inner queue") itself.

Oh, and all this work is done on the target process and not the one making the request. There's lots of software in the wild that calls process_info(Pid, message_queue_len) on arbitrary/all processes and that would now suddenly tank the system.

@stolen
Copy link
Contributor

stolen commented Jun 27, 2023

Thanks!
Things appear to be much more complex than I initially supposed.

process_info(Pid, message_queue_len)
that would now suddenly tank the system.

Yep, that was slow and caused heavy lock contention in our project few years ago.
We moved to send_after and measuring time a message stays in queue before being handled. Not a silver bullet but works for us.

@rickard-green rickard-green linked a pull request Aug 27, 2023 that will close this issue
@rickard-green
Copy link
Contributor

@door @stolen The PR #7595 address the potentially long message signal delivery times. I've reconsidered this a bit and consider current behavior a bug, since the message signal delivery time potentially could get infinite. We will release a patch on OTP 25 and OTP 26. I cannot say how quickly the patches will be out, though. The code in #7595 is based on OTP 25.3.2 and should cleanly merge on top of all larger OTP versions if you want to try it out on another version.

@rickard-green rickard-green reopened this Aug 27, 2023
@rickard-green rickard-green added bug Issue is reported as a bug and removed not a bug Issue is determined as not a bug by OTP labels Aug 27, 2023
@rickard-green
Copy link
Contributor

The fixes in PR #7595 have been released in OTP 25.3.2.6 and OTP 26.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue is reported as a bug team:VM Assigned to OTP team VM
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants