Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CENNSO-2033] rework logging #219

Merged
merged 4 commits into from
May 15, 2024
Merged

[CENNSO-2033] rework logging #219

merged 4 commits into from
May 15, 2024

Conversation

RoadRunnr
Copy link
Member

No description provided.

@RoadRunnr RoadRunnr requested a review from a team as a code owner May 13, 2024 09:39
ebengt
ebengt previously approved these changes May 13, 2024
Copy link

@ebengt ebengt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good

{noreply, State};

handle_info(Info, State) ->
?LOG(warning, "handle_info: ~p", [Info]),
{noreply, State}.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It used to be recommended in a gen_server to have a cluse handle_info of unknown content. Perhaps it is no longer a good idea.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd keep it in notice, like "this is not normal but it's not causing an error". It would avoid a crash. But I don't see this proposal as much better anyway 🤷🏻

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Production code should IMHO not get call/cast/info message that are not handled. Sure, it is always possible to send something though a remote shell, but I would that not call a normal situation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@javiermtorres @ebengt I would be ok with a policy to always add catch all clauses for those with logs at level alert (or even emergency). That would make them noisy enough to catch them, but still be somewhat safe.

Very high log levels are IMHO justified because missing a info/cast/call could mean that something important was missed and it is impossible so determine if it was important without understanding the message.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'd be fine with a high log level. The alternative is a crash (if we omit the catchall clause) that is equally visible.

Since infos catch any sent message, it's easier to stumble upon issues of processes sending messages to the wrong pid than calls/casts. I'd agree to not using catchall clauses for calls/casts.

src/ergw_aaa_session.erl Show resolved Hide resolved
@@ -107,7 +107,7 @@ handle_response({API, _}, #{'Result-Code' := Code}, Session, Events, State) ->
handle_response(Procedure, #{'Result-Code' := Code}, Session, Events, State) ->
{{fail, Code}, Session, [{stop, {Procedure, peer_reject}} | Events], State};
handle_response(_Procedure, Response, Session, Events, State) ->
?LOG(error, "Response: ~p", [Response]),
?LOG(debug, "unexpected response: ~p", [Response]),
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not a log level of warning be better here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree to that 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i do not: https://www.erlang.org/doc/apps/kernel/logger_chapter.html#log-level

the audience of the logs is a user of software incorporating ergw_aaa. some external party floods your instance with garbage, the garbage is properly NOT handled (thus dropped). what is the error? what is the call to action to solve that error?

@RoadRunnr RoadRunnr force-pushed the CENNSO-2033/change-logging branch from 1c9f89c to 4e06e93 Compare May 13, 2024 10:54
javiermtorres
javiermtorres previously approved these changes May 13, 2024
src/ergw_aaa_diameter_srv.erl Outdated Show resolved Hide resolved
src/ergw_aaa_ro.erl Show resolved Hide resolved
src/ergw_aaa_ro.erl Show resolved Hide resolved
src/ergw_aaa_session.erl Outdated Show resolved Hide resolved
@@ -107,7 +107,7 @@ handle_response({API, _}, #{'Result-Code' := Code}, Session, Events, State) ->
handle_response(Procedure, #{'Result-Code' := Code}, Session, Events, State) ->
{{fail, Code}, Session, [{stop, {Procedure, peer_reject}} | Events], State};
handle_response(_Procedure, Response, Session, Events, State) ->
?LOG(error, "Response: ~p", [Response]),
?LOG(debug, "unexpected response: ~p", [Response]),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd agree to that 🤔

@RoadRunnr RoadRunnr force-pushed the CENNSO-2033/change-logging branch 5 times, most recently from 85348b6 to 583b795 Compare May 13, 2024 14:39
RoadRunnr added 2 commits May 13, 2024 18:00
The setting is used by the rate limit checks. However, those tests are
designed to check for handling the rate limit only. The outstanding
requests limits is not checked and hitting it interferes with the rate
limit check, leading to spurious failure.

Raise the value to a limit that it will no longer affect the rate
limit tests.
The session process can not get an `EXIT` event from the owner since
it is never linked to it. This is a left over from very ancient
version of the session logic. Removed.

Include the session information in the log for a `DOWN` event.
@RoadRunnr RoadRunnr force-pushed the CENNSO-2033/change-logging branch 2 times, most recently from 6052df3 to d11c69f Compare May 14, 2024 08:14
@RoadRunnr RoadRunnr changed the base branch from master to CENNSO-2033/switch-to-OTP-26 May 14, 2024 08:14
RoadRunnr added 2 commits May 14, 2024 10:46
* change logs to lebel debug that where intendet for debugging only,
  but where not using the debug level
* remove obsolete debug logs
* clarify some non-debug logs
A reference count underflow indicates a logic bug in the reference
counting logic. Log it with proper criticallity, but otherwise try to
continue.
@RoadRunnr RoadRunnr force-pushed the CENNSO-2033/change-logging branch from d11c69f to f605150 Compare May 14, 2024 08:50
@RoadRunnr RoadRunnr requested a review from javiermtorres May 14, 2024 08:50
Base automatically changed from CENNSO-2033/switch-to-OTP-26 to master May 14, 2024 08:53
@RoadRunnr RoadRunnr merged commit 7a089e6 into master May 15, 2024
6 checks passed
@RoadRunnr RoadRunnr deleted the CENNSO-2033/change-logging branch May 15, 2024 09:55
@@ -264,19 +264,11 @@ handle_cast(stop, State) ->

handle_info({'DOWN', MRef, _Type, Pid, Info}, #state{peers = Peers0} = State)
when is_map_key(MRef, Peers0) ->
?LOG(alert, "got down for pending request ~0p (~p) with ~p",
[maps:get(MRef, Peers0), Pid, Info]),
?LOG(alert, "Diameter request process terminated unexpected with ~0tp (req: ~0tp, pid: ~0p)",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.erlang.org/doc/apps/kernel/logger_chapter.html#log-level

which "action must be taken immediatly"? is something dead already? is everything about to be shut down?

@@ -285,7 +277,8 @@ handle_info({'DOWN', MRef, _Type, Pid, _Info}, State) ->
{noreply, State};

handle_info(Info, State) ->
?LOG(warning, "handle_info: ~p", [Info]),
?LOG(alert, "unexpected info message, something important might have been missed: ~p",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.erlang.org/doc/apps/kernel/logger_chapter.html#log-level

what shall the user of the software incorporating ergw_aaa do with this message?
the component of the software using this library is not visible here, thus the user of the software is unable to deduce in which area a problem is, especially when the description of the loglevel "alert" is "action must be taken immediately".

_ ->
{{error, underflow}, Peers#{OH => Peer}}
?LOG(emergency, "reference counting underflow for Diameter peer ~0tp", [OH]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.erlang.org/doc/apps/kernel/logger_chapter.html#log-level

the whole system became just unusable because the some diameter-peers vanished?

_Request, SvcName, {_, #diameter_caps{origin_host = {OH, _}}} = Peer, _CallOpts)
when length(Errors) /= 0 ->
%% the exact content of Errors is a bit unclear, dumping them is best we can do
?LOG(error, "~0tp: decode of Diameter answer ~0tp from ~0tp failed, errors ~0tp",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://www.erlang.org/doc/apps/kernel/logger_chapter.html#log-level

the operator of the software using ergw_aaa is bombarded with faulty encoded diameter messages. ergw_aaa deals with this by … just not accepting those messages. what is the "error"? from the perspective of the operator everything is running just fine.

from a developer perspective i can understand "error", but from a user perspective which needs to see those logs within normal operations i do not see it. in this area, an increase of an error-metric plus a log message on debug-level (which the operator can activate on demand to find out the peer involved, the message, the error itself).

when length(Errors) /= 0 ->
?LOG(error, "~p: decode of answer ~p from ~p failed, errors ~p", [SvcName, Msg, Peer, Errors]),
%% the exact content of Errors is a bit unclear, dumping them is best we can do
?LOG(error, "~0tp: decode of Diameter answer ~0tp from ~0tp failed, errors ~0tp",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my remarks before about error-log for diameter decoding errors.

when length(Errors) /= 0 ->
?LOG(error, "~p: decode of answer from ~p failed, errors ~p", [SvcName, Peer, Errors]),
%% the exact content of Errors is a bit unclear, dumping them is best we can do
?LOG(error, "~0tp: decode of Diameter answer ~0tp from ~0tp failed, errors ~0tp",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my previous comments on that

when length(Errors) /= 0 ->
?LOG(error, "~p: decode of answer from ~p failed, errors ~p", [SvcName, Peer, Errors]),
%% the exact content of Errors is a bit unclear, dumping them is best we can do
?LOG(error, "~0tp: decode of Diameter answer ~0tp from ~0tp failed, errors ~0tp",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my previous comment on that

?LOG(error, "~p:handle_request(~p, ~p, ~p)",
[?MODULE, _Packet, _SvcName, _Caps]),
handle_request(Packet, SvcName, {_, #diameter_caps{origin_host = {OH, _}}}) ->
?LOG(error, "~0tp: unsupported Diameter request from peer ~p, raw request ~0tp",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my previous comment on that

handle_event(Type, Event, _State, _Data) ->
?LOG(warning, "unhandled event ~p:~p", [Type, Event]),
handle_event(info, Info, State, _) ->
?LOG(alert, "unexpected info message in state ~p, something important might have been missed: ~p",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my previous comment on that

@@ -107,7 +107,8 @@ handle_response({API, _}, #{'Result-Code' := Code}, Session, Events, State) ->
handle_response(Procedure, #{'Result-Code' := Code}, Session, Events, State) ->
{{fail, Code}, Session, [{stop, {Procedure, peer_reject}} | Events], State};
handle_response(_Procedure, Response, Session, Events, State) ->
?LOG(error, "Response: ~p", [Response]),
?LOG(alert, "unexpected Diameter response, something important might have been missed: ~p",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my previous comment on that

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants