Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error messages for network connectivity and memory issue #1483

Merged
merged 20 commits into from
Feb 26, 2021

Conversation

kryalama
Copy link
Contributor

Algorithm:

  1. Keep track of # successes since startup (or even just first success)
  2. When we hit first failure log the failure, mention how many successes since startup, and mention future failures will be aggregated and logged once every 5 min
  3. After first failure keep running number of failed requests and total requests
    have a timer, every 5 minutes, if any failures, log number of failed requests and successful requests in last 5 min.

@kryalama kryalama requested a review from trask February 11, 2021 21:07
@kryalama kryalama requested a review from trask February 23, 2021 18:10
@kryalama kryalama changed the title Improve error messages for network connectivity issue Improve error messages for network connectivity and memory issue Feb 24, 2021
@kryalama
Copy link
Contributor Author

kryalama commented Feb 24, 2021

image
image

@trask
Copy link
Member

trask commented Feb 24, 2021

@kryalama instead of screenshot, can you paste the text? I'd like to see the left hand side (timestamps) and the right hand side (to see what's different that's causing those lines to have difference messages)

@kryalama
Copy link
Contributor Author

kryalama commented Feb 24, 2021

Its because of the number of bytes in those messages. And also the first group is from Network exceptions and second group is from DIsk Exceptions.

2021-02-24 07:26:17.541-08 ERROR c.m.a.internal.util.ExceptionStats - Failed to send, socket exception. Telemetry will be stored locally and re-sent later once the connection is stable again (failed 517 times in the last 5 minutes)
2021-02-24 07:26:17.541-08 WARN  c.m.a.internal.util.ExceptionStats - 517/517(Total Failures/Total Requests) reported in the last 5 minutes
2021-02-24 07:26:17.680-08 ERROR c.m.a.internal.util.ExceptionStats - Persistent storage max capacity has been reached; currently at 10488481 bytes. Telemetry will be lost, please consider increasing the value of MaxTransmissionStorageFilesCapacityInMB property in the configuration file. (failed 12330 times in the last 5 minutes)
2021-02-24 07:26:17.680-08 ERROR c.m.a.internal.util.ExceptionStats - Persistent storage max capacity has been reached; currently at 10488407 bytes. Telemetry will be lost, please consider increasing the value of MaxTransmissionStorageFilesCapacityInMB property in the configuration file. (failed 63 times in the last 5 minutes)
2021-02-24 07:26:17.680-08 ERROR c.m.a.internal.util.ExceptionStats - Rename To Temporary Name failed (failed 98 times in the last 5 minutes)
2021-02-24 07:26:17.680-08 ERROR c.m.a.internal.util.ExceptionStats - Persistent storage max capacity has been reached; currently at 10488401 bytes. Telemetry will be lost, please consider increasing the value of MaxTransmissionStorageFilesCapacityInMB property in the configuration file. (failed 10 times in the last 5 minutes)
2021-02-24 07:26:17.681-08 ERROR c.m.a.internal.util.ExceptionStats - Persistent storage max capacity has been reached; currently at 10488385 bytes. Telemetry will be lost, please consider increasing the value of MaxTransmissionStorageFilesCapacityInMB property in the configuration file. (failed 1734 times in the last 5 minutes)
2021-02-24 07:26:17.681-08 ERROR c.m.a.internal.util.ExceptionStats - Persistent storage max capacity has been reached; currently at 10488416 bytes. Telemetry will be lost, please consider increasing the value of MaxTransmissionStorageFilesCapacityInMB property in the configuration file. (failed 56 times in the last 5 minutes)
2021-02-24 07:26:17.681-08 ERROR c.m.a.internal.util.ExceptionStats - Persistent storage max capacity has been reached; currently at 10488388 bytes. Telemetry will be lost, please consider increasing the value of MaxTransmissionStorageFilesCapacityInMB property in the configuration file. (failed 54 times in the last 5 minutes)
2021-02-24 07:26:17.681-08 ERROR c.m.a.internal.util.ExceptionStats - Persistent storage max capacity has been reached; currently at 10488426 bytes. Telemetry will be lost, please consider increasing the value of MaxTransmissionStorageFilesCapacityInMB property in the configuration file. (failed 10 times in the last 5 minutes)
2021-02-24 07:26:17.681-08 WARN  c.m.a.internal.util.ExceptionStats - 14355/15115(Total Failures/Total Requests) reported in the last 5 minutes

@trask
Copy link
Member

trask commented Feb 24, 2021

let's change the error message so it has low cardinality, e.g.

Persistent storage max capacity has been reached; currently at 10488401 bytes
-->
Persistent storage max capacity of 10485760 bytes has been reached

(where 10485760 is capacityInBytes, not currentSizeInBytes)

@kryalama
Copy link
Contributor Author

Sample output after latest changes:

2021-02-25 10:53:11.828-08 WARN  c.m.a.i.c.c.TransmissionFileSystemOutput - Unable to send telemetry to the ingestion service, and unable to store telemetry locally, so telemetry will be discarded. See previous warning(s) for reason unable to send telemetry to the ingestion service. Reason unable to store telemetry locally: local storage capacity (10MB) has been exceeded (future failures will be aggregated and logged once every 5 minutes)
2021-02-25 10:57:46.751-08 WARN  c.m.a.i.c.c.TransmissionNetworkOutput - In the last 5 minutes, the following operation has failed 517 times (out of 517 total):
Unable to send telemetry to the ingestion service, so telemetry will be stored to disk and sent to the ingestion service later if possible. Reason unable to send telemetry to the ingestion service:
 * socket exception: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8080 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused: connect (517 times)
2021-02-25 10:58:11.842-08 WARN  c.m.a.i.c.c.TransmissionFileSystemOutput - In the last 5 minutes, the following operation has failed 12779 times (out of 16029 total):
Unable to send telemetry to the ingestion service, and unable to store telemetry locally, so telemetry will be discarded. See previous warning(s) for reason unable to send telemetry to the ingestion service. Reason unable to store telemetry locally:
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279439319-18129934496622261209.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279439646-120274041152921294.trn' does not exist (1 times)
 * local storage capacity (10MB) has been exceeded (12759 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279438986-5997749172258811018.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279440864-4084345048394395496.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279440534-1962911171427426965.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279440201-5658239660918810669.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279439757-2892336431241058619.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279440422-2796910921071553014.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279439539-729406873432368983.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279440091-13089500213803274533.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279440752-1965355712487808118.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279439868-15067146696742896313.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279439209-14184126695746796473.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279439429-17135129225473260175.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279439098-13636197194005688023.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279441084-8143357387794191794.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279439980-1755867371589061107.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279440311-3828026578525996487.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279440642-3148681853088031666.trn' does not exist (1 times)
 * unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279440975-4653977834937726088.trn' does not exist (1 times)

message.append(" times (out of ");
message.append(total);
message.append(" total):\n");
message.append(introMessage);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding introMessage again is adding too much clutter I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good context, but I shortened the intro significantly to make the messages less clutterful

@trask
Copy link
Member

trask commented Feb 25, 2021

this message isn't part of storing telemetry to disk (it's part of reading from disk and I don't think a spam problem), so I removed it from diskExceptionStats

unable to rename file to temporary name: java.io.FileNotFoundException: Source 'C:\Users\kryalama\AppData\Local\Temp\transmissions\Transmission-1614279440201-5658239660918810669.trn' does not exist (1 times)

@kryalama
Copy link
Contributor Author

Looks good to me

2021-02-25 14:24:00.998-08 WARN  c.m.a.i.c.c.TransmissionFileSystemOutput - Unable to store telemetry to disk (telemetry will be discarded): local storage capacity (10MB) has been exceeded (future failures will be aggregated and logged once every 5 minutes)
2021-02-25 14:28:40.172-08 WARN  c.m.a.i.c.c.TransmissionNetworkOutput - In the last 5 minutes, the following operation has failed 512 times (out of 512):
Unable to send telemetry to the ingestion service (telemetry will be stored to disk):
 * socket exception: org.apache.http.conn.HttpHostConnectException: Connect to localhost:8080 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused: connect (512 times)
2021-02-25 14:29:01.005-08 WARN  c.m.a.i.c.c.TransmissionFileSystemOutput - In the last 5 minutes, the following operation has failed 11973 times (out of 15572):
Unable to store telemetry to disk (telemetry will be discarded):
 * local storage capacity (10MB) has been exceeded (11973 times)

Copy link
Member

@trask trask left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@trask trask merged commit 0fd3d86 into master Feb 26, 2021
@trask trask deleted the kryalama/impnetworkmsg branch February 26, 2021 05:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants