-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APM] Collect telemetry about data/queries #50757
Comments
Pinging @elastic/apm-ui (Team:apm) |
Not sure about this, but would it be apm-server telemetry that reports the no. of transactions, errors and metrics? Or would it be better to query against the various document types in ES not to put additional pressure on server? |
I'd think that query should happen in ES. AFAIK, APM Server is stateless, and at most knows about the amount of documents it is currently processing (and there could be multiple instances of APM Server as well). Telemetry is reported from Kibana as well, so I would imagine this as a Kibana task that queries ES and then sends up telemetry to our telemetry cluster. |
@dgieselaar OK, that makes sense |
Thanks for creating this issue. Super good point about us needing better insights into the users data volumes and pain points in querying this. Query response times
Regarding Regarding If using the APM agent is not feasible in the short term I'm good with starting out doing it ourselves. Data volume |
Looks like Copy and paste this into the console
|
What about: interface TimeframeMap {
'1d': number;
'1mo': number;
'6mo': number;
all: number;
}
interface APMDataTelemetry {
has_any_services: boolean;
services_per_agent: {
go: number;
java: number;
'js-base': number;
'rum-js': number;
nodejs: number;
python: number;
dotnet: number;
ruby: number;
};
data_characteristics: {
transactions: TimeframeMap;
spans: TimeframeMap;
errors: TimeframeMap;
metrics: TimeframeMap;
transaction_groups: Pick<TimeframeMap, 'all'>;
error_groups: Pick<TimeframeMap, 'all'>;
traces: Pick<TimeframeMap, 'all'>;
services: Pick<TimeframeMap, 'all'>;
agent_configurations: Pick<TimeframeMap, 'all'>;
};
integrations: {
alerting:boolean;
ml:boolean;
}
}
interface APMPerformanceMeasurement {
path: string;
response_time_ms: number;
status_code: number;
}
interface APMPerformanceTelemetry {
total_api_requests: number;
requests: APMPerformanceMeasurement[];
}
type APMTelemetry = APMDataTelemetry & APMPerformanceTelemetry; |
Not sure if this is feasible with telemetry or not, but for agent developers, it would be super helpful to know overall counts for
All of these come in pairs, so it would be most meaningful to count distinct combinations. Having this data would give us information on adoption rate of new agent versions, as well as some data of which frameworks and language versions are being used, which could inform decisions on deprecating support for old versions (Python 2.7 support comes to mind). cc @elastic/apm-agent-devs |
This is what we have today: {
"requests": [
{
"path": "GET /api/apm/index_pattern/dynamic",
"response_time": {
"ms": 139
},
"status_code": 200
},
{
"path": "GET /api/apm/services/{serviceName}/transaction_types",
"response_time": {
"ms": 552
},
"status_code": 200
},
{
"path": "GET /api/apm/ui_filters/environments",
"response_time": {
"ms": 569
},
"status_code": 200
},
{
"path": "GET /api/apm/services/{serviceName}/agent_name",
"response_time": {
"ms": 72
},
"status_code": 200
},
{
"path": "POST /api/apm/index_pattern/static",
"response_time": {
"ms": 65
},
"status_code": 200
},
{
"path": "GET /api/apm/ui_filters/local_filters/transactionGroups",
"response_time": {
"ms": 788
},
"status_code": 200
},
{
"path": "GET /api/apm/services/{serviceName}/transaction_groups/breakdown",
"response_time": {
"ms": 939
},
"status_code": 200
},
{
"path": "GET /api/apm/services/{serviceName}/transaction_groups/charts",
"response_time": {
"ms": 970
},
"status_code": 200
},
{
"path": "GET /api/apm/services/{serviceName}/transaction_groups",
"response_time": {
"ms": 940
},
"status_code": 200
}
],
"total_requests": 9,
"total_response_time": {
"ms": 5034
},
"counts": {
"span": {
"1d": 718028,
"1M": 11504559,
"6M": 16720490,
"all": 16720490
},
"transaction": {
"1d": 119024,
"1M": 2014449,
"6M": 2867622,
"all": 2867622
},
"metric": {
"1d": 51093,
"1M": 969633,
"6M": 1307368,
"all": 1307368
},
"error": {
"1d": 14721,
"1M": 227677,
"6M": 312524,
"all": 312527
},
"onboarding": {
"1d": 1,
"1M": 25,
"6M": 36,
"all": 36
},
"agent_configuration": {
"all": 3
},
"transaction_group": {
"all": 24
},
"error_group": {
"all": 866
},
"trace": {
"all": 981677
},
"service": {
"all": 8
}
},
"has_any_services": true,
"services_per_agent": {
"js-base": 0,
"rum-js": 0,
"dotnet": 0,
"go": 1,
"java": 2,
"nodejs": 1,
"python": 1,
"ruby": 1
},
"versions": {
"apm_server": {
"major": 8,
"minor": 0,
"patch": 0
}
},
"integrations": {
"ml": true
}
} |
|
Btw since the requests have |
Discussed per Slack to drop performance measurements for now. It's going to be cumbersome to actually use this data because we cannot use nested objects as the xpack-phone-home indices use index sorting. I've added some agent and index metrics as well. I've also uploaded a bunch of dummy data to Some highlights of what I'm collecting so far:
Here's a sample document: Sample document{
"counts": {
"error": {
"1d": 233186,
"1M": 3453848,
"6M": 3453854,
"all": 3453854
},
"metric": {
"1d": 613004,
"1M": 9461527,
"6M": 9461527,
"all": 9461563
},
"span": {
"1d": 7949662,
"1M": 120103499,
"6M": 120104792,
"all": 120105948
},
"transaction": {
"1d": 1599822,
"1M": 31030690,
"6M": 31030798,
"all": 31030845
},
"onboarding": {
"1d": 0,
"1M": 15,
"6M": 15,
"all": 15
},
"sourcemap": {
"1d": 0,
"1M": 0,
"6M": 0,
"all": 0
},
"agent_configuration": {
"all": 43
},
"max_error_groups_per_service": {
"all": 293394
},
"max_transaction_groups_per_service": {
"all": 25
},
"traces": {
"1d": 1259637,
"all": 25892950
},
"services": {
"all": 9
}
},
"tasks": {
"processor_events": {
"took": {
"ms": 41142
}
},
"agent_configuration": {
"took": {
"ms": 15
}
},
"services": {
"took": {
"ms": 50393
}
},
"versions": {
"took": {
"ms": 17
}
},
"groupings": {
"took": {
"ms": 60768
}
},
"integrations": {
"took": {
"ms": 19
}
},
"agents": {
"took": {
"ms": 2996
}
},
"indices_stats": {
"took": {
"ms": 58
}
}
},
"has_any_services": true,
"services_per_agent": {
"java": 1,
"js-base": 2,
"rum-js": 0,
"dotnet": 1,
"go": 2,
"nodejs": 1,
"python": 1,
"ruby": 1
},
"versions": {
"apm_server": {
"major": 8,
"minor": 0,
"patch": 0
}
},
"integrations": {
"ml": {
"has_anomalies_indices": true
}
},
"agents": {
"java": {
"agent": {
"version": [
"1.11.1-SNAPSHOT"
]
},
"service": {
"framework": {
"name": [],
"version": []
},
"language": {
"name": [
"Java"
],
"version": [
"10.0.2"
]
},
"runtime": {
"name": [
"Java"
],
"version": [
"10.0.2"
]
}
}
},
"js-base": {
"agent": {
"version": [
"4.6.0",
"4.5.1"
]
},
"service": {
"framework": {
"name": [],
"version": []
},
"language": {
"name": [
"javascript"
],
"version": []
},
"runtime": {
"name": [],
"version": []
}
}
},
"rum-js": {
"agent": {
"version": []
},
"service": {
"framework": {
"name": [],
"version": []
},
"language": {
"name": [],
"version": []
},
"runtime": {
"name": [],
"version": []
}
}
},
"dotnet": {
"agent": {
"version": [
"1.1.2"
]
},
"service": {
"framework": {
"name": [
"ASP.NET Core"
],
"version": [
"2.2.0.0"
]
},
"language": {
"name": [
"C#"
],
"version": []
},
"runtime": {
"name": [
".NET Core"
],
"version": [
"2.2.7"
]
}
}
},
"go": {
"agent": {
"version": [
"1.6.0"
]
},
"service": {
"framework": {
"name": [
"gin"
],
"version": [
"v1.4.0"
]
},
"language": {
"name": [
"go"
],
"version": [
"go1.13.4",
"go1.12.12"
]
},
"runtime": {
"name": [
"gc"
],
"version": [
"go1.13.4",
"go1.12.12"
]
}
}
},
"nodejs": {
"agent": {
"version": [
"3.2.0"
]
},
"service": {
"framework": {
"name": [
"express"
],
"version": [
"4.17.1"
]
},
"language": {
"name": [
"javascript"
],
"version": []
},
"runtime": {
"name": [
"node"
],
"version": [
"12.13.0"
]
}
}
},
"python": {
"agent": {
"version": [
"5.3.1"
]
},
"service": {
"framework": {
"name": [
"django"
],
"version": [
"2.1.13"
]
},
"language": {
"name": [
"python"
],
"version": [
"3.6.9"
]
},
"runtime": {
"name": [
"CPython"
],
"version": [
"3.6.9"
]
}
}
},
"ruby": {
"agent": {
"version": [
"3.1.0"
]
},
"service": {
"framework": {
"name": [
"Ruby on Rails"
],
"version": [
"5.2.3"
]
},
"language": {
"name": [
"ruby"
],
"version": [
"2.6.5"
]
},
"runtime": {
"name": [
"ruby"
],
"version": [
"2.6.5"
]
}
}
}
},
"indices": {
"shards": {
"total": 10
},
"all": {
"total": {
"docs": {
"count": 164334688
},
"store": {
"size_in_bytes": 49911744350
}
}
}
}
} @elastic/apm Any thoughts/suggestions here? Here's how you can help:
Would be great to have this feedback somewhere next week so we can finish this up. |
This looks great. A few thoughts:
|
@elastic/kibana-stack-services do you all have any advice here? The APM team would like to collect telemetry regarding their "apm data" and usage. The APM "data indices" are configurable, and the I believe that Elasticsearch itself writes its own telemetry data that Kibana then consumes. Is Logstash or any other application doing something similar? |
cc: @elastic/pulse |
Hey @ogupte, is my understanding correct that this is something that you'd like to implement for 7.7? Is it safe to assume that #51612 is the PR which would implement this functionality? The @elastic/pulse team today was discussing how we should handle sending telemetry data for products besides Kibana, and this seems to fall into that category. I don't want to necessarily stall this effort while we continue to have this discussion, but I did want to make sure we weren't working ourselves into a corner. |
@kobelb We're aiming for 7.7, but status is not entirely clear atm - I'll send you a message. |
@dgieselaar and @kobelb The information above is super helpful in fleshing out additional Pulse service requirements for products besides Kibana. I've made notes from the whole discussion and will take them into account during planning specifications for these. |
Sorry to be commenting late in the day, but instead of elastic/elasticsearch#52917 (comment) is related. EDIT:
elastic/elasticsearch#52917 (comment) and elastic/elasticsearch#52917 (comment) are related to this. Based on those comments it sounds like a better name would be something like |
Allows the kibana user to collect APM telemetry in a background task.
* [APM] Collect telemetry about data/API performance Closes #50757. * Ignore apm scripts package.json * Config flag for enabling/disabling telemetry collection
* Required for elastic/kibana#50757. Allows the kibana user to collect APM telemetry in a background task. * removed unnecessary priviledges on `.ml-anomalies-*` for the `kibana_system` reserved role
… (#54106) * Required for elastic/kibana#50757. Allows the kibana user to collect APM telemetry in a background task. * removed unnecessary priviledges on `.ml-anomalies-*` for the `kibana_system` reserved role
* Required for elastic/kibana#50757. Allows the kibana user to collect APM telemetry in a background task. * removed unnecessary priviledges on `.ml-anomalies-*` for the `kibana_system` reserved role
We currently don't have a lot of insight into the amounts of data that our customers have and how long it takes for our ES queries to be processed. This makes it hard to judge at which scale current and new functionalities need to operate. For instance, in some cases it might be reasonable to process things in memory on the Node server rather than in ES, which might simplify our implementation. Additionally, optimizing right now is hard because we don't know where our users are experiencing slowness.
Ideally we would have telemetry about:
@graphaelli: any idea if the monitoring data that we have provides an answer to any of these questions?
The text was updated successfully, but these errors were encountered: