Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[APM] Collect telemetry about data/queries #50757

Closed
dgieselaar opened this issue Nov 15, 2019 · 22 comments · Fixed by #51612
Closed

[APM] Collect telemetry about data/queries #50757

dgieselaar opened this issue Nov 15, 2019 · 22 comments · Fixed by #51612
Assignees
Labels
Feature:Telemetry Team:APM All issues that need APM UI Team support

Comments

@dgieselaar
Copy link
Member

We currently don't have a lot of insight into the amounts of data that our customers have and how long it takes for our ES queries to be processed. This makes it hard to judge at which scale current and new functionalities need to operate. For instance, in some cases it might be reasonable to process things in memory on the Node server rather than in ES, which might simplify our implementation. Additionally, optimizing right now is hard because we don't know where our users are experiencing slowness.

Ideally we would have telemetry about:

  • The data volume (how many errors/error groups? how many transactions? how many transactions/spans per trace? how many services? etc). This could be collected with a Kibana task that queries the data indices at a set interval and sends the data back home.
  • Query response times. We could instrument our ES client facade to store telemetry about ES response times. There's also the possibility of using the nodejs agent to instrument Kibana, but the ongoing efforts are explicitly scoped to non-production usage: Instrument Kibana with Elastic APM #43548

@graphaelli: any idea if the monitoring data that we have provides an answer to any of these questions?

@dgieselaar dgieselaar added the Team:APM All issues that need APM UI Team support label Nov 15, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/apm-ui (Team:apm)

@formgeist
Copy link
Contributor

Not sure about this, but would it be apm-server telemetry that reports the no. of transactions, errors and metrics? Or would it be better to query against the various document types in ES not to put additional pressure on server?

@dgieselaar
Copy link
Member Author

dgieselaar commented Nov 15, 2019

I'd think that query should happen in ES. AFAIK, APM Server is stateless, and at most knows about the amount of documents it is currently processing (and there could be multiple instances of APM Server as well). Telemetry is reported from Kibana as well, so I would imagine this as a Kibana task that queries ES and then sends up telemetry to our telemetry cluster.

@formgeist
Copy link
Contributor

@dgieselaar OK, that makes sense

@sorenlouv
Copy link
Member

sorenlouv commented Nov 16, 2019

Thanks for creating this issue. Super good point about us needing better insights into the users data volumes and pain points in querying this.

Query response times
I'd really hate for us to re-build the nodejs APM agent all over again, so I'd much prefer if we could investigate ways to piggyback on the extensive auto-instrumentation it has.
Afair we ran into two problems when enabling the agent in prod:

  1. PII: we were worried about sending sensitive data
  2. secretToken needed to be bundled with kibana, which is not great

Regarding #1 I think the node agent is quite configurable and we might be able to turn off most things except for elasticsearch query performance numbers. And we might be able to enable it just for specific plugins.

Regarding #2 we might be able to disable the transport via APM Server, and instead send it as telemetry data, and thus avoid needing secretToken. This would probably require us to parse it first but it would still provide us with a solution that potentially can auto-instrument the entire Kibana and send consistent performance telemetry.
I've emphasised consistent here because it's key that performance data it collected similarly between plugins for us to be able to compare it

If using the APM agent is not feasible in the short term I'm good with starting out doing it ourselves.

Data volume
APM agent doesn't do anything in this area so I feel much better about solving this on our own.
Again, it would be optimal if we can build something that other plugins can use too, to improve the consistency of the collected data.

@dgieselaar
Copy link
Member Author

Looks like nested could work for us, nice stuff:

Copy and paste this into the console
PUT apm-telemetry-test

PUT apm-telemetry-test/_mapping
{
  "properties": {
    "plugins": {
      "properties": {
        "apm": {
          "properties": {
            "has_any_services": {
              "type": "boolean"
            },
            "services_per_agent": {
              "properties": {
                "go": {
                  "type": "long"
                },
                "java": {
                  "type": "long"
                },
                "js-base": {
                  "type": "long"
                },
                "rum-js": {
                  "type": "long"
                },
                "nodejs": {
                  "type": "long"
                },
                "python": {
                  "type": "long"
                },
                "dotnet": {
                  "type": "long"
                },
                "ruby": {
                  "type": "long"
                }
              }
            },
            "endpoint_responses": {
              "type": "nested",
              "properties": {
                "path": {
                  "type": "keyword"
                },
                "status_code": {
                  "type": "integer"
                },
                "response_time_ms": {
                  "type": "double"
                }
              }
            }
          }
        }
      }
    }
  }
}

POST apm-telemetry-test/_delete_by_query
{
  "query": {
    "match_all": {}
  }
}

POST apm-telemetry-test/_doc
{
  "plugins": {
    "apm": {
      "endpoint_responses": [
        {
          "path": "POST /api/apm/foo",
          "status_code": 200,
          "response_time_ms": 100
        },
        {
          "path": "GET /api/apm/foo",
          "status_code": 200,
          "response_time_ms": 200
        },
        {
          "path": "GET /api/apm/bar",
          "status_code": 500,
          "response_time_ms": 300
        }
      ]
    }
  }
}

GET apm-telemetry-test/_search

GET apm-telemetry-test/_search
{
  "size": 0,
  "aggs": {
    "average_response_time_per_endpoint": {
      "nested": {
        "path": "plugins.apm.endpoint_responses"
      },
      "aggs": {
        "by_path": {
          "terms": {
            "field": "plugins.apm.endpoint_responses.path"
          },
          "aggs": {
            "average_response_time":  {
              "avg": {
                "field": "plugins.apm.endpoint_responses.response_time_ms"
              }
            }
          }
        }
      }
    }
  }
}

@dgieselaar
Copy link
Member Author

dgieselaar commented Nov 21, 2019

What about:

interface TimeframeMap {
  '1d': number;
  '1mo': number;
  '6mo': number;
  all: number;
}

interface APMDataTelemetry {
  has_any_services: boolean;
  services_per_agent: {
    go: number;
    java: number;
    'js-base': number;
    'rum-js': number;
    nodejs: number;
    python: number;
    dotnet: number;
    ruby: number;
  };
  data_characteristics: {
    transactions: TimeframeMap;
    spans: TimeframeMap;
    errors: TimeframeMap;
    metrics: TimeframeMap;
    transaction_groups: Pick<TimeframeMap, 'all'>;
    error_groups: Pick<TimeframeMap, 'all'>;
    traces: Pick<TimeframeMap, 'all'>;
    services: Pick<TimeframeMap, 'all'>;
    agent_configurations: Pick<TimeframeMap, 'all'>;
  };
  integrations: {
    alerting:boolean;
    ml:boolean;
  }
}

interface APMPerformanceMeasurement {
  path: string;
  response_time_ms: number;
  status_code: number;
}

interface APMPerformanceTelemetry {
  total_api_requests: number;
  requests: APMPerformanceMeasurement[];
}

type APMTelemetry = APMDataTelemetry & APMPerformanceTelemetry;

dgieselaar added a commit to dgieselaar/kibana that referenced this issue Nov 25, 2019
dgieselaar added a commit to dgieselaar/kibana that referenced this issue Nov 25, 2019
@beniwohli
Copy link

beniwohli commented Nov 27, 2019

Not sure if this is feasible with telemetry or not, but for agent developers, it would be super helpful to know overall counts for

  • service.agent.name / service.agent.version
  • service.framework.name / service.framework.version
  • service.language.name / service.language.version
  • service.runtime.name / service.runtime.version

All of these come in pairs, so it would be most meaningful to count distinct combinations.

Having this data would give us information on adoption rate of new agent versions, as well as some data of which frameworks and language versions are being used, which could inform decisions on deprecating support for old versions (Python 2.7 support comes to mind).

cc @elastic/apm-agent-devs

@dgieselaar
Copy link
Member Author

dgieselaar commented Nov 27, 2019

This is what we have today:

{
  "requests": [
    {
      "path": "GET /api/apm/index_pattern/dynamic",
      "response_time": {
        "ms": 139
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/services/{serviceName}/transaction_types",
      "response_time": {
        "ms": 552
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/ui_filters/environments",
      "response_time": {
        "ms": 569
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/services/{serviceName}/agent_name",
      "response_time": {
        "ms": 72
      },
      "status_code": 200
    },
    {
      "path": "POST /api/apm/index_pattern/static",
      "response_time": {
        "ms": 65
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/ui_filters/local_filters/transactionGroups",
      "response_time": {
        "ms": 788
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/services/{serviceName}/transaction_groups/breakdown",
      "response_time": {
        "ms": 939
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/services/{serviceName}/transaction_groups/charts",
      "response_time": {
        "ms": 970
      },
      "status_code": 200
    },
    {
      "path": "GET /api/apm/services/{serviceName}/transaction_groups",
      "response_time": {
        "ms": 940
      },
      "status_code": 200
    }
  ],
  "total_requests": 9,
  "total_response_time": {
    "ms": 5034
  },
  "counts": {
    "span": {
      "1d": 718028,
      "1M": 11504559,
      "6M": 16720490,
      "all": 16720490
    },
    "transaction": {
      "1d": 119024,
      "1M": 2014449,
      "6M": 2867622,
      "all": 2867622
    },
    "metric": {
      "1d": 51093,
      "1M": 969633,
      "6M": 1307368,
      "all": 1307368
    },
    "error": {
      "1d": 14721,
      "1M": 227677,
      "6M": 312524,
      "all": 312527
    },
    "onboarding": {
      "1d": 1,
      "1M": 25,
      "6M": 36,
      "all": 36
    },
    "agent_configuration": {
      "all": 3
    },
    "transaction_group": {
      "all": 24
    },
    "error_group": {
      "all": 866
    },
    "trace": {
      "all": 981677
    },
    "service": {
      "all": 8
    }
  },
  "has_any_services": true,
  "services_per_agent": {
    "js-base": 0,
    "rum-js": 0,
    "dotnet": 0,
    "go": 1,
    "java": 2,
    "nodejs": 1,
    "python": 1,
    "ruby": 1
  },
  "versions": {
    "apm_server": {
      "major": 8,
      "minor": 0,
      "patch": 0
    }
  },
  "integrations": {
    "ml": true
  }
}

@sorenlouv
Copy link
Member

requests is an aggregation of requests made throughout the time-period, right? Should there be a count on each request (number of times it was called?)

@sorenlouv
Copy link
Member

Btw since the requests have status_code: can the same request show up twice with different status codes?

dgieselaar added a commit to dgieselaar/kibana that referenced this issue Dec 3, 2019
@dgieselaar
Copy link
Member Author

Discussed per Slack to drop performance measurements for now. It's going to be cumbersome to actually use this data because we cannot use nested objects as the xpack-phone-home indices use index sorting.

I've added some agent and index metrics as well. I've also uploaded a bunch of dummy data to https://apm.elstc.co that uses the same mapping as what is used for the telemetry cluster, so we can test out dashboards, queries etc over there.

Some highlights of what I'm collecting so far:

  • counts of processor events of various time ranges (1d, 1M, 6M, forever)
  • counts of common groupings we use in the UI (error groups, transaction groups, traces, services)
  • number of services per agent
  • per agent, the top 3 values for agent.version, service.framework.name, service.framework.version, service.language.name, service.language.version, service.runtime.name, and service.runtime.version
  • whether the cluster has APM-specific ML indices
  • the most recent version of APM server that is being used (kibana and ES versions are tracked separately)
  • stats about indices (document/shard count, disk size)

Here's a sample document:

Sample document
{
  "counts": {
    "error": {
      "1d": 233186,
      "1M": 3453848,
      "6M": 3453854,
      "all": 3453854
    },
    "metric": {
      "1d": 613004,
      "1M": 9461527,
      "6M": 9461527,
      "all": 9461563
    },
    "span": {
      "1d": 7949662,
      "1M": 120103499,
      "6M": 120104792,
      "all": 120105948
    },
    "transaction": {
      "1d": 1599822,
      "1M": 31030690,
      "6M": 31030798,
      "all": 31030845
    },
    "onboarding": {
      "1d": 0,
      "1M": 15,
      "6M": 15,
      "all": 15
    },
    "sourcemap": {
      "1d": 0,
      "1M": 0,
      "6M": 0,
      "all": 0
    },
    "agent_configuration": {
      "all": 43
    },
    "max_error_groups_per_service": {
      "all": 293394
    },
    "max_transaction_groups_per_service": {
      "all": 25
    },
    "traces": {
      "1d": 1259637,
      "all": 25892950
    },
    "services": {
      "all": 9
    }
  },
  "tasks": {
    "processor_events": {
      "took": {
        "ms": 41142
      }
    },
    "agent_configuration": {
      "took": {
        "ms": 15
      }
    },
    "services": {
      "took": {
        "ms": 50393
      }
    },
    "versions": {
      "took": {
        "ms": 17
      }
    },
    "groupings": {
      "took": {
        "ms": 60768
      }
    },
    "integrations": {
      "took": {
        "ms": 19
      }
    },
    "agents": {
      "took": {
        "ms": 2996
      }
    },
    "indices_stats": {
      "took": {
        "ms": 58
      }
    }
  },
  "has_any_services": true,
  "services_per_agent": {
    "java": 1,
    "js-base": 2,
    "rum-js": 0,
    "dotnet": 1,
    "go": 2,
    "nodejs": 1,
    "python": 1,
    "ruby": 1
  },
  "versions": {
    "apm_server": {
      "major": 8,
      "minor": 0,
      "patch": 0
    }
  },
  "integrations": {
    "ml": {
      "has_anomalies_indices": true
    }
  },
  "agents": {
    "java": {
      "agent": {
        "version": [
          "1.11.1-SNAPSHOT"
        ]
      },
      "service": {
        "framework": {
          "name": [],
          "version": []
        },
        "language": {
          "name": [
            "Java"
          ],
          "version": [
            "10.0.2"
          ]
        },
        "runtime": {
          "name": [
            "Java"
          ],
          "version": [
            "10.0.2"
          ]
        }
      }
    },
    "js-base": {
      "agent": {
        "version": [
          "4.6.0",
          "4.5.1"
        ]
      },
      "service": {
        "framework": {
          "name": [],
          "version": []
        },
        "language": {
          "name": [
            "javascript"
          ],
          "version": []
        },
        "runtime": {
          "name": [],
          "version": []
        }
      }
    },
    "rum-js": {
      "agent": {
        "version": []
      },
      "service": {
        "framework": {
          "name": [],
          "version": []
        },
        "language": {
          "name": [],
          "version": []
        },
        "runtime": {
          "name": [],
          "version": []
        }
      }
    },
    "dotnet": {
      "agent": {
        "version": [
          "1.1.2"
        ]
      },
      "service": {
        "framework": {
          "name": [
            "ASP.NET Core"
          ],
          "version": [
            "2.2.0.0"
          ]
        },
        "language": {
          "name": [
            "C#"
          ],
          "version": []
        },
        "runtime": {
          "name": [
            ".NET Core"
          ],
          "version": [
            "2.2.7"
          ]
        }
      }
    },
    "go": {
      "agent": {
        "version": [
          "1.6.0"
        ]
      },
      "service": {
        "framework": {
          "name": [
            "gin"
          ],
          "version": [
            "v1.4.0"
          ]
        },
        "language": {
          "name": [
            "go"
          ],
          "version": [
            "go1.13.4",
            "go1.12.12"
          ]
        },
        "runtime": {
          "name": [
            "gc"
          ],
          "version": [
            "go1.13.4",
            "go1.12.12"
          ]
        }
      }
    },
    "nodejs": {
      "agent": {
        "version": [
          "3.2.0"
        ]
      },
      "service": {
        "framework": {
          "name": [
            "express"
          ],
          "version": [
            "4.17.1"
          ]
        },
        "language": {
          "name": [
            "javascript"
          ],
          "version": []
        },
        "runtime": {
          "name": [
            "node"
          ],
          "version": [
            "12.13.0"
          ]
        }
      }
    },
    "python": {
      "agent": {
        "version": [
          "5.3.1"
        ]
      },
      "service": {
        "framework": {
          "name": [
            "django"
          ],
          "version": [
            "2.1.13"
          ]
        },
        "language": {
          "name": [
            "python"
          ],
          "version": [
            "3.6.9"
          ]
        },
        "runtime": {
          "name": [
            "CPython"
          ],
          "version": [
            "3.6.9"
          ]
        }
      }
    },
    "ruby": {
      "agent": {
        "version": [
          "3.1.0"
        ]
      },
      "service": {
        "framework": {
          "name": [
            "Ruby on Rails"
          ],
          "version": [
            "5.2.3"
          ]
        },
        "language": {
          "name": [
            "ruby"
          ],
          "version": [
            "2.6.5"
          ]
        },
        "runtime": {
          "name": [
            "ruby"
          ],
          "version": [
            "2.6.5"
          ]
        }
      }
    }
  },
  "indices": {
    "shards": {
      "total": 10
    },
    "all": {
      "total": {
        "docs": {
          "count": 164334688
        },
        "store": {
          "size_in_bytes": 49911744350
        }
      }
    }
  }
}

@elastic/apm Any thoughts/suggestions here? Here's how you can help:

Would be great to have this feedback somewhere next week so we can finish this up.

@graphaelli
Copy link
Member

This looks great. A few thoughts:

  • telemetry on the oldest retained data, per event type (transaction, metrics, ...), to help determine if the 1M stats are really over that time or actually much less time coverage
  • how did we pick 6M as a time?
  • stack_stats.kibana.plugins.apm.counts.services.all broken down over time to indicate whether stats have should be stable / expecting large changes
  • are stack_stats.kibana.plugins.apm.counts.max_error_groups_per_service.all and stack_stats.kibana.plugins.apm.counts.agent_configuration.all correct? look too large in the sample data
  • Considered any spans per transactions or stackframes per error or span telemetry (per language)?

@kobelb
Copy link
Contributor

kobelb commented Dec 11, 2019

@elastic/kibana-stack-services do you all have any advice here? The APM team would like to collect telemetry regarding their "apm data" and usage. The APM "data indices" are configurable, and the kibana_system role doesn't, and shouldn't, have access to read from these data-indices as elaborated upon here.

I believe that Elasticsearch itself writes its own telemetry data that Kibana then consumes. Is Logstash or any other application doing something similar?

@alexfrancoeur
Copy link

cc: @elastic/pulse

@kobelb
Copy link
Contributor

kobelb commented Mar 10, 2020

Hey @ogupte, is my understanding correct that this is something that you'd like to implement for 7.7? Is it safe to assume that #51612 is the PR which would implement this functionality?

The @elastic/pulse team today was discussing how we should handle sending telemetry data for products besides Kibana, and this seems to fall into that category. I don't want to necessarily stall this effort while we continue to have this discussion, but I did want to make sure we weren't working ourselves into a corner.

@dgieselaar
Copy link
Member Author

@kobelb We're aiming for 7.7, but status is not entirely clear atm - I'll send you a message.

@TinaHeiligers
Copy link
Contributor

TinaHeiligers commented Mar 16, 2020

The @elastic/pulse team today was discussing how we should handle sending telemetry data for products besides Kibana, and this seems to fall into that category.

@dgieselaar and @kobelb The information above is super helpful in fleshing out additional Pulse service requirements for products besides Kibana. I've made notes from the whole discussion and will take them into account during planning specifications for these.

@droberts195
Copy link
Contributor

droberts195 commented Mar 17, 2020

Sample document

...
 "integrations": {
   "ml": {
     "has_anomalies_indices": true
   }
 },
...

Sorry to be commenting late in the day, but instead of has_anomalies_indices would it be better to have something like has_apm_job_use? The reason is that anomalies indices are an internal implementation detail whereas jobs are the public interface. In other words, is the high level requirement to report when an APM job has been used?

elastic/elasticsearch#52917 (comment) is related.

EDIT:

would it be better to have something like has_apm_job_use?

elastic/elasticsearch#52917 (comment) and elastic/elasticsearch#52917 (comment) are related to this. Based on those comments it sounds like a better name would be something like has_apm_job.

dgieselaar added a commit to dgieselaar/kibana that referenced this issue Mar 18, 2020
dgieselaar added a commit to dgieselaar/kibana that referenced this issue Mar 18, 2020
dgieselaar added a commit to dgieselaar/kibana that referenced this issue Mar 18, 2020
ogupte added a commit to ogupte/elasticsearch that referenced this issue Mar 19, 2020
Allows the kibana user to collect APM telemetry in a background task.
dgieselaar added a commit to dgieselaar/kibana that referenced this issue Mar 19, 2020
dgieselaar added a commit to dgieselaar/kibana that referenced this issue Mar 19, 2020
dgieselaar added a commit to dgieselaar/kibana that referenced this issue Mar 20, 2020
dgieselaar added a commit to dgieselaar/kibana that referenced this issue Mar 20, 2020
dgieselaar added a commit to dgieselaar/kibana that referenced this issue Mar 20, 2020
dgieselaar added a commit to dgieselaar/kibana that referenced this issue Mar 23, 2020
dgieselaar added a commit that referenced this issue Mar 23, 2020
* [APM] Collect telemetry about data/API performance

Closes #50757.

* Ignore apm scripts package.json

* Config flag for enabling/disabling telemetry collection
ogupte added a commit to elastic/elasticsearch that referenced this issue Mar 24, 2020
* Required for elastic/kibana#50757.
Allows the kibana user to collect APM telemetry in a background task.

* removed unnecessary priviledges on `.ml-anomalies-*` for the `kibana_system` reserved role
ogupte added a commit to elastic/elasticsearch that referenced this issue Mar 25, 2020
… (#54106)

* Required for elastic/kibana#50757.
Allows the kibana user to collect APM telemetry in a background task.

* removed unnecessary priviledges on `.ml-anomalies-*` for the `kibana_system` reserved role
2lambda123 pushed a commit to 2lambda123/elastic-elasticsearch that referenced this issue May 2, 2024
* Required for elastic/kibana#50757.
Allows the kibana user to collect APM telemetry in a background task.

* removed unnecessary priviledges on `.ml-anomalies-*` for the `kibana_system` reserved role
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Telemetry Team:APM All issues that need APM UI Team support
Projects
None yet
Development

Successfully merging a pull request may close this issue.