Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support to CEPH input. #3311

Merged
merged 26 commits into from
Jan 24, 2017
Merged

Support to CEPH input. #3311

merged 26 commits into from
Jan 24, 2017

Conversation

amandahla
Copy link

Using as reference Telegraf (https://github.com/influxdata/telegraf/blob/master/plugins/inputs/ceph/ceph.go). Working in progress.
#197

@elasticmachine
Copy link
Collaborator

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

1 similar comment
@elasticmachine
Copy link
Collaborator

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

enabled: true
period: 1s

ceph:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this working? This looks like invalid yaml. All config options related to a metricset should be in the module itself.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually no. I commited a new version with this file corrected now.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for starting with this PR and adding support for CEPH to metricbeat. What is your current stage? Should we already have a detailed look at the PR or do you prefer first some high level feedback?

@@ -1,105 +0,0 @@
###################### Metricbeat Configuration Example #######################
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file should not be removed. Running make update should generate it again.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, new commit with this corrected.

@ruflin ruflin added in progress Pull request is currently in progress. Metricbeat Metricbeat labels Jan 9, 2017
@amandahla
Copy link
Author

Hi Ruflin, thanks for the review. I prefer a detailed look because everything is working but it can really be better. I had difficult to deal with json output from ceph and only tested with a container (https://hub.docker.com/r/ceph/demo/).
Any help / review is appreciated. :-)

@ruflin
Copy link
Contributor

ruflin commented Jan 10, 2017

I will have a closer look. Please make sure to run make fmt so all code adheres to the coding standards.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work on this. To make this testable it is important to add a testing environment and add integration tests.

Have a look at the docker-compose.yml file in metricbeat. There you can add the ceph service and then add tests for it. This will also make it easy to create the example data.json files and will directly show all the fields that are exported by these metricsets.

enabled: true
period: 1s

ceph:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the config must be moved under the module. I assume it doesn't work the way it is at the moment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved.

# Consult the ceph documentation for more detail on keyring generation.
user = "client.admin"
# Ceph configuration to use to locate the cluster
config_path = "/etc/ceph/ceph.conf"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general we recommend metricbeat to connect to each node and get the specific metrics for each node instead from the cluster. What is inside this file? You should use hosts: ["..."] for that if possible.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used telegraf as reference and that only collects performance metrics from the MON and OSD nodes in a Ceph storage cluster. I can change the description, perhaps.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of the inner workings of CEPH. Can you share some details on what MON and OSD nodes are?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Ceph Storage Cluster requires at least one MON and at least 2 OSD.

A OSD (Object Storage Daemon) handles data (storage, replication, recovery...) and provide some monitoring information to MON by checking others OSD for a heartbeat.

A MON (Monitor) handles the cluster state (maps, history...).

You can have this architecture on one server/node but just for tests. Real productions environments works in a distributed way.

If you want metrics just for the daemons where metricbeat its collecting, you can use Admin Socket Stats. In our case here, it's the "perf" metricset.

If you want metrics for the whole cluster, you can use ceph commands. In our case here, it's the "status", "df" and "osdpoolstats" metricsets.

That's why I wrote those comments on config.yml. Depending on choosen metricset, you need just some parameters.

Copy link
Contributor

@ruflin ruflin Jan 12, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So perf is a local stat that should be collect by all metricbeat instances and the other 3 are cluster stats, means only one of the instances connecting needs to collect them? If there are multiple MON, will they provide different data?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"...means only one of the instances connecting needs to collect them?" Yes.
"If there are multiple MON, will they provide different data?" No, at least it's not expected this behavior.

enabled: true
period: 5s

ceph:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is this line for? Forgot to remove?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed. Useless line, sorry.

period: 5s

ceph:
# location of ceph binary
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to split up the config into config.full.yml and config.yml and only have the extensive options in the full file. See other metricsets for examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working.

- module: ceph
metricsets: ["perf","status","df","osdpoolstats"]
enabled: true
period: 5s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default we use is 10s

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

logp.Err("An error occurred while parsing data for getting ceph df: %v", err)
}

stats_fields, ok := data["stats"].(map[string]interface{})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use statsFields etc as naming ocnvention.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

This is the ceph Module.

We used code/reference from:
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/ceph/ceph.go
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make sure to not just copy / paste code from here. Or was this also created by you?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I copy/paste code and then adapted to collect metrics. It's really similar but not exactly equal. That's why I wrote this on docs.asciidoc. Is it a problem?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah: This referente on telegraf was not created by me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to be careful about the licenses here. Getting inspired by some code is one thing, copy / pasting it an other. If we have the same code, we should use it as a library ;-)

Copy link
Author

@amandahla amandahla Jan 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw that Telegraf uses MIT License. So, I understand that there is no problem to copy parts of the Software. Unfortunately, I don't think that it's possible to use as a library.
To get a better view what kind of copy I did, here it's a example. I changed the code to generate events.
Mine (line 40)
Telegraf (line 462)

description: >
df
fields:
- name: example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fields docs have to be added. This will also make it easier for us to understand what kind of data is provided.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, just to be sure: is there a way to generate fields automatically? Or just manually?


func New(base mb.BaseMetricSet) (mb.MetricSet, error) {

config := ceph.Config{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the Ceph metricsets should be marked experimental first. We do this for all metricsets for the first release. See other metricsets for examples.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working.

},
"metricset":{
"host":"localhost",
"module":"mysql",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can generate this files by running make generate-json if you have created the integration test files which are responsible to create the data (see other metricsets). This requires a working docker environment.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Working.

@ruflin
Copy link
Contributor

ruflin commented Jan 11, 2017

@amandahla Ping me when it is ready for an other round of reviews.

@amandahla
Copy link
Author

Hi @ruflin
I made some changes here:

  • Removed "perf" metricset because our CEPH Admin told me that isn't so useful. So we have now just the ones from cluster.
  • Wrote fields.yml files.
  • Wrote integration tests.
  • Working on unity tests.
    Please, can you review?

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this module depends on executing a binary which we normally don't do in Metricbeat. It should be changed to use the CEPH REST API which will also simplify the implementation and configuration. Are there any limitations using the REST API?

@@ -46,13 +47,26 @@ services:
volumes:
- ${PWD}/..:/go/src/github.com/elastic/beats/
- /var/run/docker.sock:/var/run/docker.sock
- datavolume:/etc/ceph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these data volumes on the client side?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll study the rest api but I think that we can remove all this and don't use command any more.

command: make
entrypoint: /go/src/github.com/elastic/beats/metricbeat/docker-entrypoint.sh

# Modules
apache:
build: ${PWD}/module/apache/_meta

ceph:
image: ceph/demo
container_name: metricbeatceph
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not give the container a special name. This should not be needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above :)
"I'll study the rest api but I think that we can remove all this and don't use command any more."

@@ -91,3 +105,8 @@ services:

zookeeper:
image: jplock/zookeeper:3.4.8

volumes:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these volumes are ceph specific, we should have better names. If possible, I would prefer not to have "global" volumes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above :)
I'll study the rest api but I think that we can remove all this and don't use command any more.


This is the ceph Module.

We used code/reference from:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the reference is only used to get inspiration, these should be removed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since I'll change to use Rest API, this reference will be no longer necessary indeed. I'll remove.

var output string

if strings.Contains(c.BinaryPath, "docker") {
cmdCeph := []string{"/usr/bin/ceph"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not execute binaries whenever possible. As far as I can see ceph also has a REST API to fetch monitoring data: http://ceph.com/planet/experimenting-with-the-ceph-rest-api/ This should be used to fetch the data. Calling a binary requires CEPH to be installed on the machine. This works in most cases but breaks for example in docker environments, where on container is used to monitor the other containers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW: Why is the execution different if it is in a container or not?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If is a container, I needed to call the ceph command inside him so that's why was different. In container, command will be something like "docker exec -i container_name /usr/bin/ceph ...".
With REST api, I can remove this too.

"mons":[
{
"name":"prcdsrvv1619",
"kb_total":6281216,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check our naming conventions: https://www.elastic.co/guide/en/beats/libbeat/5.1/event-conventions.html

Whenever possible we should no use nested arrays. I think we need to have a closer look at the data set and potentially resturcture it to fit into metricbeat.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far, I see that with REST API the output it's more simple. But I'll have a closer look at this.

var defaultConfig = CephConfig{
BinaryPath: "/usr/bin/ceph",
User: "client.admin",
ConfigPath: "/etc/ceph/ceph.conf",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the REST API using config etc. becomes obsolete.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. :)

@ruflin
Copy link
Contributor

ruflin commented Jan 17, 2017

@amandahla Keep me posted about the REST API

@amandahla
Copy link
Author

@ruflin Thanks for showing me the CEPH Rest API. I was so focused on use the same way that Telegraf did, that didn't even think about searching for another way.
Now I'll take a bit more time to change everything, but I hope to finish soon. I'll keep this PR open but when I finish, I'll open another, ok?

@ruflin
Copy link
Contributor

ruflin commented Jan 17, 2017

You can also just push the updated version to this branch. No worries if you overwrite the current commits. I think the title stays the same anyways ;-)

@ruflin
Copy link
Contributor

ruflin commented Jan 18, 2017

@amandahla Your last commit looks very promising. This heavily simplifies the implementation 👍 Let me know if you should have another detailed look.

@amandahla
Copy link
Author

Hi @ruflin
Thanks!
I think that I'll let this PR only with one MetricSet ("health") like you suggested before. Please, can you review?
It's really much better to use the Ceph Rest API!

@ruflin
Copy link
Contributor

ruflin commented Jan 19, 2017

Quick note on the failing travis build: Run make fmt to clean up the source code and make update to update all generated files. It should go green afterwards.

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change looks good to me. I left some comments on the naming of the fields. We are very close to get this in. Thanks a lot for going forward with this.

This is the ceph Module. Metrics are collected submitting HTTP GET requests to ceph-rest-api.
Reference: http://docs.ceph.com/docs/master/man/8/ceph-rest-api/

Thanks!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this line.

type: keyword
description: >
Status of the round
- name: mon.avail_percent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would change this to mon.available.pct. See https://www.elastic.co/guide/en/beats/libbeat/5.1/event-conventions.html

type: keyword
description: >
Health of the MON
- name: mon.kb_avail
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mon.available.kb

type: long
description: >
Available KB of the MON
- name: mon.kb_total
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mon.total.kb

type: long
description: >
Total KB of the MON
- name: mon.kb_used
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mon.used.kb

type: long
description: >
Log bytes of MON
- name: mon.store_stats.bytes_misc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mon.store_stats.misc.bytes

You can also add format: bytes to all bytes value, so Kibana will automatically know how to convert it to MB etc.

    - name: mon.store_stats.sst.bytes
      type: long
      description: >
        SST bytes of MON
      format: bytes

type: long
description: >
Misc bytes of MON
- name: mon.store_stats.bytes_sst
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same with the bytes for the following two

"encoding/json"
"github.com/elastic/beats/libbeat/common"
"github.com/elastic/beats/libbeat/logp"
"io"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we normally but the external imports on the two following and empty line before the internal once


events := []common.MapStr{}

event := common.MapStr{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could probably use the schema package here to do the event conversion, but we can do this cleanup in a second step.


event := common.MapStr{
"cluster.overall_status": d.Output.OverallStatus,
"cluster.timechecks": common.MapStr{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must use here "cluster" : common.MapStr{ "timechecks" : common.MapStr{ ... as otherwise the event will not be compatible with elasticsearch 2.x version which do not support dots in fields.

@amandahla
Copy link
Author

Thanks @ruflin !
I made the changes that you asked and created a unity test. I hope that will be ok now (I always forget the make fmt/update, sorry!).

Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The only thing I see missing is a quick system tests that checks if all fields are as expected. See https://github.com/elastic/beats/blob/master/metricbeat/tests/system/test_haproxy.py#L34 But we can also add this as a follow up PR.

func TestFetchEventContents(t *testing.T) {
absPath, err := filepath.Abs("./testdata/")

response, err := ioutil.ReadFile(absPath + "/sample_response.json")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great way to test the fetcher with predefined data. We should apply the same logic to other metricsets.

@ruflin
Copy link
Contributor

ruflin commented Jan 19, 2017

jenkins, test it

@amandahla
Copy link
Author

Hello @ruflin . I'm thinking about start on a new metricset now. Please, is this one ok so I can use as reference?

@ruflin
Copy link
Contributor

ruflin commented Jan 24, 2017

@amandahla will review just now and let you know.

@ruflin ruflin merged commit f9031b9 into elastic:master Jan 24, 2017
@ruflin
Copy link
Contributor

ruflin commented Jan 24, 2017

@amandahla Just merged it. I will do a follow up PR with some cleanup and will ping you on it. So you can take this directly into account in your next metricset. Thanks for pushing this forward.


events = append(events, event)

for _, HealthService := range d.Output.Health.HealthServices {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amandahla I just realised now that we have a problem here. We send two different event types as part of the same metricset. We need to extract this part here into its own metricset. Not should war we should call it. monitor?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ruflin Sorry, I didn't notice that would be a problem because all data is related and is obtained with only one command.
Answering your question: Yes, I think that "monitor" its good.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@amandahla As we send it here in separate events anyways, it doesn't make a difference if it is a different metricset. Can you open a PR with an additional metricset? Sorry that I missed that one during the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress Pull request is currently in progress. Metricbeat Metricbeat
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants