-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support to CEPH input. #3311
Support to CEPH input. #3311
Conversation
Conflicts: metricbeat/metricbeat.yml
Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run. |
1 similar comment
Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run. |
enabled: true | ||
period: 1s | ||
|
||
ceph: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this working? This looks like invalid yaml. All config options related to a metricset should be in the module itself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually no. I commited a new version with this file corrected now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for starting with this PR and adding support for CEPH to metricbeat. What is your current stage? Should we already have a detailed look at the PR or do you prefer first some high level feedback?
@@ -1,105 +0,0 @@ | |||
###################### Metricbeat Configuration Example ####################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file should not be removed. Running make update
should generate it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, new commit with this corrected.
Hi Ruflin, thanks for the review. I prefer a detailed look because everything is working but it can really be better. I had difficult to deal with json output from ceph and only tested with a container (https://hub.docker.com/r/ceph/demo/). |
I will have a closer look. Please make sure to run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the work on this. To make this testable it is important to add a testing environment and add integration tests.
Have a look at the docker-compose.yml file in metricbeat. There you can add the ceph service and then add tests for it. This will also make it easy to create the example data.json
files and will directly show all the fields that are exported by these metricsets.
enabled: true | ||
period: 1s | ||
|
||
ceph: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the config must be moved under the module. I assume it doesn't work the way it is at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved.
# Consult the ceph documentation for more detail on keyring generation. | ||
user = "client.admin" | ||
# Ceph configuration to use to locate the cluster | ||
config_path = "/etc/ceph/ceph.conf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general we recommend metricbeat to connect to each node and get the specific metrics for each node instead from the cluster. What is inside this file? You should use hosts: ["..."]
for that if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used telegraf as reference and that only collects performance metrics from the MON and OSD nodes in a Ceph storage cluster. I can change the description, perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not aware of the inner workings of CEPH. Can you share some details on what MON and OSD nodes are?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A Ceph Storage Cluster requires at least one MON and at least 2 OSD.
A OSD (Object Storage Daemon) handles data (storage, replication, recovery...) and provide some monitoring information to MON by checking others OSD for a heartbeat.
A MON (Monitor) handles the cluster state (maps, history...).
You can have this architecture on one server/node but just for tests. Real productions environments works in a distributed way.
If you want metrics just for the daemons where metricbeat its collecting, you can use Admin Socket Stats. In our case here, it's the "perf" metricset.
If you want metrics for the whole cluster, you can use ceph commands. In our case here, it's the "status", "df" and "osdpoolstats" metricsets.
That's why I wrote those comments on config.yml. Depending on choosen metricset, you need just some parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So perf
is a local stat that should be collect by all metricbeat instances and the other 3 are cluster stats, means only one of the instances connecting needs to collect them? If there are multiple MON, will they provide different data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...means only one of the instances connecting needs to collect them?" Yes.
"If there are multiple MON, will they provide different data?" No, at least it's not expected this behavior.
enabled: true | ||
period: 5s | ||
|
||
ceph: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is this line for? Forgot to remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. Useless line, sorry.
period: 5s | ||
|
||
ceph: | ||
# location of ceph binary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to split up the config into config.full.yml and config.yml and only have the extensive options in the full file. See other metricsets for examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working.
- module: ceph | ||
metricsets: ["perf","status","df","osdpoolstats"] | ||
enabled: true | ||
period: 5s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default we use is 10s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
logp.Err("An error occurred while parsing data for getting ceph df: %v", err) | ||
} | ||
|
||
stats_fields, ok := data["stats"].(map[string]interface{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we use statsFields
etc as naming ocnvention.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
This is the ceph Module. | ||
|
||
We used code/reference from: | ||
https://github.com/influxdata/telegraf/blob/master/plugins/inputs/ceph/ceph.go |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should make sure to not just copy / paste code from here. Or was this also created by you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I copy/paste code and then adapted to collect metrics. It's really similar but not exactly equal. That's why I wrote this on docs.asciidoc. Is it a problem?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah: This referente on telegraf was not created by me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to be careful about the licenses here. Getting inspired by some code is one thing, copy / pasting it an other. If we have the same code, we should use it as a library ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw that Telegraf uses MIT License. So, I understand that there is no problem to copy parts of the Software. Unfortunately, I don't think that it's possible to use as a library.
To get a better view what kind of copy I did, here it's a example. I changed the code to generate events.
Mine (line 40)
Telegraf (line 462)
description: > | ||
df | ||
fields: | ||
- name: example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fields docs have to be added. This will also make it easier for us to understand what kind of data is provided.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, just to be sure: is there a way to generate fields automatically? Or just manually?
|
||
func New(base mb.BaseMetricSet) (mb.MetricSet, error) { | ||
|
||
config := ceph.Config{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the Ceph metricsets should be marked experimental first. We do this for all metricsets for the first release. See other metricsets for examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working.
}, | ||
"metricset":{ | ||
"host":"localhost", | ||
"module":"mysql", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can generate this files by running make generate-json
if you have created the integration
test files which are responsible to create the data (see other metricsets). This requires a working docker environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Working.
Config reading changed.
@amandahla Ping me when it is ready for an other round of reviews. |
Hi @ruflin
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently this module depends on executing a binary which we normally don't do in Metricbeat. It should be changed to use the CEPH REST API which will also simplify the implementation and configuration. Are there any limitations using the REST API?
@@ -46,13 +47,26 @@ services: | |||
volumes: | |||
- ${PWD}/..:/go/src/github.com/elastic/beats/ | |||
- /var/run/docker.sock:/var/run/docker.sock | |||
- datavolume:/etc/ceph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need these data volumes on the client side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll study the rest api but I think that we can remove all this and don't use command any more.
command: make | ||
entrypoint: /go/src/github.com/elastic/beats/metricbeat/docker-entrypoint.sh | ||
|
||
# Modules | ||
apache: | ||
build: ${PWD}/module/apache/_meta | ||
|
||
ceph: | ||
image: ceph/demo | ||
container_name: metricbeatceph |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not give the container a special name. This should not be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above :)
"I'll study the rest api but I think that we can remove all this and don't use command any more."
@@ -91,3 +105,8 @@ services: | |||
|
|||
zookeeper: | |||
image: jplock/zookeeper:3.4.8 | |||
|
|||
volumes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these volumes are ceph specific, we should have better names. If possible, I would prefer not to have "global" volumes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above :)
I'll study the rest api but I think that we can remove all this and don't use command any more.
|
||
This is the ceph Module. | ||
|
||
We used code/reference from: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the reference is only used to get inspiration, these should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since I'll change to use Rest API, this reference will be no longer necessary indeed. I'll remove.
var output string | ||
|
||
if strings.Contains(c.BinaryPath, "docker") { | ||
cmdCeph := []string{"/usr/bin/ceph"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not execute binaries whenever possible. As far as I can see ceph also has a REST API to fetch monitoring data: http://ceph.com/planet/experimenting-with-the-ceph-rest-api/ This should be used to fetch the data. Calling a binary requires CEPH to be installed on the machine. This works in most cases but breaks for example in docker environments, where on container is used to monitor the other containers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: Why is the execution different if it is in a container or not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If is a container, I needed to call the ceph command inside him so that's why was different. In container, command will be something like "docker exec -i container_name /usr/bin/ceph ...".
With REST api, I can remove this too.
"mons":[ | ||
{ | ||
"name":"prcdsrvv1619", | ||
"kb_total":6281216, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check our naming conventions: https://www.elastic.co/guide/en/beats/libbeat/5.1/event-conventions.html
Whenever possible we should no use nested arrays. I think we need to have a closer look at the data set and potentially resturcture it to fit into metricbeat.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So far, I see that with REST API the output it's more simple. But I'll have a closer look at this.
var defaultConfig = CephConfig{ | ||
BinaryPath: "/usr/bin/ceph", | ||
User: "client.admin", | ||
ConfigPath: "/etc/ceph/ceph.conf", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use the REST API using config etc. becomes obsolete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. :)
@amandahla Keep me posted about the REST API |
@ruflin Thanks for showing me the CEPH Rest API. I was so focused on use the same way that Telegraf did, that didn't even think about searching for another way. |
You can also just push the updated version to this branch. No worries if you overwrite the current commits. I think the title stays the same anyways ;-) |
@amandahla Your last commit looks very promising. This heavily simplifies the implementation 👍 Let me know if you should have another detailed look. |
Hi @ruflin |
Quick note on the failing travis build: Run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change looks good to me. I left some comments on the naming of the fields. We are very close to get this in. Thanks a lot for going forward with this.
This is the ceph Module. Metrics are collected submitting HTTP GET requests to ceph-rest-api. | ||
Reference: http://docs.ceph.com/docs/master/man/8/ceph-rest-api/ | ||
|
||
Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this line.
type: keyword | ||
description: > | ||
Status of the round | ||
- name: mon.avail_percent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would change this to mon.available.pct
. See https://www.elastic.co/guide/en/beats/libbeat/5.1/event-conventions.html
type: keyword | ||
description: > | ||
Health of the MON | ||
- name: mon.kb_avail |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mon.available.kb
type: long | ||
description: > | ||
Available KB of the MON | ||
- name: mon.kb_total |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mon.total.kb
type: long | ||
description: > | ||
Total KB of the MON | ||
- name: mon.kb_used |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mon.used.kb
type: long | ||
description: > | ||
Log bytes of MON | ||
- name: mon.store_stats.bytes_misc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mon.store_stats.misc.bytes
You can also add format: bytes
to all bytes value, so Kibana will automatically know how to convert it to MB etc.
- name: mon.store_stats.sst.bytes
type: long
description: >
SST bytes of MON
format: bytes
type: long | ||
description: > | ||
Misc bytes of MON | ||
- name: mon.store_stats.bytes_sst |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same with the bytes for the following two
"encoding/json" | ||
"github.com/elastic/beats/libbeat/common" | ||
"github.com/elastic/beats/libbeat/logp" | ||
"io" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we normally but the external imports on the two following and empty line before the internal once
|
||
events := []common.MapStr{} | ||
|
||
event := common.MapStr{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could probably use the schema package here to do the event conversion, but we can do this cleanup in a second step.
|
||
event := common.MapStr{ | ||
"cluster.overall_status": d.Output.OverallStatus, | ||
"cluster.timechecks": common.MapStr{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must use here "cluster" : common.MapStr{ "timechecks" : common.MapStr{ ...
as otherwise the event will not be compatible with elasticsearch 2.x version which do not support dots in fields.
Thanks @ruflin ! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The only thing I see missing is a quick system tests that checks if all fields are as expected. See https://github.com/elastic/beats/blob/master/metricbeat/tests/system/test_haproxy.py#L34 But we can also add this as a follow up PR.
func TestFetchEventContents(t *testing.T) { | ||
absPath, err := filepath.Abs("./testdata/") | ||
|
||
response, err := ioutil.ReadFile(absPath + "/sample_response.json") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great way to test the fetcher with predefined data. We should apply the same logic to other metricsets.
jenkins, test it |
Hello @ruflin . I'm thinking about start on a new metricset now. Please, is this one ok so I can use as reference? |
@amandahla will review just now and let you know. |
@amandahla Just merged it. I will do a follow up PR with some cleanup and will ping you on it. So you can take this directly into account in your next metricset. Thanks for pushing this forward. |
|
||
events = append(events, event) | ||
|
||
for _, HealthService := range d.Output.Health.HealthServices { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amandahla I just realised now that we have a problem here. We send two different event types as part of the same metricset. We need to extract this part here into its own metricset. Not should war we should call it. monitor
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ruflin Sorry, I didn't notice that would be a problem because all data is related and is obtained with only one command.
Answering your question: Yes, I think that "monitor" its good.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@amandahla As we send it here in separate events anyways, it doesn't make a difference if it is a different metricset. Can you open a PR with an additional metricset? Sorry that I missed that one during the review.
Using as reference Telegraf (https://github.com/influxdata/telegraf/blob/master/plugins/inputs/ceph/ceph.go). Working in progress.
#197