Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch never joined #362

Closed
AlexGrs opened this issue Mar 21, 2016 · 14 comments
Closed

Batch never joined #362

AlexGrs opened this issue Mar 21, 2016 · 14 comments

Comments

@AlexGrs
Copy link

AlexGrs commented Mar 21, 2016

I m still trying to get a working version of my alert for my tests success rates. Because there is a lot of delays in alert with stream, I was thinking about using a batch alert.

I can now successfully have in real time success and total numbers. But as soon as I try to join them to compute a success rate, nothing happens. I never have any alert, even if I print successfully the number of success and total tests before.

The success and total point seems to have the same timestamp (at least it should be match by the tolerance option)

capture d ecran 2016-03-21 a 18 34 07

Here is my definition :

var success = batch
        .query('''SELECT count(value) FROM "metrics_data"."default"."c_test_status" WHERE status='success' AND test_id='34' ''')
                .period(6m)
        .every(1m)
        .groupBy('test_id', 'platform_id')
        .fill('previous')

var total = batch
                .query('''SELECT count(value) FROM "metrics_data"."default"."c_test_status" WHERE test_id='34' ''')
        .period(6m)
                .every(1m)
                .groupBy('test_id', 'platform_id')
                .fill('previous')

success.alert()
       .id('kapacitor/{{ .TaskName }}/{{ index .Tags }}/{{.Name}}')
       .message('{{ .Time }} - {{ .ID }} is {{ .Level}} value: {{ index .Fields }}')
       .crit(lambda: TRUE)
       .slack()

total.alert()
       .id('kapacitor/{{ .TaskName }}/{{ index .Tags }}/{{.Name}}')
       .message('{{ .Time }} - {{ .ID }} is {{ .Level}} TOTAL: {{ index .Fields }}')
       .crit(lambda: TRUE)
       .slack()

total.join(success)
    .as('test_total', 'test_success')
    .tolerance(2m)
    .alert()
    .id('kapacitor/{{ .TaskName }}/{{ index .Tags }}/{{.Name}}')
                .message('{{ .Time }} - {{ .ID }} is {{ .Level}} SUCCESS RATE: {{ index .Fields }}')
    .crit(lambda: TRUE)
    .slack()

DOT:
digraph success_rate {
graph [throughput="0.00 batches/s"];

batch2 [avg_exec_time_ns="0" ];
batch2 -> alert4 [processed="2"];
batch2 -> join6 [processed="2"];

alert4 [alerts_triggered="2" avg_exec_time_ns="0" ];

batch1 [avg_exec_time_ns="0" ];
batch1 -> join6 [processed="2"];
batch1 -> alert3 [processed="2"];

join6 [avg_exec_time_ns="4.668µs" ];
join6 -> alert7 [processed="0"];

alert7 [alerts_triggered="0" avg_exec_time_ns="0" ];
alert3 [alerts_triggered="2" avg_exec_time_ns="0" ];
}
@nathanielc
Copy link
Contributor

@AlexGrs I have looked over this several times now and do not see anything wrong. My thinking is that there is a bug in the join code. Could you create a recording of your data so I can try and reproduce this locally.

Run this to record the batch data
kapacitor record batch -name success_rate -past 20m

Grab the returned ID

Then in Kapacitor data dir typically /var/lib/kapacitor or ~/.kapacitor there is a dir called replay in there is will be a file named <ID>.brpl. Can you share that file with me? Feel free to send it to me directly at nathaniel@influxdb.com if you like.

@AlexGrs
Copy link
Author

AlexGrs commented Mar 21, 2016

Recording on-going. As soon as I have the file I'll send it to your email address.

Done. You have received the records.

@AlexGrs
Copy link
Author

AlexGrs commented Mar 22, 2016

Tell me if you need more infos to reproduce the problem 😄 I'm sure we will manage to solve this !

@nathanielc
Copy link
Contributor

@AlexGrs What version of Kapacitor are you using?

kapacitord version ...

@AlexGrs
Copy link
Author

AlexGrs commented Mar 22, 2016

kapacitord version
Kapacitor 0.11.0rc1 (git: master 0b43cc3)

@nathanielc
Copy link
Contributor

hmmm, well it works for me.

I have used both recordings you sent and I am using the same version and it is working for me. I get several alerts from the joined alert. I did add a .log to the last alert so I could see the alerts without using slack. Does it work for you in a replay?

Try this:
kapacitor replay -name success_rate -id 66cbec74-3f20-4322-b489-f2981f3abbbe -fast -rec-time

@AlexGrs
Copy link
Author

AlexGrs commented Mar 22, 2016

I switched to log alert.

It works when I replay it: It is the first time i can see those messages (...CRITICAL SUCCESS RATE: map...)

But when I do that in with 'real time' metrics, it doesn't work. I only see CRITICAL and CRITICAL TOTAL

@AlexGrs
Copy link
Author

AlexGrs commented Mar 22, 2016

Ok i can confirm. I just register the data for 10 minutes. At the same time I was checking my alert log in real time. I only saw alerts coming from my two debug streams but nothing from the join. Then i used your command to replay it : and i also saw the join alerts.

@nathanielc
Copy link
Contributor

@AlexGrs Thanks, I am stumped on how it doesn't work for an enabled task but does during a replay. The purpose of the replay system is to be identical to the real thing to test tasks, they only vary but how data is ingested into the task, which should not effect how the join functions... I am digging further in and I'll keep you posted.

@AlexGrs
Copy link
Author

AlexGrs commented Mar 22, 2016

I agree. I'm also stuck on this point. I tried again and see the same. Not working in live, but working with replay. There clearly is something i don't get here. Could it be linked with ingestion speed which is different from a real case ?

@nathanielc
Copy link
Contributor

@AlexGrs Good news, I can reproduce the issue. I'll update when I have a fix....

@AlexGrs
Copy link
Author

AlexGrs commented Mar 22, 2016

Good news !! I will be really happy to test the fix 👓

nathanielc pushed a commit that referenced this issue Mar 22, 2016
Fix #362, join+batches+tolerance did not work
@nathanielc
Copy link
Contributor

@AlexGrs the fix has been merged, it was pretty obvious once I found it. Feel free to test using master, or the nightly builds tonight. Thanks

@AlexGrs
Copy link
Author

AlexGrs commented Mar 23, 2016

Indeed it works perfectly :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants