Batch never joined #362

AlexGrs · 2016-03-21T16:19:31Z

I m still trying to get a working version of my alert for my tests success rates. Because there is a lot of delays in alert with stream, I was thinking about using a batch alert.

I can now successfully have in real time success and total numbers. But as soon as I try to join them to compute a success rate, nothing happens. I never have any alert, even if I print successfully the number of success and total tests before.

The success and total point seems to have the same timestamp (at least it should be match by the tolerance option)

Here is my definition :

var success = batch
        .query('''SELECT count(value) FROM "metrics_data"."default"."c_test_status" WHERE status='success' AND test_id='34' ''')
                .period(6m)
        .every(1m)
        .groupBy('test_id', 'platform_id')
        .fill('previous')

var total = batch
                .query('''SELECT count(value) FROM "metrics_data"."default"."c_test_status" WHERE test_id='34' ''')
        .period(6m)
                .every(1m)
                .groupBy('test_id', 'platform_id')
                .fill('previous')

success.alert()
       .id('kapacitor/{{ .TaskName }}/{{ index .Tags }}/{{.Name}}')
       .message('{{ .Time }} - {{ .ID }} is {{ .Level}} value: {{ index .Fields }}')
       .crit(lambda: TRUE)
       .slack()

total.alert()
       .id('kapacitor/{{ .TaskName }}/{{ index .Tags }}/{{.Name}}')
       .message('{{ .Time }} - {{ .ID }} is {{ .Level}} TOTAL: {{ index .Fields }}')
       .crit(lambda: TRUE)
       .slack()

total.join(success)
    .as('test_total', 'test_success')
    .tolerance(2m)
    .alert()
    .id('kapacitor/{{ .TaskName }}/{{ index .Tags }}/{{.Name}}')
                .message('{{ .Time }} - {{ .ID }} is {{ .Level}} SUCCESS RATE: {{ index .Fields }}')
    .crit(lambda: TRUE)
    .slack()

DOT:
digraph success_rate {
graph [throughput="0.00 batches/s"];

batch2 [avg_exec_time_ns="0" ];
batch2 -> alert4 [processed="2"];
batch2 -> join6 [processed="2"];

alert4 [alerts_triggered="2" avg_exec_time_ns="0" ];

batch1 [avg_exec_time_ns="0" ];
batch1 -> join6 [processed="2"];
batch1 -> alert3 [processed="2"];

join6 [avg_exec_time_ns="4.668µs" ];
join6 -> alert7 [processed="0"];

alert7 [alerts_triggered="0" avg_exec_time_ns="0" ];
alert3 [alerts_triggered="2" avg_exec_time_ns="0" ];
}

The text was updated successfully, but these errors were encountered:

nathanielc · 2016-03-21T19:39:21Z

@AlexGrs I have looked over this several times now and do not see anything wrong. My thinking is that there is a bug in the join code. Could you create a recording of your data so I can try and reproduce this locally.

Run this to record the batch data
kapacitor record batch -name success_rate -past 20m

Grab the returned ID

Then in Kapacitor data dir typically /var/lib/kapacitor or ~/.kapacitor there is a dir called replay in there is will be a file named <ID>.brpl. Can you share that file with me? Feel free to send it to me directly at nathaniel@influxdb.com if you like.

AlexGrs · 2016-03-21T21:53:25Z

Recording on-going. As soon as I have the file I'll send it to your email address.

Done. You have received the records.

AlexGrs · 2016-03-22T14:29:09Z

Tell me if you need more infos to reproduce the problem 😄 I'm sure we will manage to solve this !

nathanielc · 2016-03-22T15:48:22Z

@AlexGrs What version of Kapacitor are you using?

kapacitord version ...

AlexGrs · 2016-03-22T15:49:51Z

kapacitord version
Kapacitor 0.11.0rc1 (git: master 0b43cc3)

nathanielc · 2016-03-22T15:56:49Z

hmmm, well it works for me.

I have used both recordings you sent and I am using the same version and it is working for me. I get several alerts from the joined alert. I did add a .log to the last alert so I could see the alerts without using slack. Does it work for you in a replay?

Try this:
kapacitor replay -name success_rate -id 66cbec74-3f20-4322-b489-f2981f3abbbe -fast -rec-time

AlexGrs · 2016-03-22T16:06:49Z

I switched to log alert.

It works when I replay it: It is the first time i can see those messages (...CRITICAL SUCCESS RATE: map...)

But when I do that in with 'real time' metrics, it doesn't work. I only see CRITICAL and CRITICAL TOTAL

AlexGrs · 2016-03-22T16:17:17Z

Ok i can confirm. I just register the data for 10 minutes. At the same time I was checking my alert log in real time. I only saw alerts coming from my two debug streams but nothing from the join. Then i used your command to replay it : and i also saw the join alerts.

nathanielc · 2016-03-22T16:19:58Z

@AlexGrs Thanks, I am stumped on how it doesn't work for an enabled task but does during a replay. The purpose of the replay system is to be identical to the real thing to test tasks, they only vary but how data is ingested into the task, which should not effect how the join functions... I am digging further in and I'll keep you posted.

AlexGrs · 2016-03-22T16:51:32Z

I agree. I'm also stuck on this point. I tried again and see the same. Not working in live, but working with replay. There clearly is something i don't get here. Could it be linked with ingestion speed which is different from a real case ?

nathanielc · 2016-03-22T17:02:37Z

@AlexGrs Good news, I can reproduce the issue. I'll update when I have a fix....

AlexGrs · 2016-03-22T17:03:13Z

Good news !! I will be really happy to test the fix 👓

Fix #362, join+batches+tolerance did not work

nathanielc · 2016-03-22T20:07:11Z

@AlexGrs the fix has been merged, it was pretty obvious once I found it. Feel free to test using master, or the nightly builds tonight. Thanks

AlexGrs · 2016-03-23T09:49:24Z

Indeed it works perfectly :)

AlexGrs mentioned this issue Mar 21, 2016

Delay in alerts #361

Closed

nathanielc added a commit that referenced this issue Mar 22, 2016

fix #362, join+batches+tolerance did not work

ed01d71

nathanielc mentioned this issue Mar 22, 2016

Fix #362, join+batches+tolerance did not work #373

Merged

nathanielc closed this as completed in #373 Mar 22, 2016

nathanielc pushed a commit that referenced this issue Mar 22, 2016

Merge pull request #373 from influxdata/nc-issue#362

97ac4ed

Fix #362, join+batches+tolerance did not work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch never joined #362

Batch never joined #362

AlexGrs commented Mar 21, 2016

nathanielc commented Mar 21, 2016

AlexGrs commented Mar 21, 2016

AlexGrs commented Mar 22, 2016

nathanielc commented Mar 22, 2016

AlexGrs commented Mar 22, 2016

nathanielc commented Mar 22, 2016

AlexGrs commented Mar 22, 2016

AlexGrs commented Mar 22, 2016

nathanielc commented Mar 22, 2016

AlexGrs commented Mar 22, 2016

nathanielc commented Mar 22, 2016

AlexGrs commented Mar 22, 2016

nathanielc commented Mar 22, 2016

AlexGrs commented Mar 23, 2016

Batch never joined #362

Batch never joined #362

Comments

AlexGrs commented Mar 21, 2016

nathanielc commented Mar 21, 2016

AlexGrs commented Mar 21, 2016

AlexGrs commented Mar 22, 2016

nathanielc commented Mar 22, 2016

AlexGrs commented Mar 22, 2016

nathanielc commented Mar 22, 2016

AlexGrs commented Mar 22, 2016

AlexGrs commented Mar 22, 2016

nathanielc commented Mar 22, 2016

AlexGrs commented Mar 22, 2016

nathanielc commented Mar 22, 2016

AlexGrs commented Mar 22, 2016

nathanielc commented Mar 22, 2016

AlexGrs commented Mar 23, 2016