[0.10] Series data loss under high write load #5719

madushan1000 · 2016-02-17T12:28:16Z

I'm testing influxdb v0.10 for our production use.

I wrote this (https://gist.github.com/madushan1000/7d4993dc19a24a01eb84) node script using node-influx (with es6 and babel-polyfill). Which basically batch 10000(or 1000) documents into a one write and iterates forever. The write rate is 14,000points/s -- 25,000points/s

The problem I have is all the data in the messurment vanishes after it reaches a certain amount of writes (about 10 writes using 10000 batches, about 2500 writes using 1000 batches). I only have the default retention ploicy which I created manually running,

>create retention policy "default" on test duration inf replication 2 default
>show retention policies on test
name    duration        replicaN        default
default 0               2               true

Furthermore looking at the influxdb logs I disovered something like this can be seen near the time of data loss.

[tsm1] 2016/02/17 11:40:37 beginning level 2 compaction of group 0, 2 TSM files
[tsm1] 2016/02/17 11:40:37 compacting level 2 group (0) /var/lib/influxdb/data/test/default/3/000000202-000000002.t
sm (#0)
[tsm1] 2016/02/17 11:40:37 compacting level 2 group (0) /var/lib/influxdb/data/test/default/3/000000204-000000002.t
sm (#1)
[tsm1] 2016/02/17 11:40:37 compacted level 2 group (0) into /var/lib/influxdb/data/test/default/3/000000204-0000000
03.tsm.tmp (#0)

or

[tsm1] 2016/02/17 11:40:36 beginning level 1 compaction of group 0, 2 TSM files
[tsm1] 2016/02/17 11:40:36 compacting level 1 group (0) /var/lib/influxdb/data/test/default/3/000000203-000000001.t
sm (#0)
[tsm1] 2016/02/17 11:40:36 compacting level 1 group (0) /var/lib/influxdb/data/test/default/3/000000204-000000001.t
sm (#1)
[tsm1] 2016/02/17 11:40:36 compacted level 1 group (0) into /var/lib/influxdb/data/test/default/3/000000204-0000000
02.tsm.tmp (#0)
[tsm1] 2016/02/17 11:40:36 compacted level 1 group 0 of 2 files into 1 files in 35.948657ms

This setup is a influxdb cluster which have 2 data nodes and 3 meta nodes. All the servers run on Ubuntu 14.04 and I installed influxdb from the prebuild packages following the offical documantation for v0.10. Both data nodes have 500GB SSD drives mounted formatted with EXT4 mounted at /var/lib/influxdb

Here is a sample config file. I first tried influxd config output. After it gave the same issue, I tried this one came with the distribution. I tweaked around a little but the issue was same for every config I used. Only hostnames and data.Enabled change for each node.

Is this write rate is too high? Even if it is, why does the data disappear? Shouldn't be some portion of the data left?

The text was updated successfully, but these errors were encountered:

rossmcdonald · 2016-02-17T19:14:36Z

@madushan1000 Does all of the data in the database disappear after the compaction? Or is there still some data retained?

madushan1000 · 2016-02-18T03:54:29Z

@rossmcdonald depends on the insert rate. If I keep the same insert rate for some time, then all the data is lost. But if I change the insert rate somewhere before the data loss, data came with the old insert rate are retained.

rossmcdonald · 2016-02-18T18:43:16Z

@madushan1000 That's very strange. Do you know if the same issue occurs with a single-node setup, or have you only tested on a cluster?

Also, how are you distributing writes/reads? Are you writing and reading from the same instance, different instance, or is it in a round-robin fashion?

jwilder · 2016-02-18T20:58:12Z

@madushan1000 If you could see if you can reproduce using a single-node, that would help narrow down where the problem might be.

madushan1000 · 2016-02-19T04:11:57Z

So I tried a single node test on my i5 processor, 8GB RAM, SSD storage MacBook Pro with influxdb 0.10 from homebrew commit e13011d and the default config. Problem remains the same, Looks like clustering has nothing to do with it. Here is a influxd log file
https://gist.github.com/madushan1000/e8013564a094518dac75

jwilder · 2016-02-19T04:25:22Z

How can I run the script your running to reproduce this?

madushan1000 · 2016-02-19T05:34:21Z

I've pushed the complete code into my github https://github.com/madushan1000/influx-test, Clone it, do a npm install and then npm start.

The cache had some incorrect logic for determine when a series needed to be deduplicated. The logic was checking for unsorted points and not considering duplicate points. This would manifest itself as many points (duplicate) points being returned from the cache and after a snapshot compaction run, the points would disappear because snapshot compaction always deduplicates and sorts the points. Added a test that reproduces the issue. Fixes #5719

jwilder · 2016-02-19T17:21:33Z

@madushan1000 Thanks for pushing your repo up. I was able to reproduce it locally and I have fix in #5751 for it.

There are two things going on here that I can see:

The batches that are being written are writing with a ms precision, but in a very tight loop. This ends up writing big batches of duplicate points to the same series. Even though it's writing 10000 points in a batch, it really only ends up creating 1 or 2 because the point timestamps end up getting truncated to millisecond precision.
This test triggered a bug in the cache where duplicate points were not detected and deduplicated. After each batch while the points are still in the cache, you would see all the duplicates returned. After a snapshot compaction runs the counts would drop because snapshot compaction always deduplicates and sort points. This is fixed in Fix cache not deduplicating points in some cases #5751.

For #1, you'll need to ensure the timestamps are unique for each point in the batch for a given series.

madushan1000 · 2016-02-20T01:00:06Z

Glad I could be of help.

The cache had some incorrect logic for determine when a series needed to be deduplicated. The logic was checking for unsorted points and not considering duplicate points. This would manifest itself as many points (duplicate) points being returned from the cache and after a snapshot compaction run, the points would disappear because snapshot compaction always deduplicates and sorts the points. Added a test that reproduces the issue. Fixes #5719

The cache had some incorrect logic for determine when a series needed to be deduplicated. The logic was checking for unsorted points and not considering duplicate points. This would manifest itself as many points (duplicate) points being returned from the cache and after a snapshot compaction run, the points would disappear because snapshot compaction always deduplicates and sorts the points. Added a test that reproduces the issue. Fixes influxdata#5719

The cache had some incorrect logic for determine when a series needed to be deduplicated. The logic was checking for unsorted points and not considering duplicate points. This would manifest itself as many points (duplicate) points being returned from the cache and after a snapshot compaction run, the points would disappear because snapshot compaction always deduplicates and sorts the points. Added a test that reproduces the issue. Fixes influxdata#5719 Rebased-to-0.10.x-by: Jon Seymour <jon@wildducktheories.com>

This series re-rolls the fixes on influxdata#5719, influxdata#5699, influxdata#5832 without any other changes from 0.11.0 onto 0.10.1 for the purpose of addressing issue influxdata#5857. Signed-off-by: Jon Seymour <jon@wildducktheories.com>

jwilder mentioned this issue Feb 19, 2016

Fix cache not deduplicating points in some cases #5751

Merged

jwilder added this to the 0.11.0 milestone Feb 19, 2016

jwilder closed this as completed in #5751 Feb 22, 2016

jonseymour mentioned this issue Feb 29, 2016

[0.10.x] Backport of cache related changes to 0.10.x #5858

Merged

4 tasks

This was referenced Mar 2, 2016

change default RP duration or shard duration and data is lost #5878

Closed

DIsappearing data with default retention policy #5907

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.10] Series data loss under high write load #5719

[0.10] Series data loss under high write load #5719

madushan1000 commented Feb 17, 2016

rossmcdonald commented Feb 17, 2016

madushan1000 commented Feb 18, 2016

rossmcdonald commented Feb 18, 2016

jwilder commented Feb 18, 2016

madushan1000 commented Feb 19, 2016

jwilder commented Feb 19, 2016

madushan1000 commented Feb 19, 2016

jwilder commented Feb 19, 2016

madushan1000 commented Feb 20, 2016

[0.10] Series data loss under high write load #5719

[0.10] Series data loss under high write load #5719

Comments

madushan1000 commented Feb 17, 2016

rossmcdonald commented Feb 17, 2016

madushan1000 commented Feb 18, 2016

rossmcdonald commented Feb 18, 2016

jwilder commented Feb 18, 2016

madushan1000 commented Feb 19, 2016

jwilder commented Feb 19, 2016

madushan1000 commented Feb 19, 2016

jwilder commented Feb 19, 2016

madushan1000 commented Feb 20, 2016