Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.10] Series data loss under high write load #5719

Closed
madushan1000 opened this issue Feb 17, 2016 · 9 comments
Closed

[0.10] Series data loss under high write load #5719

madushan1000 opened this issue Feb 17, 2016 · 9 comments
Milestone

Comments

@madushan1000
Copy link

I'm testing influxdb v0.10 for our production use.

I wrote this (https://gist.github.com/madushan1000/7d4993dc19a24a01eb84) node script using node-influx (with es6 and babel-polyfill). Which basically batch 10000(or 1000) documents into a one write and iterates forever. The write rate is 14,000points/s -- 25,000points/s

The problem I have is all the data in the messurment vanishes after it reaches a certain amount of writes (about 10 writes using 10000 batches, about 2500 writes using 1000 batches). I only have the default retention ploicy which I created manually running,

>create retention policy "default" on test duration inf replication 2 default
>show retention policies on test
name    duration        replicaN        default
default 0               2               true

Furthermore looking at the influxdb logs I disovered something like this can be seen near the time of data loss.

[tsm1] 2016/02/17 11:40:37 beginning level 2 compaction of group 0, 2 TSM files
[tsm1] 2016/02/17 11:40:37 compacting level 2 group (0) /var/lib/influxdb/data/test/default/3/000000202-000000002.t
sm (#0)
[tsm1] 2016/02/17 11:40:37 compacting level 2 group (0) /var/lib/influxdb/data/test/default/3/000000204-000000002.t
sm (#1)
[tsm1] 2016/02/17 11:40:37 compacted level 2 group (0) into /var/lib/influxdb/data/test/default/3/000000204-0000000
03.tsm.tmp (#0)

or

[tsm1] 2016/02/17 11:40:36 beginning level 1 compaction of group 0, 2 TSM files
[tsm1] 2016/02/17 11:40:36 compacting level 1 group (0) /var/lib/influxdb/data/test/default/3/000000203-000000001.t
sm (#0)
[tsm1] 2016/02/17 11:40:36 compacting level 1 group (0) /var/lib/influxdb/data/test/default/3/000000204-000000001.t
sm (#1)
[tsm1] 2016/02/17 11:40:36 compacted level 1 group (0) into /var/lib/influxdb/data/test/default/3/000000204-0000000
02.tsm.tmp (#0)
[tsm1] 2016/02/17 11:40:36 compacted level 1 group 0 of 2 files into 1 files in 35.948657ms

This setup is a influxdb cluster which have 2 data nodes and 3 meta nodes. All the servers run on Ubuntu 14.04 and I installed influxdb from the prebuild packages following the offical documantation for v0.10. Both data nodes have 500GB SSD drives mounted formatted with EXT4 mounted at /var/lib/influxdb

Here is a sample config file. I first tried influxd config output. After it gave the same issue, I tried this one came with the distribution. I tweaked around a little but the issue was same for every config I used. Only hostnames and data.Enabled change for each node.

Is this write rate is too high? Even if it is, why does the data disappear? Shouldn't be some portion of the data left?

@rossmcdonald
Copy link
Contributor

@madushan1000 Does all of the data in the database disappear after the compaction? Or is there still some data retained?

@madushan1000
Copy link
Author

@rossmcdonald depends on the insert rate. If I keep the same insert rate for some time, then all the data is lost. But if I change the insert rate somewhere before the data loss, data came with the old insert rate are retained.

@rossmcdonald
Copy link
Contributor

@madushan1000 That's very strange. Do you know if the same issue occurs with a single-node setup, or have you only tested on a cluster?

Also, how are you distributing writes/reads? Are you writing and reading from the same instance, different instance, or is it in a round-robin fashion?

@jwilder
Copy link
Contributor

jwilder commented Feb 18, 2016

@madushan1000 If you could see if you can reproduce using a single-node, that would help narrow down where the problem might be.

@madushan1000
Copy link
Author

So I tried a single node test on my i5 processor, 8GB RAM, SSD storage MacBook Pro with influxdb 0.10 from homebrew commit e13011d and the default config. Problem remains the same, Looks like clustering has nothing to do with it. Here is a influxd log file
https://gist.github.com/madushan1000/e8013564a094518dac75

@jwilder
Copy link
Contributor

jwilder commented Feb 19, 2016

How can I run the script your running to reproduce this?

@madushan1000
Copy link
Author

I've pushed the complete code into my github https://github.com/madushan1000/influx-test, Clone it, do a npm install and then npm start.

jwilder added a commit that referenced this issue Feb 19, 2016
The cache had some incorrect logic for determine when a series needed
to be deduplicated.  The logic was checking for unsorted points and
not considering duplicate points.  This would manifest itself as many
points (duplicate) points being returned from the cache and after a
snapshot compaction run, the points would disappear because snapshot
compaction always deduplicates and sorts the points.

Added a test that reproduces the issue.

Fixes #5719
@jwilder jwilder added this to the 0.11.0 milestone Feb 19, 2016
@jwilder
Copy link
Contributor

jwilder commented Feb 19, 2016

@madushan1000 Thanks for pushing your repo up. I was able to reproduce it locally and I have fix in #5751 for it.

There are two things going on here that I can see:

  1. The batches that are being written are writing with a ms precision, but in a very tight loop. This ends up writing big batches of duplicate points to the same series. Even though it's writing 10000 points in a batch, it really only ends up creating 1 or 2 because the point timestamps end up getting truncated to millisecond precision.
  2. This test triggered a bug in the cache where duplicate points were not detected and deduplicated. After each batch while the points are still in the cache, you would see all the duplicates returned. After a snapshot compaction runs the counts would drop because snapshot compaction always deduplicates and sort points. This is fixed in Fix cache not deduplicating points in some cases #5751.

For #1, you'll need to ensure the timestamps are unique for each point in the batch for a given series.

@madushan1000
Copy link
Author

Glad I could be of help.

jwilder added a commit that referenced this issue Feb 22, 2016
The cache had some incorrect logic for determine when a series needed
to be deduplicated.  The logic was checking for unsorted points and
not considering duplicate points.  This would manifest itself as many
points (duplicate) points being returned from the cache and after a
snapshot compaction run, the points would disappear because snapshot
compaction always deduplicates and sorts the points.

Added a test that reproduces the issue.

Fixes #5719
jonseymour pushed a commit to jonseymour/influxdb that referenced this issue Feb 29, 2016
The cache had some incorrect logic for determine when a series needed
to be deduplicated.  The logic was checking for unsorted points and
not considering duplicate points.  This would manifest itself as many
points (duplicate) points being returned from the cache and after a
snapshot compaction run, the points would disappear because snapshot
compaction always deduplicates and sorts the points.

Added a test that reproduces the issue.

Fixes influxdata#5719
jonseymour pushed a commit to jonseymour/influxdb that referenced this issue Feb 29, 2016
The cache had some incorrect logic for determine when a series needed
to be deduplicated.  The logic was checking for unsorted points and
not considering duplicate points.  This would manifest itself as many
points (duplicate) points being returned from the cache and after a
snapshot compaction run, the points would disappear because snapshot
compaction always deduplicates and sorts the points.

Added a test that reproduces the issue.

Fixes influxdata#5719

Rebased-to-0.10.x-by: Jon Seymour <jon@wildducktheories.com>
jonseymour added a commit to jonseymour/influxdb that referenced this issue Feb 29, 2016
This series re-rolls the fixes on influxdata#5719, influxdata#5699, influxdata#5832 without any other
changes from 0.11.0 onto 0.10.1 for the purpose of addressing
issue influxdata#5857.

Signed-off-by: Jon Seymour <jon@wildducktheories.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants