-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WAL #3036
WAL #3036
Conversation
The CI failures are expected since the query engine isn't hooked up. This PR just validates that the WAL can be written, flushed, and restarted-and-flushed. |
for _, p := range points { | ||
bp, err := tx.CreateBucketIfNotExists(p.Key()) | ||
// Generate an autoincrementing index for the WAL. | ||
id, _ := wal.NextSequence() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The id
that is created here, is used as a key. Data is inserted using this key. Is this data ever deleted from the WAL bucket?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, nope, not yet. I missed that. I think I might actually change the flusher to go off the WAL bucket instead. It'll make consistency easier.
Generally makes sense. Would we really need the cache if queries weren't a concern?. Couldn't the flusher just walk the WAL keys and build the batches on the fly? Of course, queries are a concern. :-) But I want to fully understand the motivation behind the cache. After all, due to the Bolt mmap, that cached data is now in memory twice (most likely). |
The cache is only needed for queries. It'd be nice to just query directly off the WAL but the points are not necessarily going to be in order. |
@otoolep I made some changes in 6d0337e:
I'm going to hook up the query side next. |
// is responsible for combining the output of many shards into a single query result. | ||
const ( | ||
// DefaultMaxWALSize is the default size of the WAL before it is flushed. | ||
DefaultMaxWALSize = 10 * 1024 * 1024 // 10MB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a blocker for this PR, but we'll need to make this configurable through tsdb/config.go
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the WAL size and flush interval to the TSDB config.
I think this makes sense. I'd be interested to see what the perf is like when the WAL gets close to its max size. Particularly if you have a single series key that is very hot (many times per second) so its cache entry would be very large. I assume that most people would want a WAL much larger than 10MB. Would be good to test with ones in the hundreds of megabytes. We should probably force a flush of a specific key after it gets up to a certain size. No need to flush the entire WAL, just that key. I assume that you'd be removing the calls to Also, flush should be forced after a certain time as well as just size. That way once a shard is cold for writes, its WAL will get flushed. Maybe just force a flush if we haven't received a write after 10m? |
5a5c225
to
1eae26d
Compare
@otoolep @pauldix I added query integration, time-based flushing, and WAL options to the config. The flushes have been removed from the
We could do individual key flushing but that starts to complicate the WAL and has some other overhead. I'd prefer to defer that until it's an issue. I think striping the flushes is a better way to go, personally. I didn't want to include that in this first cut because that's also more complicated. |
} | ||
|
||
// Otherwise read from the cache. | ||
// Continue skipping ahead through duplicate keys in the cache list. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could there be duplicate keys? Points for the series written for the same timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
I'm running benchmarks on a WAL-enabled
|
if err := s.db.Update(func(tx *bolt.Tx) error { | ||
b := tx.Bucket([]byte("series")) | ||
for _, k := range keys { | ||
if err := b.Delete([]byte(k)); err != nil { | ||
return err | ||
} | ||
if err := tx.DeleteBucket([]byte(k)); err != nil { | ||
if err := tx.DeleteBucket([]byte(k)); err != nil && err != bolt.ErrBucketNotFound { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like it might be fixing another error that people have reported. Some users have complained of "bucket not found" coming back to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's definitely good to know the return errors from Bolt's operations. They're generally documented in the godoc but you can also find the exact list of named errors if you open up the function:
https://github.com/boltdb/bolt/blob/master/bucket.go#L208-L212
https://github.com/boltdb/bolt/blob/master/bucket.go#L218-L223
Failing from a bolt.ErrBucketNotFound
and returning that error causes the whole Update()
to rollback which is probably not what you want.
Looking forward to seeing this in action. +1 |
@benbjohnson looks good so far. What do you mean by striping? I'm still concerned about series that have a much higher frequency than others. Say you have one series that gets an average of 500 points per second. Then you have a bunch of other series are regular that only get 1 point every 10 seconds. I don't think the same approach is going to work well for both. So you really need a way to flush per series, IMO. But I'm interested in hearing more about the striping approach and if you think that'll solve this potential issue. |
@pauldix By striping I mean that we break out the WAL into, say 256 buckets, and series are sharded by those buckets. Then we can flush those buckets individually so that we're not flushing the whole WAL at once. It also still allows us to get the benefit of grouping series writes together. As for variable series sizes, I think that if you have a series with 500 w/sec then it's going to dwarf your series with a write every 10s. It seems like a premature optimization at this point. |
@benbjohnson cool, will be curious to see how striping affects performance. |
If this change makes it for 0.9.1, doesn't it also need an upgrade case? On Wednesday, June 24, 2015, Paul Dix notifications@github.com wrote:
|
Of course the system will handle no WAL bucket existing and create it. But On Wednesday, June 24, 2015, Philip O'Toole philip@influxdb.com wrote:
|
76c48e6
to
3af1f32
Compare
@benbjohnson -- what impact did striping have on the system? Does it meet the design goals? |
return | ||
} | ||
|
||
// WALPartitionN is the number of partitions in the write ahead log. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a warning in the comment that this cannot be reduced without possibly making data in partitions numbered larger from becoming invisible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The WAL gets flushed on shard open so it can rebuild the partition buckets automatically.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't follow you, perhaps I am missing something. A system is running with partition count of 8. It crashes hard. It stays down. It's upgraded to a version with a partition count of 4, and then restarted. The call to Flush() will only flush partitions 0-3 inclusive.
This is a real edge case, but I want to be sure I understand the code. Not saying we have to fix it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps the fix for this would be easy enough? The flush that takes place on open should walk the partition buckets actually on disk, and flush those? Ignore the partition count in the code, just for the first flush? Again, this is an edge case, but perhaps the fix is easy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it wouldn't catch it in the flush. I thought I fixed that. Good catch. I fixed it in the latest commit (b574e2f)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blame my work on Kafka at Loggly. Increasing partition count was, like, a 10-second operation. Decreasing it was fraught. :-)
@benbjohnson -- how does partitioning actually help us? The WAL buckets are still only in 1 BotlDB instance, so only 1 writer can be doing its thing at once. There is still only 1 shard mutex as well. I don't follow how it helps? Can you explain? |
@otoolep The main goal of partitioning was to lower the amount of continuous time that the flush blocks. Previously flushing 1.5M points would take 8s which is a long time to block for. Now with partitioning it will block 8 times for about 1s each time which is much more reasonable from a client perspective. Partitioning raised the IOPS to the WAL a bit -- from about 50 IOPS to 100-150 IOPS but it's still within our design goal. Previously we were seeing up to 20,000 IOPS on the box continuously. Now it's closer to 150 IOPS and the flushes push it up to 1500-2000 IOPS momentarily. |
OK, thanks @benbjohnson -- if it's about more, but shorter, delays, that makes sense. |
+1 looks awesome. We'll have to run this through its paces testing the next 7 days :) |
2b40fe2
to
657971b
Compare
This commit adds a write ahead log to the shard. Entries are cached in memory and periodically flushed back into the index. The WAL and the cache are both partitioned into buckets so that flushing doesn't stop the world as long.
seems have better performance,but I test for 10000msg/s pull from kafka and the influxdb hang without any log infomation |
@huhongbo can you give more information? How does it hang? What is your schema? How are you writing data? How many points do you write before it hangs? Basically, we need as much information as you can give. Our testing hasn't shown the issue you're talking about so we'd like to reproduce it if possible. |
https://gist.github.com/huhongbo/4d0882e4e0262dfa6991 have some data sample |
Overview
This commit adds a write ahead log to the shard. Entries are cached in memory and periodically flushed back into the index. The goal is to optimize write performance by flushing multiple points into series instead of writing a single point which causes a lot of random write disk overhead.
TODO