-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Promtail: Add compressed files support #6708
Promtail: Add compressed files support #6708
Conversation
eeffa05
to
d18a0a1
Compare
- The new Reader interface is used across source code instead of the `tailer` - `tailer` is an implementation of Reader
- Decompresser implements the file.Reader interface - Decompresser infers which protocol to use to unzip file based on its extension
d18a0a1
to
fd6162d
Compare
fd6162d
to
f94e7e2
Compare
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks awesome @DylanGuedes!
I'm a bit concerned by the potential here for resource overuse by reading the whole decompressed file into memory, from what I can see.
// It first decompress the file as a whole using a reader and then it will iterate | ||
// over its chunks, separated by '\n'. | ||
// During each iteration, the parsed and decoded log line is then sent to the API with the current timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what I was kinda worried about seeing...
If the file is compressed, it could potentially store a huge amount of data which we need to entirely load into memory to iterate over.
Is there a more resource-efficient way we could do this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally @dannykopping ... compression ratios >99% are common when thinking about uniformly structured logs like access or event logs. Reading a whole file into memory would totally not be an option. When suggesting implementing this along the lines of piping data via zcat
into an existing tool (#5956 (comment)). And isn't that exactly what the mountReader method does? It gets the right type of reader (which is able to read a compressed data stream of various algorithms) and then hands that off to be read from line by line. But there is not slurping into memory, is there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great observation, thanks. I'm relying on whatever the underlying readers are doing; I'm not sure if all of them (like flatten/zip) supports doing inline uncompression without buffering, but I'm not either explicitly decompressing things as a whole with something like tar -xzvf myfile.tar.gz
, so it might be supported. I'm planning on doing a few tests with this this week, one of them will be to uncompressed big files and another one will be to rotate files to see how it behaves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tested this together, and found that it is indeed allocating a lot of memory but the GC acts as it should, and the memory doesn't grow unbounded. Dylan also verified that the files are decompressed progressively, so at no time do we try to load the whole file into memory at once, so the overall memory usage will not spike very high (we use small buffers when reading the compressed files). We should add a note in the docs to specify that GC frequency will increase dramatically when decompressing large files, and this will cause CPU usage to rise.
// It first decompress the file as a whole using a reader and then it will iterate | ||
// over its chunks, separated by '\n'. | ||
// During each iteration, the parsed and decoded log line is then sent to the API with the current timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally @dannykopping ... compression ratios >99% are common when thinking about uniformly structured logs like access or event logs. Reading a whole file into memory would totally not be an option. When suggesting implementing this along the lines of piping data via zcat
into an existing tool (#5956 (comment)). And isn't that exactly what the mountReader method does? It gets the right type of reader (which is able to read a compressed data stream of various algorithms) and then hands that off to be read from line by line. But there is not slurping into memory, is there?
@dannykopping @frittentheke thank you both for the fantastic reviews, I'll be working on those suggestions and other ones that might come along the week and I plan to try testing a few things with this too. Just one thing, I think even if this doesn't add decent support for compressed log rotations, we should still proceed to at least have this, as a simpler implementation that is to be improved in a follow-up PR (I can start working on it exactly after this gets merged). My argument is that as of now Loki isn't supporting even the most straightforward compression scenarios, so this would at least add some support without a lot of burdens. |
Yes. Question is though, in what order those files will be picked up? Does Promtail sort them by modification date? That would then at least keep as much order as possible. |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
- It tells that the file was compressed when it actually just mounted its reader - Add file extension to it
- Change message log level from Info to Debug - Enhance grammar of log message (with -> has) Co-authored-by: Danny Kopping <dannykopping@gmail.com>
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0.4%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, some small doc changes and I think we should change that buffer size - but after that, merge at will!
// It first decompress the file as a whole using a reader and then it will iterate | ||
// over its chunks, separated by '\n'. | ||
// During each iteration, the parsed and decoded log line is then sent to the API with the current timestamp. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We tested this together, and found that it is indeed allocating a lot of memory but the GC acts as it should, and the memory doesn't grow unbounded. Dylan also verified that the files are decompressed progressively, so at no time do we try to load the whole file into memory at once, so the overall memory usage will not spike very high (we use small buffers when reading the compressed files). We should add a note in the docs to specify that GC frequency will increase dramatically when decompressing large files, and this will cause CPU usage to rise.
Co-authored-by: Danny Kopping <dannykopping@gmail.com>
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
+ distributor 0%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
Benchmark results with 3MB buffer size for decompression:
|
Benchmark results with 4096 buffer size:
|
@dannykopping Merging as I addressed all your great suggestions. I'll be working on the zip compression on a follow-up PR 😄 |
**What this PR does / why we need it**: Adds to Promtail the ability to read compressed files. It works by: 1. Infer which compression format to use based on the file extension 2. Uncompress the file with the native `golang/compress` packages 3. Iterate over uncompressed lines and send them to Loki Its usage is the same as our current file tailing. **Which issue(s) this PR fixes**: Fixes grafana#5956 Co-authored-by: Danny Kopping <dannykopping@gmail.com>
**What this PR does / why we need it**: Adds to Promtail the ability to read compressed files. It works by: 1. Infer which compression format to use based on the file extension 2. Uncompress the file with the native `golang/compress` packages 3. Iterate over uncompressed lines and send them to Loki Its usage is the same as our current file tailing. **Which issue(s) this PR fixes**: Fixes grafana#5956 Co-authored-by: Danny Kopping <dannykopping@gmail.com>
What this PR does / why we need it:
Adds to Promtail the ability to read compressed files. It works by:
golang/compress
packagesIts usage is the same as our current file tailing. In the example below all files under
logfiles
folder are parsed/processed, regardless of being compressed or not.file1
andfile2
are first uncompressed with Gunzip; file3 is parsed as it is.Which issue(s) this PR fixes:
Fixes #5956
Special notes for your reviewer:
Another approach is to infer the compression format based on the header bytes but this approach is much simpler; in the future we can improve it if we think the community needs it.
Checklist
CHANGELOG.md
.docs/sources/upgrading/_index.md