Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A utility to aggregate s3 access logs. #5777

Merged
merged 3 commits into from
May 20, 2018

Conversation

benjyw
Copy link
Contributor

@benjyw benjyw commented May 3, 2018

Helps us track which binaries our S3 bandwidth costs are being spent on.

Currently produces:

936.9GB 3452 bin/clang/linux/x86_64/6.0.0/clang.tar.gz
785.9GB 3401 bin/gcc/linux/x86_64/7.3.0/gcc.tar.gz
537.2GB 6987 bin/go/linux/x86_64/1.7.3/go.tar.gz
292.0GB 6891 bin/protobuf/linux/x86_64/3.4.1/protoc
195.9GB 6981 bin/cmake/linux/x86_64/3.9.5/cmake.tar.gz
187.8GB 5029 bin/thrift/linux/x86_64/0.9.2/thrift
183.1GB 5553 bin/watchman/linux/x86_64/4.9.0-pants1/watchman
123.8GB 3322 bin/binutils/linux/x86_64/2.30/binutils.tar.gz
113.9GB 1359 bin/go/linux/x86_64/1.8.3/go.tar.gz
59.8GB 3454 bin/protoc/linux/x86_64/2.4.1/protoc
42.2GB 551 bin/go/mac/10.10/1.7.3/go.tar.gz
28.8GB 634 bin/thrift/linux/x86_64/0.10.0/thrift
19.8GB 1520 bin/node/linux/x86_64/v6.9.1/node.tar.gz
...

Helps us track which binaries our S3 bandwidth costs are being spent on.
stuhood pushed a commit that referenced this pull request May 14, 2018
…ibution and LLVM subsystems to use it (#5780)

### Problem

`BinaryTool` is a great recent development which makes using binaries downloaded lazily from a specified place much more declarative and much more extensible. However, it's still only able to download from either our S3 hosting, or a mirror.

The previous structure requires the urls provided to the global option `--binaries-baseurls` to point to an exact mirror of the hierarchy we provide in our S3 hosting, but that can change at any time. It's not incredibly difficult to write a script to mirror our hosting into an internal network, but in general there's no reason the layout of binaries in `~/.cache/pants/bin/` needs to determine where those binaries are downloaded from.

Our bandwidth costs in S3 have recently increased due to the introduction of clang and gcc in #5490. *See #5777 and #5779 for further context on S3 hosting.*  There are reliable binary downloads for some of these tools, which we would be remiss not to use if we can do it in a structured way.


### Solution

- Introduce a `urls=` argument to multiple methods of `BinaryUtil` for `BinaryTool`s that don't download from our s3.
- Add support for extracting (not creating) `.tar.xz` archives by adding the `xz` BinaryTool (see pantsbuild/binaries#66) and integrating it into BinaryTool's `archive_type` selection mechanism.
- Use the above to download the `go` and `llvm` binaries from their official download urls.
  - Also, rename the `Clang` subsystem to `LLVM` as the binary download we use now (for ubuntu 16.04, currently) also contains many other LLVM tools, including e.g. `lld`.

### Result

Urls for binary downloads can now be created in a structured way for external downloads, with the `--force-baseurls` option as an escape hatch. Some binaries now default to external urls provided for public use by the maintainers of the software to download, thanks to the introduction of the `xz` binary tool. Two out of the three largest bandwidth users among our provided binaries have been switched to use the download urls provided by the maintainers of each project (LLVM and Go). gcc still needs to be fixed, which will happen in a separate PR.
@benjyw benjyw merged commit 4839d2a into pantsbuild:master May 20, 2018
@benjyw benjyw deleted the s3_log_aggregator branch May 20, 2018 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants