Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Replace Arrow Ballista with DataFusion Ballista #1041

Merged
merged 7 commits into from
Jul 13, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 2 additions & 12 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,16 +47,6 @@ inside a Python virtualenv.

## Release

The documentation is served through the [arrow-site](https://github.com/apache/arrow-site/) repository. To release
a new version of the documentation, follow these steps:
The documentation is published from the `asf-site` branch of this repository.

1. Download the release source tarball (we can only publish documentation from official releases)
2. Run `./build.sh` inside `docs` folder to generate the docs website inside the `build/html` folder.
3. Clone the arrow-site repo
4. Checkout to the `asf-site` branch (NOT `master`)
5. Copy build artifacts into `arrow-site` repo's `ballista` folder with a command such as

- `cp -rT ./build/html/ ../../arrow-site/ballista/` (doesn't work on mac)
- `rsync -avzr ./build/html/ ../../arrow-site/ballista/`

6. Commit changes in `arrow-site` and send a PR.
Documentation is published automatically when documentation changes are pushed to the main branch.
48 changes: 4 additions & 44 deletions docs/source/community/communication.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,55 +26,15 @@ All participation in the Apache DataFusion Ballista project is governed by the
Apache Software Foundation's [code of
conduct](https://www.apache.org/foundation/policies/conduct.html).

## Questions?
We use the same communication channels as the main DataFusion project:

### Mailing list

We use datafusion.apache.org's `dev@` mailing list for project management, release
coorindation and design discussions
([subscribe](mailto:dev-subscribe@datafusion.apache.org),
[unsubscribe](mailto:dev-unsubscribe@datafusion.apache.org),
[archives](https://lists.apache.org/list.html?dev@datafusion.apache.org)).

When emailing the dev list, please make sure to prefix the subject line with a
`[Ballista]` tag, e.g. `"[Ballista] New API for remote data sources"`, so
that the appropriate people in the Apache DataFusion community notice the message.

### Slack and Discord

We use the official [ASF](https://s.apache.org/slack-invite) Slack workspace
for informal discussions and coordination. This is a great place to meet other
contributors and get guidance on where to contribute. Join us in the
`#arrow-rust` channel.

We also have a backup Arrow Rust Discord
server ([invite link](https://discord.gg/Qw5gKqHxUM)) in case you are not able
to join the Slack workspace. If you need an invite to the Slack workspace, you
can also ask for one in our Discord server.

### Sync up video calls

We have biweekly sync calls every other Thursdays at both 04:00 UTC
and 16:00 UTC (starting September 30, 2021) depending on if there are
items on the agenda to discuss and someone being willing to host.

Please see the [agenda](https://docs.google.com/document/d/1atCVnoff5SR4eM4Lwf2M1BBJTY6g3_HUNR6qswYJW_U/edit)
for the video call link, add topics and to see what others plan to discuss.

The goals of these calls are:

1. Help "put a face to the name" of some of other contributors we are working with
2. Discuss / synchronize on the goals and major initiatives from different stakeholders to identify areas where more alignment is needed

No decisions are made on the call and anything of substance will be discussed on this mailing list or in github issues / google docs.

We will send a summary of all sync ups to the dev@datafusion.apache.org mailing list.
[https://datafusion.apache.org/contributor-guide/communication.html](https://datafusion.apache.org/contributor-guide/communication.html)

## Contributing

Our source code is hosted on
[GitHub](https://github.com/apache/arrow-datafusion). More information on contributing is in
the [Contribution Guide](https://github.com/apache/arrow-datafusion/blob/master/CONTRIBUTING.md)
[GitHub](https://github.com/apache/datafusion-ballista). More information on contributing is in
the [Contribution Guide](https://github.com/apache/datafusion-ballista/blob/main/CONTRIBUTING.md)
, and we have curated a [good-first-issue](https://github.com/apache/datafusion-ballista/contribute)
list to help you get started. You can find datafusion's major designs in docs/source/specification.

Expand Down
2 changes: 1 addition & 1 deletion docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@

html_context = {
"github_user": "apache",
"github_repo": "arrow-ballista",
"github_repo": "datafusion-ballista",
"github_version": "main",
"doc_path": "docs/source",
}
Expand Down
8 changes: 4 additions & 4 deletions docs/source/contributors-guide/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,9 +94,9 @@ can execute multiple partitions of the same plan in parallel.

There are multiple clients available for submitting jobs to a Ballista cluster:

- The [Ballista CLI](https://github.com/apache/arrow-ballista/tree/main/ballista-cli) provides a SQL command-line
- The [Ballista CLI](https://github.com/apache/datafusion-ballista/tree/main/ballista-cli) provides a SQL command-line
interface.
- The Python bindings ([PyBallista](https://github.com/apache/arrow-ballista/tree/main/python)) provide a session
- The Python bindings ([PyBallista](https://github.com/apache/datafusion-ballista/tree/main/python)) provide a session
context with support for SQL and DataFrame operations.
- The [ballista crate](https://crates.io/crates/ballista) provides a native Rust session context with support for
SQL and DataFrame operations.
Expand Down Expand Up @@ -201,5 +201,5 @@ Each executor will re-partition the output of the stage it is running so that it
stage. This mechanism is known as an Exchange or a Shuffle. The logic for this can be found in the [ShuffleWriterExec]
and [ShuffleReaderExec] operators.

[shufflewriterexec]: https://github.com/apache/arrow-ballista/blob/main/ballista/core/src/execution_plans/shuffle_writer.rs
[shufflereaderexec]: https://github.com/apache/arrow-ballista/blob/main/ballista/core/src/execution_plans/shuffle_reader.rs
[shufflewriterexec]: https://github.com/apache/datafusion-ballista/blob/main/ballista/core/src/execution_plans/shuffle_writer.rs
[shufflereaderexec]: https://github.com/apache/datafusion-ballista/blob/main/ballista/core/src/execution_plans/shuffle_reader.rs
36 changes: 18 additions & 18 deletions docs/source/contributors-guide/code-organization.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,33 +23,33 @@ This section provides links to the source code for major areas of functionality.

### ballista-core crate

- [Crate Source](https://github.com/apache/arrow-ballista/blob/main/ballista/core)
- [Protocol Buffer Definition](https://github.com/apache/arrow-ballista/blob/main/ballista/core/proto/ballista.proto)
- [Execution Plans](https://github.com/apache/arrow-ballista/tree/main/ballista/core/src/execution_plans)
- [Ballista Client](https://github.com/apache/arrow-ballista/blob/main/ballista/core/src/client.rs)
- [Crate Source](https://github.com/apache/datafusion-ballista/blob/main/ballista/core)
- [Protocol Buffer Definition](https://github.com/apache/datafusion-ballista/blob/main/ballista/core/proto/ballista.proto)
- [Execution Plans](https://github.com/apache/datafusion-ballista/tree/main/ballista/core/src/execution_plans)
- [Ballista Client](https://github.com/apache/datafusion-ballista/blob/main/ballista/core/src/client.rs)

### ballista-scheduler crate

- [Crate Source](https://github.com/apache/arrow-ballista/tree/main/ballista/scheduler)
- [Distributed Query Planner](https://github.com/apache/arrow-ballista/blob/main/ballista/scheduler/src/planner.rs)
- [gRPC Service](https://github.com/apache/arrow-ballista/blob/main/ballista/scheduler/src/scheduler_server/grpc.rs)
- [Flight SQL Service](https://github.com/apache/arrow-ballista/blob/main/ballista/scheduler/src/flight_sql.rs)
- [REST API](https://github.com/apache/arrow-ballista/tree/main/ballista/scheduler/src/api)
- [Web UI](https://github.com/apache/arrow-ballista/tree/main/ballista/scheduler/ui)
- [Prometheus Integration](https://github.com/apache/arrow-ballista/blob/main/ballista/scheduler/src/metrics/prometheus.rs)
- [Crate Source](https://github.com/apache/datafusion-ballista/tree/main/ballista/scheduler)
- [Distributed Query Planner](https://github.com/apache/datafusion-ballista/blob/main/ballista/scheduler/src/planner.rs)
- [gRPC Service](https://github.com/apache/datafusion-ballista/blob/main/ballista/scheduler/src/scheduler_server/grpc.rs)
- [Flight SQL Service](https://github.com/apache/datafusion-ballista/blob/main/ballista/scheduler/src/flight_sql.rs)
- [REST API](https://github.com/apache/datafusion-ballista/tree/main/ballista/scheduler/src/api)
- [Web UI](https://github.com/apache/datafusion-ballista/tree/main/ballista/scheduler/ui)
- [Prometheus Integration](https://github.com/apache/datafusion-ballista/blob/main/ballista/scheduler/src/metrics/prometheus.rs)

### ballista-executor crate

- [Crate Source](https://github.com/apache/arrow-ballista/tree/main/ballista/executor)
- [Flight Service](https://github.com/apache/arrow-ballista/blob/main/ballista/executor/src/flight_service.rs)
- [Executor Server](https://github.com/apache/arrow-ballista/blob/main/ballista/executor/src/executor_server.rs)
- [Crate Source](https://github.com/apache/datafusion-ballista/tree/main/ballista/executor)
- [Flight Service](https://github.com/apache/datafusion-ballista/blob/main/ballista/executor/src/flight_service.rs)
- [Executor Server](https://github.com/apache/datafusion-ballista/blob/main/ballista/executor/src/executor_server.rs)

### ballista crate

- [Crate Source](https://github.com/apache/arrow-ballista/tree/main/ballista/client)
- [Context](https://github.com/apache/arrow-ballista/blob/main/ballista/client/src/context.rs)
- [Crate Source](https://github.com/apache/datafusion-ballista/tree/main/ballista/client)
- [Context](https://github.com/apache/datafusion-ballista/blob/main/ballista/client/src/context.rs)

### PyBallista

- [Source](https://github.com/apache/arrow-ballista/tree/main/python)
- [Context](https://github.com/apache/arrow-ballista/blob/main/python/src/context.rs)
- [Source](https://github.com/apache/datafusion-ballista/tree/main/python)
- [Context](https://github.com/apache/datafusion-ballista/blob/main/python/src/context.rs)
6 changes: 3 additions & 3 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,7 @@ Table of content
contributors-guide/architecture
contributors-guide/code-organization
contributors-guide/development
Source code <https://github.com/apache/arrow-ballista/>
Source code <https://github.com/apache/datafusion-ballista/>

.. _toc.community:

Expand All @@ -75,5 +75,5 @@ Table of content

community/communication

Issue tracker <https://github.com/apache/arrow-ballista/issues>
Code of conduct <https://github.com/apache/arrow-ballista/blob/main/CODE_OF_CONDUCT.md>
Issue tracker <https://github.com/apache/datafusion-ballista/issues>
Code of conduct <https://github.com/apache/datafusion-ballista/blob/main/CODE_OF_CONDUCT.md>
22 changes: 11 additions & 11 deletions docs/source/user-guide/deployment/docker-compose.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,31 +23,31 @@ Docker Compose is a convenient way to launch a cluster when testing locally.

## Build Docker Images

Run the following commands to download the [official Docker image](https://github.com/apache/arrow-ballista/pkgs/container/arrow-ballista-standalone):
Run the following commands to download the [official Docker image](https://github.com/apache/datafusion-ballista/pkgs/container/datafusion-ballista-standalone):

```bash
docker pull ghcr.io/apache/arrow-ballista-standalone:0.12.0-rc4
docker pull ghcr.io/apache/datafusion-ballista-standalone:0.12.0-rc4
```

Altenatively run the following commands to clone the source repository and build the Docker images from source:

```bash
git clone git@github.com:apache/arrow-ballista.git -b 0.12.0
cd arrow-ballista
git clone git@github.com:apache/datafusion-ballista.git -b 0.12.0
cd datafusion-ballista
./dev/build-ballista-docker.sh
```

This will create the following images:

- `apache/arrow-ballista-benchmarks:0.12.0`
- `apache/arrow-ballista-cli:0.12.0`
- `apache/arrow-ballista-executor:0.12.0`
- `apache/arrow-ballista-scheduler:0.12.0`
- `apache/arrow-ballista-standalone:0.12.0`
- `apache/datafusion-ballista-benchmarks:0.12.0`
- `apache/datafusion-ballista-cli:0.12.0`
- `apache/datafusion-ballista-executor:0.12.0`
- `apache/datafusion-ballista-scheduler:0.12.0`
- `apache/datafusion-ballista-standalone:0.12.0`

## Start a Cluster

Using the [docker-compose.yml](https://github.com/apache/arrow-ballista/blob/main/docker-compose.yml) from the
Using the [docker-compose.yml](https://github.com/apache/datafusion-ballista/blob/main/docker-compose.yml) from the
source repository, run the following command to start a cluster:

```bash
Expand Down Expand Up @@ -77,5 +77,5 @@ The scheduler web UI is available on port 80 in the scheduler.
## Connect from the Ballista CLI

```shell
docker run --network=host -it apache/arrow-ballista-cli:0.12.0 --host localhost --port 50050
docker run --network=host -it apache/datafusion-ballista-cli:0.12.0 --host localhost --port 50050
```
32 changes: 16 additions & 16 deletions docs/source/user-guide/deployment/docker.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,27 +21,27 @@

## Build Docker Images

Run the following commands to download the [official Docker image](https://github.com/apache/arrow-ballista/pkgs/container/arrow-ballista-standalone):
Run the following commands to download the [official Docker image](https://github.com/apache/datafusion-ballista/pkgs/container/datafusion-ballista-standalone):

```bash
docker pull ghcr.io/apache/arrow-ballista-standalone:0.12.0-rc4
docker pull ghcr.io/apache/datafusion-ballista-standalone:0.12.0-rc4
```

Altenatively run the following commands to clone the source repository and build the Docker images from source:

```bash
git clone git@github.com:apache/arrow-ballista.git -b 0.12.0
cd arrow-ballista
git clone git@github.com:apache/datafusion-ballista.git -b 0.12.0
cd datafusion-ballista
./dev/build-ballista-docker.sh
```

This will create the following images:

- `apache/arrow-ballista-benchmarks:0.12.0`
- `apache/arrow-ballista-cli:0.12.0`
- `apache/arrow-ballista-executor:0.12.0`
- `apache/arrow-ballista-scheduler:0.12.0`
- `apache/arrow-ballista-standalone:0.12.0`
- `apache/datafusion-ballista-benchmarks:0.12.0`
- `apache/datafusion-ballista-cli:0.12.0`
- `apache/datafusion-ballista-executor:0.12.0`
- `apache/datafusion-ballista-scheduler:0.12.0`
- `apache/datafusion-ballista-standalone:0.12.0`

## Start a Cluster

Expand All @@ -51,7 +51,7 @@ Start a scheduler using the following syntax:

```bash
docker run --network=host \
-d apache/arrow-ballista-scheduler:0.12.0 \
-d apache/datafusion-ballista-scheduler:0.12.0 \
--bind-port 50050
```

Expand All @@ -60,7 +60,7 @@ Run `docker ps` to check that the process is running:
```
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
a756055576f3 apache/arrow-ballista-scheduler:0.12.0 "/root/scheduler-ent…" 8 seconds ago Up 8 seconds xenodochial_carson
a756055576f3 apache/datafusion-ballista-scheduler:0.12.0 "/root/scheduler-ent…" 8 seconds ago Up 8 seconds xenodochial_carson
```

Run `docker logs CONTAINER_ID` to check the output from the process:
Expand All @@ -84,7 +84,7 @@ Start one or more executor processes. Each executor process will need to listen

```bash
docker run --network=host \
-d apache/arrow-ballista-executor:0.12.0 \
-d apache/datafusion-ballista-executor:0.12.0 \
--external-host localhost --bind-port 50051
```

Expand All @@ -93,8 +93,8 @@ Use `docker ps` to check that both the scheduler and executor(s) are now running
```
$ docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
fb8b530cee6d apache/arrow-ballista-executor:0.12.0 "/root/executor-entr…" 2 seconds ago Up 1 second gallant_galois
a756055576f3 apache/arrow-ballista-scheduler:0.12.0 "/root/scheduler-ent…" 8 seconds ago Up 8 seconds xenodochial_carson
fb8b530cee6d apache/datafusion-ballista-executor:0.12.0 "/root/executor-entr…" 2 seconds ago Up 1 second gallant_galois
a756055576f3 apache/datafusion-ballista-scheduler:0.12.0 "/root/scheduler-ent…" 8 seconds ago Up 8 seconds xenodochial_carson
```

Use `docker logs CONTAINER_ID` to check the output from the executor(s):
Expand All @@ -117,7 +117,7 @@ to launch the scheduler with this option enabled.

```bash
docker run --network=host \
-d apache/arrow-ballista-scheduler:0.12.0 \
-d apache/datafusion-ballista-scheduler:0.12.0 \
--bind-port 50050 \
--config-backend etcd \
--etcd-urls etcd:2379
Expand All @@ -129,5 +129,5 @@ recommended.
## Connect from the CLI

```shell
docker run --network=host -it apache/arrow-ballista-cli:0.12.0 --host localhost --port 50050
docker run --network=host -it apache/datafusion-ballista-cli:0.12.0 --host localhost --port 50050
```
30 changes: 15 additions & 15 deletions docs/source/user-guide/deployment/kubernetes.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,37 +41,37 @@ microk8s enable dns

## Build Docker Images

Run the following commands to download the [official Docker image](https://github.com/apache/arrow-ballista/pkgs/container/arrow-ballista-standalone):
Run the following commands to download the [official Docker image](https://github.com/apache/datafusion-ballista/pkgs/container/datafusion-ballista-standalone):

```bash
docker pull ghcr.io/apache/arrow-ballista-standalone:0.12.0-rc4
docker pull ghcr.io/apache/datafusion-ballista-standalone:0.12.0-rc4
```

Altenatively run the following commands to clone the source repository and build the Docker images from source:

```bash
git clone git@github.com:apache/arrow-ballista.git -b 0.12.0
cd arrow-ballista
git clone git@github.com:apache/datafusion-ballista.git -b 0.12.0
cd datafusion-ballista
./dev/build-ballista-docker.sh
```

This will create the following images:

- `apache/arrow-ballista-benchmarks:0.12.0`
- `apache/arrow-ballista-cli:0.12.0`
- `apache/arrow-ballista-executor:0.12.0`
- `apache/arrow-ballista-scheduler:0.12.0`
- `apache/arrow-ballista-standalone:0.12.0`
- `apache/datafusion-ballista-benchmarks:0.12.0`
- `apache/datafusion-ballista-cli:0.12.0`
- `apache/datafusion-ballista-executor:0.12.0`
- `apache/datafusion-ballista-scheduler:0.12.0`
- `apache/datafusion-ballista-standalone:0.12.0`

## Publishing Docker Images

Once the images have been built, you can retag them and can push them to your favourite Docker registry.

```bash
docker tag apache/arrow-ballista-scheduler:0.12.0 <your-repo>/arrow-ballista-scheduler:0.12.0
docker tag apache/arrow-ballista-executor:0.12.0 <your-repo>/arrow-ballista-executor:0.12.0
docker push <your-repo>/arrow-ballista-scheduler:0.12.0
docker push <your-repo>/arrow-ballista-executor:0.12.0
docker tag apache/datafusion-ballista-scheduler:0.12.0 <your-repo>/datafusion-ballista-scheduler:0.12.0
docker tag apache/datafusion-ballista-executor:0.12.0 <your-repo>/datafusion-ballista-executor:0.12.0
docker push <your-repo>/datafusion-ballista-scheduler:0.12.0
docker push <your-repo>/datafusion-ballista-executor:0.12.0
```

## Create Persistent Volume and Persistent Volume Claim
Expand Down Expand Up @@ -159,7 +159,7 @@ spec:
spec:
containers:
- name: ballista-scheduler
image: <your-repo>/arrow-ballista-scheduler:0.12.0
image: <your-repo>/datafusion-ballista-scheduler:0.12.0
args: ["--bind-port=50050"]
ports:
- containerPort: 50050
Expand Down Expand Up @@ -191,7 +191,7 @@ spec:
spec:
containers:
- name: ballista-executor
image: <your-repo>/arrow-ballista-executor:0.12.0
image: <your-repo>/datafusion-ballista-executor:0.12.0
args:
- "--bind-port=50051"
- "--scheduler-host=ballista-scheduler"
Expand Down
Loading
Loading