Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rapids] removed spark tests, updated to a more recent rapids release #1219

Draft
wants to merge 74 commits into
base: master
Choose a base branch
from

Conversation

cjac
Copy link
Contributor

@cjac cjac commented Aug 8, 2024

Tested with CUDA=11 and CUDA=12

@cjac
Copy link
Contributor Author

cjac commented Aug 8, 2024

I prefer this to #1218

@cjac cjac marked this pull request as draft August 8, 2024 16:37
@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

/gcbrun

@cjac cjac changed the title [rapids] tested with 24.06 [rapids] tested with CUDA11+22.08 and CUDA12+24.06 Aug 9, 2024
@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

/gcbrun

2 similar comments
@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

/gcbrun

@prince-cs
Copy link
Collaborator

/gcbrun

@prince-cs
Copy link
Collaborator

Should we increase the machine type from n1-standard-4 to n1-standard-16

@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

cuda11 has been manually tested with all versions.
dataproc 2.0 images all pass the automated tests and can be assumed to work with cuda12 as well
Trying cuda12 on 2.1 and 2.2 now.

@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

tests are failing for

  • 2.1-debian11
  • 2.1-rocky8
  • 2.1-ubuntu20
  • 2.2-debian12
  • 2.2-rocky9
  • 2.2-ubuntu22

@cjac cjac changed the title [rapids] tested with CUDA11+22.08 and CUDA12+24.06 [rapids] tested with CUDA11+22.06 and CUDA12+24.06 Aug 9, 2024
@cjac
Copy link
Contributor Author

cjac commented Aug 9, 2024

/gcbrun

1 similar comment
@cjac
Copy link
Contributor Author

cjac commented Aug 10, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 10, 2024

[edit: this was a misconfiguration in the systemd unit]

It looks like the dask infrastructure is out of date and I'll have to target 2023.12 instead.

root@cluster-1718310842-m:~# /opt/conda/miniconda3/envs/dask/bin/python /tmp/init/dask/verify_dask_standalone.py 
/opt/conda/miniconda3/envs/dask/lib/python3.11/site-packages/distributed/client.py:1394: VersionMismatchWarning: Mismatched versions found

+-------------+----------------+----------------+---------+
| Package     | Client         | Scheduler      | Workers |
+-------------+----------------+----------------+---------+
| dask        | 2024.6.2       | 2023.12.1      | None    |
| distributed | 2024.6.2       | 2023.12.1      | None    |
| python      | 3.11.9.final.0 | 3.11.8.final.0 | None    |
| tornado     | 6.4.1          | 6.3.3          | None    |
+-------------+----------------+----------------+---------+

@cjac
Copy link
Contributor Author

cjac commented Aug 10, 2024

I also need to reduce the python abi to 3.10

@cjac cjac changed the title [rapids] tested with CUDA11+22.06 and CUDA12+24.06 [rapids] tested with CUDA11+22.06 and CUDA12+23.12 Aug 10, 2024
@cjac cjac changed the title [rapids] tested with CUDA11+22.06 and CUDA12+23.12 [rapids,dask] tested with CUDA11+22.06 and CUDA12+23.12 Aug 10, 2024
@cjac
Copy link
Contributor Author

cjac commented Aug 10, 2024

/gcbrun

6 similar comments
@cjac
Copy link
Contributor Author

cjac commented Aug 10, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 10, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 11, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 11, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 11, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Aug 11, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Oct 11, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Oct 11, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Oct 11, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Oct 11, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Oct 14, 2024

/gcbrun

Copy link

@jakirkham jakirkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your continued persistence here CJ! 🙏

At this point, would it be worthwhile to try with RAPIDS 24.10?

rapids/rapids.sh Outdated
function is_cuda12() { [[ "${CUDA_VERSION%%.*}" == "12" ]] ; }
function is_cuda11() { [[ "${CUDA_VERSION%%.*}" == "11" ]] ; }

readonly DEFAULT_DASK_RAPIDS_VERSION="24.08"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
readonly DEFAULT_DASK_RAPIDS_VERSION="24.08"
readonly DEFAULT_DASK_RAPIDS_VERSION="24.10"

rapids/rapids.sh Outdated
# SPARK config
readonly DEFAULT_SPARK_RAPIDS_VERSION="24.08.0"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
readonly DEFAULT_SPARK_RAPIDS_VERSION="24.08.0"
readonly DEFAULT_SPARK_RAPIDS_VERSION="24.10.0"

@cjac
Copy link
Contributor Author

cjac commented Oct 14, 2024

/gcbrun

@cjac
Copy link
Contributor Author

cjac commented Oct 19, 2024

Thanks for your continued persistence here CJ! 🙏

At this point, would it be worthwhile to try with RAPIDS 24.10?

I'm having a hard time with 24.08 right now due to custreamz not matching:

2024-10-19T19:16:35.910978+00:00 + /opt/conda/miniconda3/bin/conda create -m -n dask-rapids -y --no-channel-priority -c conda-forge -c nvidia -c rapidsai 'cuda-version>=12,<13' rapids=24.08 'dask>=2024.8' dask-bigquery dask-ml dask-sql cudf numba 'python>=3.11'
2024-10-19T19:16:38.254208+00:00 Channels:
2024-10-19T19:16:38.254317+00:00  - conda-forge
2024-10-19T19:16:38.254362+00:00  - nvidia
2024-10-19T19:16:38.254536+00:00  - rapidsai
2024-10-19T19:16:38.254566+00:00 Platform: linux-64
2024-10-19T19:17:22.220346+00:00 Collecting package metadata (repodata.json): ...working... done
2024-10-19T19:18:06.719423+00:00 Solving environment: ...working... failed
2024-10-19T19:18:06.719680+00:00
2024-10-19T19:18:06.719739+00:00 LibMambaUnsatisfiableError: Encountered problems while solving:
2024-10-19T19:18:06.719785+00:00   - package rapids-24.08.00-cuda11_py310_240808_g86654f0_0 requires custreamz 24.08.*, but none of the providers can be installed
2024-10-19T19:18:06.719825+00:00
2024-10-19T19:18:06.719861+00:00 Could not solve for environment specs
2024-10-19T19:18:06.719900+00:00 The following packages are incompatible
2024-10-19T19:18:06.719936+00:00 ├─ dask >=2024.8  is requested and can be installed;
2024-10-19T19:18:06.719971+00:00 └─ rapids 24.08**  is not installable because it requires
2024-10-19T19:18:06.720010+00:00    └─ custreamz 24.08.* , which requires
2024-10-19T19:18:06.720046+00:00       └─ rapids-dask-dependency 24.08.* , which requires
2024-10-19T19:18:06.720081+00:00          └─ dask 2024.7.1 , which conflicts with any installable versions previously reported.

@jakirkham
Copy link

Is Dask installed before RAPIDS is installed? If so, could they be combined into the same install step? Think that would minimize conflicts

Should add RAPIDS 24.10 was recently released so is using a newer version of Dask, which also may minimize conflicts

@cjac cjac changed the title [rapids] removed spark tests, updated to latest rapids release [rapids] removed spark tests, updated to a more recent rapids release Oct 23, 2024
@cjac
Copy link
Contributor Author

cjac commented Oct 23, 2024

okay, I have a lot of changes to merge from some work I did while I was adding secure-boot support to the custom-images repo. It's kind of a lot.

@cjac
Copy link
Contributor Author

cjac commented Oct 23, 2024

rapids/BUILD
* removed dependence on verify_xgboost_spark.scala - this belongs in [spark-rapids]
* removed dependence on dask

rapids/rapids.sh
* added utility functions
* reverted dask_spec="dask>=2024.5"
* using realpath to /opt/conda/miniconda3/bin/mamba instead of default symlink
* remove conda environment [dask] if installed
* asserting existence of directory depended on by the script when run as custom-images script
* created exit_handler and prepare_to_install functions to set up and clean up

rapids/test_rapids.py
* refactored to make use of systemd unit defined in rapids.sh
* added retry to ssh
* removed condition to keep tests from running on 2.0 images
@cjac
Copy link
Contributor Author

cjac commented Oct 24, 2024

/gcbrun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants