Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: bump pyspark version #974

Open
wants to merge 34 commits into
base: dev
Choose a base branch
from
Open

chore: bump pyspark version #974

wants to merge 34 commits into from

Conversation

project-defiant
Copy link
Contributor

@project-defiant project-defiant commented Jan 16, 2025

✨ Context

This PR closes #3189
This PR closes #3680
and links to opentargets/orchestration#94

🛠 What does this PR implement

This PR focuses on allowing pyspark>=3.5.0, <3.6 along with newest hail version, that supports it.
The changes above caused trouble (errors) for poetry to resolve the dependencies. This lead to more developments, the full list is described below:

  • replaced poetry with uv as a dependency management tool (~5x faster dependency resolution without issues)
  • added support for X.Y.Z-devXX releases on push to dev branch
  • updated way to create dev dataproc cluster (Open Targets users only) - see details below
  • add support for multiple python versions in gentropy (py310, py311, py312) py313 support failure due to missing BLAS when installing scipy from wheel
  • support for pyspark>=3.5.0, <3.6
  • bump hail version to sync with pyspark dependency
  • dropped google as dependency, added more granular google-cloud-storage (To be removed in future release with other google dependent packages)
  • github actions update to use uv and test for dependency matrix of python versions (requires changes in the repository rules)

dev cluster setup

Previously dev cluster was created from the package pushed to the google cloud storage bucket under the namespace that matched the git ref name. This has now changed, as cluster now install gentropy directly from the github branch (or tag) specified in the orchestration - see changes in opentargets/orchestration#94.

Benefits of new solution

  • keeping just one copy of the static files used for running gentropy steps(install_dependencies_on_cluster.sh and cli.py) in the GCS.
  • spped up the development process, since the one does not need to wait for the github actions to finish, before they can set up the cluster from airflow on the branch.
  • speed up the dependency resolution with uv pip install (pip resolution of gentropy with support for multiple python versions takes ~ 20minutes leading to initialization actions timeout, uv does the same thing with around 1.5min).

🙈 Missing

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g poetry run pre-commit run --all-files)?

Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
2 questions:

  1. Should we also give support to pyspark versions lower than 3.5? Afaik, there is nothing in the API that is not compatible. I know we have custom code to compare dataframes, something that Pyspark 3.5 supports
  2. We need to update the Dataproc version too. Should we make a pipeline run to test nothing breaks?

@project-defiant
Copy link
Contributor Author

Thank you! 2 questions:

  1. Should we also give support to pyspark versions lower than 3.5? Afaik, there is nothing in the API that is not compatible. I know we have custom code to compare dataframes, something that Pyspark 3.5 supports
  2. We need to update the Dataproc version too. Should we make a pipeline run to test nothing breaks?

These are very nice points!

  1. I am aftraid, that poetry will not allow us to supoport more pyspark versions, latest hail requires version between 3.5 to 3.6, otherwise we could do it.
  2. Yes I am planning to do that, just right after I finish the harmonisation :). Let's wait for this check before we merge.

@github-actions github-actions bot added the Step label Jan 21, 2025
@github-actions github-actions bot added documentation Improvements or additions to documentation size-XL and removed size-M labels Jan 22, 2025
@project-defiant project-defiant linked an issue Jan 22, 2025 that may be closed by this pull request
3 tasks
Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool!! New PySpark, support for several Python versions (dev version set to 3.11), and migration to UV.

I suggest changing the PR title, this is clearly a lie haha
image

Setting up the dev environment was smooth simply by doing make setup-dev (except that uv wasnt in my path, see minor comment below).

I haven't used uv myself, so I can't really evaluate pros and cons. M only questions is to confirm that we are not missing anything from CI/CD after the migration?


echo "Fetching version changes..."
git fetch
if output=$(git status --porcelain) && [[ -n "$output" ]]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I didn't know about --porcerlain :D

"google-cloud-storage (>=2.14.0, <2.15.0)"
]
classifiers = [
"Programming Language :: Python :: 3.10",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🥳

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Gentropy image will be automatically updated, right?

if ! command -v uv &>/dev/null; then
echo "uv was not found, installing uv..."
curl -LsSf https://astral.sh/uv/install.sh | sh
source $HOME/.local/bin/env
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After installing the environment, uv wasn't on my path. Should we add export PATH="$HOME/.local/bin:$PATH" here?


test: ## Run tests
@echo "Running Tests..."
@poetry run pytest
@uv run pytest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please test this yourself, but the current command goes idle without running any tests

Suggested change
@uv run pytest
@uv run pytest .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow multiple python versions Update Dataproc and PySpark in Genetics
3 participants