Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(glue-alpha): include extra jars parameter in pyspark jobs #33238

Merged
merged 3 commits into from
Feb 19, 2025

Conversation

gontzalm
Copy link
Contributor

Issue # (if applicable)

Closes #33225.

Reason for this change

PySpark jobs with extra JAR dependencies cannot be defined with the new L2 constructs introduced in v2.177.0.

Description of changes

Add the extraJars parameter in the PySpark job L2 constructs.

Checklist


By submitting this pull request, I confirm that my contribution is made under the terms of the Apache-2.0 license

@github-actions github-actions bot added beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK feature-request A feature should be added or improved. p2 labels Jan 30, 2025
@aws-cdk-automation aws-cdk-automation requested a review from a team January 30, 2025 14:59
@gontzalm
Copy link
Contributor Author

Exemption Request: no changes in README or integration tests needed.

@aws-cdk-automation aws-cdk-automation added pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback. pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. labels Jan 30, 2025
Copy link
Collaborator

@aws-cdk-automation aws-cdk-automation left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This review is outdated)

@natalie-white-aws
Copy link
Contributor

One of the authors of the new L2 here - We talked about this during RFC and implementation phases as a potential anti-pattern. Can you share why you need extra jars for a python job?

@gontzalm
Copy link
Contributor Author

Hi Natalie, we need to use the spark-xml package in order to read XML files in Spark v3 (as you probably know, this package will be included in Spark v4). This package must be provided via the extraJars parameter, because AWS Glue does not accept installing packages via --conf spark.jars.packages=<maven coordinates>.

@natalie-white-aws
Copy link
Contributor

Thanks for the extra clarification. Let me get with the Glue service team; it sounds like this may be more of a Glue feature request than something we should work around in the L2 construct. Stay tuned.

@humanzz
Copy link
Contributor

humanzz commented Feb 9, 2025

+1 to this. Some libraries that provide additional spark capabilities require a jar, even if one is actually using spark via python (pyspark).

Here's a chatgpt-generated list of examples https://chatgpt.com/share/67a8e12d-ccd8-800e-a641-75e58db91d7b

@natalie-white-aws
Copy link
Contributor

natalie-white-aws commented Feb 9, 2025

We had some internal discussions and (in addition to the data here) decided this is a valid use case. But we should add them to all 3 PySpark job types.

@GavinZZ
Copy link
Contributor

GavinZZ commented Feb 14, 2025

@gontzalm Would you be able to add this change to all 3 pyspark job types?

@humanzz
Copy link
Contributor

humanzz commented Feb 18, 2025

not from CDK team, but with the discussion I started in #33356, I would suggest also supporting extraJarsFirst prop for setting --user-jars-first as both the --extra-jars and --user-jars-first tend to go together.

GavinZZ
GavinZZ previously approved these changes Feb 19, 2025
Copy link
Contributor

@GavinZZ GavinZZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thank you!

@GavinZZ
Copy link
Contributor

GavinZZ commented Feb 19, 2025

@humanzz I think that's a good point, but I don't think it's worth blocking to merge this PR. If anyone is interested in contributing extraJarsFirst feature, feel free to tag me for a review!

@GavinZZ GavinZZ added pr-linter/exempt-integ-test The PR linter will not require integ test changes pr-linter/exempt-readme The PR linter will not require README changes labels Feb 19, 2025
@aws-cdk-automation aws-cdk-automation removed the pr/needs-community-review This PR needs a review from a Trusted Community Member or Core Team Member. label Feb 19, 2025
@aws-cdk-automation aws-cdk-automation dismissed their stale review February 19, 2025 19:14

✅ Updated pull request passes all PRLinter validations. Dismissing previous PRLinter review.

Copy link
Contributor

mergify bot commented Feb 19, 2025

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

Copy link

codecov bot commented Feb 19, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.16%. Comparing base (6f1aa80) to head (2bbebe1).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #33238   +/-   ##
=======================================
  Coverage   82.16%   82.16%           
=======================================
  Files         119      119           
  Lines        6857     6857           
  Branches     1157     1157           
=======================================
  Hits         5634     5634           
  Misses       1120     1120           
  Partials      103      103           
Flag Coverage Δ
suite.unit 82.16% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
packages/aws-cdk ∅ <ø> (∅)
packages/aws-cdk-lib/core 82.16% <ø> (ø)

Copy link
Contributor

mergify bot commented Feb 19, 2025

This pull request has been removed from the queue for the following reason: pull request branch update failed.

The pull request can't be updated

You should look at the reason for the failure and decide if the pull request needs to be fixed or if you want to requeue it.

If you want to requeue this pull request, you need to post a comment with the text: @mergifyio requeue

@aaythapa
Copy link
Contributor

@mergify update

Copy link
Contributor

mergify bot commented Feb 19, 2025

update

❌ Mergify doesn't have permission to update

For security reasons, Mergify can't update this pull request. Try updating locally.
GitHub response: refusing to allow a GitHub App to create or update workflow .github/workflows/codecov.yml without workflows permission

@mergify mergify bot dismissed GavinZZ’s stale review February 19, 2025 21:42

Pull request has been modified.

Copy link
Contributor

mergify bot commented Feb 19, 2025

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@aws-cdk-automation
Copy link
Collaborator

AWS CodeBuild CI Report

  • CodeBuild project: AutoBuildv2Project1C6BFA3F-wQm2hXv2jqQv
  • Commit ID: 2bbebe1
  • Result: SUCCEEDED
  • Build Logs (available for 30 days)

Powered by github-codebuild-logs, available on the AWS Serverless Application Repository

Copy link
Contributor

mergify bot commented Feb 19, 2025

Thank you for contributing! Your pull request will be updated from main and then merged automatically (do not update manually, and be sure to allow changes to be pushed to your fork).

@mergify mergify bot merged commit be3bce3 into aws:main Feb 19, 2025
20 checks passed
Copy link

Comments on closed issues and PRs are hard for our team to see.
If you need help, please open a new issue that references this one.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 19, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
beginning-contributor [Pilot] contributed between 0-2 PRs to the CDK feature-request A feature should be added or improved. p2 pr-linter/exempt-integ-test The PR linter will not require integ test changes pr-linter/exempt-readme The PR linter will not require README changes pr-linter/exemption-requested The contributor has requested an exemption to the PR Linter feedback.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aws-glue-alpha: extra_jars parameter in PySparkEtlJob
6 participants