Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Match deequ support for spark 3.2.1 #93

Closed
epilif1017a opened this issue Mar 9, 2022 · 8 comments
Closed

Match deequ support for spark 3.2.1 #93

epilif1017a opened this issue Mar 9, 2022 · 8 comments

Comments

@epilif1017a
Copy link

epilif1017a commented Mar 9, 2022

Deequ now supports Spark 3.2.1. However, pydeequ still did not catch up to spark 3.1.x.

Goal: update pydeequ to support new deequ version and Spark 3.2.1

@ghirardinicola
Copy link

Beside not being up to date with deeque version in the packages (pydeequ.deequ_maven_coord)), are there other problems?

@epilif1017a
Copy link
Author

epilif1017a commented May 13, 2022

Hi @ghirardinicola thanks for reaching out :)

Honestly we can’t tell, as we are still on Spark 3.1.2 in our framework (holding to decide if we cut deequ off of a light version of the framework or not) and avoiding rolling out the dq part of the framework globally because we unfortunately cannot fork the project internally at the moment to keep up with deequ (or contribute to the open source project, maybe one day we find capacity to do it).

So in 3.1.2 there was for us the issue of in certain scenarios the Spark app wouldn’t finish automatically if we had pydeequ actions (but we manage to sort that out by manually closing the Spark context gateway (I think you have this on your issue list also, at least I remember seeing an issue). And on 3.2 we did not test yet but I believe there’s issues in your issue list reporting that some analyzers do not work.

therefore it would give the pydeequ user base much more confidence in the project if there was a faster release cycle between Spark versions, deequ and pydeequ. But don’t get me wrong, we all understand that as an open source project the dev team is already kind enough to spend their time to work extra on the project. But I believe because this project and deequ are so cool that their roadmap is very important to potential heavy users like us for example.

That’s why we are so interested and always asking for a new version :) but we fully understand that things take time, would be cool to know if there are still plans to keep updating the project or not, and that would help the community making the decision of forking the project, go the extra mile to find time to go through all the code and start contributing to the os project, or make other decision.

Appreciate all your help and kindness to put this open to everyone!

@mycaule
Copy link

mycaule commented May 30, 2022

The ApproxCountDistinct analyzer doesn't work, ConstraintSuggestionRunner and ColumnProfilerRunner neither.

java.lang.NoSuchMethodError: 'org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression. ...'

@hashanpasindu
Copy link

does pydeequ ==1.0.1 supports for spark 3.2 ?

@mycaule
Copy link

mycaule commented Mar 28, 2023

yes it also generally works with spark 3.3 and 3.1, but some components don't

@hashanpasindu
Copy link

hashanpasindu commented Mar 28, 2023

Thanks for the reply. Im trying to use column profiler and
Im getting
java.lang.NoSuchMethodError: 'org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression. ...
Error
.
I use pyspark 3.2.1 on databricks and pydeeq 1.0.1.

Is there any worker around ?

@mycaule
Copy link

mycaule commented Mar 28, 2023

There isn't, it is the same problem I had above.
You should also have a look at this the release is expected very soon #106

@chenliu0831
Copy link
Contributor

1.1.0 is released with Spark 3.2/3.3 support - I would close this for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants