Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve zip util code and add tests for both ZipArchiveUtil ane OnnxW… #14056

Merged

Conversation

anqini
Copy link
Contributor

@anqini anqini commented Nov 7, 2023

Hi team, I'm Angelo (Anqi) Ni, currently an SDE in the software industry in the US. I'm very interested in Spark NLP and hope to have the chance to contribute. This is a small Pull request to add some testing and refactor a piece of code to getting familiar with the system and understand the contribution process better. Very excited to learn from your team and get any feedback!

Description

  • Refactored ZipArchiveUtil class due to potential risk in implementation
    • zip util function arguments are not strong typed (it takes a few string instead of File)
    • separate zip single file from zip a directory without blurring them in list dir function
    • surface the exceptions when source directory is not found / destination already exist / destination has a non-exist parent folder
  • added test for ZipArchiveUtil
  • added test for OnnxWrapper

Motivation and Context

My original motivation was to investigate whether Spark-nlp is fully compatible with EMR on the ONNX and tensorflow related functionality. Since I ran into the "the SparkSession should only be created on driver" issue caused by OnnxWrapper, I suspected there was a bug in the code. But when I dived deep into the code, i found there was no test for OnnxWrapper class so it was hard to run the OnnxWrapper test on both drive node and worker node.

Caused by: java.lang.IllegalStateException: SparkSession should only be created and accessed on the driver.
	at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$assertOnDriver(SparkSession.scala:1158)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:927)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$.$anonfun$getActiveSparkSession$1(ResourceHelper.scala:56)
	at scala.Option.getOrElse(Option.scala:189)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$.getActiveSparkSession(ResourceHelper.scala:57)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$.spark$lzycompute(ResourceHelper.scala:104)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$.spark(ResourceHelper.scala:104)
	at com.johnsnowlabs.util.ConfigHelper$.sparkSession$lzycompute(ConfigHelper.scala:23)
	at com.johnsnowlabs.util.ConfigHelper$.sparkSession(ConfigHelper.scala:23)
	at com.johnsnowlabs.util.ConfigHelper$.getConfigValueOrElse(ConfigHelper.scala:79)
	at com.johnsnowlabs.util.ConfigLoader$.getConfigInfo(ConfigLoader.scala:70)
	at com.johnsnowlabs.util.ConfigLoader$.configData$lzycompute(ConfigLoader.scala:36)
	at com.johnsnowlabs.util.ConfigLoader$.configData(ConfigLoader.scala:60)
	at com.johnsnowlabs.util.ConfigLoader$.getConfigIntValue(ConfigLoader.scala:79)
	at com.johnsnowlabs.ml.onnx.OnnxWrapper$.getCPUSessionConfig(OnnxWrapper.scala:195)

Before reaching my original goal, I'm happy to add some test first like I did in this Pull request. I'm also hoping to collaborate with your team to help Spark-nlp work better on EMR in a long term. I think it very promising since EMR is probably still one of the most popular places people run spark on.

How Has This Been Tested?

Unit test at local laptop.

ZipArchiveUtil is used in the following places

  • OnnxWrapper.scala
  • TensorflowWrapper.scala
  • TrainingHelper.scala

Unit test has been added to OnnxWrapper. Since TensorflowWrapper and TrainingHelper shared the same logic, we skip them for now.

Screenshots (if appropriate):

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • Code improvements with no or little impact
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

  • My code follows the code style of this project.
  • My change requires a change to the documentation.
  • I have updated the documentation accordingly.
  • I have read the CONTRIBUTING page.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@maziyarpanahi
Copy link
Member

Hi @anqini

First, let me thank you for your contributions. Highly appreciate it.

Could you please provide some specs (EMR version, spark-nlp version, etc.) and some steps to reproduce this? I recently had a Webinar in which I used both AWS Glue (4.0) and AWS EMR (6.x) to showcase Spark NLP (E5 embeddings) by using ONNX models. It would be very helpful if you can show us how to reproduce this error via some steps.

Thanks again for your contributions

@maziyarpanahi maziyarpanahi self-assigned this Nov 7, 2023
@maziyarpanahi maziyarpanahi added bug-fix DON'T MERGE Do not merge this PR labels Nov 7, 2023
@anqini
Copy link
Contributor Author

anqini commented Nov 7, 2023

Hi Maziyar,

Thanks for your response.

I created an issue with the bug i observed.
#14057

But this PR is NOT yet resolving it. This is to update code in zipUtil to make testing OnnxWrapper easier.
It helps to resolve some minor issues such as exception handling.

for example, the zip function failed to create a zip file but there is no exception thrown.

import com.johnsnowlabs.util.ZipArchiveUtil
import java.io.File

new File("./demofolder").mkdir()
new File("./demofolder/fileA").createNewFile()
ZipArchiveUtil.zip("./demofolder", "./noway/out.zip")
assert(new File("./noway/out.zip").exists())

there are also corner cases like if target file exists it doesn't throw the error.

Let me know if you want to me create an issue for that too.

@anqini
Copy link
Contributor Author

anqini commented Nov 15, 2023

Hi @maziyarpanahi , can we take a step to review and merge this PR and then i can send a new PR to resolve the multi-task node issue? Does that sound good to you? Feel free to let me know if you have other plan. thanks

@maziyarpanahi
Copy link
Member

Hi @maziyarpanahi , can we take a step to review and merge this PR and then i can send a new PR to resolve the multi-task node issue? Does that sound good to you? Feel free to let me know if you have other plan. thanks

Hi @anqini
Absolutely! I'll fast track the review for this PR and merge it into our next RC branch.

Thanks again for your contribution

@anqini
Copy link
Contributor Author

anqini commented Nov 15, 2023

Thanks for your reply. The Java heap space exception is kind of interesting. I did observe that the testing process used a lot of memory on my local desktop.

@danilojsl
Copy link
Contributor

Hi @anqini ,

I wanted to express my sincere gratitude for your recent contribution. Your efforts in enhancing the ZipArchiveUtil component are invaluable to our library, and we truly appreciate your dedication.

Considering the critical role of ZipArchiveUtil in our annotators, I was wondering if you had the chance to conduct any end-to-end tests. Specifically, have you run our example notebooks with the implemented changes?

This component is integral to various annotators, including all of those in Onnx and some other in Tensorflow such as ClassifierDLModel, MultiClassifierDLModel, SentimentDLModel, NerDLModel, SentenceDetectorDLModel, ContextSpellCheckerModel, and ElmoEmbeddings. It would greatly benefit our workflow if you could perform tests on some of these models.

While the modifications LGFM, I plan to run tests on a few annotators to ensure seamless functionality before final approval. Your collaboration in testing these changes would be immensely helpful.

Once again, thank you for your contribution, and I look forward to your continued support.

@anqini
Copy link
Contributor Author

anqini commented Dec 4, 2023

@danilojsl Totally understand regression testing on annotator components is important. Thanks for taking over the those tests for this PR.

I'm not familiar with the Colab environment setup so lack of knowledge on how to run the same code with package replaced. But I had an informal way to do the regression with a new jar built from SBT.

  • on the cluster, run the normal sparknlp.start() command. It comes with the spark.jar.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.0 configuration which leverage ivy to resolve dependencies.
  • after the first start, close the spark session.
  • remove the default spark-nlp jar to avoid conflict, and download the new library.
  • create a new spark session with all other ivy dependency jars in '/root/.ivy2/*.jar' or hadoop ivy folder, plus we need to load the new spark-nlp jar with proposed changes.

I did this in the Scala + EMR and it reflected the latest changes. I'm also looking into pyspark code to understand how it work in pyspark. It will be very helpful to have a simpler way to run regression on notebooks if we don't have it yet.

I'll try to provide some complementary end-to-end tests and attach the resolve here later. Let me know if you have any comment.

@danilojsl
Copy link
Contributor

Hi @anqini

Thanks for the details to run tests in EMR. I tested your changes in Google Colab and all passed ✅.

@maziyarpanahi maziyarpanahi changed the base branch from master to release/520-release-candidate December 7, 2023 18:37
@maziyarpanahi maziyarpanahi merged commit 37b24b9 into JohnSnowLabs:release/520-release-candidate Dec 7, 2023
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-fix DON'T MERGE Do not merge this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants