improve zip util code and add tests for both ZipArchiveUtil ane OnnxW… #14056

anqini · 2023-11-07T00:06:28Z

Hi team, I'm Angelo (Anqi) Ni, currently an SDE in the software industry in the US. I'm very interested in Spark NLP and hope to have the chance to contribute. This is a small Pull request to add some testing and refactor a piece of code to getting familiar with the system and understand the contribution process better. Very excited to learn from your team and get any feedback!

Description

Refactored ZipArchiveUtil class due to potential risk in implementation
- zip util function arguments are not strong typed (it takes a few string instead of File)
- separate zip single file from zip a directory without blurring them in list dir function
- surface the exceptions when source directory is not found / destination already exist / destination has a non-exist parent folder
added test for ZipArchiveUtil
added test for OnnxWrapper

Motivation and Context

My original motivation was to investigate whether Spark-nlp is fully compatible with EMR on the ONNX and tensorflow related functionality. Since I ran into the "the SparkSession should only be created on driver" issue caused by OnnxWrapper, I suspected there was a bug in the code. But when I dived deep into the code, i found there was no test for OnnxWrapper class so it was hard to run the OnnxWrapper test on both drive node and worker node.

Caused by: java.lang.IllegalStateException: SparkSession should only be created and accessed on the driver.
	at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$assertOnDriver(SparkSession.scala:1158)
	at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:927)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$.$anonfun$getActiveSparkSession$1(ResourceHelper.scala:56)
	at scala.Option.getOrElse(Option.scala:189)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$.getActiveSparkSession(ResourceHelper.scala:57)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$.spark$lzycompute(ResourceHelper.scala:104)
	at com.johnsnowlabs.nlp.util.io.ResourceHelper$.spark(ResourceHelper.scala:104)
	at com.johnsnowlabs.util.ConfigHelper$.sparkSession$lzycompute(ConfigHelper.scala:23)
	at com.johnsnowlabs.util.ConfigHelper$.sparkSession(ConfigHelper.scala:23)
	at com.johnsnowlabs.util.ConfigHelper$.getConfigValueOrElse(ConfigHelper.scala:79)
	at com.johnsnowlabs.util.ConfigLoader$.getConfigInfo(ConfigLoader.scala:70)
	at com.johnsnowlabs.util.ConfigLoader$.configData$lzycompute(ConfigLoader.scala:36)
	at com.johnsnowlabs.util.ConfigLoader$.configData(ConfigLoader.scala:60)
	at com.johnsnowlabs.util.ConfigLoader$.getConfigIntValue(ConfigLoader.scala:79)
	at com.johnsnowlabs.ml.onnx.OnnxWrapper$.getCPUSessionConfig(OnnxWrapper.scala:195)

Before reaching my original goal, I'm happy to add some test first like I did in this Pull request. I'm also hoping to collaborate with your team to help Spark-nlp work better on EMR in a long term. I think it very promising since EMR is probably still one of the most popular places people run spark on.

How Has This Been Tested?

Unit test at local laptop.

ZipArchiveUtil is used in the following places

OnnxWrapper.scala
TensorflowWrapper.scala
TrainingHelper.scala

Unit test has been added to OnnxWrapper. Since TensorflowWrapper and TrainingHelper shared the same logic, we skip them for now.

Screenshots (if appropriate):

Types of changes

Bug fix (non-breaking change which fixes an issue)
Code improvements with no or little impact
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Checklist:

My code follows the code style of this project.
My change requires a change to the documentation.
I have updated the documentation accordingly.
I have read the CONTRIBUTING page.
I have added tests to cover my changes.
All new and existing tests passed.

…rapper

maziyarpanahi · 2023-11-07T09:05:38Z

Hi @anqini

First, let me thank you for your contributions. Highly appreciate it.

Could you please provide some specs (EMR version, spark-nlp version, etc.) and some steps to reproduce this? I recently had a Webinar in which I used both AWS Glue (4.0) and AWS EMR (6.x) to showcase Spark NLP (E5 embeddings) by using ONNX models. It would be very helpful if you can show us how to reproduce this error via some steps.

Thanks again for your contributions

anqini · 2023-11-07T23:18:51Z

Hi Maziyar,

Thanks for your response.

I created an issue with the bug i observed.
#14057

But this PR is NOT yet resolving it. This is to update code in zipUtil to make testing OnnxWrapper easier.
It helps to resolve some minor issues such as exception handling.

for example, the zip function failed to create a zip file but there is no exception thrown.

import com.johnsnowlabs.util.ZipArchiveUtil
import java.io.File

new File("./demofolder").mkdir()
new File("./demofolder/fileA").createNewFile()
ZipArchiveUtil.zip("./demofolder", "./noway/out.zip")
assert(new File("./noway/out.zip").exists())

there are also corner cases like if target file exists it doesn't throw the error.

Let me know if you want to me create an issue for that too.

anqini · 2023-11-15T02:00:50Z

Hi @maziyarpanahi , can we take a step to review and merge this PR and then i can send a new PR to resolve the multi-task node issue? Does that sound good to you? Feel free to let me know if you have other plan. thanks

maziyarpanahi · 2023-11-15T10:23:52Z

Hi @maziyarpanahi , can we take a step to review and merge this PR and then i can send a new PR to resolve the multi-task node issue? Does that sound good to you? Feel free to let me know if you have other plan. thanks

Hi @anqini
Absolutely! I'll fast track the review for this PR and merge it into our next RC branch.

Thanks again for your contribution

anqini · 2023-11-15T17:21:31Z

Thanks for your reply. The Java heap space exception is kind of interesting. I did observe that the testing process used a lot of memory on my local desktop.

danilojsl · 2023-11-30T13:35:39Z

Hi @anqini ,

I wanted to express my sincere gratitude for your recent contribution. Your efforts in enhancing the ZipArchiveUtil component are invaluable to our library, and we truly appreciate your dedication.

Considering the critical role of ZipArchiveUtil in our annotators, I was wondering if you had the chance to conduct any end-to-end tests. Specifically, have you run our example notebooks with the implemented changes?

This component is integral to various annotators, including all of those in Onnx and some other in Tensorflow such as ClassifierDLModel, MultiClassifierDLModel, SentimentDLModel, NerDLModel, SentenceDetectorDLModel, ContextSpellCheckerModel, and ElmoEmbeddings. It would greatly benefit our workflow if you could perform tests on some of these models.

While the modifications LGFM, I plan to run tests on a few annotators to ensure seamless functionality before final approval. Your collaboration in testing these changes would be immensely helpful.

Once again, thank you for your contribution, and I look forward to your continued support.

anqini · 2023-12-04T10:27:14Z

@danilojsl Totally understand regression testing on annotator components is important. Thanks for taking over the those tests for this PR.

I'm not familiar with the Colab environment setup so lack of knowledge on how to run the same code with package replaced. But I had an informal way to do the regression with a new jar built from SBT.

on the cluster, run the normal sparknlp.start() command. It comes with the spark.jar.packages=com.johnsnowlabs.nlp:spark-nlp_2.12:4.3.0 configuration which leverage ivy to resolve dependencies.
after the first start, close the spark session.
remove the default spark-nlp jar to avoid conflict, and download the new library.
create a new spark session with all other ivy dependency jars in '/root/.ivy2/*.jar' or hadoop ivy folder, plus we need to load the new spark-nlp jar with proposed changes.

I did this in the Scala + EMR and it reflected the latest changes. I'm also looking into pyspark code to understand how it work in pyspark. It will be very helpful to have a simpler way to run regression on notebooks if we don't have it yet.

I'll try to provide some complementary end-to-end tests and attach the resolve here later. Let me know if you have any comment.

danilojsl · 2023-12-06T02:12:17Z

Hi @anqini

Thanks for the details to run tests in EMR. I tested your changes in Google Colab and all passed ✅.

Angelo Anqi Ni added 2 commits November 6, 2023 22:46

improve zip util code and add tests for both ZipArchiveUtil ane OnnxW…

b6e4d5d

…rapper

fix onnx test

95a34f9

maziyarpanahi requested review from maziyarpanahi and danilojsl November 7, 2023 09:03

maziyarpanahi self-assigned this Nov 7, 2023

maziyarpanahi added bug-fix DON'T MERGE Do not merge this PR labels Nov 7, 2023

fix zipUtil test element ordering issue

011f92b

danilojsl approved these changes Nov 30, 2023

View reviewed changes

maziyarpanahi approved these changes Dec 7, 2023

View reviewed changes

maziyarpanahi changed the base branch from master to release/520-release-candidate December 7, 2023 18:37

maziyarpanahi merged commit 37b24b9 into JohnSnowLabs:release/520-release-candidate Dec 7, 2023
5 checks passed

maziyarpanahi mentioned this pull request Dec 7, 2023

520-release-candidate #14084

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve zip util code and add tests for both ZipArchiveUtil ane OnnxW… #14056

improve zip util code and add tests for both ZipArchiveUtil ane OnnxW… #14056

anqini commented Nov 7, 2023 •

edited

Loading

maziyarpanahi commented Nov 7, 2023

anqini commented Nov 7, 2023 •

edited

Loading

anqini commented Nov 15, 2023 •

edited

Loading

maziyarpanahi commented Nov 15, 2023

anqini commented Nov 15, 2023

danilojsl commented Nov 30, 2023

anqini commented Dec 4, 2023 •

edited

Loading

danilojsl commented Dec 6, 2023

improve zip util code and add tests for both ZipArchiveUtil ane OnnxW… #14056

improve zip util code and add tests for both ZipArchiveUtil ane OnnxW… #14056

Conversation

anqini commented Nov 7, 2023 • edited Loading

Description

Motivation and Context

How Has This Been Tested?

Screenshots (if appropriate):

Types of changes

Checklist:

maziyarpanahi commented Nov 7, 2023

anqini commented Nov 7, 2023 • edited Loading

anqini commented Nov 15, 2023 • edited Loading

maziyarpanahi commented Nov 15, 2023

anqini commented Nov 15, 2023

danilojsl commented Nov 30, 2023

anqini commented Dec 4, 2023 • edited Loading

danilojsl commented Dec 6, 2023

anqini commented Nov 7, 2023 •

edited

Loading

anqini commented Nov 7, 2023 •

edited

Loading

anqini commented Nov 15, 2023 •

edited

Loading

anqini commented Dec 4, 2023 •

edited

Loading