Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Java Heap Out of Memory Error #11

Open
soleyjh opened this issue Dec 2, 2021 · 15 comments
Open

Java Heap Out of Memory Error #11

soleyjh opened this issue Dec 2, 2021 · 15 comments
Assignees
Labels
enhancement New feature or request

Comments

@soleyjh
Copy link

soleyjh commented Dec 2, 2021

Hi! And thanks so much for writing this great package!

When I run the following command:

df = spark.read.format("com.elastacloud.spark.excel").option("cellAddress", "A1").load(file_location)

I get the following error:

java.lang.OutOfMemoryError: Java heap space

The excel file is 218MB roughly 750K rows with ~50 or so fields (no long text strings).

and I'm running in Azure Databricks: 9.1 LTS (includes Apache Spark 3.1.2, Scala 2.12)
Running: Standard_F4 with 8GB Memory and 4 Cores

I downloaded the following JAR and Installed on the Cluster (Above): spark_excel_3_1_2_0_1_7.jar

Any Help or Advice would be appreciated.

@dazfuller dazfuller self-assigned this Dec 3, 2021
@dazfuller dazfuller added the question Further information is requested label Dec 3, 2021
@dazfuller
Copy link
Contributor

Hi, and thanks for the nice feedback 🙂

I'm guessing that the file you're opening is an .xlsx file which means that 218MB is it's compressed size. The library has to decompress the entire file so that it can read it (this is so the formula evaluation works), which can lead to OOM errors. I'm looking at having a simplified version which lets the library stream the data out without the formula evaluation.

One of the things you could try is to use a memory optimized sku, or a standard sku with more memory available. There's also the option of setting the JVM options to allow for a bigger heap size, -Xmx4096m for instance.

@dazfuller
Copy link
Contributor

Hi @soleyjh did this help out at all?

@josecsotomorales
Copy link
Contributor

I also faced this issue, trying to think of a workaround, can we disable the formula evaluation as part of the configuration and just read static data from Excel?

@dazfuller
Copy link
Contributor

It's definitely doable, I'll need to handle the cell evaluation slightly differently based on the config, but let me have a look and see what I can do with it

@dazfuller
Copy link
Contributor

Okay, the code is in there to handle this now in the 0.1.13 branch, so you can pass in an option like this

.option("evaluateFormulae", "false")

What this will do instead is extract the formula itself (e.g. "A7*2"), but it won't attempt to evaluate it. If you let me know which spark version you're targeting I can create a jar for you from that branch to test with. Or check out the branch and build it yourself if fancy :)

@josecsotomorales
Copy link
Contributor

I'm going to test it on my local by checking out the branch

@josecsotomorales
Copy link
Contributor

Alright, I have some insights, it helped a LOT but I still have a Java Heap OOM. The file is XLSX approx 300MB. I processed it with https://github.com/crealytics/spark-excel using the streaming reader config and it works:

.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files (will fail if used with xls format files)

I don't know how we can add this streaming as an option to this project.

@dazfuller
Copy link
Contributor

It is something I've looked at before, but you lose a lot of the features using the streaming reader. My current thought is to have a different parser which uses the stream reader and a heavy check on the options provided.

Is processing locally your usage environment?

@josecsotomorales
Copy link
Contributor

That makes sense. I'm currently performing my testing by running Spark on K8S driver 8CPU 64GB RAM + executors same size.

@dazfuller
Copy link
Contributor

What's the JVM heap size set to?

@josecsotomorales
Copy link
Contributor

It's set up to 4GB

@dazfuller
Copy link
Contributor

Any success if you increase that size?

@dazfuller dazfuller reopened this Jun 28, 2024
@dazfuller dazfuller added enhancement New feature or request and removed question Further information is requested labels Jun 28, 2024
@dazfuller dazfuller added this to the 0.1.14 Release milestone Jun 29, 2024
@dazfuller
Copy link
Contributor

I'm going to get the 0.1.13 release out and then start looking at the 0.1.14 with a view to creating a companion parser which uses the SXSSF (streaming) reader. It means that it won't have the ability to do formula evaluation, and merged cells over a certain size, but it should reduce the memory required by the POI library

@dazfuller
Copy link
Contributor

For info, there's a branch called dazfuller/streaming-reading which has a working version of the streaming reader. Needs quite a lot more testing yet, but the core if it is in place

@josecsotomorales
Copy link
Contributor

Excellent work @dazfuller!! Will review and perform some testing on that branch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants