-
Notifications
You must be signed in to change notification settings - Fork 0
Optimization Spark and Azure Mysql
Swapna Y edited this page May 8, 2021
·
5 revisions
Azure Mysql was not always the first choice.Mysql 8.2 was running on a local docker container. Using Star Schema enabled adding data sources incrementally. Adding Housing data source over time inherently made the system very slow. It was taking forever to run ETL pipeline on various housing related datasets.Now ,after all the data is loaded, Housing tables are forming 40% of the entire data size. With easy set up , operate, and scalability, Azure Database for MySQL was a good choice.Below are changes made to get better performance during insertions and selections.
- Increase compute size plus storage
- set up
slow_query_log
andlong_query_time
from 10 to 2 and redirect logs to blob storage to understand the queries that are causing performance issues.- Change parameters as mentioned in this post
- After deploying the code in Azure ACI for docker, S&P 500 (around 500 symbols) Stocks data ETL was very slow and eventually crashed 4vCPU 8 Gib ACI.
Moved to 8vCPU 12 GiB Azure VM
- From Azure VM metrics, memory usage was maxed out while only 30 to 40% CPU usage. further, Azure mysql slow logs didn't contain stocks related queries.This prompted for spark and code optimization.Further profiling revealed that data related to each stock symbol was taking about 30 seconds to load.That was a huge bottle neck.Doing all below reduced time from 30+ seconds per stock to 12 seconds entire dataset containing about 60000 rows.
- Fetch all stock data by partitions in to Data Frame.(Spark DF can handle billions of records)
- use batch insertions
spark_df.write.format('jdbc').options( url=self.config.connection_str_jdbc, driver=self.config.driver_name, dbtable=table_name, user=self.config.mysql_user, password=self.config.mysql_password, batchsize=batch_size, isolationLevel='NONE' ).mode(mode).save() ```
- Add rewriteBatchedStatements=true to MYSQL connection string
- Since the DAGS are run Daily/Monthly data is not large, default value of 200
spark.sql.shuffle.partitions
is also contributing to ETL performance delay.
- changing the value to 20
self.spark.conf.set("spark.sql.shuffle.partitions", 20)
had 20% gain in performance. 2.This value needed to be high for housing price data.
- Tuning Repartition and Coalesce.
- changing order of composite keys on fact tables.
- Moving from Spark 2.4 to 3.0.1.
- Spark 3.0.1 takes care of many configurations under the hood. It has improved Garbage collection.More from Here
- Increasing number of executors, tuning
SPARK_EXECUTOR_MEMORY
andSPARK_WORKER_CORES
. Also had a positive effect. - Go Here for one of Azure Monitoring dashboard used