Skip to content

Optimization Spark and Azure Mysql

Swapna Y edited this page May 8, 2021 · 5 revisions

Azure Mysql was not always the first choice.Mysql 8.2 was running on a local docker container. Using Star Schema enabled adding data sources incrementally. Adding Housing data source over time inherently made the system very slow. It was taking forever to run ETL pipeline on various housing related datasets.Now ,after all the data is loaded, Housing tables are forming 40% of the entire data size. With easy set up , operate, and scalability, Azure Database for MySQL was a good choice.Below are changes made to get better performance during insertions and selections.

  1. Increase compute size plus storage
  2. set up slow_query_log and long_query_time from 10 to 2 and redirect logs to blob storage to understand the queries that are causing performance issues.
  3. Change parameters as mentioned in this post
  • After deploying the code in Azure ACI for docker, S&P 500 (around 500 symbols) Stocks data ETL was very slow and eventually crashed 4vCPU 8 Gib ACI.

Moved to 8vCPU 12 GiB Azure VM

  • From Azure VM metrics, memory usage was maxed out while only 30 to 40% CPU usage. further, Azure mysql slow logs didn't contain stocks related queries.This prompted for spark and code optimization.Further profiling revealed that data related to each stock symbol was taking about 30 seconds to load.That was a huge bottle neck.Doing all below reduced time from 30+ seconds per stock to 12 seconds entire dataset containing about 60000 rows.
  1. Fetch all stock data by partitions in to Data Frame.(Spark DF can handle billions of records)
  2. use batch insertions
  3.          spark_df.write.format('jdbc').options(
                 url=self.config.connection_str_jdbc,
                 driver=self.config.driver_name,
                 dbtable=table_name,
                 user=self.config.mysql_user,
                 password=self.config.mysql_password,
                 batchsize=batch_size,
                 isolationLevel='NONE'
             ).mode(mode).save() ```
    
  4. Add rewriteBatchedStatements=true to MYSQL connection string
  • Since the DAGS are run Daily/Monthly data is not large, default value of 200 spark.sql.shuffle.partitions is also contributing to ETL performance delay.
  1. changing the value to 20 self.spark.conf.set("spark.sql.shuffle.partitions", 20) had 20% gain in performance. 2.This value needed to be high for housing price data.
  • Tuning Repartition and Coalesce.
  • changing order of composite keys on fact tables.
  • Moving from Spark 2.4 to 3.0.1.
  1. Spark 3.0.1 takes care of many configurations under the hood. It has improved Garbage collection.More from Here
Clone this wiki locally