Spark

At Spark+AI conference this year, Daniel Tomes from Databricks gave a deep-dive talk on Spark performance optimizations. After watching it, I feel it’s super useful, so I decide to write down some important notes which address the most common performance issues from his talk. Here is the YouTube video just in case if you are interested. Paritions We often encounter into situations that partition is not optimal at different stages of our workflow, so it slows down the entire job siganificantly. For example, six month ago I tried to analyze some telemetry data exported from Application Insights, but there are way to many JSON files (> 100,000 files) and each file is small (< 1MB each). This makes a groupBy stage takes an hour to finish on 8 machines. If this is a one-time workflow, I’m okay to not optimize it. But it’s not. ...

Over the holiday I spent some time to make some progress of moving one of my machine learning project into Spark. An important piece of the project is a data transformation library with pre-defined functions available. The original implementation uses pandas dataframe and runs on a single machine. Given the size of our data gets much bigger, sometimes we have to use a giant 512GB memory Azure VM to run the transformation and it takes a long time to run the entire transformation or I have chunk the data then transform in batches (which is not a good idea for column based transformation such as feature normalization). Another blocking issue is the intermediate memory consumption can be really high – 10x of the original data size. ...

Spark

Spark Performance Optimization

Boradcast Variable in Spark SQL