Spark Performance Optimization
At Spark+AI conference this year, Daniel Tomes from Databricks gave a deep-dive talk on Spark performance optimizations. After watching it, I feel it’s super useful, so I decide to write down some important notes which address the most common performance issues from his talk. Here is the YouTube video just in case if you are interested. Paritions We often encounter into situations that partition is not optimal at different stages of our workflow, so it slows down the entire job siganificantly. For example, six month ago I tried to analyze some telemetry data exported from Application Insights, but there are way to many JSON files (> 100,000 files) and each file is small (< 1MB each). This makes a groupBy stage takes an hour to finish on 8 machines. If this is a one-time workflow, I’m okay to not optimize it. But it’s not. ...