Although extract, load, and transform (ELT) workloads have been around for ages, they still comprise some of the biggest challenges that Apache Spark™ users bring to us. Like with all applications that process large data sets, ELT performance problems create delayed results, missed deadlines, and even low morale.
Even when customers can clearly identify poor performance in their Spark data pipeline, pinpointing the problem and mitigating it can be difficult. There are many potential root causes. Some common issues we’ve seen recently include:
- Data skew: Some data sets possess an inherent bias toward one or a few unusually large subsets. This can hamper load balancing of parallel execution.
- Incorrect memory configuration: Changing worker thread memory allocation by just a few percentage points can make a huge difference in performance.
- Incorrect thread allocation: Allocating too many or too few workers to a job can cause severe CPU overloading or underutilization, respectively.
- Bottleneck stages: A single Spark stage can cause poor performance for the entire application.
Bigstream Hyperacceleration improves performance across many stages in a Spark data pipeline. In developing Hyperacceleration, Bigstream also created the xRay profiling tool to quickly break out the key stages of an application and identify the root issues.
To illustrate how xRay addresses the challenges that Spark customers bring us, we ran some test cases, running benchmark data (TPC-DS) through a typical ELT workflow.
Before we dive into the example, let’s give a quick summary of ELT (which evolved from the ETL acronym when the transform step had to precede the load). ELT workloads transform large data sets from their original form to a format ready for analysis. They are typically developed by data engineers, and they’re frequently built to populate data warehouses and data lakes. Analysts and data scientists use this processed data, though the ELT process itself can include some analytics steps as well.
Figure 1: Common ELT Application
The ELT Test Case
A common ELT workflow is reading text files, such as comma-separated values (CSV) files; transforming them; and then outputting them into the Parquet format optimized for downstream analysis. Our test case uses data from TPC-DS (30sf), specifically 30 GB of CSV data.
Step 1: Analyze the original Spark ELT job
As with a typical customer engagement, step 1 is to use xRay to examine the existing, unaccelerated Spark ELT job—in this case, on our sample data set. xRay includes a “listener” that captures the granular stages of the job and produces visual reports within seconds from Spark output logs.
Figure 2 shows the web-based interface, summarizing xRay analyses, recommendations, and estimated acceleration opportunities.
Figure 2: xRay Main Dashboard
By clicking on a Spark application log from the list, you can drill into the execution timeline view. Figure 3 shows the stage-by-stage execution times for our sample data. Stages 2 and 12 clearly have the longest execution times.
Figure 3: Unaccelerated ELT Timeline
Clicking on an individual stage provides detailed information such as task execution times, operators used in the stage, I/O statistics, and more. In figure 4, we use some of these details to determine that stage 2 is a CSV scan stage (as well as stage 12, not shown).
Step 2: Run the Spark ELT job with Bigstream
The xRay results showed there were significant opportunities for performance acceleration. So for step 2, we added Bigstream software-only acceleration and ran the identical Spark code.
With the new output logs, xRay generated figure 5. This includes both the Bigstream-accelerated run in purple along with the unaccelerated run from step 1 in green. You can see that the longest-running stages have been accelerated by more than 2.5x. The scan stages identified in step 1 have drastically improved, but so have other stages with functions like Parquet writes. The end-to-end speedup for the application is approximately 2x.
Figure 5: xRay Analysis of Acceleration
Easy Setup and Ongoing Savings
We’ve shown how Bigstream’s software-only solution cut a 27-minute ELT job down to 13 minutes. Keep in mind that Bigstream Hyperacceleration also powers hardware acceleration, like FPGA- and SmartSSD-based solutions that provide even greater performance improvements.
Configuring xRay, installing Bigstream, and adding and comparing the results on xRay took only 10 minutes. Because ELT jobs are typically run repeatedly, at least daily, the power of hyperacceleration is that after the initial analysis and setup, you reap the performance gains every time they run in the future.