Spark Profiling with Bigstream xRay
Application Performance Management (APM) has become a key aspect of application development in the big data and analytics realm. APMs provide detailed analysis of big data applications and their performance with the goal of finding improvements. Data scientists and developers as well as operations professionals use APM to discover how to improve platform configuration, application structure and application infrastructure.
This has become more crucial as analytics and models leverage larger, more unstructured, and formerly dark data sources. It is particularly relevant for performance engineering teams focused on acceleration technologies and other means of platform optimization.
Bigstream developed the xRay APM tool which profiles big data applications, specifically Apache SparkTM. It works seamlessly with Bigstream Hyperacceleration, analyzing users’ production application performance as well as the opportunities for acceleration technology. xRay is easy to install, provides fully web-based analysis and runs both in the cloud and on-premises. xRay can also be used to assess workload performance, independent of acceleration.
In this paper, we’ll cover:
- What is xRay and what answers it provides
- How Bigstream Hyperacceleration works in tandem with xRay
- How to get started with xRay
- xRay future plans
The overall goal of the tool is to provide useful, in-depth insight to the user on the performance aspects of their Apache Spark applications. xRay runs as a listener class on the user’s production application in their environment, i.e. not a benchmark or surrogate. It provides basic application run information such as configurations and runtimes of the application, stages and tasks. It also provides log data signature analysis to provide recommendations to the user for how to improve performance. Finally, it provides an assessment of the potential in applying Bigstream Hyperacceleration technology to the application. Information about operators, UDFs, stage structure, data formats and more inform this acceleration assessment.
Given the goal of performance, xRay automatically provides answers to the following questions, among others:
- What configurations (i.e. Spark, Yarn etc.) am I using? Can they be changed to improve performance?
- Where is most of the time being spent in my application? Compute, I/O, idling etc.?
- What are the bottleneck stages of my application? What operators/data are they using?
- Does my data have skew?
- Do certain stages have straggler tasks?
- Which stages can be accelerated either via software or hardware acceleration? Are these the key stages in my application?
- How much acceleration can I expect?
The majority of these questions have to do with basic application performance, and are not specific to acceleration. While there is no requirement for using acceleration to use xRay, the next section describes how xRay can be used in conjunction with Bigstream acceleration technology.
xRay and Bigstream Hyperacceleration
The xRay tool works in concert with Bigstream acceleration technology to provide high performance to the user. As stated earlier, there is no requirement for using acceleration to use xRay and vice versa. However, Bigstream Hyperacceleration and xRay are made to work synergistically to maximize performance for the user with zero code change.
Running and Analyzing with xRay
This section describes how to get started with xRay for the reader’s Spark application, and gives an example of the analysis it provides. This example is a subset of the functionality provided; full documentation and functionality are available directly from the website, insight.bigstream.co
Running an application with xRay
After signing up, instrumenting and running is a simple 3-step process, illustrated in Figure 2. This figure is taken directly from the website.
Referring to the list of links on the left side of Figure 2, the user can click on the “Download Package” link to select the xRay package suitable for the OS version being run on. Currently, four Linux types are available: RedHat, Amazon EMR, Ubuntu and CentOS. Check the download page for exact versions to match the reader’s infrastructure. These packages work on both cloud and on-premises infrastructure equally.
Once the appropriate package is unpackaged on the master node of the cluster, the Spark application can be run with xRay by just adding two additional flags to the Spark run command (i.e. command line). Details are in the “How to run Apache Spark” link shown in the left of Figure 2.
After running the Spark application, the last step is to upload the resulting logs to the xRay website via the “Upload New xRay” link. There is an anonymization function provided by the package, again documented on the link.
After this, in a few minutes, the user will find a web-based, clickable report in their dashboard on the xRay site, ready for analysis. We examine some aspects of this report next.
xRay Report Analysis Example
We present an example analysis that is possible with xRay on a Spark workload. This example is intended to be illustrative; it gives the look and feel of the xRay web interface for a report. There are many more capabilities that xRay has for APM, which we list after this example.
Figure 3 shows the top portion of the user dashboard, with several application log outputs uploaded. In the upper left of the dashboard, summary information is presented including aggregate runtime for all of this user’s applications. The list of applications is below this, organized into “App Groups” which are groupings of apps defined by the user. These groups are user-defined and are a convenient way to organize related application runs’ information. Groups are clickable to drill down into specific application information, as needed. In addition, each application group has colored icons indicating recommendations, where Blue=”Info”, Orange=”Warning”, and Red=”Critical”, provided by xRay’s analysis of the application logs. Clicking an application group provides access to individual application run reports.
Figure 4 shows the result of clicking on a particular application report, in this case named “tpcdssf1000-hdfs-TPCHSQ3”. This overview gives information about the application in aggregate, including run time, cluster time used, data format and timeline. Again the blue, orange, and red issue buttons are clickable. They give access to analysis performed via log signature detection, along with recommendations for this particular application.
Clicking on the green timeline bar in Figure 4 leads the user to a stage-by-stage timeline of execution shown in Figure 5. The stage runtimes are shown, in order of execution. The dark parts of the individual stage timelines represent periods of time when the stage was runnable, but waiting for other inputs. Green portions show actual execution. In this example, the user's eyes are drawn to stage 11, which appears to have the longest running time.
By clicking on the stage 11 timeline in Figure 5, the user has access to stage-by-stage information. The view is shown in Figure 6 and 7, where we have split up the view for clearer explanation. In this case, the user has identified stage 11 as the long running stage and is interested in the distribution of task runtimes. On the right of Figure 6, we see a histogram of task runtimes for the stage which in turn shows a group of outliers i.e. “straggler” tasks (the rightmost blue stack in the histogram).
Figure 7 is the other half of the stage analysis and provides information about the root cause of the outlier problem of stage 11. On the left is the Spark physical plan of stage 11 in isolation. This provides the user information about the operators that are causing the slowdown of the tasks. On the right, we see the specific straggler tasks (in red), their number and the length of those outliers. At this point the user has information to take action. This could include approaches such as examining the application keying structure, memory allocation, employing Bigstream acceleration to shorten task times, or any subset of these approaches.
This example showed how with a few clicks, xRay can identify performance issues and provide knowledge and suggestions aimed at improvement. This example was a small subset of the analysis capability, the user is encouraged to visit the xRay site at insight.bigstream.co to explore further functionality including:
- Issue-by-issue recommendations for performance slowdown mitigation.
- Side-by-side, stage-by-stage comparison of multiple runs, useful for comparing like application runs.
- Acceleration potential estimates, based on operator usage and historical analysis
We have introduced and shown examples of analyses that are provided by the xRay tool. The tool is usable in 3 easy steps and is runnable on the cloud or in the customer premises, even in a production environment.
In the future, we intend to expand the tool by adding functionality such as:
- Online, while running analytical capability.
- Additional optimization recommendations.
- Streaming application statistical analytics.
- Automatic “upload-less” report creation directly from the customer premises.
- Operating system level analytics.
In summary, xRay is an easy-to-use tool to analyze the performance of a huge spectrum of Spark applications. It can also point the way to utilizing seamless Bigstream Hyperacceleration to address any user performance needs. We encourage the reader to get started with xRay today by signing up in minutes at insight.bigstream.co.