The adoption of big data analytics in the cloud is well underway. Five to 10 years ago, it was hard to imagine housing and managing large volumes of data in a public cloud. Times have changed, though, and the cloud options have evolved to meet enterprise needs, particularly around performance, security, and reliability.

IDC’s research shows the public cloud share of big data and analytics is already 25 percent and will reach 50 percent by 2025. We are not just talking about early adopters anymore. Most large enterprises have at least some workloads running in the cloud—some in widespread production and others primarily for prototyping. 

A cloud approach for your big data environment provides advantages, but like any major technology shift, it also comes with challenges. Bigstream proudly provides Apache SparkTM acceleration with a variety of computing frameworks both on prem and in the cloud. In this blog, we look closer at five reasons customers are embracing Bigstream Hyperacceleration in the cloud.

1. Performance and SLAs

The first reason customers adopt Bigstream on cloud is, not surprisingly, to improve their Spark performance. Performance and acceleration are in Bigstream’s DNA.

Spark and other parallel computing platforms have always promised the ease and simplicity of scaling. However, most Spark users have learned that simply throwing more nodes (scaling out) or bigger nodes (scaling up) at their clusters yields diminishing returns (see Figure 1). 

Speedup vs. vCPU Chart

Figure 1

Yes, the wide range of available instance types have made it faster to grow your cluster, but the cloud doesn’t change the fundamental limits that cause performance declines. Bigstream Hyperacceleration lets Spark make better use of the available computing infrastructure, whether using CPU-based instances or advanced hardware like Amazon Web Services (AWS) F1 FPGA instances. Customers who struggle to meet aggressive deadlines and service-level agreements (SLAs) can breathe easier when Bigstream cuts runtime by half or more.

2. AWS Cost Management and TCO Simplicity

Managing costs is a key challenge for customers as they run more of their analytics in the cloud. Many come to the cloud motivated by reducing their upfront capital expenditures (CapEx) as well as the associated burdens of managing a data center. The shift can be economical. Cloud elasticity has real benefits for companies with fluctuating demand for compute resources. The “only pay for what you use” cloud model is great compared with underutilized on-premises data center investments. But the shift toward higher operating expenses (OpEx) can mean ongoing costs get out of control. 

Few organizations will jump to the cloud without a thorough comparison of the total cost of ownership (TCO) of their analytics environment. Analyzing on-premises TCO can quickly become complex, trying to quantify a data engineer’s time, system power, cooling, faster results, and so on. 

The TCO analysis comparing approaches on the cloud, though, is far simpler since time is literally money. A solution that cuts a five-hour job to two hours will bring faster business results and reduce data scientist wait time, whether it’s on the cloud or on prem. But in the cloud, that three-hour reduction runtime has a quantifiable reduction in cost. Many customers invest in performance engineering teams and cost optimization initiatives to keep their cloud costs in check. Along with efforts like leveraging spot instances and managing instance sizes, adding Bigstream directly reduces AWS costs because it shortens runtimes by half or more.

Figure 2

3. Fast and Easy Setup and Proof of Concept 

It’s remarkably easy to get started with Bigstream in the cloud, and that helps customers quickly realize the value. Customers can add Bigstream to their existing Spark clusters in minutes and test the impact for themselves. 

For any new technology, customers evaluate the level of effort and risk of adding software or hardware to their environment. What we’ve found with cloud customers, though, is they are ready and able to say “prove it” with actual workloads. By adding Bigstream software acceleration to an existing AWS Spark job, customers can quickly determine the bottom line of how much faster their jobs can complete. 

This is particularly true for accelerating AWS EMR Spark, which is now available from Bigstream on Marketplace. This is not the type of proof of concept (POC) that requires significant human and technical resources and training. Bigstream requires zero code changes for Spark in the cloud or in any environment.

4. Workload Optimization

As they say, when you have a hammer, every problem looks like a nail. With the growing toolkit available in the cloud, customers can create the right analytics stack for different workloads. They can choose the software and hardware instances for machine learning workloads separately than for ETL (extract, transform, load), for instance. 

With F1 instances, Bigstream provides, on average, 5x acceleration across all 100+ TPC-DS queries with row-based data. Adopting Bigstream does not require a customer to use Hyperacceleration on every workload. Customers can initially select Bigstream for the workloads that are the most time sensitive, where Bigstream can deliver the most savings and acceleration first. Customers can then expand over time. 

5. No Risk, No Commitment

The fifth reason customers adopt Bigstream on the cloud is that all of the advantages discussed above are available with no risk or cost for 30 days. This is not a complex POC followed by a migration to production. The fast setup means you are up and running in minutes, and you can enable Bigstream with a single instruction in your Spark settings going forward. 

With the free trial, there is no risk in trying Bigstream on all your workloads. This will give the best visibility into the performance gains and relative AWS cost reduction. If you discover some workloads don’t justify adding Bigstream, you can limit Bigstream to the applications that do. The trial is unlimited and unthrottled. In the first 30 days, you can enjoy full Hyperacceleration, up to 10x for some Spark applications, and thereafter you only pay for workloads that you are accelerating.

We invite you to join the customers who are adding Bigstream to their cloud analytics environments. For your AWS EMR Spark jobs, simply subscribe to Bigstream on Marketplace. Click here for details. For your Spark on EC2 deployments, contact to get started or click here for details.


Simplifying an Accelerating Data Lake ingestion