IOT pipelines are compute-intensive. Apache Spark performance is often put to the test in these scaled environments. Learn how a Bigstream IOT customer was able to improve its Spark workflow Sixfold.
The Internet of Things, or IoT, is one of several areas driving big data adoption. In the world of analytics, AI, and IoT, there are varying degrees of immediate applicability and future vision (sometimes hype). But many organizations have had IoT use cases in production for years, often as core parts of their business. Some examples include:
- Manufacturing: Production facilities can lose millions of dollars for every minute a line is down. Using data from a variety of sensors, companies can anticipate and reduce such downtime.
- Transportation: Autonomous vehicles capture our imagination, but airplanes, tractors, trucks, and cars have used data and have been connected for years. Engine temperature, tire pressure, and countless other telemetry data points help identify parts that need replacing and, ultimately, keep trucks on the road and planes in the air.
- Logistics and supply chain: Millions of parts, packages, and shipping containers traverse the globe each day. Location and movement data is obviously essential, but so is information on temperature and pressure, depending on the cargo.
- Wearables: Our Fitbits and other consumer devices summarize our data and tell us how fast and far we move. Industrial versions of these devices, in the form of helmets or glasses, have important safety applications as well.
A common thread across IoT use cases is a collection of vast amounts of data. Initially, much of the data was captured and analyzed periodically—only when the helicopter landed, when the tractor exited the mine, and when runners synced their watches at the end of the day. But more and more, IoT data is being transmitted and acted upon in real time.
A large ecosystem of hardware and software providers enables this marvel of technology. Data is captured by sensors, compressed, transmitted, processed, analyzed, and acted upon—and that is a very high-level summary. Open-source data tools like Apache KafkaTM and Apache SparkTM are standard parts of this data pipeline. Kafka supports streaming, and Spark processes data for ingestion into a data environment as well as supporting machine learning and other analytics processing.
One of the overarching challenges is to minimize the volume of data that needs to be transmitted and, of course, encrypt it. Organizations use various forms of data compression and encryption to achieve this. Although this helps address the transmission bottleneck, it adds to the processing bottleneck on the other side, where data needs to be parsed, decompressed, decrypted, and converted to a usable analytics format.
Once the data is decompressed, the next (and usually more daunting) challenge is data processing for use in downstream analytics. This can take a variety of formats, but we frequently see Spark processing at this stage of an IoT workflow.
One of Bigstream’s manufacturing customers was struggling with this particular stage, unable to process the incoming data fast enough for its analytics organization and requiring expensive expansion of its cluster.
With millions of its products in customers’ hands, each potentially producing hundreds of data points per second, this company has its sights on making IoT core to future innovation and the customer experience. But with this CPU bottleneck, its ability to make use of their IoT potential was at risk. The company engaged Bigstream to help with the performance problem while reducing its overall total cost of ownership.
This customer was using Spark on Amazon EMR for its IoT processing for one country. The Controller Area Network (CAN bus) is the predominant protocol for capturing and compressing data within vehicles. Each of the customer’s connected products sends CAN messages more than 1,000 times per day from both active and inactive devices. Kafka streams them and generates an upload to Amazon S3 storage about 100 times per day. Currently, Spark processes the messages, including parsing and applying mathematical rules in a daily batch job. Even with an expanded Spark cluster, the main job was taking at least 2-3 hours.
The customer searched for acceleration solutions to mitigate its problem and found Bigstream. Bigstream’s acceleration solutions provide a seamless way to speed up Spark without altering the code or scaling the hardware. Bigstream can deliver this with software alone or along with specialized hardware like field-programmable gate arrays (FPGAs) or computational storage devices such as the Samsung SmartSSD.
Given the immediate challenge, the company added Bigstream software to its batch pipeline on the EMR cluster running Spark. As with all Bigstream implementations, the customer’s Spark users had to make exactly zero changes to their code. The batch job outputs a structured Parquet file (columnar data format) with roughly 650 columns, representing different sensor data.
The accelerated Spark pipeline performs the same steps as before but completes them six times faster, in 20-30 minutes instead of 2-3 hours. The project is in its early stages, but this impressive result lays the groundwork to expand this beyond the single-country pilot. As the customer looks to scale its IoT capabilities globally, the solution represents the chance for enormous cost savings as they can build Spark clusters with a fraction of the server nodes thanks to Bigstream.
Bigstream is excited to work with this company to innovate and improve its customer experience. IoT represents a wide range of use cases that can impact operations, logistics, and the customer experience. Although these use cases exist in different environments, Bigstream’s Spark acceleration can fill a consistent performance gap to help organizations succeed.