To improve the performance of compute infrastructure and to keep up with the expanding requirements of data analytics and AI, many enterprises are looking to hardware acceleration as a integral solution. In most situations, advanced programmable hardware—mainly GPUs and FPGAs—is the primary source of acceleration. By using this advanced hardware, enterprises are gaining computational advantages; however, there are still reasonable concerns related to the difficulty of programming this specific hardware.

FPGA vs. GPU Acceleration: Considering Performance/Power

Figure 1: Analytics/AI Pipeline Components


As illustrated in Figure 1, all elements of a data analytics pipeline are represented to include data collection, verification, feature extraction, and analysis, as well as, the ML/AI component.

Hardware manufacturers are now applying acceleration using computational storage, designed to include an in-line computational element. Hardware manufacturers are applying acceleration methods to computational storage, which is storage specifically designed to incorporate an in-line computational element. This approach has been shown to deliver high performance for analytics and AI applications. These devices offer a key advantage because costly computations are offloaded to the storage device, rather than being done on the server CPU. Compared to standard storage/CPU methods, these are the advantages gained by computational storage: 

  1. Achieving faster Achieving enhanced performance by customizing the programmable hardware. with application-specific programming of hardware
  2. Freeing up CPU resources by offloading computation from the server to the storage device
  3. Co-location of data and compute, reducing the need to transfer data

This novel approach is promising; however, you should assess it for your specific use case, considering performance, cost, power consumption, and ease of use. Previously, we provided information on the performance/price ratio; in this piece, we are looking at the performance/power ratio. 


Computational Storage Power Comparison Overview


About the 3 Systems

In this scenario, we're comparing three tools focusing on CSV data read use cases: NVIDIA GPU Direct Storage, NVIDIA RAPIDS, and Samsung SmartSSD powered by Xilinx. CSV read is crucial in compute intensive pipelines (see Figure 1).

In the following, we define performance to be the processing rate of CSV or the “bandwidth” of the processing.

Here’s a quick refresher on how the three systems work. The systems are also detailed more substantially in a previous post.

Nvidia GPUDirect Storage

  • Addresses analytics and AI end-to-end
  • Uses the GPU as a computational element placed next to an NVMe-based storage device (GPUDirect)
  • Leverages CUDA for programming (RAPIDS)

More information here.

NVIDIA employed its technology to CSV data read to measure the performance gain over a standard SSD. The results in Figure 1 show a 4-22 GB/s throughput for a range of 1-8 accelerators. 

Samsung SmartSSD Drive

  • Uses an FPGA as the computational element
  • Resides in-line with the storage logic on the same internal PCIe interconnect
  • Performs computation on the storage platform with programming

Bigstream partnered with Samsung to design an accelerator for Apache Spark, including IP for CSV and Parquet processing. Testing of the SmartSSD occurred using the CSV parsing engine in stand-alone mode for comparison. Results in Figure 2 demonstrate a throughput of 4-23 GB/s for 1-12 accelerators, along with the NVIDIA results (for 1-8 accelerators). Please note all results in this discussion are parameterized by the number of accelerator cards employed on the x-axis.

These outcomes are meaningful. It is also important to consider the power consumption when choosing your solution.


Figure 2: SmartSSD Drive Performance Results for CSV Parsing


The Power-Performance Comparison

Figure 3 shows the results of including power consumption as a consideration for analysis. They are presented in terms of performance achieved per unit of power, with the following assumptions based on the related material cited in the discussion above:

  • Tesla V100 GPU: 250-watt max power
  • SmartSSD Drive FPGA: 30-watt max power

Bigstream Bandwidth per Watt Comparison

Figure 3: Bandwidth per Watt Comparison for CSV Parsing

In this scenario, calculations show almost a 25x increase in performance/power for the SmartSSD over GPUDirect Storage with eight accelerators each. 


FPGA vs. GPU: Power/Performance Final Thoughts

The advantages of computational storage can enhance the performance of data analytics and AI applications. However, for the approach to be practical and useful for deployment, evaluations must consider power consumption.

We have presented throughput performance curves parameterized by power for two different computational storage approaches for CSV data parsing. Results show that when comparing a like number of accelerators, the SmartSSD drive outperforms the GPUDirect storage approach in terms of performance/power 

GPUDirect is a research system from NVIDIA to be made available via the NVIDIA DGX-2 appliance platform. To review more information, check out this blog.

As of this writing, the Samsung SmartSSD Drive is a development system, with components available later this year. The drive will be available as a PCIe-pluggable platform. To learn more about Samsung SmartSSD Drive, check out the eBook here.