Hardware-based acceleration is becoming a more important approach for improving the performance of compute infrastructure, addressing the growing needs of data analytics and AI. Typically, acceleration occurs via some form of advanced programmable hardware, such as a GPU or FPGA. This provides computational advantages, including application specificity of hardware, over general-purpose CPUs. A principal concern is the programming of advanced hardware and its difficulty.
Figure 1 - Analytics/AI Pipeline Components
Figure 1 presents the various components of a data analytics pipeline, including data collection, verification, feature extraction, and analysis, all in addition to the ML/AI component. A high-performance solution must address all of these components, due to the effects of Amdahl’s law.
Acceleration has now been applied to computational storage by hardware manufacturers, wherein storage devices are designed to contain an in line computational element. These devices are able to provide high performance for data analytics/AI applications. Conceptually, the key advantage of such a platform is that it can offload costly computations to the storage device, relieving them from the server CPU. In total, computational storage offers the following advantages over standard storage/CPU approaches:
- Faster computation/performance, derived from application-specific programming of hardware
- Offloading of computation from the server to the storage, freeing CPU resources
- Shifting compute to where the data resides, reducing the need to move it and increasing effective connector performance
This approach is promising, but it is important to evaluate it for the analytics use case based on a number of aspects, including performance, price, power consumption, and ease of use. Here, we present a comparison of three approaches—NVIDIA GPUDirect Storage, NVIDIA RAPIDS, and the Samsung SmartSSD drive, powered by Xilinx,—for the use case of CSV data read, which is a key compute-intensive component of the overall AI pipeline presented in figure 1. In the following, we define performance to be the processing rate of CSV or the “bandwidth” of the processing.
NVIDIA GPUDirect Storage
NVIDIA has announced GPUDirect Storage, which uses a GPU as a computational element placed next to an NVMe-based storage device. In addition, NVIDIA offers an acceleration approach for offload GPUs, RAPIDS that are programmed using the CUDA programming environment. These approaches are designed to address analytics and AI end to end. The announcement with descriptions and performance numbers can be found here.
NVIDIA has applied this technology to a CSV data read application to measure the performance gain over standard SSDs. Results showed a 4-22 GB/s throughput for a range of 1-8 accelerators employed.
Samsung SmartSSD Drive
Samsung Semiconductor has developed the SmartSSD drive platform, which uses an FPGA as the computational element. The FPGA is in line with the storage logic, with both residing on the same internal PCIe interconnect. The FPGA can be programmed to process the data read to perform computation on the storage platform itself. Bigstream has developed a method to accelerate Apache Spark on the Samsung SmartSSD drive platform, including IP for CSV and Parquet processing.
Bigstream has tested the SmartSSD drive, using its CSV parsing engine in stand-alone mode for this comparison. Results in figure 2 show a throughput of 4-16 GB/s for the range of 1-12 accelerators, alongside the NVIDIA results (for 1-8 accelerators) discussed above. Note that all results in this discussion are parameterized by the number of accelerator cards employed on the x-axis.
These accelerated performance results are promising, it is important to take into account the price when analyzing the effectiveness of these approaches.
Figure 2: SmartSSD Drive Performance Results for CSV Parsing
Performance/Price Comparison
Figure 3 shows the results of including accelerator price as a consideration for analysis. Results are presented in terms of performance achieved per U.S. dollar, with the following assumptions based on the related material cited in the discussion above (only the cost of the accelerators is considered—this does not include the cost of the chassis, the interconnect, and so on):
- $1,500 retail cost for the SmartSSD drive FPGA—this is a notional estimate of the cost based on the current cost of SSDs for the sake of this discussion and does not reflect actual pricing, which is to be determined
- $8,000 retail cost for the Tesla V100 GPU

Figure 3: Performance per $ Comparison for CSV Parsing
In this scenario, calculations show up to almost a 3x factor in performance/price for the SmartSSD over GPUDirect Storage with eight accelerators each.
FPGA vs. GPU Acceleration: Final Thoughts
The advantages provided by computational storage can enhance the performance of many data analytics and AI applications. However, for the approach to be practical and useful for deployment, evaluations must take performance/price into account.
We have shown throughput performance curves parameterized by cost for two different computational storage approaches for CSV data parsing. Results show that when comparing a like number of accelerators, the SmartSSD drive outperforms the GPUDirect Storage approach in terms of performance/price.
An additional aspect for consideration is the purchase price of the GPUDirect appliance. SmartSSD unit pricing is close to that of individual, commodity SSDs.
GPUDirect is a research system from NVIDIA to be made available via the NVIDIA DGX-2 appliance platform. Find more information here.
The Samsung SmartSSD drive is currently a development system, with production components available later this year. It will be available as a PCIe-pluggable platform, and you can further explore the solution here.