Big data and Hadoop are all about exploiting new value and opportunities with data. In financial trading, business and some areas of science, it’s all about being fastest or first to take advantage of the data. The bigger the data sets, the smarter the analytics. The next competitive edge with big data comes when you layer in flash acceleration. The challenge is scaling performance in Hadoop clusters.
The most cost-effective option emerging for breaking through disk-to-I/O bottlenecks to scale performance is to use high-performance read/write flash cache acceleration cards for caching. This is essentially a way to get more work for less cost, by bringing data closer to the processing. The LSI® Nytro™ product has been shown during testing to improve the time it takes to complete Hadoop software framework jobs up to a 33%.
Flash cache cards increase Hadoop application performance
Combining flash cache acceleration cards with Hadoop software is a big opportunity for end users and suppliers. LSI estimates that less than 10% of Hadoop software installations today incorporate flash acceleration1. This will grow rapidly as companies see the increased productivity and ROI of flash to accelerate their systems. And Hadoop software adoption is also growing fast. IDC predicts a CAGR of as much as 60% by 20162. Drivers include IT security, e-commerce, fraud detection and mobile data user management. Gartner predicts that Hadoop software will be in two-thirds of advanced analytics products by 20153. Many thousands of Hadoop software clusters are already deployed.
Where flash makes the most immediate sense is with those who have smaller clusters doing lots of in-place batch processing. Hadoop is purpose-built for analyzing a variety of data, whether structured, semi-structured or unstructured, without the need to define a schema or otherwise anticipate results in advance. Hadoop enables scaling that allows an unprecedented volume of data to be analyzed quickly and cost-effectively on clusters of commodity servers. Speed gains are about data proximity. This is why flash cache acceleration typically delivers the highest performance gains when the card is placed directly in the server on the PCI Express® (PCIe) bus.
Combining the best of flash and HDDs to drive higher performance and storage capacity
PCIe flash cache cards are now available with multiple terabytes of NAND flash storage, which substantially increases the hit rate. We offer a solution with both onboard flash modules and Serial-Attached SCSI (SAS) interfaces to enable high-performance direct-attached storage (DAS) configurations consisting of solid state and hard disk drive storage. This couples the low-latency performance benefits of flash with the capacity and cost-per-gigabyte advantages of HDDs.
To keep the processor close to the data, Hadoop uses servers with DAS. And to get the data even closer to the processor, the servers are usually equipped with significant amounts of random access memory (RAM). An additional benefit: Smart implementation of Hadoop and flash components can reduce the overall server footprint and simplify scaling, with some solutions enabling up to 128 devices to share a very high bandwidth interface. Most commodity servers provide 8 or less SATA ports for disks, reducing expandability.
Hadoop is great, but flash-accelerated Hadoop is best. It’s an effective way, as you work to extract full value from big data, to secure a competitive edge.
It may sound crazy, but hard disk drives (HDDs) do not have a delete command. Now we all know HDDs have a fixed capacity, so over time the older data must somehow get removed, right? Actually it is not removed, but overwritten. The operating system (OS) uses a reference table to track the locations (addresses) of all data on the HDD. This table tells the OS which spots on the HDD are used and which are free. When the OS or a user deletes a file from the system, the OS simply marks the corresponding spot in the table as free, making it available to store new data.
The HDD is told nothing about this change, and it does not need to know since it would not do anything with that information. When the OS is ready to store new data in that location, it just sends the data to the HDD and tells it to write to that spot, directly overwriting the prior data. It is simple and efficient, and no delete command is required.
However, with the advent of NAND flash-based solid state drives (SSDs) a new problem emerged. In my blog, Gassing up your SSD, I explain how NAND flash memory pages cannot be directly overwritten with new data, but must first be erased at the block level through a process called garbage collection (GC). I further describe how the SSD uses non-user space in the flash memory (over provisioning or OP) to improve performance and longevity of the SSD. In addition, any user space not consumed by the user becomes what we call dynamic over provisioning – dynamic because it changes as the amount of stored data changes.
When less data is stored by the user, the amount of dynamic OP increases, further improving performance and endurance. The problem I alluded to earlier is caused by the lack of a delete command. Without a delete command, every SSD will eventually fill up with data, both valid and invalid, eliminating any dynamic OP. The result would be the lowest possible performance at that factory OP level. So unlike HDDs, SSDs need to know what data is invalid in order to provide optimum performance and endurance.
Keeping your SSD TRIM
A number of years ago, the storage industry got together and developed a solution between the OS and the SSD by creating a new SATA command called TRIM. It is not a command that forces the SSD to immediately erase data like some people believe. Actually the TRIM command can be thought of as a message from the OS about what previously used addresses on the SSD are no longer holding valid data. The SSD takes those addresses and updates its own internal map of its flash memory to mark those locations as invalid. With this information, the SSD no longer moves that invalid data during the GC process, eliminating wasted time rewriting invalid data to new flash pages. It also reduces the number of write cycles on the flash, increasing the SSD’s endurance. Another benefit of the TRIM command is that more space is available for dynamic OP.
Today, most current operating systems and SSDs support TRIM, and all SandForce Driven™ member SSDs have always supported TRIM. Note that most RAID environments do not support TRIM, although some RAID 0 configurations have claimed to support it. I have presented on this topic in detail previously. You can view the presentation in full here. In my next blog I will explain how there may be an alternate solution using SandForce Driven member SSDs.