Most users have no idea that reading electronic information from a data storage medium like a hard disk drive (HDD) or solid state drive (SSD) is plagued with read errors. For this reason error correction codes (ECC) are used to fix the random bit errors that arise during the reading process before the incorrect data is returned to the user. But the error correction codes can only handle so many errors at one time. If data errors exceed the ECC limits, the data goes uncorrected and is lost forever. More recent ECC algorithms like the LSI SHIELD error correction technology go a lot farther to protect user data than prior solutions.
What happens to the data when the ECC fails?
If the ECC fails, only a backup protection mechanism will recover the data. There are three alternatives. First, users should always back up their critical data since ECC failure and other threats can destroy data or render it inaccessible such as natural disasters (earthquakes, tornadoes, flooding etc.) that cause heavy damage to buildings and their contents, lightning overloading that can burn up a computer without adequate electrical protection, and of course computer theft. Any backup system should be either automated or at least run consistently if it is manual. Industry reports cite that less than 10% of computer users back up their data. That is not very comforting.
The second solution is to employ a RAID (Redundant Array of Independent Disks) array that uses multiple storage devices with one or more of the drives acting as a parity device to provide redundancy. That way if one drive fails, the redundant drive provides enough parity information to restore the original data. This type of system is very common in enterprise environments—a work computer—but hardly used in home systems or laptop PC.
Is the third solution simple, automatic, and operable in a single-drive environment?
Yes. Yes. And Yes. LSI® SandForce® flash and SSD controllers have a feature called RAISE™ data protection that meets all of these needs. Introduced in 2009 with the first SandForce controller, RAISE technology stands for Redundant Array of Independent Silicon Elements. It sounds like RAID, and acts something like RAID, but protects data using a single drive. With RAISE technology, the individual flash die act like the drives in a RAID array. The original RAISE level 1 technology protects against single page and block failures in the flash. These types of failures are beyond the protection of the ECC, but RAISE technology can recover the data.
With the introduction of the SF3700 this month, RAISE technology now offers more flexibility to deliver greater data protection. With the original RAISE level 1, the space of a full flash die had to be allocated solely to protect user data. In small-capacity configurations, like 64GB, RAISE level 1 required too much over provisioning and therefore had to be disabled or, with RAISE left on, its available capacity reduced to 60GB or 55GB. With a new enhancement to the SF3700, no such tradeoff is necessary. The new Fractional RAISE option for this first level of protection uses only a small portion of a die to protect user data in even the smallest configurations and preserve over provisioning (OP). This is important because, as I explained in my blog titled Gassing up your SSD, the more space you allocate for OP, the lower the write amplification, which translates to higher performance during writes and longer endurance of the flash memory.
Stronger data protection with RAISE level 2
A new RAISE level 2 capability offers even stronger data protection, safeguarding against multiple, simultaneous page and block failures, as well as a full die failure. If a die fails, the SandForce controller recovers the user data. RAISE level 2 includes Auto-Reallocation that can be set up to automatically redistribute and protect user data in the event of a subsequent die failure. Because the option to protect against a second die failure would reduce the available OP area, the RAISE level 2 feature can be set up to simply drop back to RAISE level 1 protection without sacrificing any OP space. .
Another new capability is an additional (9th) flash channel that enables the manufacturer to populate an extra flash package with one die that enables full RAISE level 1 protection while maintaining maximum user data capacity such as 64GB, 128GB, 256GB, etc. Without the 9th channel option, the SSD capacity would be forced to sacrifice a few GBs of capacity (reducing available user capacity to 60GB, 120GB, 240GB, respectively) because RAISE requires extra storage space.
Although all these new features cannot protect against the would-be thief or catastrophic drive failures from electrical surges or natural disasters, the probability of those events is much lower than a simple ECC failure. That’s why you would be best suited to have an SSD with RAISE technology to automatically protect against the more common ECC failures and then make a backup copy of your system at least periodically to protect your data against those far more serious events.
For the uninitiated, low-density parity-check (LDPC) code is an error correction code (ECC) that is used to both detect and correct errors on data that is transmitted from one point to another. All ECC types include correction data, so when information is transmitted with errors, the receiver has enough information to fix the errors without having to ask the source for the data again.
This enables transmitted data to maintain a constant speed as is required with digital television signals. What you don’t want is for the image to freeze repeatedly while waiting for correction data to be sent multiple times.
LDPC code was first presented to the world by Robert G. Gallager at MIT in 1960. It was very advanced for its time and, as it turned out, required a fantastic amount of computation to use real-time. The problem was that back in 1960, vacuum tube computers of that period performed about 100 times less work than the microprocessor-powered computers of today. In 1960, you would need a computer the size of a 2,000-square-foot house to process the LDPC correction information in real-time. This was hardly economical, so LDPC was mostly lost for nearly 40 years, as other simpler codes took its place over that period.
What was old is new again
In the mid-1990s, engineers working on satellite transmissions for digital television dusted off the LDPC codes and started using them for real-time operations. By then, computer processing had seen dramatic reductions in size and costs. Fast-forward to the past 5 years: we have seen a major increase in LDPC development and use because it appears to be the best solution for high-speed data transmissions, especially those subject to heavy levels of electrical noise that induce higher error rates. Also, the processing power of target devices like WiFi receivers and HDDs has grown even stronger and faster than some of the main CPUs of a few years ago. This enables LDPC to be deployed for little additional cost with the advantage of real-time data correction superior to correction offered by simpler codes.
If you have seen one LDPC solution, have you seen them all?
Nothing could be further from the truth. For example, an LDPC solution designed for satellite communication cannot be used for HDDs since there is no direct porting of the code, though there are distinct advantages to the two engineering teams sharing their knowledge and experience through their development efforts. Take, for example, the LDPC code that LSI has been shipping in its TrueStore® HDD read channel solutions for 3 years now. When LSI acquired SandForce and started work on SHIELD™ error correction code (based on LDPC) for flash controllers, there was no direct porting of that HDD code to support SSDs. However, the HDD development team’s knowledge and experience from creating the HDD code greatly improved the SSD team’s ability to more quickly bring SHIELD technology to the next-generation SandForce flash controller.
How do LDPC solutions for SSDs differ?
Many LDPC providers claim that their offerings rival the capabilities of competitive solutions, though often they aren’t telling the whole story. All LDPC solutions start with what is called hard-decision LDPC – a digital correction algorithm that operates at line rate on all data passing through the correction engine. The algorithm uses the meta-data generated from the user and system data stored on the flash memory, and helps recreate the user data when the flash memory returns it with errors. Hard-decision LDPC catches most errors from the flash memory, though sometimes it can be overwhelmed by an inordinate number of errors. That is where soft-decision LDPC – a more analog-based correction algorithm – comes into play.
Can a soft-decision be strong enough for my data?
Soft-decision LDPC is an error correction method that looks at other information beyond the actual ECC data. Soft-decision, in a sense, looks at the meta-data of the meta-data. The simplest form of soft-decision LDPC may just re-read the data at a different reference voltage, as if asking a person “can you say that again?” More complex soft-decision might be compared to listening to a man with a heavy French accent speaking English. You know he just said something in English, but you could not clearly grasp what he said. You ask some questions and, from his answers, soon realize what he originally said and are now back on track. While this might seem more like guessing at the answer, soft-decision LDPC uses statistics to help ensure the answers are not false positive results. As a result, soft-decision LDPC uncovers a new set of engineering problems that need to be solved, opening new opportunities for flash controller manufacturers to create powerful intellectual property (IP). For that reason, you’re likely to learn very little about how a given company’s soft-decision LDPC works.
At the 2013 Flash Memory Summit in Santa Clara, California, LSI demonstrated its SHIELD Advanced Error Correction Technology. SHIELD technology includes hard and soft-decision LDPC with digital signal processing (DSP) and a number of other unique features designed to optimize future NAND flash memory operation in compute environments. One feature, called Adaptive Code Rate, works with other LSI features to enable the spare area in flash memory reserved for ECC data to occupy less space than the manufacturer allocation and then dynamically grow to accommodate inevitable increases in flash error rates. The soft-decision LDPC capability offers multiple strengths of correction, with each activating only as necessary to ensure the lowest possible real-time latency.
So it’s clear that all LDPC solutions are far from the same. When evaluating LDPC solutions, be sure to understand how they manage data correction when it exceeds the ability of the hard-decision LDPC. Also, make sure the algorithms are actually in use in a product. Otherwise, the product might turn out to be a science experiment that never works.
Tags: Adaptive Code Rate, ECC, error correction, flash controllers, flash memory, Flash Memory Summit, LDPC, NAND flash, Robert Gallager, SandForce, SHIELD, solid state drive, SSD, TrueStore