Most consumers are skeptical when they see a manufacturer whipping out grandiose performance claims. And for good reason. The manufacturer could be stretching the truth, twisting the results, or just being downright misleading. From this distrust grew demand for 3rd-party writers to review products, test claims and provide an unbiased analysis of the device’s performance and other capabilities – as consumers would experience themselves.
Who can really claim to be an SSD benchmarking expert?
Solid state drive (SSD) technology is still relatively new in the computer industry, and in many ways SSDs are profoundly different from hard disk drives – perhaps most notably, in the way they record data, to a NAND cell rather than on spinning media. Because of differences in their operation, SSDs have to be tested in ways that are not necessarily obvious.
Can anyone who simply runs a benchmark application claim to be an expert? I would say not. Just as anyone sitting behind the wheel of a car is not necessarily an expert driver. The problem is that it is hard to determine the thoroughness and expertise of an SSD reviewer. Does the author really understand the details behind the technology to run adequate tests and analyze the results?
Can “experts” present bad data?
Maybe it is obvious, but of course experts can be wrong, especially when they are self-proclaimed mavens without deep experience in the technology they cover. At a minimum, you can generally count on them to act in good faith – that is, to not be intentionally misleading – but they can easily be misinformed (for instance, by manufacturers) and perpetuate the misinformation. What’s more, some reviewers are pressured to do a cursory analysis of an SSD as they crank through countless product evaluations under unremitting deadlines – a crush that can cause oversights in telling aspects of a drive’s performance. In any case, it is not good to rely on bad data no matter the intention.
What makes for a thorough SSD review?
Some reviewers have gone to great lengths to ensure their SSD analysis is extremely detailed and represents a real-world environment and performance. These reviewers will generally talk about how their analysis simulates a true user or server environment. The trouble can begin if a reviewer doesn’t recognize normal operation of an SSD in its own environment. With SSDs, “normal” is when garbage collection is operating, which greatly impacts overall performance. It’s important for reviewers to recognize that, with a new SSD, garbage collection is inactive until at least one full physical capacity of data has been sequentially written to the device. For example, with a 256GB SSD, 256GB of data must be written to trigger garbage collecting. At that point, garbage collection is ongoing, the drive has reached its steady-state performance, and the device is ready for evaluation. Random writes are another story, requiring up to three passes (full-capacity writes) randomly written to the SSD before the steady-state performance level shown below is reached.
You can see that running only a few minutes of random write tests on this SSD logs performance of over 275 MB/s. However, once garbage collection starts, performance plunges and then takes up to 3 hours before the true performance of 25 MB/s (a 90% drop) is finally evident – a phenomenon that often is not communicated clearly in reviews nor widely understood.
Good benchmarkers will discuss how their review factors in both garbage collection preparation and steady-state performance testing. Test results that purportedly achieve steady state in less time than in the example above are unlikely to reflect real-world performance. This is all part of what is called SSD preconditioning, but keep in mind that different tests require different steps for preconditioning.
For additional information on this topic, you can review my presentation from Flash Memory Summit 2013 on “Don’t let your favorite benchmarks lie to you.”
For the uninitiated, low-density parity-check (LDPC) code is an error correction code (ECC) that is used to both detect and correct errors on data that is transmitted from one point to another. All ECC types include correction data, so when information is transmitted with errors, the receiver has enough information to fix the errors without having to ask the source for the data again.
This enables transmitted data to maintain a constant speed as is required with digital television signals. What you don’t want is for the image to freeze repeatedly while waiting for correction data to be sent multiple times.
LDPC code was first presented to the world by Robert G. Gallager at MIT in 1960. It was very advanced for its time and, as it turned out, required a fantastic amount of computation to use real-time. The problem was that back in 1960, vacuum tube computers of that period performed about 100 times less work than the microprocessor-powered computers of today. In 1960, you would need a computer the size of a 2,000-square-foot house to process the LDPC correction information in real-time. This was hardly economical, so LDPC was mostly lost for nearly 40 years, as other simpler codes took its place over that period.
What was old is new again
In the mid-1990s, engineers working on satellite transmissions for digital television dusted off the LDPC codes and started using them for real-time operations. By then, computer processing had seen dramatic reductions in size and costs. Fast-forward to the past 5 years: we have seen a major increase in LDPC development and use because it appears to be the best solution for high-speed data transmissions, especially those subject to heavy levels of electrical noise that induce higher error rates. Also, the processing power of target devices like WiFi receivers and HDDs has grown even stronger and faster than some of the main CPUs of a few years ago. This enables LDPC to be deployed for little additional cost with the advantage of real-time data correction superior to correction offered by simpler codes.
If you have seen one LDPC solution, have you seen them all?
Nothing could be further from the truth. For example, an LDPC solution designed for satellite communication cannot be used for HDDs since there is no direct porting of the code, though there are distinct advantages to the two engineering teams sharing their knowledge and experience through their development efforts. Take, for example, the LDPC code that LSI has been shipping in its TrueStore® HDD read channel solutions for 3 years now. When LSI acquired SandForce and started work on SHIELD™ error correction code (based on LDPC) for flash controllers, there was no direct porting of that HDD code to support SSDs. However, the HDD development team’s knowledge and experience from creating the HDD code greatly improved the SSD team’s ability to more quickly bring SHIELD technology to the next-generation SandForce flash controller.
How do LDPC solutions for SSDs differ?
Many LDPC providers claim that their offerings rival the capabilities of competitive solutions, though often they aren’t telling the whole story. All LDPC solutions start with what is called hard-decision LDPC – a digital correction algorithm that operates at line rate on all data passing through the correction engine. The algorithm uses the meta-data generated from the user and system data stored on the flash memory, and helps recreate the user data when the flash memory returns it with errors. Hard-decision LDPC catches most errors from the flash memory, though sometimes it can be overwhelmed by an inordinate number of errors. That is where soft-decision LDPC – a more analog-based correction algorithm – comes into play.
Can a soft-decision be strong enough for my data?
Soft-decision LDPC is an error correction method that looks at other information beyond the actual ECC data. Soft-decision, in a sense, looks at the meta-data of the meta-data. The simplest form of soft-decision LDPC may just re-read the data at a different reference voltage, as if asking a person “can you say that again?” More complex soft-decision might be compared to listening to a man with a heavy French accent speaking English. You know he just said something in English, but you could not clearly grasp what he said. You ask some questions and, from his answers, soon realize what he originally said and are now back on track. While this might seem more like guessing at the answer, soft-decision LDPC uses statistics to help ensure the answers are not false positive results. As a result, soft-decision LDPC uncovers a new set of engineering problems that need to be solved, opening new opportunities for flash controller manufacturers to create powerful intellectual property (IP). For that reason, you’re likely to learn very little about how a given company’s soft-decision LDPC works.
At the 2013 Flash Memory Summit in Santa Clara, California, LSI demonstrated its SHIELD Advanced Error Correction Technology. SHIELD technology includes hard and soft-decision LDPC with digital signal processing (DSP) and a number of other unique features designed to optimize future NAND flash memory operation in compute environments. One feature, called Adaptive Code Rate, works with other LSI features to enable the spare area in flash memory reserved for ECC data to occupy less space than the manufacturer allocation and then dynamically grow to accommodate inevitable increases in flash error rates. The soft-decision LDPC capability offers multiple strengths of correction, with each activating only as necessary to ensure the lowest possible real-time latency.
So it’s clear that all LDPC solutions are far from the same. When evaluating LDPC solutions, be sure to understand how they manage data correction when it exceeds the ability of the hard-decision LDPC. Also, make sure the algorithms are actually in use in a product. Otherwise, the product might turn out to be a science experiment that never works.
Tags: Adaptive Code Rate, ECC, error correction, flash controllers, flash memory, Flash Memory Summit, LDPC, NAND flash, Robert Gallager, SandForce, SHIELD, solid state drive, SSD, TrueStore
Part two of this Write Amplification (WA) series covered how WA works in solid-state drives (SSDs) that use data reduction technology. I mentioned that, with one of these SSDs, the WA can be less than one, which can greatly improve flash memory performance and endurance.
Why is it important to know your SSD write amplification?
Well, it’s not really necessary to know the write amplification of your SSD at any particular point in time, but you do want an SSD with the lowest WA available. The reason is that the limited number of program/erase cycles that NAND flash can support keeps dropping with each generation of flash developed. A low WA will ensure the flash memory lasts longer than flash on an SSD with a higher WA.
A direct benefit of a WA below one is that the amount of dynamic over provisioning is higher, which generally provides higher performance. In the case of over provisioning, more is better, since a key attribute of SSD is performance. Keep in mind that, beyond selecting the best controller, you cannot control the WA of an SSD.
Just how smart are the SSD SMART attributes?
The monitoring system SMART (Self-Monitoring, Analysis and Reporting Technology) tracks various indicators of hard disk solid state drive reliability – including the number of errors corrected, bytes written, and power-on hours – to help anticipate failures, enabling users to replace storage before a failure causes data loss or system outages.
Some of these indicators, or attributes, point to the status of the drive health and others provide statistical information. While all manufacturers use many of these attributes in the same or a similar way, there is no standard definition for each attribute, so the meaning of any attribute can vary from one manufacturer to another. What’s more, there’s no requirement for drive manufacturers to list their SMART attributes.
How to measure missing attributes by extrapolation
Most SSDs provide some list of SMART attributes, but WA typically is excluded. However, with the right tests, you can sometimes extrapolate, with some accuracy, the WA value. We know that under normal conditions, an SSD will have a WA very close to 1:1 when writing data sequentially.
For an SSD with data reduction technology, you must write data with 100% entropy to ensure you identify the correct attributes, then rerun the tests with an entropy that matches your typical data workload to get a true WA calculation. SSDs without data reduction technology do not benefit from entropy, so the level of entropy used on them does not matter.
To measure missing attributes by extrapolation, start by performing a secure erase of the SSD, and then use a program to read all the current SMART attribute values. Some programs do not accurately display the true meaning of an attribute simply because the attribute itself contains no description. For you to know what each attribute represents, the program reading the attribute has to be pre-programmed by the manufacturer. The problem is that some programs mislabel some attributes. Therefore, you need to perform tests to confirm the attributes’ true meaning.
Start writing sequential data to the SSD, noting how much data is being written. Some programs will indicate exactly how much data the SSD has written, while others will reveal only the average data per second over a given period. Either way, the number of bytes written to the SSD will be clear. You want to write about 10 or more times the physical capacity of the SSD. This step is often completed with IOMeter, VDbench, or other programs that can send large measurable quantities of data.
At the end of the test period, print out the SMART attributes again and look for all attributes that have a different value than at the start of the test. Record the attribute number and the difference between the two test runs. You are trying to find one that represents a change of about 10, or the number of times you wrote to the entire capacity of the SSD. The attribute you are trying to find may represent the number of complete program/erase cycles, which would match your count almost exactly. You might also find an attribute that is counting the number of gigabytes (GBs) of data written from the host. To match that attribute, take the number of times you wrote to the entire SSD and multiply by the physical capacity of the flash. Technically, you already know how much you wrote from the host, but it is good to have the drive confirm that value.
Doing the math
When you find candidates that might be a match (you might have multiple attributes), secure erase the drive again, this time writing randomly with 4K transfers. Again, write about 10 times the physical capacity of the drive, then record the SMART attributes and calculate the difference from the last recording of the same attributes that changed between the first two recordings. This time, the change you see in the data written from the host should be nearly the same as with the sequential run. However, the attribute that represents the program/erase cycles (if present) will be many times higher than during the sequential run.
To calculate write amplification, use this equation:
( Number of erase cycles x Physical capacity in GB ) / Amount of data written from the host in GB
With sequential transfers, this number should be very close to 1. With random transfers, the number will be much higher depending on the SSD controller. Different SSDs will have different random WA values.
If you have an SSD with the type of data reduction technology used in the LSI® SandForce® controller, you will see lower and lower write amplification as you approach your lowest data entropy when you test with any entropy lower than 100%. With this method, you should be able to measure the write amplification of any SSD as long as it has erase cycles and host data-written attributes or something that closely represents them.
Protect your SSD against degraded performance
The key point to remember is that write amplification is the enemy of flash memory performance and endurance, and therefore the users of SSDs. This three-part series examined all the elements that affect WA, including the implications and advantages of a data reduction technology like the LSI SandForce DuraWrite™ technology. Once you understand how WA works and how to measure it, you will be better armed to defend yourself against this beastly cause of degraded SSD performance.
This three-part series examines some of the details behind write amplification, a key property of all NAND flash-based SSDs that is often misunderstood in the industry.
In part one of this Write Amplification (WA) series, I examined how WA works in basic solid-state drives (SSDs). Part two now takes a deeper look at how SSDs that use some form of data reduction technology can see a very big and positive impact on WA.
Data reduction technology can master data entropy
The performance of all SSDs is influenced by the same factors – such as the amount of over provisioning and levels of random vs. sequential writing – with one major exception: entropy. Only SSDs with data reduction technology can take advantage of entropy – the degree of randomness of data – to provide significant performance, endurance and power-reduction advantages.
Data reduction technology parlays data entropy (not to be confused with how data is written to the storage device – sequential vs. random) into higher performance. How? When data reduction technology sends data to the flash memory, it uses some form of data de-duplication, compression, or data differencing to rearrange the information and use fewer bytes overall. After the data is read from flash memory, data reduction technology, by design, restores 100% of the original content to the host computer This is known as “loss-less” data reduction and can be contrasted with “lossy” techniques like MPEG, MP3, JPEG, and other common formats used for video, audio, and visual data files. These formats lose information that cannot be restored, though the resolution remains adequate for entertainment purposes.
The multi-faceted power of data reduction technology
My previous blog on data reduction discusses how data reduction technology relates to the SATA TRIM command and increases free space on the SSD, which in turn reduces WA and improves subsequent write performance. With a data-reduction SSD, the lower the entropy of the data coming from the host computer, the less the SSD has to write to the flash memory, leaving more space for over provisioning. This additional space enables write operations to complete faster, which translates not only into a higher write speed at the host computer but also into lower power use because flash memory draws power only while reading or writing. Higher write speeds also mean lower power draw for the flash memory.
Because data reduction technology can send less data to the flash than the host originally sent to the SSD, the typical write amplification factor falls below 1.0. It is not uncommon to see a WA of 0.5 on an SSD with this technology. Writing less data to the flash leads directly to:
Each of these in turn produces other benefits, some of which circle back upon themselves in a recursive manner. This logic diagram highlights those benefits.
So this is a rare instance when an amplifier – namely, Write Amplification – makes something smaller. At LSI, this unique amplifier comes in the form of the LSI® DuraWrite™ data reduction technology in all SandForce® Driven™ SSDs.
This three-part series examines some of the details behind write amplification, a key property of all NAND flash-based SSDs that is often misunderstood in the industry.
In today’s solid state drives (SSDs), the NAND flash memory must be erased before it can store new data. In other words, data cannot be overwritten directly as it is in a hard disk drive (HDD). Instead, SSDs use a process called garbage collection (GC) to reclaim the space taken by previously stored data. This means that write demands are heavier on SSDs than HDDs when storing the same information.
This is bad because the flash memory in the SSD supports only a limited number of writes before it can no longer be read. We call this undesirable effect write amplification (WA). In my blog, Gassing up your SSD, I describe why WA exists in a little more detail, but here I will explain what controls it.
It’s all about the free space
I often tell people that SSDs work better with more free space, so anything that increases free space will keep WA lower. The two key ways to expand free space (thereby decreasing WA) are to 1) increase over provisioning and 2) keep more storage space free (if you have TRIM support).
As I said earlier, there is no WA before GC is active. However, this pristine pre-GC condition has a tiny life span – just one full-capacity write cycle during a “fresh-out-of-box” (FOB) state, which accounts for less than 0.04% of the SSD’s life. Although you can manually recreate this condition with a secure erase, the cost is an additional write cycle, which defeats the purpose. Also keep in mind that the GC efficiency and associated wear leveling algorithms can affect WA (more efficient = lower WA).
The other major contributor to WA is the organization of the free space (how data is written to the flash). When data is written randomly, the eventual replacement data will also likely come in randomly, so some pages of a block will be replaced (made invalid) and others will still be good (valid). During GC, valid data in blocks like this needs to be rewritten to new blocks. This produces another write to the flash for each valid page, causing write amplification.
With sequential writes, generally all the data in the pages of the block becomes invalid at the same time. As a result, no data needs relocating during GC since there is no valid data remaining in the block before it is erased. In this case, there is no amplification, but other things like wear leveling on blocks that don’t change will still eventually produce some write amplification no matter how data is written.
Calculating write amplification
Write Amplification is fundamentally the result of data written to the flash memory divided by data written by the host. In 2008, both Intel and SiliconSystems (acquired by Western Digital) were the first to start talking publically about WA. At that time, the WA of all SSDs was something greater than 1.0. It was not until SandForce introduced the first SSD controller with DuraWrite™ technology in 2009 that WA could fall below 1.0. DuraWrite technology increases the free space mentioned above, but in a way that is unique from other SSD controllers. In part two of my write amplification series, I will explain how DuraWrite technology works.
This three-part series examines some of the details behind write amplification, a key property of all NAND flash-based SSDs that is often misunderstood in the industry.
It is always good to hear the opinions of your customers and end users, and in that respect June was a banner month for LSI® SandForce® flash controllers.
In a survey soliciting responses from more than 1 million members of on-line groups and other sources by IT Brand Pulse, an independent product testing and validation lab, LSI SandForce controllers ranked at the top of all six SSD controller chip sub-categories: market, price, performance, reliability, service and support, and innovation. Last August, the LSI SandForce controllers won in three of the six sub-categories, so we’re thrilled to see momentum building.
Winning all six awards is no easy task. Some of the sub-categories could be considered mutually exclusive, requiring customers to make trade-offs among product attributes. For example, often a product with the best price is considered to have skimped on quality compared to pricier solutions. A product with screaming performance, ironically, is seen as something of a market laggard because it usually does not carry the best price. So it is exciting to strike the right balance among all six measures and sweep the product category. You can find more details on the awards here: http://itbrandpulse.com/research/brand-leader-program/225-ssd-controller-chips-2013
For those of us in product marketing, winning product awards voted on by your peers can bring on a feeling similar to that warm afterglow parents bask in when they hear their child has made the honor roll or was named the valedictorian for his or her graduating class.
So please pardon us, for a moment, as we beam with pride.
It seems like our smartphones are getting bigger and bigger with each generation. Sometimes I see people holding what appears to be a tablet computer up next to their head. I doubt they know how ridiculous they look to the rest of us, and I wonder what pants today have pockets that big.
I certainly do like the convenience of the instant-on capabilities my smart phone gives me, but I still need my portable computer with its big screen and keyboard separate from my phone.
A few years ago, SATA-IO, the standards body, added a new feature to the Serial ATA (SATA) specification designed to further reduce battery consumption in portable computer products. This new feature, DevSleep, enables solid state drives (SSDs) to act more like smartphones, allowing you to go days without plugging in to recharge and then instantly turn them on and see all the latest email, social media updates, news and events.
Why not just switch the system off?
When most PC users think about switching off their system, they dread waiting for the operating system to boot back up. That is one of the key advantages of replacing the hard disk drive (HDD) in the system with a faster SSD. However, in our instant gratification society, we hate to wait even seconds for web pages to come up, so waiting minutes for your PC to turn on and boot up can feel like an eternity. Therefore, many people choose to leave the system on to save those precious moments… but at the expense of battery life.
Can I get this today?
To further extend battery life, the new DevSleep feature requires a signal change on the SATA connector. This change is currently supported only in new Intel® Haswell chipset-based platforms announced this June. What’s more, the SSD in these systems must support the DevSleep feature and monitor the signal on the SATA connector. Most systems that support DevSleep will likely be very low-power notebook systems and will likely already ship with an SSD installed using a small mSATA, M.2, or similar edge connector. Therefore, the signal change on the SATA interface will not immediately affect the rest of the SSDs designed for desktop systems shipping through retail and online sources. Note that not all SSDs are created equal and, while many claim support for DevSleep, be sure to look at the fine print to compare the actual power draw when in DevSleep.
At Computex last month, LSI announced support for the DevSleep feature and staged demonstrations showing a 400x reduction in idle power. It should be noted that a 400x reduction in power does not directly translate to a 400x increase in battery life, but any reduction in power will give you more time on the battery, and that will certainly benefit any user who often works without a power cord.
Not likely. But you might think that solving your computer data security problems is very well possible when someone tells you that TCG Opal is the key. According to its website, “The Trusted Computing Group (TCG) is a not-for-profit organization formed to develop, define and promote open, vendor-neutral, global industry standards, supportive of a hardware-based root of trust, for interoperable trusted computing platforms.”
That might take a bit to digest, but think about TCG as a group of companies creating standards to simplify deployment and increase adoption of data security. The consortium has two better known specifications called TCG Enterprise and TCG Opal.
Sorting through the alphabet soup of data security
“Our SED with TCG Opal provides FDE.” While this might look like a spoonful of alphabet soup, it is music to the ears of a corporate IT manager. Let me break it down for those who just hear fingernails on the chalkboard. A self-encrypting drive (SED) is one that embeds a hardware-based encryption engine in the storage device. One chief benefit is that the hardware engine performs the encryption, preserving full performance of the host CPU. An SED can be a hard disk drive (HDD) or a solid state drive (SSD). True, traditional software encryption can secure data going to the storage device, but it consumes precious host CPU bandwidth. The related term, full drive encryption (FDE), is used to describe any drive (HDD or SSD) that stores data in an encrypted form. This can be through either software-based (host CPU) or hardware-based (an SED) encryption.
Most people would assume that if their work laptop were lost or stolen, they would suffer only some lost productivity for a short time and about $1,500 in hardware costs. However, a study by Intel and the Ponemon Institute found that the cost of a lost laptop totaled nearly $50,000 when you account for lost IP, legal costs, customer notifications, lost business, harm to reputation, and damages associated with compromising confidential customer information. When the data stored on the laptop is encrypted, this cost is reduced by nearly $20,000. This difference certainly supports the need for better security for these mobile platforms.
When considering a security solution for this valuable data, you have to decide between a hardware-based SED and a host-based software solution. The primary problem with software solutions is they require the host CPU to do all of the encryption. This detracts from the CPU’s core computing work, leaving users with a slower computer or forcing them to pay for greater CPU performance. Another drawback of many software encryption solutions is that they can be turned off by the computer user, leaving data in the clear and vulnerable. Since hardware-based encryption is native to the HDD or SSD, it cannot be disabled by the end user.
In April 2013, LSI and a few other storage companies worked with the Ponemon Institute to better understand the value of hardware-based encryption. You can read about the details in the study here, but the quick summary is that hardware-based encryption solutions can offer a 75% total cost savings over software-based solutions, on average.
When is this available?
At the Computex Taipei 2013 show earlier this month, LSI announced availability of a firmware update for SandForce® controllers that adds support for TCG Opal. The LSI suite at the show featured TCG Opal demonstrations using self-encrypting SSDs provided by SandForce Driven™ member companies, including Kingston, A-DATA, Avant and Edge. (Contact SSD manufacturers directly for product availability.)
I often think about green, environmental impact, and what we’re doing to the environment. One major reason I became an engineer was to leave the world a little better than when I arrived. I’ve gotten sidetracked a few times, but I’ve tried to help, even if just a little.
The good people in LSI’s EHS (Environment, Health & Safety) asked me a question the other day about carbon footprint, energy impact, and materials use. Which got me thinking … OK – I know most people in LSI don’t really think of ourselves as a “green tech” company. But we are – really. No foolin’. We are having a big impact on the global power consumption and material consumption of the IT industry. And I mean that in a good way.
There are many ways to look at this, both from what we enable datacenters to do, to what we enable integrators to do, all the way to hard-core technology improvements and massive changes in what it’s possible to do.
Back in 2008 I got to speak at the AlwaysOn GoingGreen conference. (I was lucky enough to be just after Elon Musk– he’s a lot more famous now with Tesla doing so well.
http://www.smartplanet.com/video/making-the-case-for-green-it/305467 (at 2:09 in video)
IT consumes massive amounts of energy
The massive deployment of IT equipment, all the ancillary metal, plastic wiring, etc. that goes with them, consumes energy as its being shipped and moved halfway around the world, and, more importantly, then gets scrapped out quickly. This has been a concern for me for quite a while. I mean – think about that. As an industry we are generating about 9 million servers a year, about 3 million go into hyperscale datacenters (or hyperscale if you prefer). Many of those are scrapped on a 2, 3 or 4 year cycle – so in steady state, maybe 1 million to 2 million a year are scrapped. Worse – there is amazing use of energy by that many servers (even as they have advanced the state of the art unbelievably since 2008). And frankly, you and I are responsible for using all that power. Did you know thousands of servers are activated every time you make a Google® query from your phone?
I want to take a look at basic silicon improvements we make, the impact of disk architecture improvement, SSDs, system and improvements, efficiency improvements, and also where we’re going in the near future with eliminating scrap in hard drives and batteries. In reality, it’s the massive pressure on work/$ that has made us optimize everything – being able to do much more work at a lower cost, when a lot of cost is the energy and material that goes into the products that forces our hand. But the result is a real, profound impact on our carbon footprint that we should be proud of.
Sure we have a general silicon roadmap where each node enables reduced power, even as some standards and improvements actually increase individual device power. For example, our transition from 28nm semi process to 14nm FinFET can literally cut the power consumption of a chip in half. But that’s small potatoes.
How about Ethernet? It’s everywhere – right? Did you know servers often have 4 ethernet ports, and that there are a matching 4 ports on a network switch? LSI pioneered something called Energy Efficient Ethernet (EEE). We’re also one of the biggest manufacturers of Ethernet PHYs – the part that drives the cable – and we come standard in everything from personal computers to servers to enterprise switches. The savings are hard to estimate, because they depend very much on how much traffic there is, but you can realistically save Watts per interface link, and there are often 256 links in a rack. 500 Watts per rack is no joke, and in some datacenters it adds up to 1 or 2 MegaWatts.
How about something a little bigger and more specific? Hard disk drives. Did you know a typical hyperscale datacenter has between 1 million and 1.5 million disk drives? Each one of those consumes about 9 Watts, and most have 2 TBytes of capacity. So for easy math, 1 million drives is about 9 MegaWatts (!?) and about 2 Exabytes of capacity (remember – data is often replicated 3 or more times). Data capacities in these facilities are needed to grow about 50% per year. So if we did nothing, we would need to go from 1 million drives to 1.5 million drives: 9 MegaWatts goes to 13.5 MegaWatts. Wow! Instead – our high linearity, low noise PA and read channel designs are allowing drives to go to 4 TBytes per drives. (Sure the chip itself may use slightly more power, but that’s not the point, what it enables is a profound difference.) So to get that 50% increase in capacity we could actually reduce the number of drives deployed, with a net savings of 6.75 MegaWatts. Consider an average US home, with air conditioning, uses 1 kiloWatt. That’s almost 7,000 homes. In reality – they won’t get deployed that way – but it will still be a huge savings. Instead of buying another 0.5 million drives they would buy 0.25 million drives with a net savings of 2.2 MegaWatts. That’s still HUGE! (way to go, guys!) How many datacenters are doing that? Dozens. So that’s easily 20 or 30 MegaWatts globally. Did I say we saved them money too? A lot of money.
SSDs sip power to help improve energy profile
SSDs don’t always get the credit they deserve. Yes, they really are fast, and they are awesome in your laptop, but they also end up being much lower power than hard drives. Our controllers were in about half the flash solutions shipped last year. Think tens of millions. If you just assume they were all laptop SSDs (at least half were not) then that’s another 20 MegaWatts in savings.
Did you know that in a traditional datacenter, about 30% of the power going into the building is used for air conditioning? It doesn’t actually get used on the IT equipment at all, but is used to remove the heat that the IT equipment generates. We design our solutions so they can accommodate 40C ambient inlet air (that’s a little over 100F… hot). What that means is that the 30% of power used for the air conditioners disappears. Gone. That’s not theoretical either. Most of the large social media, search engine, web shopping, and web portal companies are using our solutions this way. That’s a 30% reduction in the power of storage solutions globally. Again, its MegaWatts in savings. And mega money savings too.
But let’s really get to the big hitters: improved work per server. Yep – we do that. In fact adding a Nytro™ MegaRAID® solution will almost always give you 4x the work out of a server. It’s a slam dunk if you’re running a database. You heard me – 1 server doing the work that it previously took 4 servers to do. Not only is that a huge savings in dollars (especially if you pay for software licenses!) but it’s a massive savings in power. You can replace 4 servers with 1, saving at least 900 Watts, and that lone server that’s left is actually dissipating less power too, because it’s actively using fewer HDDs, and using flash for most traffic instead. If you go a step further and use Nytro WarpDrive Flash cards in the servers, you can get much more – 6 to 8 times the work. (Yes, sometimes up to 10x, but let’s not get too excited). If you think that’s just theoretical again, check your Facebook® account, or download something from iTunes®. Those two services are the biggest users of PCIe® flash in the world. Why? It works cost effectively. And in case you haven’t noticed those two companies like to make money, not spend it. So again, we’re talking about MegaWatts of savings. Arguably on the order of 150 MegaWatts. Yea – that’s pretty theoretical, because they couldn’t really do the same work otherwise, but still, if you had to do the work in a traditional way, it would be around that.
It’s hard to be more precise than giving round numbers at these massive scales, but the numbers are definitely in the right zone. I can say with a straight face we save the world 10’s, and maybe even 100’s of MegaWatts per year. But no one sees that, and not many people even think about it. Still – I’d say LSI is a green hero.
Hey – we’re not done by a long shot. Let’s just look at scrap. If you read my earlier post on false disk failure, you’ll see some scary numbers. (http://blog.lsi.com/what-is-false-disk-failure-and-why-is-it-a-problem/ ) A normal hyperscale datacenter can expect 40-60 disks per day to be mistakenly scrapped out. That’s around 20,000 disk drives a year that should not have been scrapped, from just one web company. Think of the material waste, shipping waste, manufacturing waste, and eWaste issues. Wow – all for nothing. We’re working on solutions to that. And batteries. Ugly, eWaste, recycle only, heavy metal batteries. They are necessary for RAID protected storage systems. And much of the world’s data is protected that way – the battery is needed to save meta-data and transient writes in the event of a power failure, or server failure. We ship millions a year. (Sorry, mother earth). But we’re working diligently to make that a thing of the past. And that will also result in big savings for datacenters in both materials and recycling costs.
Can we do more? Sure. I know I am trying to get us the core technologies that will help reduce power consumption, raise capability and performance, and reduce waste. But we’ll never be done with that march of technology. (Which is a good thing if engineering is your career…)
I still often think about green, environmental impact, and what we’re doing to the environment. And I guess in my own small way, I am leaving the world a little better than when I arrived. And I think we at LSI should at least take a moment and pat ourselves on the back for that. You have to celebrate the small victories, you know? Even as the fight goes on.
I want to warn you, there is some thick background information here first. But don’t worry. I’ll get to the meat of the topic and that’s this: Ultimately, I think that PCIe® cards will evolve to more external, rack-level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but other leaders in flash are going down this path too…
I’ve been working on enterprise flash storage since 2007 – mulling over how to make it work. Endurance, capacity, cost, performance have all been concerns that have been grappled with. Of course the flash is changing too as the nodes change: 60nm, 50nm, 35nm, 24nm, 20nm… and single level cell (SLC) to multi level cell (MLC) to triple level cell (TLC) and all the variants of these “trimmed” for specific use cases. The spec “endurance” has gone from 1 million program/erase cycles (PE) to 3,000, and in some cases 500.
It’s worth pointing out that almost all the “magic” that has been developed around flash was already scoped out in 2007. It just takes a while for a whole new industry to mature. Individual die capacity increased, meaning fewer die are needed for a solution – and that means less parallel bandwidth for data transfer… And the “requirement” for state-of-the-art single operation write latency has fallen well below the write latency of the flash itself. (What the ?? Yea – talk about that later in some other blog. But flash is ~1500uS write latency, where state of the art flash cards are ~50uS.) When I describe the state of technology it sounds pretty pessimistic. I’m not. We’ve overcome a lot.
We built our first PCIe card solution at LSI in 2009. It wasn’t perfect, but it was better than anything else out there in many ways. We’ve learned a lot in the years since – both from making them, and from dealing with customer and users – about our own solutions and our competitors. We’re lucky to be an important player in storage, so in general the big OEMs, large enterprises and the hyperscale datacenters all want to talk with us – not just about what we have or can sell, but what we could have and what we could do. They’re generous enough to share what works and what doesn’t. What the values of solutions are and what the pitfalls are too. Honestly? It’s the hyperscale datacenters in the lead both practically and in vision.
If you haven’t nodded off to sleep yet, that’s a long-winded way of saying – things have changed fast, and, boy, we’ve learned a lot in just a few years.
Most important thing we’ve learned…
Most importantly, we’ve learned it’s latency that matters. No one is pushing the IOPs limits of flash, and no one is pushing the bandwidth limits of flash. But they sure are pushing the latency limits.
PCIe cards are great, but…
We’ve gotten lots of feedback, and one of the biggest things we’ve learned is – PCIe flash cards are awesome. They radically change performance profiles of most applications, especially databases allowing servers to run efficiently and actual work done by that server to multiply 4x to 10x (and in a few extreme cases 100x). So the feedback we get from large users is “PCIe cards are fantastic. We’re so thankful they came along. But…” There’s always a “but,” right??
It tends to be a pretty long list of frustrations, and they differ depending on the type of datacenter using them. We’re not the only ones hearing it. To be clear, none of these are stopping people from deploying PCIe flash… the attraction is just too compelling. But the problems are real, and they have real implications, and the market is asking for real solutions.
Of course, everyone wants these fixed without affecting single operation latency, or increasing cost, etc. That’s what we’re here for though – right? Solve the impossible?
A quick summary is in order. It’s not looking good. For a given solution, flash is getting less reliable, there is less bandwidth available at capacity because there are fewer die, we’re driving latency way below the actual write latency of flash, and we’re not satisfied with the best solutions we have for all the reasons above.
If you think these through enough, you start to consider one basic path. It also turns out we’re not the only ones realizing this. Where will PCIe flash solutions evolve over the next 2, 3, 4 years? The basic goals are:
One easy answer would be – that’s a flash SAN or NAS. But that’s not the answer. Not many customers want a flash SAN or NAS – not for their new infrastructure, but more importantly, all the data is at the wrong end of the straw. The poor server is left sucking hard. Remember – this is flash, and people use flash for latency. Today these SAN type of flash devices have 4x-10x worse latency than PCIe cards. Ouch. You have to suck the data through a relatively low bandwidth interconnect, after passing through both the storage and network stacks. And there is interaction between the I/O threads of various servers and applications – you have to wait in line for that resource. It’s true there is a lot of startup energy in this space. It seems to make sense if you’re a startup, because SAN/NAS is what people use today, and there’s lots of money spent in that market today. However, it’s not what the market is asking for.
Another easy answer is NVMe SSDs. Right? Everyone wants them – right? Well, OEMs at least. Front bay PCIe SSDs (HDD form factor or NVMe – lots of names) that crowd out your disk drive bays. But they don’t fix the problems. The extra mechanicals and form factor are more expensive, and just make replacing the cards every 5 years a few minutes faster. Wow. With NVME SSDs, you can fit fewer HDDs – not good. They also provide uniformly bad cooling, and hard limit power to 9W or 25W per device. But to protect the storage in these devices, you need to have enough of them that you can RAID or otherwise protect. Once you have enough of those for protection, they give you awesome capacity, IOPs and bandwidth, too much in fact, but that’s not what applications need – they need low latency for the working set of data.
What do I think the PCIe replacement solutions in the near future will look like? You need to pool the flash across servers (to optimize bandwidth and resource usage, and allocate appropriate capacity). You need to protect against failures/errors and limit the span of failure, commit writes at very low latency (lower than native flash) and maintain low latency, bottleneck-free physical links to each server… To me that implies:
That means the performance looks exactly as if each server had multiple PCIe cards. But the capacity and bandwidth resources are shared, and systems can remain resilient. So ultimately, I think that PCIe cards will evolve to more external, rack level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but as I say – other leaders in flash are going down this path too…
What’s your opinion?
Tags: DAS, datacenter, direct attached storage, enterprise IT, flash, hard disk drive, HDD, hyperscale, latency, NAS, network attached storage, NVMe, PCIe, SAN, solid state drive, SSD, storage area network