I’ve spent a lot of time with hyperscale datacenters around the world trying to understand their problems – and I really don’t care what area those problems are as long as they’re important to the datacenter. What is the #1 Real Problem for many hyperscale datacenters? It’s something you’ve probably never heard about, and probably have not even thought about. It’s called false disk failure. Some hyperscale datacenters have crafted their own solutions – but most have not.
Why is this important, you ask? Many large datacenters today have 1 million to 4 million hard disk drives (HDDs) in active operation. In anyone’s book that’s a lot. It’s also a very interesting statistical sample size of HDDs. Hyperscale datacenters get great pricing on HDDs. Probably better than OEMs get, and certainly better than the $79 for buying 1 HDD at your local Fry’s store. So you would imagine if a disk fails – no one cares – they’re cheap and easy to replace. But the burden of a failed disk is much more than the raw cost of the disk:
Let’s put some scale to this problem, and you’ll begin to understand the issue. One modest size hyperscale datacenter has been very generous in sharing its real numbers. (When I say modest, they are ~1/4 to 1/2 the size of many other hyperscale datacenters, but they are still huge – more than 200k servers). Other hyperscale datacenters I have checked with say – yep, that’s about right. And one engineer I know at an HDD manufacturer said – “wow – I expected worse than that. That’s pretty good.” To be clear – these are very good HDDs they are using, it’s just that the numbers add up.
The raw data:
RAIDed SAS HDDs
Non-RAIDed (direct map) SATA drives behind HBAs
What’s interesting is the relative failure rate of SAS drives vs. SATA. It’s about an order of magnitude worse in SATA drives than SAS. Frankly some of this is due to protocol differences. SAS allows far more error recovery capabilities, and because they also tend to be more expensive, I believe manufacturers invest in slightly higher quality electronics and components. I know the electronics we ship into SAS drives is certainly more sophisticated than SATA drives.
False fail? What? Yea, that’s an interesting topic. It turns out that about 40% of the time with SAS and about 50% of the time with SATA, the drive didn’t actually fail. It just lost its marbles for a while. When they pull the drive out and put it into a test jig, everything is just fine. And more interesting, when they put the drive back into service, it is no more statistically likely to fail again than any other drive in the datacenter. Why? No one knows. I suspect though.
I used to work on engine controllers. That’s a very paranoid business. If something goes wrong and someone crashes, you have a lawsuit on your hands. If a controller needs a recall, that’s millions of units to replace, with a multi-hundred dollar module, and hundreds of dollars in labor for each one replaced. No one is willing to take that risk. So we designed very carefully to handle soft errors in memory and registers. We incorporated ECC like servers use, background code checksums and scrubbing, and all sorts of proprietary techniques, including watchdogs and super-fast self-resets that could get operational again in less than a full revolution of the engine. Why? – the events were statistically rare. The average controller might see 1 or 2 events in its lifetime, and a turn of the ignition would reset that state. But the events do happen, and so do recalls and lawsuits… HDD controllers don’t have these protections, which is reasonable. It would be an inappropriate cost burden for their price point.
You remember the Toyota Prius accelerator problems? I know that controller was not protected for soft errors. And the source of the problem remained a “mystery.” Maybe it just lost its marbles for a while? A false fail if you will. Just sayin’.
Back to HDDs. False fail is especially frustrating, because half the HDDs actually didn’t need to be replaced. All the operational costs were paid for no reason. The disk just needed a power cycle reset. (OK, that introduces all sorts of complex management by the RAID controller or application to manage that 10 second power reset cycle and application traffic created in that time – be we can handle that.)
Daily, this datacenter has to:
And 1/2 of that is for no reason at all.
First – why not rebuild the disk if it’s RAIDed? Usually hyperscale datacenters use clustered applications. A traditional RAID rebuild drives the server performance to ~50%, and for a 2TByte drive, under heavy application load (definition of a hyperscale datacenter) can truly take up to a week. 50% performance for a week? In a cluster that means the overall cluster is running ~50% performance. Say 200 nodes in a cluster – that means you just lost ~100 nodes of work – or 50% of cluster performance. It’s much simpler to just take the node offline with the failed drive, and get 99.5% cluster performance, and operationally redistribute the workload across multiple nodes (because you have replicated data elsewhere). But after rebuild, the node will have to be re-synced or re-imaged. There are ways to fix all this. We’ll talk about them on another day. Or you can simply run direct mapped storage, and unmounts the failed drive.
Next – Why replicate data over the network, and why is that a big deal? For geographic redundancy (say a natural disaster at one facility) and regional locality, hyperscale datacenters need multiple data copies. Often 3 copies so they can do double duty as high-availability copies, or in the case of some erasure coding, 2.2 to 2.5 copies (yea – weird math – how do you have 0.5 copy…). When you lose one copy, you are down to 2, possibly 1. You need to get back to a reliable number again. Fast. Customers are loyal because of your perfect data retention. So you need to replicate that data and re-distribute it across the datacenter on multiple servers. That’s network traffic, and possibly congestion, which affects other aspects of the operations of the datacenter. In this datacenter it’s about 50 hours of 10G Ethernet traffic every day.
To be fair, there is a new standard in SAS interfaces that will facilitate resetting a disk in-situ. And there is the start of discussion of the same around SATA – but that’s more problematic. Whatever the case, it will be a years before the ecosystem is in place to handle the problems this way.
What’s that mean to you?
Well. You can expect something like 1/100 of your drives to really fail this year. And you can expect another 1/100 of your drives to fail this year, but not actually be failed. You’ll still pay all the operational overhead of not actually having a failed drive – rebuilds, disk replacements, management interventions, scheduled downtime/maintenance time, and the OEM replacement price for that drive – what $600 or so ?… Depending on your size, that’s either a don’t care, or a big deal. There are ways to handle this, and they’re not expensive – much less than the disk carrier you already pay for to allow you to replace that drive – and it can be handled transparently – just a log entry without seeing any performance hiccups. You just need to convince your OEM to carry the solution.
Congratulations to Seagate on achieving this monumental milestone of shipping 2 billion hard disk drives (HDDs)! To put this in perspective, if you measured the Earth’s circumference in inches, it is only about 1.57 billion.
Those HDDs have used a lot of parts and intellectual property (IP), and LSI has been a very happy, long-time partner to Seagate providing read channel IP for nearly all of the 2 billion HDDs it has shipped. The read channel is a critical piece of technology inside the HDD that translates the magnetically encoded information on the rotating media to electronic signals that can be understood by the host computer. Advances in the LSI read channel IP over the years have also contributed to the continuing HDD capacity increases we depend upon today.
Seagate reported it took 29 years to ship its first billion drives and only 4 years for the second. This jaw-dropping growth is a stark reminder that the global data deluge continues to swell. Now more than ever, IT architects and managers need smarter ways – down to the silicon – to produce new storage and networking efficiencies and reduce costs.
Anyone who knows me knows I like to ask “why?” Maybe I never outgrew the 2-year-old phase. But I also like to ask “why not?” Every now and then you need to rethink everything you know top to bottom because something might have changed.
I’ve been talking to a lot of enterprise datacenter architects and managers lately. They’re interested in using flash in their servers and storage, but they can’t get over all the “problems.”
The conversation goes something like this: Flash is interesting, but it’s crazy expensive $/bit. The prices have to come way down – after all it’s just a commodity part. And I have these $4k servers – why would I put an $8k PCIe card in them – that makes no sense. And the stuff wears out, which is an operational risk for me – disks last forever. Maybe flash isn’t ready for prime time yet.
These arguments are reasonable if you think about flash as a disk replacement, and don’t think through all the follow-on implications.
In contrast I’ve also been spending a lot of time with the biggest datacenters in the world – you know – the ones we all know by brand name. They have at least 200k servers, and anywhere from 1.5 million to 7 million disks. They notice CapEx and OpEx a lot. You multiply anything by that much and it’s noticeable. (My simple example is add 1 LED to each server with 200k servers and the cost adds up to 26K watts + $10K LED cost.) They are very scientific about cost. More specifically they measure work/$ very carefully. Anything to increase work or reduce $ is very interesting – doing both at once is the holy grail. Already one of those datacenters is completely diskless. Others are part way there, or have the ambition of being there. You might think they’re crazy – how can they spend so much on flash when disks are so much cheaper, and these guys offer their services for free?
When the large datacenters – I call the hyperscale datacenters – measure cost, they’re looking at purchase cost, including metal racks and enclosures, shipping, service cost both parts and human expense, as well as operational disruption overhead and the complexity of managing that, the opportunity cost of new systems vs. old systems that are less efficient, and of course facilities expenses – buildings, power, cooling, people… They try to optimize the mix of these.
Let’s look at the arguments against using flash one by one.
Flash is just a commodity part
This is a very big fallacy. It’s not a commodity part, and flash is not all the same. The parts you see in cheap consumer devices deserve their price. In the chip industry, it’s common to have manufacturing fallout; 3% – 10% is reasonable. What’s more the devices come at different performance levels – just look at x86 performance versions of the same design. In the flash business 100% of the devices are sold, used, and find their way into products. Those cheap consumer products are usually the 3%-10% that would be scrap in other industries. (I was once told – with a smile – “those are the parts we sweep off the floor”…)
Each generation of flash (about 18 months between them) and each manufacturer (there are 5, depending how you count) have very different characteristics. There are wild differences in erase time, write time, read time, bandwidth, capacity, endurance, and cost. There is no one supplier that is best at all of these, and leadership moves around. More importantly, in a flash system, how you trade these things off has a huge effect on write latency (#1 impactor on work done), latency outliers (consistent operation), endurance or life span, power consumption, and solution cost. All flash products are not equal – not by a long shot. Even hyperscale datacenters have different types of solutions for different needs.
It’s also important to know that temperature of operation and storage, inter-arrival time of writes, and “over provisioning” (the amount hidden for background use and garbage collection) have profound impacts on lifespan and performance.
$8k PCIe card in a $4k server – really?
I am always stunned by this. No one thinks twice about spending more on virtualization licenses than on hardware, or say $50k for a database license to run on a $4k server. It’s all about what work you need to accomplish, and what’s the best way to accomplish it. It’s no joke that in database applications it’s pretty easy to get 4x the work from a server with a flash solution inserted. You probably won’t get worse than 4x, and as good as 10x. On a purely hardware basis, that makes sense – I can have 1 server @ $4k + $8K flash vs. 4 servers @ $4k. I just saved $4k CapEx. More importantly, I saved the service contract, power, cooling and admin of 3 servers. If I include virtualization or database licenses, I saved another $150k + annual service contracts on those licenses. That’s easy math. If I worry about users supported rather than work done, I can support as many as 100x users. The math becomes overwhelming. $8K PCIe card in a $4k server? You bet when I think of work/$.
The stuff wears out & disks last forever
It’s true that car tires wear out, and depending on how hard you use them that might be faster or slower. But tires are one of the most important parts in a cars performance – acceleration, stopping, handling – you couldn’t do any of that without them. The only time you really have catastrophic failure with tires is when you wear them way past any reasonable point – until they are bald and should have been replaced. Flash is like that – you get lots of warning as its wearing out, and you get lots of opportunity to operationally plan and replace the flash without disruption. You might need to replace it after 4 or 5 years, but you can plan and do it gracefully. Disks can last “forever,” but they also fail randomly and often.
Reliability statistics across millions of hard drives show somewhere around 2.5% fail annually. And that’s for 1st quality drives. Those are unpredicted, catastrophic failures, and depending on your storage systems that means you need to go into rebuild or replication of TBytes of data, and you have a subsequent degradation in performance (which can completely mess up load balancing of a cluster of 20 to 200 other nodes too), potentially network traffic overhead, and a physical service event that needs to be handled manually and fairly quickly. And really – how often do admins want to take the risk of physically replacing a drive while a system is running. Just one mistake by your tech and it’s all over… Operationally flash is way better, less disruptive, predictable, lower cost, and the follow on implications are much simpler.
Crazy expensive $/bit
OK – so this argument doesn’t seem so relevant anymore. Even so, in most cases you can’t use much of the disk capacity you have. It will be stranded because you need to have spare space as databases, etc. grow. If you run out of space for db’s the result is catastrophic. If you are driving a system hard, you often don’t have the bandwidth left to actually access that extra capacity. It’s common to only use ½ of the available capacity of drives.
Caching solutions change the equation as well. You can spend money on flash for the performance characteristics, and shift disk drive spend to fewer, higher capacity, slower, more power efficient drives for bulk capacity. Often for the same or similar overall storage spend you can have the same capacity at 4x the system performance. And the space and power consumed and cooling needed for that system is dramatically reduced.
Even so, flash is not going to replace large capacity storage for a long, long time, if ever. What ever the case, the $/bit is simply not the right metric for evaluating flash. But it’s true, flash is more expensive per bit. It’s simply that in most operational contexts, it more than makes up for that by other savings and work/$ improvements.
So I would argue (and I’m backed up by the biggest hyperscale datacenters in the world) that flash is ready for prime time adoption. Work/$ is the correct metric, but you need to measure from the application down to the storage bits to get that metric. It’s not correct to think about flash as “just a disk replacement” – it changes the entire balance of a solution stack from application performance and responsiveness and cumulative work, to server utilization to power consumption and cooling to maintenance and service to predictable operational stability. It’s not just a small win; it’s a big win. It’s not a fit yet for large pools of archival storage – but even for that a lot of energy is going into trying to make that work. So no – enterprise will not go diskless for quite a while, but it is understandable why hyperscale datacenters want to go diskless. It’s simple math.
Every now and then you need to rethink everything you know top to bottom because something might have changed.