I often think about green, environmental impact, and what we’re doing to the environment. One major reason I became an engineer was to leave the world a little better than when I arrived. I’ve gotten sidetracked a few times, but I’ve tried to help, even if just a little.
The good people in LSI’s EHS (Environment, Health & Safety) asked me a question the other day about carbon footprint, energy impact, and materials use. Which got me thinking … OK – I know most people in LSI don’t really think of ourselves as a “green tech” company. But we are – really. No foolin’. We are having a big impact on the global power consumption and material consumption of the IT industry. And I mean that in a good way.
There are many ways to look at this, both from what we enable datacenters to do, to what we enable integrators to do, all the way to hard-core technology improvements and massive changes in what it’s possible to do.
Back in 2008 I got to speak at the AlwaysOn GoingGreen conference. (I was lucky enough to be just after Elon Musk– he’s a lot more famous now with Tesla doing so well.
http://www.smartplanet.com/video/making-the-case-for-green-it/305467 (at 2:09 in video)
IT consumes massive amounts of energy
The massive deployment of IT equipment, all the ancillary metal, plastic wiring, etc. that goes with them, consumes energy as its being shipped and moved halfway around the world, and, more importantly, then gets scrapped out quickly. This has been a concern for me for quite a while. I mean – think about that. As an industry we are generating about 9 million servers a year, about 3 million go into hyperscale datacenters (or hyperscale if you prefer). Many of those are scrapped on a 2, 3 or 4 year cycle – so in steady state, maybe 1 million to 2 million a year are scrapped. Worse – there is amazing use of energy by that many servers (even as they have advanced the state of the art unbelievably since 2008). And frankly, you and I are responsible for using all that power. Did you know thousands of servers are activated every time you make a Google® query from your phone?
I want to take a look at basic silicon improvements we make, the impact of disk architecture improvement, SSDs, system and improvements, efficiency improvements, and also where we’re going in the near future with eliminating scrap in hard drives and batteries. In reality, it’s the massive pressure on work/$ that has made us optimize everything – being able to do much more work at a lower cost, when a lot of cost is the energy and material that goes into the products that forces our hand. But the result is a real, profound impact on our carbon footprint that we should be proud of.
Sure we have a general silicon roadmap where each node enables reduced power, even as some standards and improvements actually increase individual device power. For example, our transition from 28nm semi process to 14nm FinFET can literally cut the power consumption of a chip in half. But that’s small potatoes.
How about Ethernet? It’s everywhere – right? Did you know servers often have 4 ethernet ports, and that there are a matching 4 ports on a network switch? LSI pioneered something called Energy Efficient Ethernet (EEE). We’re also one of the biggest manufacturers of Ethernet PHYs – the part that drives the cable – and we come standard in everything from personal computers to servers to enterprise switches. The savings are hard to estimate, because they depend very much on how much traffic there is, but you can realistically save Watts per interface link, and there are often 256 links in a rack. 500 Watts per rack is no joke, and in some datacenters it adds up to 1 or 2 MegaWatts.
How about something a little bigger and more specific? Hard disk drives. Did you know a typical hyperscale datacenter has between 1 million and 1.5 million disk drives? Each one of those consumes about 9 Watts, and most have 2 TBytes of capacity. So for easy math, 1 million drives is about 9 MegaWatts (!?) and about 2 Exabytes of capacity (remember – data is often replicated 3 or more times). Data capacities in these facilities are needed to grow about 50% per year. So if we did nothing, we would need to go from 1 million drives to 1.5 million drives: 9 MegaWatts goes to 13.5 MegaWatts. Wow! Instead – our high linearity, low noise PA and read channel designs are allowing drives to go to 4 TBytes per drives. (Sure the chip itself may use slightly more power, but that’s not the point, what it enables is a profound difference.) So to get that 50% increase in capacity we could actually reduce the number of drives deployed, with a net savings of 6.75 MegaWatts. Consider an average US home, with air conditioning, uses 1 kiloWatt. That’s almost 7,000 homes. In reality – they won’t get deployed that way – but it will still be a huge savings. Instead of buying another 0.5 million drives they would buy 0.25 million drives with a net savings of 2.2 MegaWatts. That’s still HUGE! (way to go, guys!) How many datacenters are doing that? Dozens. So that’s easily 20 or 30 MegaWatts globally. Did I say we saved them money too? A lot of money.
SSDs sip power to help improve energy profile
SSDs don’t always get the credit they deserve. Yes, they really are fast, and they are awesome in your laptop, but they also end up being much lower power than hard drives. Our controllers were in about half the flash solutions shipped last year. Think tens of millions. If you just assume they were all laptop SSDs (at least half were not) then that’s another 20 MegaWatts in savings.
Did you know that in a traditional datacenter, about 30% of the power going into the building is used for air conditioning? It doesn’t actually get used on the IT equipment at all, but is used to remove the heat that the IT equipment generates. We design our solutions so they can accommodate 40C ambient inlet air (that’s a little over 100F… hot). What that means is that the 30% of power used for the air conditioners disappears. Gone. That’s not theoretical either. Most of the large social media, search engine, web shopping, and web portal companies are using our solutions this way. That’s a 30% reduction in the power of storage solutions globally. Again, its MegaWatts in savings. And mega money savings too.
But let’s really get to the big hitters: improved work per server. Yep – we do that. In fact adding a Nytro™ MegaRAID® solution will almost always give you 4x the work out of a server. It’s a slam dunk if you’re running a database. You heard me – 1 server doing the work that it previously took 4 servers to do. Not only is that a huge savings in dollars (especially if you pay for software licenses!) but it’s a massive savings in power. You can replace 4 servers with 1, saving at least 900 Watts, and that lone server that’s left is actually dissipating less power too, because it’s actively using fewer HDDs, and using flash for most traffic instead. If you go a step further and use Nytro WarpDrive Flash cards in the servers, you can get much more – 6 to 8 times the work. (Yes, sometimes up to 10x, but let’s not get too excited). If you think that’s just theoretical again, check your Facebook® account, or download something from iTunes®. Those two services are the biggest users of PCIe® flash in the world. Why? It works cost effectively. And in case you haven’t noticed those two companies like to make money, not spend it. So again, we’re talking about MegaWatts of savings. Arguably on the order of 150 MegaWatts. Yea – that’s pretty theoretical, because they couldn’t really do the same work otherwise, but still, if you had to do the work in a traditional way, it would be around that.
It’s hard to be more precise than giving round numbers at these massive scales, but the numbers are definitely in the right zone. I can say with a straight face we save the world 10’s, and maybe even 100’s of MegaWatts per year. But no one sees that, and not many people even think about it. Still – I’d say LSI is a green hero.
Hey – we’re not done by a long shot. Let’s just look at scrap. If you read my earlier post on false disk failure, you’ll see some scary numbers. (http://blog.lsi.com/what-is-false-disk-failure-and-why-is-it-a-problem/ ) A normal hyperscale datacenter can expect 40-60 disks per day to be mistakenly scrapped out. That’s around 20,000 disk drives a year that should not have been scrapped, from just one web company. Think of the material waste, shipping waste, manufacturing waste, and eWaste issues. Wow – all for nothing. We’re working on solutions to that. And batteries. Ugly, eWaste, recycle only, heavy metal batteries. They are necessary for RAID protected storage systems. And much of the world’s data is protected that way – the battery is needed to save meta-data and transient writes in the event of a power failure, or server failure. We ship millions a year. (Sorry, mother earth). But we’re working diligently to make that a thing of the past. And that will also result in big savings for datacenters in both materials and recycling costs.
Can we do more? Sure. I know I am trying to get us the core technologies that will help reduce power consumption, raise capability and performance, and reduce waste. But we’ll never be done with that march of technology. (Which is a good thing if engineering is your career…)
I still often think about green, environmental impact, and what we’re doing to the environment. And I guess in my own small way, I am leaving the world a little better than when I arrived. And I think we at LSI should at least take a moment and pat ourselves on the back for that. You have to celebrate the small victories, you know? Even as the fight goes on.
I want to warn you, there is some thick background information here first. But don’t worry. I’ll get to the meat of the topic and that’s this: Ultimately, I think that PCIe® cards will evolve to more external, rack-level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but other leaders in flash are going down this path too…
I’ve been working on enterprise flash storage since 2007 – mulling over how to make it work. Endurance, capacity, cost, performance have all been concerns that have been grappled with. Of course the flash is changing too as the nodes change: 60nm, 50nm, 35nm, 24nm, 20nm… and single level cell (SLC) to multi level cell (MLC) to triple level cell (TLC) and all the variants of these “trimmed” for specific use cases. The spec “endurance” has gone from 1 million program/erase cycles (PE) to 3,000, and in some cases 500.
It’s worth pointing out that almost all the “magic” that has been developed around flash was already scoped out in 2007. It just takes a while for a whole new industry to mature. Individual die capacity increased, meaning fewer die are needed for a solution – and that means less parallel bandwidth for data transfer… And the “requirement” for state-of-the-art single operation write latency has fallen well below the write latency of the flash itself. (What the ?? Yea – talk about that later in some other blog. But flash is ~1500uS write latency, where state of the art flash cards are ~50uS.) When I describe the state of technology it sounds pretty pessimistic. I’m not. We’ve overcome a lot.
We built our first PCIe card solution at LSI in 2009. It wasn’t perfect, but it was better than anything else out there in many ways. We’ve learned a lot in the years since – both from making them, and from dealing with customer and users – about our own solutions and our competitors. We’re lucky to be an important player in storage, so in general the big OEMs, large enterprises and the hyperscale datacenters all want to talk with us – not just about what we have or can sell, but what we could have and what we could do. They’re generous enough to share what works and what doesn’t. What the values of solutions are and what the pitfalls are too. Honestly? It’s the hyperscale datacenters in the lead both practically and in vision.
If you haven’t nodded off to sleep yet, that’s a long-winded way of saying – things have changed fast, and, boy, we’ve learned a lot in just a few years.
Most important thing we’ve learned…
Most importantly, we’ve learned it’s latency that matters. No one is pushing the IOPs limits of flash, and no one is pushing the bandwidth limits of flash. But they sure are pushing the latency limits.
PCIe cards are great, but…
We’ve gotten lots of feedback, and one of the biggest things we’ve learned is – PCIe flash cards are awesome. They radically change performance profiles of most applications, especially databases allowing servers to run efficiently and actual work done by that server to multiply 4x to 10x (and in a few extreme cases 100x). So the feedback we get from large users is “PCIe cards are fantastic. We’re so thankful they came along. But…” There’s always a “but,” right??
It tends to be a pretty long list of frustrations, and they differ depending on the type of datacenter using them. We’re not the only ones hearing it. To be clear, none of these are stopping people from deploying PCIe flash… the attraction is just too compelling. But the problems are real, and they have real implications, and the market is asking for real solutions.
Of course, everyone wants these fixed without affecting single operation latency, or increasing cost, etc. That’s what we’re here for though – right? Solve the impossible?
A quick summary is in order. It’s not looking good. For a given solution, flash is getting less reliable, there is less bandwidth available at capacity because there are fewer die, we’re driving latency way below the actual write latency of flash, and we’re not satisfied with the best solutions we have for all the reasons above.
If you think these through enough, you start to consider one basic path. It also turns out we’re not the only ones realizing this. Where will PCIe flash solutions evolve over the next 2, 3, 4 years? The basic goals are:
One easy answer would be – that’s a flash SAN or NAS. But that’s not the answer. Not many customers want a flash SAN or NAS – not for their new infrastructure, but more importantly, all the data is at the wrong end of the straw. The poor server is left sucking hard. Remember – this is flash, and people use flash for latency. Today these SAN type of flash devices have 4x-10x worse latency than PCIe cards. Ouch. You have to suck the data through a relatively low bandwidth interconnect, after passing through both the storage and network stacks. And there is interaction between the I/O threads of various servers and applications – you have to wait in line for that resource. It’s true there is a lot of startup energy in this space. It seems to make sense if you’re a startup, because SAN/NAS is what people use today, and there’s lots of money spent in that market today. However, it’s not what the market is asking for.
Another easy answer is NVMe SSDs. Right? Everyone wants them – right? Well, OEMs at least. Front bay PCIe SSDs (HDD form factor or NVMe – lots of names) that crowd out your disk drive bays. But they don’t fix the problems. The extra mechanicals and form factor are more expensive, and just make replacing the cards every 5 years a few minutes faster. Wow. With NVME SSDs, you can fit fewer HDDs – not good. They also provide uniformly bad cooling, and hard limit power to 9W or 25W per device. But to protect the storage in these devices, you need to have enough of them that you can RAID or otherwise protect. Once you have enough of those for protection, they give you awesome capacity, IOPs and bandwidth, too much in fact, but that’s not what applications need – they need low latency for the working set of data.
What do I think the PCIe replacement solutions in the near future will look like? You need to pool the flash across servers (to optimize bandwidth and resource usage, and allocate appropriate capacity). You need to protect against failures/errors and limit the span of failure, commit writes at very low latency (lower than native flash) and maintain low latency, bottleneck-free physical links to each server… To me that implies:
That means the performance looks exactly as if each server had multiple PCIe cards. But the capacity and bandwidth resources are shared, and systems can remain resilient. So ultimately, I think that PCIe cards will evolve to more external, rack level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but as I say – other leaders in flash are going down this path too…
What’s your opinion?
Tags: DAS, datacenter, direct attached storage, enterprise IT, flash, hard disk drive, HDD, hyperscale, latency, NAS, network attached storage, NVMe, PCIe, SAN, solid state drive, SSD, storage area network
I remember in the mid-1990s the question of how many minutes away from a diversion airport a two-engine passenger jet should be allowed to fly in the event of an engine failure. Staying in the air long enough is one of those high-availability functions that really matters. In the case of the Boeing 777, it was the first aircraft to enter service with a 180-minute extended operations certification (ETOPS)1. This meant that longer over-water and remote terrain routes were immediately possible.
The question was “can a two-engine passenger aircraft be as safe as a four engine aircraft for long haul flights?” The short answer is yes. Reducing the points of failure from four engines to two, while meeting strict maintenance requirements and maintaining redundant systems, reduces the probability of a failure. The 777 and many other aircraft have proven to be safe for these longer flights. Recently, the 777 has received FAA approval for a 330-minute ETOPS rating2, which allows airlines to offer routes that are longer, straighter and more economical.
What does this have to do with a datacenter? It turns out that some hyperscale datacenters house hundreds of thousands of servers, each with its own boot drive. Each of these boot drives is a potential point of failure, which can drive up acquisition and operating costs and the odds of a breakdown. Datacenter managers need to control CapEx, so for the sheer volume of server boot drives they commonly use the lowest cost 2.5-inch notebook SATA hard drives. The problem is that these commodity hard drives tend to fail more often. This is not a huge issue with only a few servers. But in a datacenter with 200,000 servers, LSI has found through internal research that, on average, 40 to 200 drives fail per week! (2.5″ hard drive, ~2.5 to 4-year lifespan, which equates to a conservative 5% failure rate/year).
Traditionally, a hyperscale datacenter has a sea of racks filled with servers. LSI approximates that, in the majority of large datacenters, at least 60% of the servers (Web servers, database servers, etc.) use a boot drive requiring no more than 40GB of storage capacity since it performs only boot-up and journaling or logging. For higher reliability, the key is to consolidate these low-capacity drives, virtually speaking. With our Syncro™ MX-B Rack Boot Appliance, we can consolidate the boot drives for 24 or 48 of these servers into a single mirrored array (using LSI MegaRAID technology), which makes 40GB of virtual disk space available to each server.
Combining all these boot drives with fewer larger drives that are mirrored helps reduce total cost of ownership (TCO) and improves reliability, availability and serviceability. If a rack boot appliance drive fails, an alert is sent to the IT operator. The operator then simply replaces the failed drive, and the appliance automatically copies the disk image from the working drive. The upshot is that operations are simplified, OpEx is reduced, and there is usually no downtime.
Syncro MX-B not only improves reliability by reducing failure points; it also significantly reduces power requirements (up to 40% less in the 24-port version, up to 60% less in the 48-port version) – a good thing for the corporate utility bill and climate change. This, in turn, reduces cooling requirements, and helps make hardware upgrades less costly. With the boot drives disaggregated from the servers, there’s no need to simultaneously upgrade the drives, which typically are still functional during server hardware upgrades.
In the case of both commercial aircraft and servers, less really can be more (or at least better) in some situations. Eliminating excess can make the whole system simpler and more efficient.
To learn more, please visit the LSI® Shared Storage Solutions web page: http://www.lsi.com/solutions/Pages/SharedStorage.aspx
The term global warming can be very polarizing in a conversation and both sides of the argument have mountains of material that support or discredit the overall situation. The most devout believers in global warming point to the average temperature increases in the Earth’s atmosphere over the last 100+ years. They maintain the rise is primarily caused by increased greenhouse gases from humans burning fossil fuels and deforestation.
The opposition generally agrees with the measured increase in temperature over that time, but claims that increase is part of a natural cycle of the planet and not something humans can significantly impact one way or another. The US Energy Information Administration estimates that 90% of world’s marketed energy consumption is from non-renewable energy sources like fossil fuels. Our internet-driven lives run through datacenters that are well-known to consume large quantities of power. No matter which side of the global warming argument you support, most people agree that wasting power is not a good long-term position. Therefore, if the power consumed by datacenters can be reduced, especially as we live in an increasingly digitized world, this would benefit all mankind.
When we look at the most power-hungry components of a datacenter, we find mainly server and storage systems. However, people sometimes forget that those systems require cooling to counteract the heat generated. But the cooling itself consumes even more energy. So anything that can store data more efficiently and quickly will reduce both the initial energy consumption and the energy to cool those systems. As datacenters demand faster data storage, they are shifting to solid state drives (SSDs). SSDs generally provide higher performance per watt of power consumed over hard disk drives, but there is still more that can be done.
Reducing data to help turn down the heat
The good news is that there’s a way to reduce the amount of data that reaches the flash memory of the SSD. The unique DuraWrite™ technology found in all LSI® SandForce® flash controllers reduces the amount of data written to the flash memory to cut the time it takes to complete the writes and therefore reduce power consumption, below levels of other SSD technologies. That, in turn, reduces the cooling needed to further reduce overall power consumption. Now this data reduction is “loss-less,” meaning 100% of what is saved is returned to the host, unlike MPEG, JPEG, and MP3 files, which tolerate some amount of data loss to reduce file sizes.
Today you can find many datacenters already using SandForce Driven SSDs and LSI Nytro™ application acceleration products (which use DuraWrite technology as well). When we start to see datacenters deploying these flash storage products by the millions, you will certainly be able to measure the reduction in power consumed by datacenters. Unfortunately, LSI will not be able to claim it stopped global warming, but at least we, and our customers, can say we did something to help defer the end result.
I’ve been travelling to China quite a bit over the last year or so. I’m sitting in Shenzhen right now (If you know Chinese internet companies, you’ll know who I’m visiting). The growth is staggering. I’ve had a bit of a trains, planes, automobiles experience this trip, and that’s exposed me to parts of China I never would have seen otherwise. Just to accommodate sheer population growth and the modest increase in wealth, there is construction everywhere – a press of people and energy, constant traffic jams, unending urban centers, and most everything is new. Very new. It must be exciting to be part of that explosive growth. What a market. I mean – come on – there are 1.3 billion potential users in China.
The amazing thing for me is the rapid growth of hyperscale datacenters in China, which is truly exponential. Their infrastructure growth has been 200%-300% CAGR for the past few years. It’s also fantastic walking into a building in China, say Baidu, and feeling very much at home – just like you walked into Facebook or Google. It’s the same young vibe, energy, and ambition to change how the world does things. And it’s also the same pleasure – talking to architects who are super-sharp, have few technical prejudices, and have very little vanity – just a will to get to business and solve problems. Polite, but blunt. We’re lucky that they recognize LSI as a leader, and are willing to spend time to listen to our ideas, and to give us theirs.
Even their infrastructure has a similar feel to the US hyperscale datacenters. The same only different. ;-)
A lot of these guys are growing revenue at 50% per year, several getting 50% gross margin. Those are nice numbers in any country. One has $100’s of billions in revenue. And they’re starting to push out of China. So far their pushes into Japan have not gone well, but other countries should be better. They all have unique business models. “We” in the US like to say things like “Alibaba is the Chinese eBay” or “Sina Weibo is the Chinese Twitter”…. But that’s not true – they all have more hybrid business models, unique, and so their datacenter goals, revenue and growth have a slightly different profile. And there are some very cool services that simply are not available elsewhere. (You listening Apple®, Google®, Twitter®, Facebook®?) But they are all expanding their services, products and user base. Interestingly, there is very little public cloud in China. So there are no real equivalents to Amazon’s services or Microsoft’s Azure. I have heard about current development of that kind of model with the government as initial customer. We’ll see how that goes.
100’s of thousands of servers. They’re not the scale of Google, but they sure are the scale of Facebook, Amazon, Microsoft…. It’s a serious market for an outfit like LSI. Really it’s a very similar scale now to the US market. Close to 1 million servers installed among the main 4 players, and exabytes of data (we’ve blown past mere petabytes). Interestingly, they still use many co-location facilities, but that will change. More important – they’re all planning to probably double their infrastructure in the next 1-2 years – they have to – their growth rates are crazy.
Often 5 or 6 distinct platforms, just like the US hyperscale datacenters. Database platforms, storage platforms, analytics platforms, archival platforms, web server platforms…. But they tend to be a little more like a rack of traditional servers that enterprise buys with integrated disk bays, still a lot of 1G Ethernet, and they are still mostly from established OEMs. In fact I just ran into one OEM’s American GM, who I happen to know, in Tencent’s offices today. The typical servers have 12 HDDs in drive bays, though they are starting to look at SSDs as part of the storage platform. They do use PCIe® flash cards in some platforms, but the performance requirements are not as extreme as you might imagine. Reasonably low latency and consistent latency are the premium they are looking for from these flash cards – not maximum IOPs or bandwidth – very similar to their American counterparts. I think hyperscale datacenters are sophisticated in understanding what they need from flash, and not requiring more than that. Enterprise could learn a thing or two.
Some server platforms have RAIDed HDDs, but most are direct map drives using a high availability (HA) layer across the server center – Hadoop® HDFS or self-developed Hadoop like platforms. Some have also started to deploy microserver archival “bit buckets.” A small ARM® SoC with 4 HDDs totaling 12 TBytes of storage, giving densities like 72 TBytes of file storage in 2U of rack. While I can only find about 5,000 of those in China that are the first generation experiments, it’s the first of a growing wave of archival solutions based on lower performance ARM servers. The feedback is clear – they’re not perfect yet, but the writing is on the wall. (If you’re wondering about the math, that’s 5,000 x 12 TBytes = 60 Petabytes….)
Yes, it’s important, but maybe more than we’re used to. It’s harder to get licenses for power in China. So it’s really important to stay within the envelope of power your datacenter has. You simply can’t get more. That means they have to deploy solutions that do more in the same power profile, especially as they move out of co-located datacenters into private ones. Annually, 50% more users supported, more storage capacity, more performance, more services, all in the same power. That’s not so easy. I would expect solar power in their future, just as Apple has done.
Here’s where it gets interesting. They are developing a cousin to OpenCompute that’s called Scorpio. It’s Tencent, Alibaba, Baidu, and China Telecom so far driving the standard. The goals are similar to OpenCompute, but more aligned to standardized sub-systems that can be co-mingled from multiple vendors. There is some harmonization and coordination between OpenCompute and Scorpio, and in fact the Scorpio companies are members of OpenCompute. But where OpenCompute is trying to change the complete architecture of scale-out clusters, Scorpio is much more pragmatic – some would say less ambitious. They’ve finished version 1 and rolled out about 200 racks as a “test case” to learn from. Baidu was the guinea pig. That’s around 6,000 servers. They weren’t expecting more from version 1. They’re trying to learn. They’ve made mistakes, learned a lot, and are working on version 2.
Even if it’s not exciting, it will have an impact because of the sheer size of deployments these guys are getting ready to roll out in the next few years. They see the progression as 1) they were using standard equipment, 2) they’re experimenting and learning from trial runs of Scorpio versions 1 and 2, and then they’ll work on 3) new architectures that are efficient and powerful, and different.
Information is pretty sketchy if you are not one of the member companies or one of their direct vendors. We were just invited to join Scorpio by one of the founders, and would be the first group outside of China to do so. If that all works out, I’ll have a much better idea of the details, and hopefully can influence the standards to be better for these hyperscale datacenter applications. Between OpenCompute and Scorpio we’ll be seeing a major shift in the industry – a shift that will undoubtedly be disturbing to a lot of current players. It makes me nervous, even though I’m excited about it. One thing is sure – just as the server market volume is migrating from traditional enterprise to hyperscale datacenter (25-30% of the server market and growing quickly), we’re starting to see a migration to Chinese hyperscale datacenters from US-based ones. They have to grow just to stay still. I mean – come on – there are 1.3 billion potential users in China….
Tags: Alibaba, Amazon, Apple, ARM, Baidu, China, China Telecom, datacenter, Facebook, Google, Hadoop, hard disk drive, HDD, hyperscale, Microsoft, OpenCompute, Scorpio, Shenzhen, Sina Weibo, solid state drive, SSD, Tencent, Twitter
I’ve spent a lot of time with hyperscale datacenters around the world trying to understand their problems – and I really don’t care what area those problems are as long as they’re important to the datacenter. What is the #1 Real Problem for many hyperscale datacenters? It’s something you’ve probably never heard about, and probably have not even thought about. It’s called false disk failure. Some hyperscale datacenters have crafted their own solutions – but most have not.
Why is this important, you ask? Many large datacenters today have 1 million to 4 million hard disk drives (HDDs) in active operation. In anyone’s book that’s a lot. It’s also a very interesting statistical sample size of HDDs. Hyperscale datacenters get great pricing on HDDs. Probably better than OEMs get, and certainly better than the $79 for buying 1 HDD at your local Fry’s store. So you would imagine if a disk fails – no one cares – they’re cheap and easy to replace. But the burden of a failed disk is much more than the raw cost of the disk:
Let’s put some scale to this problem, and you’ll begin to understand the issue. One modest size hyperscale datacenter has been very generous in sharing its real numbers. (When I say modest, they are ~1/4 to 1/2 the size of many other hyperscale datacenters, but they are still huge – more than 200k servers). Other hyperscale datacenters I have checked with say – yep, that’s about right. And one engineer I know at an HDD manufacturer said – “wow – I expected worse than that. That’s pretty good.” To be clear – these are very good HDDs they are using, it’s just that the numbers add up.
The raw data:
RAIDed SAS HDDs
Non-RAIDed (direct map) SATA drives behind HBAs
What’s interesting is the relative failure rate of SAS drives vs. SATA. It’s about an order of magnitude worse in SATA drives than SAS. Frankly some of this is due to protocol differences. SAS allows far more error recovery capabilities, and because they also tend to be more expensive, I believe manufacturers invest in slightly higher quality electronics and components. I know the electronics we ship into SAS drives is certainly more sophisticated than SATA drives.
False fail? What? Yea, that’s an interesting topic. It turns out that about 40% of the time with SAS and about 50% of the time with SATA, the drive didn’t actually fail. It just lost its marbles for a while. When they pull the drive out and put it into a test jig, everything is just fine. And more interesting, when they put the drive back into service, it is no more statistically likely to fail again than any other drive in the datacenter. Why? No one knows. I suspect though.
I used to work on engine controllers. That’s a very paranoid business. If something goes wrong and someone crashes, you have a lawsuit on your hands. If a controller needs a recall, that’s millions of units to replace, with a multi-hundred dollar module, and hundreds of dollars in labor for each one replaced. No one is willing to take that risk. So we designed very carefully to handle soft errors in memory and registers. We incorporated ECC like servers use, background code checksums and scrubbing, and all sorts of proprietary techniques, including watchdogs and super-fast self-resets that could get operational again in less than a full revolution of the engine. Why? – the events were statistically rare. The average controller might see 1 or 2 events in its lifetime, and a turn of the ignition would reset that state. But the events do happen, and so do recalls and lawsuits… HDD controllers don’t have these protections, which is reasonable. It would be an inappropriate cost burden for their price point.
You remember the Toyota Prius accelerator problems? I know that controller was not protected for soft errors. And the source of the problem remained a “mystery.” Maybe it just lost its marbles for a while? A false fail if you will. Just sayin’.
Back to HDDs. False fail is especially frustrating, because half the HDDs actually didn’t need to be replaced. All the operational costs were paid for no reason. The disk just needed a power cycle reset. (OK, that introduces all sorts of complex management by the RAID controller or application to manage that 10 second power reset cycle and application traffic created in that time – be we can handle that.)
Daily, this datacenter has to:
And 1/2 of that is for no reason at all.
First – why not rebuild the disk if it’s RAIDed? Usually hyperscale datacenters use clustered applications. A traditional RAID rebuild drives the server performance to ~50%, and for a 2TByte drive, under heavy application load (definition of a hyperscale datacenter) can truly take up to a week. 50% performance for a week? In a cluster that means the overall cluster is running ~50% performance. Say 200 nodes in a cluster – that means you just lost ~100 nodes of work – or 50% of cluster performance. It’s much simpler to just take the node offline with the failed drive, and get 99.5% cluster performance, and operationally redistribute the workload across multiple nodes (because you have replicated data elsewhere). But after rebuild, the node will have to be re-synced or re-imaged. There are ways to fix all this. We’ll talk about them on another day. Or you can simply run direct mapped storage, and unmounts the failed drive.
Next – Why replicate data over the network, and why is that a big deal? For geographic redundancy (say a natural disaster at one facility) and regional locality, hyperscale datacenters need multiple data copies. Often 3 copies so they can do double duty as high-availability copies, or in the case of some erasure coding, 2.2 to 2.5 copies (yea – weird math – how do you have 0.5 copy…). When you lose one copy, you are down to 2, possibly 1. You need to get back to a reliable number again. Fast. Customers are loyal because of your perfect data retention. So you need to replicate that data and re-distribute it across the datacenter on multiple servers. That’s network traffic, and possibly congestion, which affects other aspects of the operations of the datacenter. In this datacenter it’s about 50 hours of 10G Ethernet traffic every day.
To be fair, there is a new standard in SAS interfaces that will facilitate resetting a disk in-situ. And there is the start of discussion of the same around SATA – but that’s more problematic. Whatever the case, it will be a years before the ecosystem is in place to handle the problems this way.
What’s that mean to you?
Well. You can expect something like 1/100 of your drives to really fail this year. And you can expect another 1/100 of your drives to fail this year, but not actually be failed. You’ll still pay all the operational overhead of not actually having a failed drive – rebuilds, disk replacements, management interventions, scheduled downtime/maintenance time, and the OEM replacement price for that drive – what $600 or so ?… Depending on your size, that’s either a don’t care, or a big deal. There are ways to handle this, and they’re not expensive – much less than the disk carrier you already pay for to allow you to replace that drive – and it can be handled transparently – just a log entry without seeing any performance hiccups. You just need to convince your OEM to carry the solution.
Anyone who knows me knows I like to ask “why?” Maybe I never outgrew the 2-year-old phase. But I also like to ask “why not?” Every now and then you need to rethink everything you know top to bottom because something might have changed.
I’ve been talking to a lot of enterprise datacenter architects and managers lately. They’re interested in using flash in their servers and storage, but they can’t get over all the “problems.”
The conversation goes something like this: Flash is interesting, but it’s crazy expensive $/bit. The prices have to come way down – after all it’s just a commodity part. And I have these $4k servers – why would I put an $8k PCIe card in them – that makes no sense. And the stuff wears out, which is an operational risk for me – disks last forever. Maybe flash isn’t ready for prime time yet.
These arguments are reasonable if you think about flash as a disk replacement, and don’t think through all the follow-on implications.
In contrast I’ve also been spending a lot of time with the biggest datacenters in the world – you know – the ones we all know by brand name. They have at least 200k servers, and anywhere from 1.5 million to 7 million disks. They notice CapEx and OpEx a lot. You multiply anything by that much and it’s noticeable. (My simple example is add 1 LED to each server with 200k servers and the cost adds up to 26K watts + $10K LED cost.) They are very scientific about cost. More specifically they measure work/$ very carefully. Anything to increase work or reduce $ is very interesting – doing both at once is the holy grail. Already one of those datacenters is completely diskless. Others are part way there, or have the ambition of being there. You might think they’re crazy – how can they spend so much on flash when disks are so much cheaper, and these guys offer their services for free?
When the large datacenters – I call the hyperscale datacenters – measure cost, they’re looking at purchase cost, including metal racks and enclosures, shipping, service cost both parts and human expense, as well as operational disruption overhead and the complexity of managing that, the opportunity cost of new systems vs. old systems that are less efficient, and of course facilities expenses – buildings, power, cooling, people… They try to optimize the mix of these.
Let’s look at the arguments against using flash one by one.
Flash is just a commodity part
This is a very big fallacy. It’s not a commodity part, and flash is not all the same. The parts you see in cheap consumer devices deserve their price. In the chip industry, it’s common to have manufacturing fallout; 3% – 10% is reasonable. What’s more the devices come at different performance levels – just look at x86 performance versions of the same design. In the flash business 100% of the devices are sold, used, and find their way into products. Those cheap consumer products are usually the 3%-10% that would be scrap in other industries. (I was once told – with a smile – “those are the parts we sweep off the floor”…)
Each generation of flash (about 18 months between them) and each manufacturer (there are 5, depending how you count) have very different characteristics. There are wild differences in erase time, write time, read time, bandwidth, capacity, endurance, and cost. There is no one supplier that is best at all of these, and leadership moves around. More importantly, in a flash system, how you trade these things off has a huge effect on write latency (#1 impactor on work done), latency outliers (consistent operation), endurance or life span, power consumption, and solution cost. All flash products are not equal – not by a long shot. Even hyperscale datacenters have different types of solutions for different needs.
It’s also important to know that temperature of operation and storage, inter-arrival time of writes, and “over provisioning” (the amount hidden for background use and garbage collection) have profound impacts on lifespan and performance.
$8k PCIe card in a $4k server – really?
I am always stunned by this. No one thinks twice about spending more on virtualization licenses than on hardware, or say $50k for a database license to run on a $4k server. It’s all about what work you need to accomplish, and what’s the best way to accomplish it. It’s no joke that in database applications it’s pretty easy to get 4x the work from a server with a flash solution inserted. You probably won’t get worse than 4x, and as good as 10x. On a purely hardware basis, that makes sense – I can have 1 server @ $4k + $8K flash vs. 4 servers @ $4k. I just saved $4k CapEx. More importantly, I saved the service contract, power, cooling and admin of 3 servers. If I include virtualization or database licenses, I saved another $150k + annual service contracts on those licenses. That’s easy math. If I worry about users supported rather than work done, I can support as many as 100x users. The math becomes overwhelming. $8K PCIe card in a $4k server? You bet when I think of work/$.
The stuff wears out & disks last forever
It’s true that car tires wear out, and depending on how hard you use them that might be faster or slower. But tires are one of the most important parts in a cars performance – acceleration, stopping, handling – you couldn’t do any of that without them. The only time you really have catastrophic failure with tires is when you wear them way past any reasonable point – until they are bald and should have been replaced. Flash is like that – you get lots of warning as its wearing out, and you get lots of opportunity to operationally plan and replace the flash without disruption. You might need to replace it after 4 or 5 years, but you can plan and do it gracefully. Disks can last “forever,” but they also fail randomly and often.
Reliability statistics across millions of hard drives show somewhere around 2.5% fail annually. And that’s for 1st quality drives. Those are unpredicted, catastrophic failures, and depending on your storage systems that means you need to go into rebuild or replication of TBytes of data, and you have a subsequent degradation in performance (which can completely mess up load balancing of a cluster of 20 to 200 other nodes too), potentially network traffic overhead, and a physical service event that needs to be handled manually and fairly quickly. And really – how often do admins want to take the risk of physically replacing a drive while a system is running. Just one mistake by your tech and it’s all over… Operationally flash is way better, less disruptive, predictable, lower cost, and the follow on implications are much simpler.
Crazy expensive $/bit
OK – so this argument doesn’t seem so relevant anymore. Even so, in most cases you can’t use much of the disk capacity you have. It will be stranded because you need to have spare space as databases, etc. grow. If you run out of space for db’s the result is catastrophic. If you are driving a system hard, you often don’t have the bandwidth left to actually access that extra capacity. It’s common to only use ½ of the available capacity of drives.
Caching solutions change the equation as well. You can spend money on flash for the performance characteristics, and shift disk drive spend to fewer, higher capacity, slower, more power efficient drives for bulk capacity. Often for the same or similar overall storage spend you can have the same capacity at 4x the system performance. And the space and power consumed and cooling needed for that system is dramatically reduced.
Even so, flash is not going to replace large capacity storage for a long, long time, if ever. What ever the case, the $/bit is simply not the right metric for evaluating flash. But it’s true, flash is more expensive per bit. It’s simply that in most operational contexts, it more than makes up for that by other savings and work/$ improvements.
So I would argue (and I’m backed up by the biggest hyperscale datacenters in the world) that flash is ready for prime time adoption. Work/$ is the correct metric, but you need to measure from the application down to the storage bits to get that metric. It’s not correct to think about flash as “just a disk replacement” – it changes the entire balance of a solution stack from application performance and responsiveness and cumulative work, to server utilization to power consumption and cooling to maintenance and service to predictable operational stability. It’s not just a small win; it’s a big win. It’s not a fit yet for large pools of archival storage – but even for that a lot of energy is going into trying to make that work. So no – enterprise will not go diskless for quite a while, but it is understandable why hyperscale datacenters want to go diskless. It’s simple math.
Every now and then you need to rethink everything you know top to bottom because something might have changed.
One of the big challenges that I see so many IT managers struggling with is how are they supposed to deal with the almost exponential growth of data that has to be stored, accessed, and protected, with IT budgets that are flat or growing at rates lower than the nonstop increases in storage volumes.
I’ve found that it doesn’t seem to matter if it is a departmental or small business datacenter, or a hyperscale datacenter with many thousands of servers. The data growth continues to outpace the budgets.
At LSI we call this disparity between the IT budget and the needs growth the “data deluge gap.”
Of course, smaller datacenters have different issues than the hyperscale datacenters. However, no matter the datacenter size, concerns generally center on TCO. This, of course, includes both CapEx and OpEx for the storage systems.
It’s a good feeling to know that we are tackling these datacenter growth and operations issues head-on for many different environments – large and small.
LSI has developed and is starting to provide a new shared DAS (sDAS) architecture that supports the sharing of storage across multiple servers. We call it the LSI® SyncroTM architecture and it really is the next step in the evolution of DAS. Our Syncro solutions deliver increased uptime, help to reduce overall costs, increase agility, and are designed for ease of deployment. The fact that the Syncro architecture is built on our proven MegaRAID® technology means that our customers can trust that it will work in all types of environments.
Syncro architecture is a very exciting new capability that addresses storage and data protection needs for numerous datacenter environments. Our first product, Syncro MX-B, is targeted at hyperscale datacenter environments including Web 2.0 and cloud. I will be blogging about that offering in the near future. We will soon be announcing details on our Syncro CS product line, previously known as High Availability DAS, for small and medium businesses and I will blog about what it can mean for our customers and users.
Both of these initial versions of the Syncro architecture can be very exciting and I really like to watch how datacenter managers react when they find about these game-changing capabilities.
We say that “with the LSI Syncro architecture you take DAS out of the box and make it sharable and scalable. The LSI Syncro architecture helps make your storage Simple. Smart. On.” Our tag line for Syncro is “The Smarter Way to ON.™” It really is.
To learn more, please visit the LSI Shared Storage Solutions web page: http://www.lsi.com/solutions/Pages/SharedStorage.aspx