You might be surprised to find out how big the infrastructure for cloud and Web 2.0 is. It is mind-blowing. Microsoft has acknowledged packing more than 1 million servers into its datacenters, and by some accounts that is fewer than Googleâ€™s massive server count but a bit more than Amazon. Â
Facebookâ€™s server count is said to have skyrocketed from 30,000 in 2012 to 180,000 just this past August, serving 900 million plus users. And the social media giant is even putting its considerable weight behind the Open Compute effort to make servers fit better in a rack and draw less power. The list of mega infrastructures also includes Tencent, Baidu and Alibaba and the roster goes on and on.
Even more jaw-dropping is that almost 99.9% of these hyperscale infrastructures are built with servers featuring direct-attached storage. Thatâ€™s right â€“ they do the computing and store the data. In other words, no special, dedicated storage gear. Yes, your Facebook photos, your Skydrive personal cloud and all the content you use for entertainment, on-demand video and gaming data are stored inside the server.
Direct-attached storage reigns supreme
Everything in these infrastructures â€“ compute and storage â€“ is built out of x-86 based servers with storage inside. Whatâ€™s more, growth of direct-attached storage is many folds bigger than any other storage deployments in IT. Rising deployments of cloud, or cloud-like, architectures are behind much of this expansion.
The prevalence of direct-attached storage is not unique to hyperscale deployments. Large IT organizations are looking to reap the rewards of creating similar on-premise infrastructures. The benefits are impressive: Build one kind of infrastructure (server racks), host anything you want (any of your properties), and scale if you need to very easily. TCO is much less than infrastructures relying on network storage or SANs.
With direct-attached you no longer need dedicated appliances for your database tier, your email tier, your analytics tier, your EDA tier. All of that can be hosted on scalable, share-nothing infrastructure. And just as with hyperscale, the storage is all in the server. No SAN storage required.
Open Compute, OpenStack and software-defined storage drive DAS growth
Open Compute is part of the picture. A recent Open Compute show I attended was mostly sponsored by hyperscale customers/suppliers. Many big-bank IT folks attended. Open Compute isnâ€™t the only initiative driving growing deployments of direct-attached storage. So is software-defined storage and OpenStack. Big application vendors such as Oracle, Microsoft, VMware and SAP are also on board, providing solutions that support server-based storage/compute platforms that are easy and cost-effective to deploy, maintain and scale and need no external storage (or SAN including all-flash arrays).
So if you are a network-storage or SAN manufacturer, you have to be doing some serious thinking (many have already) about how youâ€™re going to catch and ride this huge wave of growth.
Tags: Alibaba, Amazon, Baidu, cloud computing, DAS, direct attached storage, enterprise, enterprise IT, Google, hyperscale, Microsoft, Open Compute, OpenStack, Oracle, SAP, Tencent, VMware
I’ve just been to China. Again. Â Itâ€™s only been a few months since I was last there.
I was lucky enough to attend the 5th China Cloud Computing Conference at the China National Convention Center in Beijing. You probably have not heard of it, but itâ€™s an impressive conference. Itâ€™s â€śthe oneâ€ť for the cloud computing industry. It was a unique view for me â€“ more of an inside-out view of the industry. Everyone whoâ€™s anyone in Chinaâ€™s cloud industry was there. Our CEO, Abhi Talwalkar, had been invited to keynote the conference, so I tagged along.
First, the air was really hazy, but I donâ€™t think the locals considered it that bad. The US consulate iPhone app said the particulates were in the very unhealthy range. Imagine looking across the street. Sure, you can see the building there, but the next one? Not so much. Look up. Can you see past the 10th floor? No, not really. The building disappears into the smog. Thatâ€™s what it was like at the China National Convention Center, which is part of the same Olympics complex as the famous Birdcage stadium: http://www.cnccchina.com/en/Venues/Traffic.aspx
I had a fantastic chance to catch up with a university friend, who has been living in Beijing since the 90â€™s, and is now a venture capitalist. Itâ€™s amazing how almost 30 years can disappear and you pick up where you left off. He sure knows how to live. I was picked up in his private limo, whisked off to a very well-known restaurant across the city, where we had a private room and private waitress. We even had some exotic, special dishes that needed to be ordered at least a day in advance. Wow.Â But we broke Chinese tradition and had imported beer in honor of our Canadian education.
Sizing up China’s cloud infrastructure
The most unusual meeting I attended was an invitation-only session â€“ the Sino-American roundtable on cloud computing. There were just about 40 people in a room â€“ half from the US, half from China. Mostly what I learned is that the cloud infrastructure in China is fragmented, and probably sub-scale. And itâ€™s like that for a reason. It was difficult to understand at first, but I think Iâ€™ve made sense of it.
I started asking why to friends and consultants and got some interesting answers. Essentially different regional governments are trying to capture the cloud â€śindustryâ€ť in their locality, so they promote activity, and they promote creation of new tools and infrastructure for that. Why reuse something thatâ€™s open source and works if you donâ€™t have to and you can create high-tech jobs? (Thatâ€™s sarcasm, by the way.) Many technologists I spoke with felt this will hold them back, and that they are probably 3-5 years behind the US. As well, each government-run industry specifies the datacenter and infrastructure needed to be a supplier or ecosystem partner with them, and each is different. The national train system has a different cloud infrastructure from the agriculture department, and from the shipping authority, etcâ€¦ and if you do business with them â€“ that is you are part of their ecosystem of vendors, then you use their infrastructure. It all spells fragmentation and sub-scale. In contrast, the Web 2.0 / social media companies seem to be doing just fine.
Baidu was also showing off its open rack. Itâ€™s an embodiment of the Scorpio V1 standard, which was jointly developed with Tencent, Alibaba and China Telecom. It views this as a first experiment, and is looking forward to V2, which will be a much more mature system.
I was also lucky to have personal meetings with general managers,chief architects and effective CTOs of the biggest cloud companies in China. What did I learn? They are all at an inflexion point. Many of the key technologists have experience at American Web 2.0 companies, so theyâ€™re able to evolveÂ quickly, leveraging their industry knowledge. Theyâ€™re all working to build or grow their own datacenters, their own infrastructure. And theyâ€™re aggressively expanding products, not just users, so theyâ€™re getting a compound growth rate.
Hereâ€™s a little of what I learned. In general, there is a trend to try and simplify infrastructure, harmonize divergent platforms, and deploy more infrastructure by spending less on each unit. (In general, they donâ€™t make as much per user as American companies, but they have more users). As a result they are more cost-focused than US companies. And they are starting to put more emphasis on operational simplicity in general. As one GM described it to me â€“ â€śYes, techs are inexpensive in China for maintainence, but more often than not they make mistakes that impact operations.â€ť So we (LSI) will be focussing more on simplifying management and maintainence for them.
Baiduâ€™s biggest Hadoop cluster is 20k nodes. I believe thatâ€™s as big as Yahooâ€™s â€“ and it is the originator of Hadoop. Baidu has a unique use profile for flash â€“ itâ€™s not like theÂ hyperscale datacenters in the US. But Baidu is starting to consume a lot. Like most other hyperscale datacenters, it is working on storage erasure coding across servers, racks and datacenters, andÂ it is trying to make a unified namespace across everything. One of its main interests is architecture at datacenter level, harmonizing the various platforms and looking for the optimum at the datacenter level. In general, Baidu is very proud of the advances it has made, and it has real confidence in its vision and route forward, and from what I heard, its architectural ambitions are big.
JD.com (which used to be 360buy.com) is the largest direct ecommerce company in China and (only) had about $10 billion (US) in revenue last year, with 100% CAGR growth. As the GM there said, its growth has to slow sometime, or in 5 years itâ€™ll be the biggest company in the world. I think it isÂ the closest equivalent to Amazon there is out there, and they have similar ambitions. They are in the process of transforming to a self-built, self-managed datacenter infrastructure. It is a company I am going to keep my eyes on.
Tencent is expanding into some interesting new businesses. Sure, people know about the Tencent cloud services that the Chinese government will be using, but Tencent also has some interesting and unique cloud services coming. Letâ€™s just say even I am interested in using them. And of course, while Tencent is already the largest Web 2.0 company in China, its new services promise to push it to new scale and new markets.
Extra! Extra! Read all about it …
And then there was press. I had a very enjoyable conversation with Yuan Shaolong, editor at WatchStor, that I think ran way over. Amazingly â€“ we discovered we have the same favorite band, even half a world away from each other. The results are here, though Iâ€™m not sure if Google translate messed a few things up, or if there was some miscommunication, but in general, I think most of the basics are right: http://translate.google.com/translate?hl=en&sl=zh-CN&u=http://tech.watchstor.com/storage-module-144394.htm&prev=/search%3Fq%3Drobert%2Bober%2BLSI%26client%3Dfirefox-a%26rls%3Dorg.mozilla:en-US:official%26biw%3D1346%26bih%3D619
I just keep learning new things every time I go to China. I suspect it has as much to do with how quickly things are changing as new stuff to learn. So I expect it wonâ€™t be too long until I go to China, againâ€¦
Tags: Abhi Talwalkar, Alibaba, Amazon, Baidu, China, China Cloud Computing Conference, China National Convention Center, China Telecom, datacenter, Hadoop, hyperscale, JD.com, WatchStor, web 2.0, Yahoo
I was lucky enough to get together for dinner and beer with old friends a few weeks ago. Between the 4 of us, weâ€™ve been involved in or responsible for a lot of stuff you use every day, or at least know about.
Supercomputers, minicomputers, PCs, Macs, Newton, smart phones, game consoles, automotive engine controllers and safety systems, secure passport chips, DRAM interfaces, netbooks, and a bunch of processor architectures: Alpha, PowerPC, Sparc, MIPS, StrongARM/XScale, x86 64-bit, and a bunch of other ones you haven’t heard of (um – most of those are mine, like TriCore). Basically if you drive a European car, travel internationally, use the Internet , if you play video games, or use a smart phone, wellâ€¦Â youâ€™re welcome.
Why do I tell you this? Well – first I’m name dropping – I’m always stunned I can call these guys friends and be their peers. But more importantly, we’ve all been in this industry as architects for about 30 years. Of course our talk went to whatâ€™s going on today. And we all agree that we’ve never seen more changes – inflexions – than the raft unfolding right now. Maybe its pressure from the recession, or maybe un-naturally pent up need for change in the ecosystem, but change there is.
Changes in who drives innovation, whatâ€™s needed, the companies on top and on bottom at every point in the food chain, who competes with whom, how workloads have changed from compute to dataflow, software has moved to opensource, how abstracted code is now from processor architecture, how individual and enterprise customers have been revolting against the “old” ways, old vendors, old business models, and what the architectures look like, how processors communicate, and how systems are purchased, and what fundamental system architectures look like. But not much besides that…
Ok – so if you’re an architect, thatâ€™s as exciting as it gets (you hear it in my voice â€“ right ?), and it makes for a lot of opportunities to innovate and create new or changed businesses. Because innovation is so often at the intersection of changing ways of doing things. We’re at a point where the changes are definitely not done yet. We’re just at the start. (OK â€“ now try to imagine a really animated 4-way conversation over beers at the Britannia Arms in Cupertinoâ€¦ Yea â€“ exciting.)
Iâ€™m going to focus on just one sliver of the market â€“ but itâ€™s important to me â€“ and thatâ€™s enterprise IT. Â I think the changes are as much about business models as technology.
Iâ€™ll start in a strange place.Â Hyperscale datacenters (think social media, search, etc.) and the scale of deployment changes the optimization point. Most of us starting to get comfortable with rack as the new purchase quantum. And some of us are comfortable with the pod or container as the new purchase quantum. But theÂ hyperscale dataenters work more at the datacenter as the quantum. By looking at it that way, they can trade off the cost of power, real estate, bent sheet metal, network bandwidth, disk drives, flash, processor type and quantity, memory amount, where work gets done, and what applications are optimized for. In other words, we shifted from looking at local optima to looking for global optima. I donâ€™t know about you, but when I took operations research in university, I learned there was an unbelievable difference between the two â€“ and global optima was the one you wantedâ€¦
Hyperscale datacenters buy enough (top 6 are probably more than 10% of the market today) that 1) they need to determine what they deploy very carefully on their own, and 2) vendors work hard to give them what they need.
That means innovation used to be driven by OEMs, but now itâ€™s driven by hyperscale datacenters andÂ itâ€™s driven hard. That global optimum? Itâ€™s work/$ spent. Thatâ€™s global work, and global spend. Itâ€™s OK to spend more, even way more on one thing if over-all you get more done for the $â€™s you spend.
Thatâ€™s why the 3 biggest consumers of flash in servers are Facebook, Google, and Apple, with some of the others not far behind. You want stuff, they want to provide it, and flash makes it happen efficiently. So efficiently they can often give that service away for free.
Hyperscale datacenters have started to publish their cost metrics, and open up their architectures (like OpenCompute), and open up their software (like Hadoop and derivatives). More to the point, services like Amazon have put a very clear $ value on services. And itâ€™s shockingly low.
Enterprises have looked at those numbers. Hard. Thatâ€™s catalyzed a customer revolt against the old way of doing things â€“ the old way of buy and billing. OEMs and ISVs are creating lots of value for enterprise, but not that much. They’ve been innovating around â€śstickinessâ€ť and â€ślock-inâ€ť (yea â€“ those really are industry terms) for too long, while hyperscale datacenters have been focused on getting stuff done efficiently. The money they save per unit just means they can deploy more units and provide better services.
That revolt is manifesting itself in 2 ways. The first is seen in the quarterly reports of OEMs and ISVs. Rumors of IBM selling its X-series to Lenovo, Dell going private, Oracle trying to shift business, HP talking of the â€śnew style of ITâ€ťâ€¦ The second is enterprises are looking to emulate hyperscale datacenters as much as possible, and deploy private cloud infrastructure. And often as not, those will be running some of the same open source applications and file systems as the big hyperscale datacenters use.
Where are the hyperscale datacenters leading them? Itâ€™s a big list of changes, and theyâ€™re all over the place.
But theyâ€™re also looking at a few different things. For example, global name space NAS file systems. Personally? I think this oneâ€™s a mistake. I like the idea of file systems/object stores, but the network interconnect seems like a bottleneck. Storage traffic is shared with network traffic, creates some network spine bottlenecks, creates consistency performance bottlenecks between the NAS heads, and â€“ letâ€™s face it â€“ people usually skimp on the number of 10GE ports on the server and in the top of rack switch. A typical SAS storage card now has 8 x 12G ports â€“ thatâ€™s 96G of bandwidth. Will servers have 10 x 10G ports? Yea. I didnâ€™t think so either.
Anyway â€“ all this is not academic. One Wall Street bank shared with me that â€“ hold your breath â€“ it could save 70% of its spend going this route. It was shocked. I wasnâ€™t shocked, because at first blush this seems absurd â€“ not possible. Thatâ€™s how I reacted. I laughed. Butâ€¦ The systems are simpler and less costly to make. There is simply less there to make or ship than OEMs force into the machines for uniqueness and â€śvalue.â€ť They are purchased from much lower margin manufacturers. They have massively reduced maintenance costs (thereâ€™s less to service, and, well, no OEM service contracts). And also important â€“ some of the incredibly expensive software licenses are flipped to open source equivalents. Net savings of 70%. Easy. Stop laughing.
Disaggregation: Or in other words, Pooled Resources
But probably the most important trend from all of this is what server manufacturers are calling â€śdisaggregationâ€ť (hey â€“ youâ€™re ripping apart my server!) but architects are more descriptively calling pooled resources.
First â€“ the intent of disaggregation is not to rip the parts of a server to pieces to get lowest pricing on the components. No. If youâ€™re buying by the rack anyway â€“ why not package so you can put like with like. Each part has its own life cycle after all. CPUs are 18 months. DRAM is several years. Flash might be 3 years. Disks can be 5 to 7 years. Networks are 5 to 10 years. Power supplies areâ€¦ forever? Why not replace each on its own natural failure/upgrade cycle? Why not make enclosures appropriate to the technology they hold? Disk drives need solid vibration-free mechanical enclosures of heavy metal. Processors need strong cooling. Flash wants to run hot. DRAM cool.
Second â€“ pooling allows really efficient use of resources. Systems need slush resources. What happens to a systems that uses 100% of physical memory? It slows down a lot. If a database runs out of storage? It blue screens. If you donâ€™t have enough network bandwidth? The result is, every server is over provisioned for its task. Extra DRAM, extra network bandwidth, extra flash, extra disk drive spindles.. If you have 1,000 nodes you can easily strand TBytes of DRAM, TBytes of flash, a TByte/s of network bandwidth of wasted capacity, and all that always burning power. Worse, if you plan wrong and deploy servers with too little disk or flash or DRAM, thereâ€™s not much you can do about it. Now think 10,000 or 100,000 nodesâ€¦ Ouch.
If you pool those things across 30 to 100 servers, you can allocate as needed to individual servers. Just as importantly, you can configure systems logically, not physically. That means you donâ€™t have to be perfect in planning ahead what configurations and how many of each youâ€™ll need. You have sub-assemblies you slap into a rack, and hook up by configuration scripts, and get efficient resource allocation that can change over time. You need a lot of storage? A little? Higher performance flash? Extra network bandwidth? Just configure them.
Thatâ€™s a big deal.
And of course, this sets the stage for immense pooled main memory â€“ once the next generation non-volatile memories are ready â€“ probably starting around 2015.
You canâ€™t underestimate the operational problems associated with different platforms at scale. Many hyperscale datacenters today have around 6 platforms. If you think they are rolling out new versions of those before old ones are retired they often have 3 generations of each. Thatâ€™s 18 distinct platforms, with multiple software revisions of each. That starts to get crazy when you may have 200,000 to 400,000 servers to manage and maintain in a lights out environment. Pooling resources and allocating them in the field goes a huge way to simplifying operations.
Alternate Processor Architecture
It didnâ€™t always used to be Intel x86. There was a time when Intel was an upstart in the server business. It was Power, MIPs, Alpha, SPARCâ€¦ (and before that IBM mainframes and minis, etc). Each of the changes was brought on by changing the cost structure. Mainframes got displaced by multi-processor RISC, which gave way to x86.
Today, we have Oracle saying theyâ€™re getting out of x86 commodity servers and doubling down on SPARC. IBM is selling off its x86 business and doubling down on Power (hey â€“ donâ€™t confuse that with PowerPC â€“ which started as an architectural cut-down of Power â€“ I was thereâ€¦). And of course there is a rash of 64-bit ARM server SOCs coming â€“ with HP and Dell already dabbling in it. Whatâ€™s important to realize is that all of these offerings are focusing on the platform architecture, and how applications really perform in total, not just the processor.
Let me warp up with an email thread cut/paste from a smart friend â€“ Wayne Nation. I think he summed up some of whatâ€™s going on well, in a sobering way most people donâ€™t even consider.
â€śDoes this remind you of a time, long ago, when the market was exploding with companies that started to make servers out of those cheap little desktop x86 CPUs? What is different this time? Cost reduction and disaggregation? No, cost and disagg are important still, but not new.
A new CPU architecture? No, x86 was “new” before. ARM promises to reduce cost, as did Intel.
Disaggregation enables hyperscale datacenters to leverage vanity-free, but consistent delivery will determine the winning supplier. There is the potential for another Intel to rise from these other companies. â€ś
I often think about green, environmental impact, and what weâ€™re doing to the environment. One major reason I became an engineer was to leave the world a little better than when I arrived. Iâ€™ve gotten sidetracked a few times, but Iâ€™ve tried to help, even if just a little.
The good people in LSIâ€™s EHS (Environment, Health & Safety) asked me a question the other day about carbon footprint, energy impact, and materials use. Which got me thinking â€¦ OK â€“ I know most people in LSI donâ€™t really think of ourselves as a â€śgreen techâ€ť company. But we are â€“ really. No foolinâ€™. We are having a big impact on the global power consumption and material consumption of the IT industry. And I mean that in a good way.
There are many ways to look at this, both from what we enable datacenters to do, to what we enable integrators to do, all the way to hard core technology improvements and massive changes in what itâ€™s possible to do.
Back in 2008 I got to speak at the AlwaysOn GoingGreen conference. (I was lucky enough to be just after Elon Muskâ€“ heâ€™s a lot more famous now with Tesla doing so well.
http://www.smartplanet.com/video/making-the-case-for-green-it/305467Â (at 2:09 in video)
The massive consumption of IT equipment, all the ancillary metal, plastic wiring, etc. that goes with them, consumes energy as its being shipped and moved halfway around the world, and, more importantly, then gets scrapped out quickly. This has been a concern for me for quite a while. I mean â€“ think about that. As an industry we are generating about 9 million servers a year, about 3 million go intoÂ hyperscale datacenters (or hyperscale if you prefer). Many of those are scrapped on a 2, 3 or 4 year cycle â€“ so in steady state, maybe 1 million to 2 million a year are scrapped. Worse â€“ there is amazing use of energy by that many servers (even as they have advanced the state of the art unbelievably since 2008). And frankly, you and I are responsible for using all that power. Did you know thousands of servers are activated every time you make a GoogleÂ® query from your phone?
I want to take a look at basic silicon improvements we make, the impact of disk architecture improvement, SSDs, system and improvements, efficiency improvements, and also where weâ€™re going in the near future with eliminating scrap in hard drives and batteries. In reality, itâ€™s the massive pressure on work/$ that has made us optimize everything â€“ being able to do much more work at a lower cost, when a lot of cost is the energy and material that goes into the products that forces our hand. But the result is a real, profound impact on our carbon footprint that we should be proud of.
Sure we have a general silicon roadmap where each node enables reduced power, even as some standards and improvements actually increase individual device power. For example, our transition from 28nm semi process to 14 FinFET can literally cut the power consumption of a chip in half. But thatâ€™s small potatoes.
How about Ethernet? Itâ€™s everywhere â€“ right? Did you know servers often have 4 ethernet ports, and that there are a matching 4 ports on a network switch? LSI pioneered something called Energy Efficient Ethernet (EEE). Weâ€™re also one of the biggest manufacturers of Ethernet PHYs â€“ the part that drives the cable â€“ and we come standard in everything from personal computers to servers to enterprise switches. The savings are hard to estimate, because they depend very much on how much traffic there is, but you can realistically save Watts per interface link, and there are often 256 links in a rack. Â 500 Watts per rack is no joke, and in some datacenters it adds up to 1 or 2 MegaWatts.
How about something a little bigger and more specific? Hard disk drives. Did you know a typicalÂ hyperscale datacenter has between 1 million and 1.5 million disk drives? Each one of those consumes about Â 9 Watts, and most have 2 TBytes of capacity. So for easy math, 1 million drives is about 9 MegaWatts (!?) and about 2 Exabytes of capacity (remember â€“ data is often replicated 3 or more times). Data capacities in these facilities are needed to grow about 50% per year. So if we did nothing, we would need to go from 1 million drives to 1.5 million drives: 9 MegaWatts goes to 13.5 MegaWatts. Wow! Instead â€“ our high linearity, low noise PA and read channel designs are allowing drives to go to 4 TBytes per drives. (Sure the chip itself may use slightly more power, but thatâ€™s not the point, what it enables is a profound difference.) So to get that 50% increase in capacity we could actually reduce the number of drives deployed, with a net savings of 6.75 MegaWatts. Consider an average US home, with air conditioning, uses 1 kiloWatt. Thatâ€™s almost 7,000 homes. In reality â€“ they wonâ€™t get deployed that way â€“ but it will still be a huge savings. Instead of buying another 0.5 million drives they would buy 0.25 million drives with a net savings of 2.2 MegaWatts. Thatâ€™s still HUGE! (way to go, guys!) How many datacenters are doing that? Dozens. So thatâ€™s easily 20 or 30 MegaWatts globally. Did I say we saved them money too? A lot of money.
SSDs donâ€™t always get the credit they deserve. Yes, they really are fast, and they are awesome in your laptop, but they also end up being much lower power than hard drives. Our controllers were in about half the flash solutions shipped last year. Think tens of millions. If you just assume they were all laptop SSDs (at least half were not) then thatâ€™s another 20 MegaWatts in savings.
Did you know that in a traditional datacenter, about 30% of the power going into the building is used for air conditioning? It doesnâ€™t actually get used on the IT equipment at all, but is used to remove the heat that the IT equipment generates. We design our solutions so they can accommodate 40C ambient inlet air (thatâ€™s a little over 100Fâ€¦ hot). What that means is that the 30% of power used for the air conditioners disappears. Gone. Thatâ€™s not theoretical either. Most of the large social media, search engine, web shopping, and web portal companies are using our solutions this way. Thatâ€™s a 30% reduction in the power of storage solutions globally. Again, its MegaWatts in savings. And mega money savings too.
But letâ€™s really get to the big hitters: improved work per server. Yep â€“ we do that. In fact adding a Nytroâ„˘ MegaRAIDÂ® solution will almost always give you 4x the work out of a server. Itâ€™s a slam dunk if youâ€™re running a database. You heard me â€“ 1 server doing the work that it previously took 4 servers to do. Not only is that a huge savings in dollars (especially if you pay for software licenses!) but itâ€™s a massive savings in power. You can replace 4 servers with 1, saving at least 900 Watts, and that lone server thatâ€™s left is actually dissipating less power too, because itâ€™s actively using fewer HDDs, and using flash for most traffic instead. If you go a step further and use Nytro WarpDrive Flash cards in the servers, you can get much more â€“ 6 to 8 times the work. (Yes, sometimes up to 10x, but letâ€™s not get too excited). If you think thatâ€™s just theoretical again, check your FacebookÂ® account, or download something from iTunesÂ®. Those two services are the biggest users of PCIeÂ® flash in the world. Why? It works cost effectively. And in case you havenâ€™t noticed those two companies like to make money, not spend it. So again, weâ€™re talking about MegaWatts of savings. Arguably on the order of 150 MegaWatts. Yea â€“ thatâ€™s pretty theoretical, because they couldnâ€™t really do the same work otherwise, but still, if you had to do the work in a traditional way, it would be around that.
Itâ€™s hard to be more precise than giving round numbers at these massive scales, but the numbers are definitely in the right zone. I can say with a straight face we save the world 10â€™s, and maybe even 100â€™s of MegaWatts per year. But no one sees that, and not many people even think about it. Still â€“ Iâ€™d say LSI is a green hero.
Hey â€“ weâ€™re not done by a long shot. Letâ€™s just look at scrap. If you read my earlier post on false disk failure, youâ€™ll see some scary numbers. (http://blog.lsi.com/what-is-false-disk-failure-and-why-is-it-a-problem/ ) A normalÂ hyperscale datacenter can expect 40-60 disks per day to be mistakenly scrapped out. Thatâ€™s around 20,000 disk drives a year that should not have been scrapped, from just one web company. Think of the material waste, shipping waste, manufacturing waste, and eWaste issues. Wow â€“ all for nothing. Weâ€™re working on solutions to that. And batteries.Â Ugly, eWaste, recycle only, heavy metal batteries. They are necessary for RAID protected storage systems. And much of the worldâ€™s data is protected that way â€“ the battery is needed to save meta-data and transient writes in the event of a power failure, or server failure. We ship millions a year. (Sorry, mother earth). But weâ€™re working diligently to make that a thing of the past. And that will also result in big savings for datacenters in both materials and recycling costs.
Can we do more? Sure. I know I am trying to get us the core technologies that will help reduce power consumption, raise capability and performance, and reduce waste. But weâ€™ll never be done with that march of technology. (Which is a good thing if engineering is your careerâ€¦)
I still often think about green, environmental impact, and what weâ€™re doing to the environment. And I guess in my own small way, I am leaving the world a little better than when I arrived. And I think we at LSI should at least take a moment and pat ourselves on the back for that. You have to celebrate the small victories, you know? Even as the fight goes on.
I want to warn you, there is some thick background information here first. But donâ€™t worry. Iâ€™ll get to the meat of the topic and thatâ€™s this: Ultimately, I think thatÂ PCIeÂ® cards will evolve to more external, rack-level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but other leaders in flash are going down this path tooâ€¦
Iâ€™ve been working on enterprise flash storage since 2007 â€“ mulling over how to make it work. Endurance, capacity, cost, performance have all been concerns that have been grappled with. Of course the flash is changing too as the nodes change: 60nm, 50nm, 35nm, 24nm, 20nmâ€¦ and single level cell (SLC) to multi level cell (MLC) to triple level cell (TLC) and all the variants of these â€śtrimmedâ€ť for specific use cases. The spec â€śenduranceâ€ť has gone from 1 million program/erase cycles (PE) to 3,000, and in some cases 500.
Itâ€™s worth pointing out that almost all the â€śmagicâ€ť that has been developed around flash was already scoped out in 2007. It just takes a while for a whole new industry to mature. Individual die capacity increased, meaning fewer die are needed for a solution â€“ and that means less parallel bandwidth for data transferâ€¦ And the â€śrequirementâ€ť for state-of-the-art single operation write latency has fallen well below the write latency of the flash itself. (What the ?? Yea â€“ talk about that later in some other blog. But flash is ~1500uS write latency, where state of the art flash cards are ~50uS.) When I describe the state of technology it sounds pretty pessimistic. Â Iâ€™m not. Weâ€™ve overcome a lot.
We built our first PCIe card solution at LSI in 2009. It wasnâ€™t perfect, but it was better than anything else out there in many ways. Weâ€™ve learned a lot in the years since â€“ both from making them, and from dealing with customer and users â€“ both of our own solutions and our competitors.Â Weâ€™re lucky to be an important player in storage, so in general the big OEMs, large enterprises and theÂ hyperscale datacenters all want to talk with us â€“ not just about what we have or can sell, but what we could have and what we could do. Theyâ€™re generous enough to share what works and what doesnâ€™t. What the values of solutions are and what the pitfalls are too. Honestly? Itâ€™s theÂ hyperscale datacenters in the lead both practically and in vision.
If you havenâ€™tÂ nodded off to sleep yet, thatâ€™s a long-winded way of saying â€“ things have changed fast, and, boy, weâ€™ve learned a lot in just a few years.
Most important thing weâ€™ve learnedâ€¦
Most importantly, weâ€™ve learned itâ€™s latency that matters. No one is pushing the IOPs limits of flash, and no one is pushing the bandwidth limits of flash. But they sure are pushing the latency limits.
PCIe cards are great, butâ€¦
Weâ€™ve gotten lots of feedback, and one of the biggest things weâ€™ve learned is â€“ PCIe flash cards are awesome. They radically change performance profiles of most applications, especially databases allowing servers to run efficiently and actual work done by that server to multiply 4x to 10x (and in a few extreme cases 100x). So the feedback we get from large users is â€śPCIe cards are fantastic. Weâ€™re so thankful they came along. Butâ€¦â€ť Thereâ€™s always a â€śbut,â€ť right??
It tends to be a pretty long list of frustrations, and they differ depending on the type of datacenter using them. Weâ€™re not the only ones hearing it. To be clear, none of these are stopping people from deploying PCIe flashâ€¦ the attraction is just too compelling. But the problems are real, and they have real implications, and the market is asking for real solutions.
Of course, everyone wants these fixed without affecting single operation latency, or increasing cost, etc. Thatâ€™s what weâ€™re here for though â€“ right? Solve the impossible?
A quick summary is in order. Itâ€™s not looking good. For a given solution, flash is getting less reliable, there is less bandwidth available at capacity because there are fewer die, weâ€™re driving latency way below the actual write latency of flash, and weâ€™re not satisfied with the best solutions we have for all the reasons above.
If you think these through enough, you start to consider one basic path. It also turns out weâ€™re not the only ones realizing this. Where will PCIe flash solutions evolve over the next 2, 3, 4 years? The basic goals are:
One easy answer would be â€“ thatâ€™s a flash SAN or NAS. But thatâ€™s not the answer. Not many customers want a flash SAN or NAS â€“ not for their new infrastructure, but more importantly, all the data is at the wrong end of the straw. The poor server is left sucking hard. Remember â€“ this is flash, and people use flash for latency. Today these SAN type of flash devices have 4x-10x worse latency than PCIe cards. Ouch. You have to suck the data through a relatively low bandwidth interconnect, after passing through both the storage and network stacks. And there is interaction between the I/O threads of various servers and applications â€“ you have to wait in line for that resource. Itâ€™s true there is a lot of startup energy in this space. Â It seems to make sense if youâ€™re a startup, because SAN/NAS is what people use today, and thereâ€™s lots of money spent in that market today. However, itâ€™s not what the market is asking for.
Another easy answer is NVMe SSDs. Right? Everyone wants them â€“ right? Well, OEMs at least. Front bay PCIe SSDs (HDD form factor or NVMe â€“ lots of names) that crowd out your disk drive bays. But they donâ€™t fix the problems. The extra mechanicals and form factor are more expensive, and just make replacing the cards every 5 years a few minutes faster. Wow. With NVME SSDs, you can fit fewer HDDs â€“ not good. They also provide uniformly bad cooling, and hard limit power to 9W or 25W per device. But to protect the storage in these devices, you need to have enough of them that you can RAID or otherwise protect. Once you have enough of those for protection, they give you awesome capacity, IOPs and bandwidth, too much in fact, but thatâ€™s not what applications need â€“ they need low latency for the working set of data.
What do I think the PCIe replacement solutions in the near future will look like? You need to pool the flash across servers (to optimize bandwidth and resource usage, and allocate appropriate capacity). You need to protect against failures/errors and limit the span of failure,Â commit writes at very low latency (lower than native flash) and maintain low latency, bottleneck-free physical links to each serverâ€¦ To me that implies:
That means the performance looks exactly as if each server had multiple PCIe cards. But the capacity and bandwidth resources are shared, and systems can remain resilient. So ultimately, I think that PCIe cards will evolve to more external, rack level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but as I say â€“ other leaders in flash are going down this path tooâ€¦
Whatâ€™s your opinion?
Tags: DAS, datacenter, direct attached storage, enterprise IT, flash, hard disk drive, HDD, hyperscale, latency, NAS, network attached storage, NVMe, PCIe, SAN, solid state drive, SSD, storage area network
I remember in the mid-1990s the question of how many minutes away from a diversion airport a two-engine passenger jet should be allowed to fly in the event of an engine failure.Â Staying in the air long enough is one of those high-availability functions that really matters.Â In the case of the Boeing 777, it was the first aircraft to enter service with a 180-minute extended operations certification (ETOPS)1.Â This meant that longer over-water and remote terrain routes were immediately possible.
The question was â€ścan a two-engine passenger aircraft be as safe as a four engine aircraft for long haul flights?â€ťÂ The short answer is yes. Reducing the points of failure from four engines to two, while meeting strict maintenance requirements and maintaining redundant systems, reduces the probability of a failure. The 777 and many other aircraft have proven to be safe for these longer flights.Â Recently, the 777 has received FAA approval for a 330-minute ETOPS rating2, which allows airlines to offer routes that are longer, straighter and more economical.
What does this have to do with a datacenter?Â It turns out that someÂ hyperscale datacenters house hundreds of thousands of servers, each with its own boot drive.Â Each of these boot drives is a potential point of failure, which can drive up acquisition and operating costs and the odds of a breakdown.Â Datacenter managers need to control CapEx, so for the sheer volume of server boot drives they commonly use the lowest cost 2.5-inch notebook SATA hard drives. The problem is that these commodity hard drives tend to fail more often. This is not a huge issue with only a few servers. But in a datacenter with 200,000 servers, LSI has found through internal research that, on average, 40 to 200 drives fail per week! (2.5â€ł hard drive, ~2.5 to 4-year lifespan, which equates to a conservative 5% failure rate/year).
Traditionally, aÂ hyperscale datacenter has a sea of racks filled with servers.Â LSI approximates that, in the majority of large datacenters, at least 60% of the servers (Web servers, database servers, etc.) use a boot drive requiring no more than 40GB of storage capacity since it performs only boot-up and journaling or logging.Â For higher reliability, the key is to consolidate these low-capacity drives, virtually speaking. With our Syncroâ„˘ MX-B Rack Boot Appliance, we can consolidate the boot drives for 24 or 48 of these servers into a single mirrored array (using LSI MegaRAIDÂ technology), which makes 40GB of virtual disk space available to each server.
Combining all these boot drives with fewer larger drives that are mirrored helps reduce total cost of ownership (TCO) and improves reliability, availability and serviceability.Â If a rack boot appliance drive fails, an alert is sent to the IT operator. The operator then simply replaces the failed drive, and the appliance automatically copies the disk image from the working drive. The upshot is that operations are simplified, OpEx is reduced, and there is usually no downtime.
Syncro MX-B not only improves reliability by reducing failure points; it also significantly reduces power requirements (up to 40% less in the 24-port version, up to 60% less in the 48-port version) â€“ a good thing for the corporate utility bill and climate change. This, in turn, reduces cooling requirements, and helps make hardware upgrades less costly. With the boot drives disaggregated from the servers, thereâ€™s no need to simultaneously upgrade the drives, which typically are still functional during server hardware upgrades.
In the case of both commercial aircraft and servers, less really can be more (or at least better) in some situations. Eliminating excess can make the whole system simpler and more efficient.
To learn more, please visit the LSIÂ® Shared Storage Solutions web page:Â http://www.lsi.com/solutions/Pages/SharedStorage.aspx
Iâ€™ve been travelling to China quite a bit over the last year or so. Iâ€™m sitting in Shenzhen right now (If you know Chinese internet companies, youâ€™ll know who Iâ€™m visiting). The growth is staggering. Iâ€™ve had a bit of a trains, planes, automobiles experience this trip, and thatâ€™s exposed me to parts of China I never would have seen otherwise. Just to accommodate sheer population growth and the modest increase in wealth, there is construction everywhere â€“ a press of people and energy, constant traffic jams, unending urban centers, and most everything is new. Very new. It must be exciting to be part of that explosive growth. What a market. Â I mean â€“ come on â€“ there are 1.3 billion potential users in China.
The amazing thing for me is the rapid growth ofÂ hyperscale datacenters in China, which is truly exponential. Their infrastructure growth has been 200%-300% CAGR for the past few years. Itâ€™s also fantastic walking into a building in China, say Baidu, and feeling very much at home â€“ just like you walked into Facebook or Google. Itâ€™s the same young vibe, energy, and ambition to change how the world does things. And itâ€™s also the same pleasure â€“ talking to architects who are super-sharp, have few technical prejudices, and have very little vanity â€“ just a will to get to business and solve problems. Polite, but blunt. Weâ€™re lucky that they recognize LSI as a leader, and are willing to spend time to listen to our ideas, and to give us theirs.
Even their infrastructure has a similar feel to the USÂ hyperscale datacenters. The same only different. Â ;-)
A lot of these guys are growing revenue at 50% per year, several getting 50% gross margin. Those are nice numbers in any country. One has $100â€™s of billions in revenue. Â And theyâ€™re starting to push out of China. Â So far their pushes into Japan have not gone well, but other countries should be better. They all have unique business models. â€śWeâ€ť in the US like to say things like â€śAlibaba is the Chinese eBayâ€ť or â€śSina Weibo is the Chinese Twitterâ€ťâ€¦. But thatâ€™s not true â€“ they all have more hybrid business models, unique, and so their datacenter goals, revenue and growth have a slightly different profile. And there are some very cool services that simply are not available elsewhere. (You listening AppleÂ®, GoogleÂ®, TwitterÂ®, FacebookÂ®?) But they are all expanding their services, products and user base.Â Interestingly, there is very little public cloud in China. So there are no real equivalents to Amazonâ€™s services or Microsoftâ€™s Azure. I have heard about current development of that kind of model with the government as initial customer. Weâ€™ll see how that goes.
100â€™s of thousands of servers. Theyâ€™re not the scale of Google, but they sure are the scale of Facebook, Amazon, Microsoftâ€¦. Itâ€™s a serious market for an outfit like LSI. Really itâ€™s a very similar scale now to the US market. Close to 1 million servers installed among the main 4 players, and exabytes of data (weâ€™ve blown past mere petabytes). Interestingly, they still use many co-location facilities, but that will change. More important â€“ theyâ€™re all planning to probably double their infrastructure in the next 1-2 years â€“ they have to â€“ their growth rates are crazy.
Often 5 or 6 distinct platforms, just like the USÂ hyperscale datacenters. Database platforms, storage platforms, analytics platforms, archival platforms, web server platformsâ€¦. But they tend to be a little more like a rack of traditional servers that enterprise buys with integrated disk bays, still a lot of 1G Ethernet, and they are still mostly from established OEMs. In fact I just ran into one OEMâ€™s American GM, who I happen to know, in Tencentâ€™s offices today. The typical servers have 12 HDDs in drive bays, though they are starting to look at SSDs as part of the storage platform. They do use PCIeÂ® flash cards in some platforms, but the performance requirements are not as extreme as you might imagine. Reasonably low latency and consistent latency are the premium they are looking for from these flash cards â€“ not maximum IOPs or bandwidth â€“ very similar to their American counterparts. I thinkÂ hyperscale datacenters are sophisticated in understanding what they need from flash, and not requiring more than that. Enterprise could learn a thing or two.
Some server platforms have RAIDed HDDs, but most are direct map drives using a high availability (HA) layer across the server center â€“ HadoopÂ® HDFS or self-developed Hadoop like platforms. Some have also started to deploy microserver archival â€śbit buckets.â€ť A small ARMÂ® SoC with 4 HDDs totaling 12 TBytes of storage, giving densities like 72 TBytes of file storage in 2U of rack. While I can only find about 5,000 of those in China that are the first generation experiments, itâ€™s the first of a growing wave of archival solutions based on lower performance ARM servers. The feedback is clear – theyâ€™re not perfect yet, but the writing is on the wall. (If youâ€™re wondering about the math, thatâ€™s 5,000 x 12 TBytes = 60 Petabytesâ€¦.)
Yes, itâ€™s important, but maybe more than weâ€™re used to. Itâ€™s harder to get licenses for power in China. So itâ€™s really important to stay within the envelope of power your datacenter has. You simply canâ€™t get more. That means they have to deploy solutions that do more in the same power profile, especially as they move out of co-located datacenters into private ones. Annually, 50% more users supported, more storage capacity, more performance, more services, all in the same power. Thatâ€™s not so easy. I would expect solar power in their future, just as Apple has done.
Hereâ€™s where it gets interesting. They are developing a cousin to OpenCompute thatâ€™s called Scorpio. Itâ€™s Tencent, Alibaba, Baidu, and China Telecom so far driving the standard. Â The goals are similar to OpenCompute, but more aligned to standardized sub-systems that can be co-mingled from multiple vendors. There is some harmonization and coordination between OpenCompute and Scorpio, and in fact the Scorpio companies are members of OpenCompute. But where OpenCompute is trying to change the complete architecture of scale-out clusters, Scorpio is much more pragmatic â€“ some would say less ambitious. Theyâ€™ve finished version 1 and rolled out about 200 racks as a â€śtest caseâ€ť to learn from. Baidu was the guinea pig. Thatâ€™s around 6,000 servers. They werenâ€™t expecting more from version 1. Theyâ€™re trying to learn. Theyâ€™ve made mistakes, learned a lot, and are working on version 2.
Even if itâ€™s not exciting, it will have an impact because of the sheer size of deployments these guys are getting ready to roll out in the next few years. They see the progression as 1) they were using standard equipment, 2) theyâ€™re experimenting and learning from trial runs ofÂ Scorpio versions 1 and 2, and then theyâ€™ll work on 3) new architectures that are efficient and powerful, and different.
Information is pretty sketchy if you are not one of the member companies or one of their direct vendors. We were just invited to join Scorpio by one of the founders, and would be the first group outside of China to do so. If that all works out, Iâ€™ll have a much better idea of the details, and hopefully can influence the standards to be better for theseÂ hyperscale datacenter applications. Between OpenCompute and Scorpio weâ€™ll be seeing a major shift in the industry â€“ a shift that will undoubtedly be disturbing to a lot of current players. It makes me nervous, even though Iâ€™m excited about it. One thing is sure â€“ just as the server market volume is migrating from traditional enterprise toÂ hyperscale datacenter (25-30% of the server market and growing quickly), weâ€™re starting to see a migration to ChineseÂ hyperscale datacenters from US-based ones. They have to grow just to stay still. I mean â€“ come on â€“ there are 1.3 billion potential users in Chinaâ€¦.
Tags: Alibaba, Amazon, Apple, ARM, Baidu, China, China Telecom, datacenter, Facebook, Google, Hadoop, hard disk drive, HDD, hyperscale, Microsoft, OpenCompute, Scorpio, Shenzhen, Sina Weibo, solid state drive, SSD, Tencent, Twitter
Iâ€™ve spent a lot of time with hyperscale datacenters around the world trying to understand their problems â€“ and I really donâ€™t care what area those problems are as long as theyâ€™re important to the datacenter. What is the #1 Real Problem for manyÂ hyperscale datacenters? Itâ€™s something youâ€™ve probably never heard about, and probably have not even thought about. Itâ€™s called false disk failure. Some hyperscaleÂ datacenters have crafted their own solutions â€“ but most have not.
Why is this important, you ask? Many large datacenters today have 1 million to 4 million hard disk drives (HDDs) in active operation. In anyoneâ€™s book thatâ€™s a lot. Itâ€™s also a very interesting statistical sample size of HDDs.Â Hyperscale datacentersÂ get great pricing on HDDs. Probably better than OEMs get, and certainly better than the $79 for buying 1 HDD at your local Fryâ€™s store. So you would imagine if a disk fails â€“ no one cares â€“ theyâ€™re cheap and easy to replace. But the burden of a failed disk is much more than the raw cost of the disk:
Letâ€™s put some scale to this problem, and youâ€™ll begin to understand the issue.Â One modest size hyperscale datacenter has been very generous in sharing its real numbers. (When I say modest, they are ~1/4 to 1/2 the size of many other hyperscale datacenters, but they are still huge â€“ more than 200k servers). Other hyperscale datacenters I have checked with say â€“ yep, thatâ€™s about right. And one engineer I know at an HDD manufacturer said â€“ â€śwow â€“ I expected worse than that. Thatâ€™s pretty good.â€ť To be clear â€“ these are very good HDDs they are using, itâ€™s just that the numbers add up.
The raw data:
RAIDed SAS HDDs
Non-RAIDed (direct map) SATA drives behind HBAs
Whatâ€™s interesting is the relative failure rate of SAS drives vs. SATA. Itâ€™s about an order of magnitude worse in SATA drives than SAS. Frankly some of this is due to protocol differences. SAS allows far more error recovery capabilities, and because they also tend to be more expensive, I believe manufacturers invest in slightly higher quality electronics and components. I know the electronics we ship into SAS drives is certainly more sophisticated than SATA drives.
False fail? What? Yea, thatâ€™s an interesting topic. It turns out that about 40% of the time with SAS and about 50% of the time with SATA, the drive didnâ€™t actually fail. It just lost its marbles for a while. When they pull the drive out and put it into a test jig, everything is just fine. And more interesting, when they put the drive back into service, it is no more statistically likely to fail again than any other drive in the datacenter. Why? No one knows. I suspect though.
I used to work on engine controllers. Thatâ€™s a very paranoid business. If something goes wrong and someone crashes, you have a lawsuit on your hands. If a controller needs a recall, thatâ€™s millions of units to replace, with a multi-hundred dollar module, and hundreds of dollars in labor for each one replaced. No one is willing to take that risk. So we designed very carefully to handle soft errors in memory and registers. We incorporated ECC like servers use, background code checksums and scrubbing, and all sorts of proprietary techniques, including watchdogs and super-fast self-resets that could get operational again in less than a full revolution of the engine.Â Why? â€“ the events were statistically rare. The average controller might see 1 or 2 events in its lifetime, and a turn of the ignition would reset that state.Â But the events do happen, and so do recalls and lawsuitsâ€¦ HDD controllers donâ€™t have these protections, which is reasonable. It would be an inappropriate cost burden for their price point.
You remember the Toyota Prius accelerator problems? I know that controller was not protected for soft errors. And the source of the problem remained a â€śmystery.â€ťÂ Maybe it just lost its marbles for a while? A false fail if you will. Just sayinâ€™.
Back to HDDs. False fail is especially frustrating, because half the HDDs actually didnâ€™t need to be replaced. All the operational costs were paid for no reason. The disk just needed a power cycle reset. (OK, that introduces all sorts of complex management by the RAID controller or application to manage that 10 second power reset cycle and application traffic created in that time â€“ be we can handle that.)
Daily, this datacenter has to:
And 1/2 of that is for no reason at all.
First â€“ why not rebuild the disk if itâ€™s RAIDed? Usually hyperscale datacenters use clustered applications. A traditional RAID rebuild drives the server performance to ~50%, and for a 2TByte drive, under heavy application load (definition of a hyperscale datacenter) can truly take up to a week.Â 50% performance for a week? In a cluster that means the overall cluster is running ~50% performance.Â Say 200 nodes in a cluster â€“ that means you just lost ~100 nodes of work â€“ or 50% of cluster performance. Itâ€™s much simpler to just take the node offline with the failed drive, and get 99.5% cluster performance, and operationally redistribute the workload across multiple nodes (because you have replicated data elsewhere). But after rebuild, the node will have to be re-synced or re-imaged. There are ways to fix all this. Weâ€™ll talk about them on another day. Or you can simply run direct mapped storage, and unmounts the failed drive.
Next â€“ Why replicate data over the network, and why is that a big deal? For geographic redundancy (say a natural disaster at one facility) and regional locality, hyperscale datacenters need multiple data copies. Often 3 copies so they can do double duty as high-availability copies, or in the case of some erasure coding, 2.2 to 2.5 copies (yea â€“ weird math â€“ how do you have 0.5 copyâ€¦). When you lose one copy, you are down to 2, possibly 1. You need to get back to a reliable number again. Fast. Customers are loyal because of your perfect data retention. So you need to replicate that data and re-distribute it across the datacenter on multiple servers. Thatâ€™s network traffic, and possibly congestion, which affects other aspects of the operations of the datacenter. In this datacenter itâ€™s about 50 hours of 10G Ethernet traffic every day.
To be fair, there is a new standard in SAS interfaces that will facilitate resetting a disk in-situ. And there is the start of discussion of the same around SATA â€“ but thatâ€™s more problematic. Whatever the case, it will be a years before the ecosystem is in place to handle the problems this way.
Whatâ€™s that mean to you?
Well. You can expect something like 1/100 of your drives to really fail this year. And you can expect another 1/100 of your drives to fail this year, but not actually be failed. Youâ€™ll still pay all the operational overhead of not actually having a failed drive â€“ rebuilds, disk replacements, management interventions, scheduled downtime/maintenance time, and the OEM replacement price for that drive â€“ what $600 or so ?â€¦ Depending on your size, thatâ€™s either a donâ€™t care, or a big deal. There are ways to handle this, and theyâ€™re not expensive â€“ much less than the disk carrier you already pay for to allow you to replace that drive â€“ and it can be handled transparently â€“ just a log entry without seeing any performance hiccups. Â You just need to convince your OEM to carry the solution.
Anyone who knows me knows I like to ask â€śwhy?â€ť Maybe I never outgrew the 2-year-old phase. But I also like to ask â€śwhy not?â€ť Every now and then you need to rethink everything you know top to bottom because something might have changed.
Iâ€™ve been talking to a lot of enterprise datacenter architects and managers lately. Theyâ€™re interested in using flash in their servers and storage, but they canâ€™t get over all the â€śproblems.â€ť
The conversation goes something like this: Flash is interesting, but itâ€™s crazy expensive $/bit. The prices have to come way down â€“ after all itâ€™s just a commodity part. And I have these $4k servers â€“ why would I put an $8k PCIe card in them â€“ that makes no sense. And the stuff wears out, which is an operational risk for me â€“ disks last forever. Maybe flash isnâ€™t ready for prime time yet.
These arguments are reasonable if you think about flash as a disk replacement, and donâ€™t think through all the follow-on implications.
In contrast Iâ€™ve also been spending a lot of time with the biggest datacenters in the world â€“ you know â€“ the ones we all know by brand name. They have at least 200k servers, and anywhere from 1.5 million to 7 million disks. They notice CapEx and OpEx a lot. You multiply anything by that much and itâ€™s noticeable. (My simple example is add 1 LED to each server with 200k servers and the cost adds up to 26K watts + $10K LED cost.) They are very scientific about cost. More specifically they measure work/$ very carefully. Anything to increase work or reduce $ is very interesting â€“ doing both at once is the holy grail. Already one of those datacenters is completely diskless. Others are part way there, or have the ambition of being there. You might think theyâ€™re crazy â€“ how can they spend so much on flash when disks are so much cheaper, and these guys offer their services for free?
When the large datacenters â€“ I call theÂ hyperscale datacenters â€“ measure cost, theyâ€™re looking at purchase cost, including metal racks and enclosures, shipping, service cost both parts and human expense, as well as operational disruption overhead and the complexity of managing that, the opportunity cost of new systems vs. old systems that are less efficient, and of course facilities expenses â€“ buildings, power, cooling, peopleâ€¦ They try to optimize the mix of these.
Letâ€™s look at the arguments against using flash one by one.
Flash is just a commodity part
This is a very big fallacy. Itâ€™s not a commodity part, and flash is not all the same. The parts you see in cheap consumer devices deserve their price. In the chip industry, itâ€™s common to have manufacturing fallout; 3% – 10% is reasonable. Whatâ€™s more the devices come at different performance levels â€“ just look at x86 performance versions of the same design. In the flash business 100% of the devices are sold, used, and find their way into products. Those cheap consumer products are usually the 3%-10% that would be scrap in other industries. (I was once told â€“ with a smile â€“ â€śthose are the parts we sweep off the floorâ€ťâ€¦)
Each generation of flash (about 18 months between them) and each manufacturer (there are 5, depending how you count) have very different characteristics. There are wild differences in erase time, write time, read time, bandwidth, capacity, endurance, and cost. There is no one supplier that is best at all of these, and leadership moves around. More importantly, in a flash system, how you trade these things off has a huge effect on write latency (#1 impactor on work done), latency outliers (consistent operation), endurance or life span, power consumption, and solution cost. All flash products are not equal â€“ not by a long shot. EvenÂ hyperscale datacenters have different types of solutions for different needs.
Itâ€™s also important to know that temperature of operation and storage, inter-arrival time of writes, and â€śover provisioningâ€ť (the amount hidden for background use and garbage collection) have profound impacts on lifespan and performance.
$8k PCIe card in a $4k server â€“ really?
I am always stunned by this. No one thinks twice about spending more on virtualization licenses than on hardware, or say $50k for a database license to run on a $4k server. Itâ€™s all about what work you need to accomplish, and whatâ€™s the best way to accomplish it. Itâ€™s no joke that in database applications itâ€™s pretty easy to get 4x the work from a server with a flash solution inserted. You probably wonâ€™t get worse than 4x, and as good as 10x. On a purely hardware basis, that makes sense â€“ I can have 1 server @ $4k + Â $8K flash vs. 4 servers @ $4k. I just saved $4k CapEx. More importantly, I saved the service contract, power, cooling and admin of 3 servers. If I include virtualization or database licenses, I saved another $150k + annual service contracts on those licenses. Thatâ€™s easy math. If I worry about users supported rather than work done, I can support as many as 100x users. The math becomes overwhelming. $8K PCIe card in a $4k server? You bet when I think of work/$.
The stuff wears out & disks last forever
Itâ€™s true that car tires wear out, and depending on how hard you use them that might be faster or slower. But tires are one of the most important parts in a cars performance â€“ acceleration, stopping, handling â€“ you couldnâ€™t do any of that without them. The only time you really have catastrophic failure with tires is when you wear them way past any reasonable point â€“ until they are bald and should have been replaced. Flash is like that â€“ you get lots of warning as its wearing out, and you get lots of opportunity to operationally plan and replace the flash without disruption. You might need to replace it after 4 or 5 years, but you can plan and do it gracefully. Disks can last â€śforever,â€ť but they also fail randomly and often.
Reliability statistics across millions of hard drives show somewhere around 2.5% fail annually. And thatâ€™s for 1st quality drives. Those are unpredicted, catastrophic failures, and depending on your storage systems that means you need to go into rebuild or replication of TBytes of data, and you have a subsequent degradation in performance (which can completely mess up load balancing of a cluster of 20 to 200 other nodes too), potentially network traffic overhead, and a physical service event that needs to be handled manually and fairly quickly. And really â€“ how often do admins want to take the risk of physically replacing a drive while a system is running. Just one mistake by your tech and itâ€™s all overâ€¦ Operationally flash is way better, less disruptive, predictable, lower cost, and the follow on implications are much simpler.
Crazy expensive $/bit
OK â€“ so this argument doesnâ€™t seem so relevant anymore. Even so, in most cases you canâ€™t use much of the disk capacity you have. It will be stranded because you need to have spare space as databases, etc. grow. If you run out of space for dbâ€™s the result is catastrophic. If you are driving a system hard, you often donâ€™t have the bandwidth left to actually access that extra capacity. Itâ€™s common to only use Â˝ of the available capacity of drives.
Caching solutions change the equation as well. You can spend money on flash for the performance characteristics, and shift disk drive spend to fewer, higher capacity, slower, more power efficient drives for bulk capacity. Often for the same or similar overall storage spend you can have the same capacity at 4x the system performance. And the space and power consumed and cooling needed for that system is dramatically reduced.
Even so, flash is not going to replace large capacity storage for a long, long time, if ever. What ever the case, the $/bit is simply not the right metric for evaluating flash. But itâ€™s true, flash is more expensive per bit. Itâ€™s simply that in most operational contexts, it more than makes up for that by other savings and work/$ improvements.
So I would argue (and Iâ€™m backed up by the biggestÂ hyperscale datacenters in the world) that flash is ready for prime time adoption. Work/$ is the correct metric, but you need to measure from the application down to the storage bits to get that metric. Itâ€™s not correct to think about flash as â€śjust a disk replacementâ€ť â€“ it changes the entire balance of a solution stack from application performance and responsiveness and cumulative work, to server utilization to power consumption and cooling to maintenance and service to predictable operational stability. Itâ€™s not just a small win; itâ€™s a big win. Itâ€™s not a fit yet for large pools of archival storage â€“ but even for that a lot of energy is going into trying to make that work. So no â€“ enterprise will not go diskless for quite a while, but it is understandable whyÂ hyperscale datacenters want to go diskless. Itâ€™s simple math.
Every now and then you need to rethink everything you know top to bottom because something might have changed.
One of the big challenges that I see so many IT managers struggling with is how are they supposed to deal with the almost exponential growth of data that has to be stored, accessed, and protected, with IT budgets that are flat or growing at rates lower than the nonstop increases in storage volumes.
Iâ€™ve found that it doesnâ€™t seem to matter if it is a departmental or small business datacenter, or aÂ hyperscale datacenter with many thousands of servers. The data growth continues to outpace the budgets.
At LSI we call this disparity between the IT budget and the needs growth the â€śdata deluge gap.â€ť
Of course, smaller datacenters have different issues than theÂ hyperscale datacenters. However, no matter the datacenter size, concerns generally center on TCO.Â This, of course, includes both CapEx and OpEx for the storage systems.
Itâ€™s a good feeling to know that we are tackling these datacenter growth and operations issues head-on for many different environments â€“ large and small.
LSI has developed and is starting to provide a new shared DAS (sDAS) architecture that supports the sharing of storage across multiple servers.Â We call it the LSIÂ® SyncroTM architecture and it really is the next step in the evolution of DAS.Â Our Syncro solutions deliver increased uptime, help to reduce overall costs, increase agility, and are designed for ease of deployment.Â The fact that the Syncro architecture is built on our proven MegaRAIDÂ® technology means that our customers can trust that it will work in all types of environments.
Syncro architecture is a very exciting new capability that addresses storage and data protection needs for numerous datacenter environments.Â Our first product, Syncro MX-B, is targeted atÂ hyperscale datacenter environments including Web 2.0 and cloud. Â I will be blogging about that offering in the near future. Â We will soon be announcing details on our Syncro CS product line, previously known as High Availability DAS, for small and medium businesses and I will blog about what it can mean for our customers and users.
Both of these initial versions of the Syncro architecture can be very exciting and I really like to watch how datacenter managers react when they find about these game-changing capabilities.
We say that â€świth the LSI Syncro architecture you take DAS out of the box and make it sharable and scalable.Â The LSI Syncro architecture helps make your storage Simple. Smart. On.â€ťÂ Our tag line for Syncro is â€śThe Smarter Way to ON.â„˘â€ťÂ It really is.
To learn more, please visit the LSI Shared Storage Solutions web page:Â http://www.lsi.com/solutions/Pages/SharedStorage.aspx