Scaling compute power and storage in space-constrained datacenters is one of the top IT challenges of our time. With datacenters worldwide pressed to maximize both within the same floor space, the central challenge is increasing density.

At IBM we continue to design products that help businesses meet their most pressing IT requirements, whether it’s optimizing data analytics, data management, the fastest growing workloads such as social media and cloud delivery or, of course, increasing compute and storage density. Our technology partners are a crucial part of our work, and this week at AIS we are teaming with LSI to showcase our new high-density NeXtScale computing platform and x3650 M4 HD server. Both leverage LSI® SAS RAID controllers for data protection, and the x3650 M4 HD server features an integrated leading-edge LSI 12Gb/s SAS RAID controller.

IBM NeXtScale System

NeXtScale System – ideal for HPC, cloud service providers and Web 2.0
The NeXtScale System®, an economical addition to the IBM System® family, maximizes usable compute density by packing up to 84 x86-based systems and 2,016 processing cores into a standard 19-inch rack to enable seamless integration into existing infrastructures. The family also enables organizations of all sizes and budgets to start small and scale rapidly for growth. The NeXtScale System is an ideal high-density solution for high-performance computing (HPC), cloud service providers and Web 2.0.

IBM System x3650 M4 HD

The System x3650 M4 HD, IBM’s newest high-density storage server, is designed for data-intensive analytics or business-critical workloads. The 2U rack server supports up to 62% more drive bays than the System x3650 M4 platform, providing connections for up to 26 2.5-inch HDDs or SSDs. The server is powered by the Intel Xeon processor E5-2600 family and features up to 6 PCIe 3.0 slots and an onboard LSI 12Gb/s SAS RAID controller. This combination gives a big boost to data applications and cloud deployments by increasing the processing power, performance and data protection that are the lifeblood of these environments.

IBM dense storage solutions to help drive data management, cloud computing and big data strategies
Cloud computing and big data will continue to have a tremendous impact on the IT infrastructure and create data management challenges for businesses. At IBM, we think holistically about the needs of our customers and believe that our new line of dense storage solutions will help them design, develop and execute on their data management, cloud computing and big data strategies.


Tags: , , , , , , , , , , , , ,
Views: (4927)

The world according to DAS


You might be surprised to find out how big the infrastructure for cloud and Web 2.0 is. It is mind-blowing. Microsoft has acknowledged packing more than 1 million servers into its datacenters, and by some accounts that is fewer than Google’s massive server count but a bit more than Amazon.  

Facebook’s server count is said to have skyrocketed from 30,000 in 2012 to 180,000 just this past August, serving 900 million plus users. And the social media giant is even putting its considerable weight behind the Open Compute effort to make servers fit better in a rack and draw less power. The list of mega infrastructures also includes Tencent, Baidu and Alibaba and the roster goes on and on.

Even more jaw-dropping is that almost 99.9% of these hyperscale infrastructures are built with servers featuring direct-attached storage. That’s right – they do the computing and store the data. In other words, no special, dedicated storage gear. Yes, your Facebook photos, your Skydrive personal cloud and all the content you use for entertainment, on-demand video and gaming data are stored inside the server.

Direct-attached storage reigns supreme
Everything in these infrastructures – compute and storage – is built out of x-86 based servers with storage inside. What’s more, growth of direct-attached storage is many folds bigger than any other storage deployments in IT. Rising deployments of cloud, or cloud-like, architectures are behind much of this expansion.

The prevalence of direct-attached storage is not unique to hyperscale deployments. Large IT organizations are looking to reap the rewards of creating similar on-premise infrastructures. The benefits are impressive: Build one kind of infrastructure (server racks), host anything you want (any of your properties), and scale if you need to very easily. TCO is much less than infrastructures relying on network storage or SANs.

With direct-attached you no longer need dedicated appliances for your database tier, your email tier, your analytics tier, your EDA tier. All of that can be hosted on scalable, share-nothing infrastructure. And just as with hyperscale, the storage is all in the server. No SAN storage required.

Open Compute, OpenStack and software-defined storage drive DAS growth
Open Compute is part of the picture. A recent Open Compute show I attended was mostly sponsored by hyperscale customers/suppliers. Many big-bank IT folks attended. Open Compute isn’t the only initiative driving growing deployments of direct-attached storage. So is software-defined storage and OpenStack. Big application vendors such as Oracle, Microsoft, VMware and SAP are also on board, providing solutions that support server-based storage/compute platforms that are easy and cost-effective to deploy, maintain and scale and need no external storage (or SAN including all-flash arrays).

So if you are a network-storage or SAN manufacturer, you have to be doing some serious thinking (many have already) about how you’re going to catch and ride this huge wave of growth.




Tags: , , , , , , , , , , , , , , , ,
Views: (1777)

Optimizing the work per dollar spent is a high priority in datacenters around the world. But there aren’t many ways to accomplish that. I’d argue that integrating flash into the storage system drives the best – sometimes most profound – improvement in the cost of getting work done.

Yea, I know work/$ is a US-centric metric, but replace the $ with your favorite currency. The principle remains the same.

I had the chance to talk with one of the execs who’s responsible for Google’s infrastructure last week. He talked about how his fundamental job was improving performance/$. I asked about that, and he explained “performance” as how much work an application could get done. I asked if work/$ at the application was the same, and he agreed – yes – pretty much.

You remember as a kid that you brought along a big brother as authoritative backup? OK – so my big brother Google and I agree – you should be trying to optimize your work/$. Why? Well – it could be to spend less, or to do more with the same spend, or do things you could never do before, or simply to cope with the non-linear expansion in IT demands even as budgets are shrinking. Hey – that’s the definition of improving work/$… (And as a bonus, if you do it right, you’ll have a positive green impact that is bound to be worth brownie points.)

Here’s the point. Processors are no longer scaling the same – sure, there are more threads, but not all applications can use all those threads. Systems are becoming harder to balance for efficiency. And often storage is the bottleneck. Especially for any application built on a database. So sure – you can get 5% or 10% gain, or even in the extreme 100% gain in application work done by a server if you’re willing to pay enough and upgrade all aspects of the server: processors, memory, network… But it’s almost impossible to increase the work of a server or application by 200%, 300% or 400% – for any money.

I’m going to explain how and why you can do that, and what you get back in work/$. So much back that you’ll probably be spending less and getting more done. And I’m going to explain how even for the risk-averse, you can avoid risk and get the improvements.

More work/$ from general-purpose DAS servers and large databases
Let me start with a customer. It’s a bank, and it likes databases. A lot. And it likes large databases even more. So much so that it needs disks to hold the entire database. Using an early version of an LSI Nytro™ MegaRAID® card, it got 6x the work from the same individual node and database license. You can read that as 600% if you want. It’s big. To be fair – that early version had much more flash than our current products, and was much more expensive. Our current products give much closer to 3x-4x improvement. Again, you can think of that as 300%-400%. Again, slap a Nytro MegaRAID into your server and it’s going to do the work of 3 to 4 servers. I just did a web search and, depending on configuration, Nytro MegaRAIDs are $1,800 to $2,800 online. I don’t know about you, but I would have a hard time buying 2 to 3 configured servers + software licenses for that little, but that’s the net effect of this solution. It’s not about faster (although you get that). It’s about getting more work/$.

But you also want to feel safe – that you’re absolutely minimizing risk. OK. Nytro MegaRAID is a MegaRAID card. That’s overwhelmingly the most common RAID controller in the world, and it’s used by 9 of the top 10 OEMs, and protects 10’s to 100‘s of millions of disks every day. The Nytro version adds private flash caching in the card and stores hot reads and writes there. Writes to the cache use a RAID 1 pair. So if a flash module dies, you’re protected. If the flash blocks or chip die wear out, the bad blocks are removed from the cache pool, and the cache shrinks by that much, but everything keeps operating – it’s not like a normal LUN that can’t change size. What’s more, flash blocks usually finally wear out during the erase cycle – so no data is lost.  And as a bonus, you can eliminate the traditional battery most RAID cards use – the embedded flash covers that – so no more annual battery service needed. This is a solution that will continue to improve work/$ for years and years, all the while getting 3x-4x the work from that server.

More work/$ from SAN-attached servers (without actually touching the SAN)
That example was great – but you don’t use DAS systems. Instead, you use a big iron SAN. (OK, not all SANs are big iron, but I like the sound of that expression.) There are a few ways to improve the work from servers attached to SANs. The easiest of course is to upgrade the SAN head, usually with a flash-based cache in the SAN controller. This works, and sometimes is “good enough” to cover needs for a year or two. However, the server still needs to reach across the SAN to access data, and it’s still forced to interact with other servers’ IO streams in deeper queues. That puts a hard limit on the possible gains. 

Nytro XD caches hot data in the server. It works with virtual machines. It intercepts storage traffic at the block layer – the same place LSI’s drivers have always been. If the data isn’t hot, and isn’t cached, it simply passes the traffic through to the SAN. I say this so you understand – it doesn’t actually touch the SAN. No risk there. More importantly, the hot storage traffic never has to be squeezed through the SAN fabric, and it doesn’t get queued in the SAN head. In other words, it makes the storage really, really fast.

We’ve typically found work from a server can increase 5x to 10x, and that’s been verified by independent reviewers. What’s more, the Nytro XD solution only costs around 4x the price of a high-end SAN NIC. It’s not cheap, but it’s way cheaper than upgrading your SAN arrays, it’s way cheaper than buying more servers, and it’s proven to enable you to get far more work from your existing infrastructure. When you need to get more work – way more work – from your SAN, this is a really cost-effective approach. Seriously – how else would you get 5x-10x more work from your existing servers and software licenses?

More work/$ from databases
A lot of hyperscale datacenters are built around databases of a finite size. That may be 1, 2 or even 4 TBytes. If you use Apple’s online services for iTunes or iCloud, or if you use Facebook, you’re using this kind of infrastructure.

If your datacenter has a database that can fit within a few TBytes (or less), you can use the same approach. Move the entire LUN into a Nytro WarpDrive® card, and you will get 10x the work from your server and database software. It makes such a difference that some architects argue Facebook and Apple cloud services would never have been possible without this type of solution. I don’t know, but they’re probably right. You can buy a Nytro WarpDrive for as little as a low-end server. I mean low end. But it will give you the work of 10. If you have a fixed-size database, you owe it to yourself to look into this one.

More work/$ from virtualized and VDI (Virtual Desktop) systems
Virtual machines are installed on a lot of servers, for very good reason. They help improve the work/$ in the datacenter by reducing the number of servers needed and thereby reducing management, maintenance and power costs. But what if they could be made even more efficient?

Wall Street banks have benchmarked virtual desktops. They found that Nytro products drive these results: support of 2x the virtual desktops, 33% improvement in boot time during boot storms, and 33% lower cost per virtual desktop. In a more general application mix, Nytro increases work per server 2x-4x.  And it also gives 2x performance for virtual storage appliances.

While that’s not as great as 10x the work, it’s still a real work/$ value that’s hard to ignore. And it’s the same reliable MegaRAID infrastructure that’s the backbone of enterprise DAS storage.

A real example from our own datacenter
Finally – a great example of getting far more work/$ was an experiment our CIO Bruce Decock did. We use a lot of servers to fuel our chip-design business. We tape out a lot of very big leading-edge process chips every year. Hundreds.  And that takes an unbelievable amount of processing to get what we call “design closure” – that is, a workable chip that will meet performance requirements and yield. We use a tool called PrimeTime that figures out timing for every signal on the chip across different silicon process points and operating conditions. There are 10’s to 100’s of millions of signals. And we run every active design – 10’s to 100’s of chips – each night so we can see how close we’re getting, and we make multiple runs per chip. That’s a lot of computation… The thing is, electronic CAD has been designed to try not to use storage or it will never finish – just /tmp space, but CAD does use huge amounts of memory for the data structures, and that means swap space on the order of TBytes. These CAD tools usually don’t need to run faster. They run overnight and results are ready when the engineers come in the next day. These are impressive machines: 384G or 768G of DRAM and 32 threads.  How do you improve work/$ in that situation? What did Bruce do?

He put LSI Nytro WarpDrives in the servers and pointed /tmp at the WarpDrives. Yep. Pretty complex. I don’t think he even had to install new drivers. The drivers are already in the latest OS distributions. Anyway – like I said – complex.

The result? WarpDrive allowed the machines to fully use the CPU and memory with no I/O contention. With WarpDrive, the PrimeTime jobs for static timing closure of a typical design could be done on 15 vs. 40 machines. That’s each Nytro node doing 260% of the work vs. a normal node and license. Remember – those are expensive machines (have you priced 768G of DRAM and do you know how much specialized electronic design CAD licenses are?) So the point wasn’t to execute faster. That’s not necessary. The point is to use fewer servers to do the work. In this case we could do 11 runs per server per night instead of just 4. A single chip design needs more than 150 runs in one night.

To be clear, the Nytro WarpDrives are a lot less expensive than the servers they displace. And the savings go beyond that – less power and cooling. Lower maintenance. Less admin time and overhead. Fewer Licenses.  That’s definitely improved work/$ for years to come. Those Nytro cards are part of our standard flow, and they should probably be part of every chip company’s design flow.

So you can improve work/$ no matter the application, no matter your storage model, and no matter how risk-averse you are.

Optimizing the work per dollar spent is a high – maybe the highest – priority in datacenters around the world. And just to be clear – Google agrees with me. There aren’t many ways to accomplish that improvement, and almost no ways to dramatically improve it. I’d argue that integrating flash into the storage system is the best – sometimes most profound – improvement in the cost of getting work done. Not so much the performance, but the actual work done for the money spent. And it ripples through the datacenter, from original CapEx, to licenses, maintenance, admin overhead, power and cooling, and floor space for years. That’s a pretty good deal. You should look into it.

For those of you who are interested, I already wrote about flash in these posts:
What are the driving forces behind going diskless?
LSI is green – no foolin’


Tags: , , , , , , , , , , , , , , , , , ,
Views: (1262)

I am sitting in the terminal waiting for my flight home from – yes, you guessed it – China. I am definitely racking up frequent flier miles this year.

This trip ended up centering on resource pooling in the datacenter. Sure, you might hear a lot about disaggregation, but the consensus seems to be: that’s the wrong name (unless you happen to make standalone servers). For anyone else, it’s about a much more flexible infrastructure, simplified platforms, better lifecycle management, and higher efficiency. I call it “resource pooling,” which is descriptive, but others simply call it rack scale architecture.

It’s been a long week, but very interesting. I was asked to keynote at the SACC conference (Systems Architect Conference China) in Beijing. It was also a great chance to meet 1-on-1 with the CTOs and chief architects from the big datacenters, and visit for a few hours with other acquaintances. I even had the chance to have dinner with the CEO /CIO China Magazine editor in chief, and CIO’s from around Beijing. As always in life, if you’re willing to listen, you can learn a lot. And I did.
Thinking on disaggregation aligns
With CTOs, there was a lot of discussion about disaggregation in the datacenter. There is a lot of aligned thinking on the topic, and it’s one of those occasions where you had to laugh because I think anyone of the CTOs keynoting could have given anyone else’s presentation. So what’s the big deal? Resource pooling and rack scale architecture.

I’ll use this trip as an excuse to dig a little deeper into my view on what this means.

First – you need to understand where these large datacenters are in their evolution. They usually have 4 to 6 platforms and2 or 3 generations of each in the datacenter. That can be 18 different platforms to manage, maintain, and tune. Worse – they have to plan 6 to 9 months in advance to deploy equipment. If you guess wrong, you’ve got a bunch of useless equipment, and you spent a bunch of money – the size of mistake that will get you fired… And even if you get it right, you’re left with the problem – Do I upgrade servers when the CPU is new? Or at, say, 18 months? Or do I wait until the biggest cost item – the drives – need to be replaced in 4 or 5 years? That’s difficult math. So resource pooling is about lifecycle management of different types of components and sub-systems. You can optimally replace each resource on its own schedule.

Increasing resource utilization and efficiency
But it’s also about resource utilization and efficiency. Datacenters have multiple platforms because each platform needs a different configuration of resources. I use the term configuration on purpose. If you have storage in your server, it’s in some standard configuration – say, 6 3 TByte drives or 18 raw TBytes. Do you use all that capacity? Or do you leave some space so databases can grow? Of course you leave empty space. You might not even have any use for that much storage in that particular server – maybe you just use half the capacity. After all, it’s a standard configuration. What about disk bandwidth? Can your Hadoop node saturate 6 drives? Probably. It could probably use 12 or maybe even 24. But sorry – it’s a standard configuration. What about latency-sensitive databases? Sure, I can plug a PCIe card in, but I only have 1.6 TByte PCIe cards as my standard configuration. My database is 1.8 TBytes and growing. Sorry – you have to refactor and put on 2 servers. Or my database is only 1 TByte. I’m wasting 600 GBytes of really expensive resource.

Meeting with Shiding Lin, Chief Architect at Baidu.

For network resources – the standard configuration gets maybe exactly 1 10GE port. You need more? Can’t have it. You don’t need that much? Sorry – wasted bandwidth capacity. What about standard memory? You either waste DRAM you don’t use, or you starve for more DRAM you can’t get.

But if I have pools of rack scale resources that I can allocate to a standard compute platform – well – that’s a different story. I can configure exactly the amount of network bandwidth, memory, flash high- performance storage, and disk bulk storage. I can even add more configured storage if a database grows, instead of being forced to refactor a database into shards across multiple standard configurations.

Pooling resources = simplified operations
So the desire to pool resources is really as much about simplified operations as anything else. I can have standardized modules that are all “the same” to manage, but can be resource configured into a well-tailored platform that can even change over time.

But pooling is also about accommodating how the application architectures have changed, and how much more important dataflow is than compute for so much of the datacenter. As a result there is a lot of uncertainty about how parts of these rack scale architectures and interconnect will evolve, even as there is a lot of certainty that they will evolve, and they will include pooled resource “modules.” Whatever the overall case, we’re pretty sure we understand how the storage will evolve. And at a high level, that’s what I presented in my keynote. (Hey – I’m not going to publicly share all our magic!)

One storage architecture of pooled resources at the rack scale level. One storage architecture that combines boot management, flash storage for performance, and disk storage for efficient bandwidth and capacity. And those resources can be allocated however and whenever the datacenter manager needs them. And the existing software model doesn’t need to change. Existing apps, OS’s, file systems, and drivers are all supported, meaning a change to pooled resource rack scale deployments is de-risked dramatically. Overall, this one architecture simplifies the number of platforms, simplifies the management of platforms, utilizes the resources very efficiently, and simplifies image and boot management.  I’m pretty sure it even reduces datacenter-level CapEx. I know it dramatically reduces OpEx.

Keynote slide: Single Pooled Storage Architecture Summary. Capture, Hold, Analyze refer to a framework for the 3 stages of Big Data usage as a theme throughout the presentation.

Yea – I know what you’re thinking – it’s awesome ! (That’s what you thought – right?)

Oh – what about those CIO meetings? Well, there is tremendous pressure to not buy American IT equipment in China because of all the news from the Snowden NSA leaks. As most of the CIO’s pointed out, though, in today’s global sourcing market, it’s pretty hard to not buy US IT equipment. So they’re feeling a bit trapped. In a no-risk profession, I suspect that means they just won’t buy anything for a year or so and hope it blows over.

But in general, yep, I think this trip was centered on resource pooling in the datacenter. Sure, you might hear about disaggregation, but there’s a lot of agreement that’s the wrong name. It’s much more about resource pooling for flexible infrastructure, simplified platforms, better lifecycle management, and higher efficiency. And we aim to be right in the middle. Literally.


Tags: , , , , , , , ,
Views: (1493)

I was lucky enough to get together for dinner and beer with old friends a few weeks ago. Between the 4 of us, we’ve been involved in or responsible for a lot of stuff you use every day, or at least know about.

Supercomputers, minicomputers, PCs, Macs, Newton, smart phones, game consoles, automotive engine controllers and safety systems, secure passport chips, DRAM interfaces, netbooks, and a bunch of processor architectures: Alpha, PowerPC, Sparc, MIPS, StrongARM/XScale, x86 64-bit, and a bunch of other ones you haven’t heard of (um – most of those are mine, like TriCore). Basically if you drive a European car, travel internationally, use the Internet , if you play video games, or use a smart phone, well…  you’re welcome.

Why do I tell you this? Well – first I’m name dropping – I’m always stunned I can call these guys friends and be their peers. But more importantly, we’ve all been in this industry as architects for about 30 years. Of course our talk went to what’s going on today. And we all agree that we’ve never seen more changes – inflexions – than the raft unfolding right now. Maybe its pressure from the recession, or maybe un-naturally pent up need for change in the ecosystem, but change there is.

Changes in who drives innovation, what’s needed, the companies on top and on bottom at every point in the food chain, who competes with whom, how workloads have changed from compute to dataflow, software has moved to opensource, how abstracted code is now from processor architecture, how individual and enterprise customers have been revolting against the “old” ways, old vendors, old business models, and what the architectures look like, how processors communicate, and how systems are purchased, and what fundamental system architectures look like. But not much besides that…

Ok – so if you’re an architect, that’s as exciting as it gets (you hear it in my voice – right ?), and it makes for a lot of opportunities to innovate and create new or changed businesses. Because innovation is so often at the intersection of changing ways of doing things. We’re at a point where the changes are definitely not done yet. We’re just at the start. (OK – now try to imagine a really animated 4-way conversation over beers at the Britannia Arms in Cupertino… Yea – exciting.)

I’m going to focus on just one sliver of the market – but it’s important to me – and that’s enterprise IT.  I think the changes are as much about business models as technology.

Hyperscale datacenters drive innovation
I’ll start in a strange place. Hyperscale datacenters (think social media, search, etc.) and the scale of deployment changes the optimization point. Most of us starting to get comfortable with rack as the new purchase quantum. And some of us are comfortable with the pod or container as the new purchase quantum. But the hyperscale dataenters work more at the datacenter as the quantum. By looking at it that way, they can trade off the cost of power, real estate, bent sheet metal, network bandwidth, disk drives, flash, processor type and quantity, memory amount, where work gets done, and what applications are optimized for. In other words, we shifted from looking at local optima to looking for global optima. I don’t know about you, but when I took operations research in university, I learned there was an unbelievable difference between the two – and global optima was the one you wanted…

Hyperscale datacenters buy enough (top 6 are probably more than 10% of the market today) that 1) they need to determine what they deploy very carefully on their own, and 2) vendors work hard to give them what they need.

That means innovation used to be driven by OEMs, but now it’s driven by hyperscale datacenters and it’s driven hard. That global optimum? It’s work/$ spent. That’s global work, and global spend. It’s OK to spend more, even way more on one thing if over-all you get more done for the $’s you spend.

That’s why the 3 biggest consumers of flash in servers are Facebook, Google, and Apple, with some of the others not far behind. You want stuff, they want to provide it, and flash makes it happen efficiently. So efficiently they can often give that service away for free.

Hyperscale datacenters have started to publish their cost metrics, and open up their architectures (like OpenCompute), and open up their software (like Hadoop and derivatives). More to the point, services like Amazon have put a very clear $ value on services. And it’s shockingly low.

Enterprises are paying attention
Enterprises have looked at those numbers. Hard. That’s catalyzed a customer revolt against the old way of doing things – the old way of buy and billing. OEMs and ISVs are creating lots of value for enterprise, but not that much. They’ve been innovating around “stickiness” and “lock-in” (yea – those really are industry terms) for too long, while hyperscale datacenters have been focused on getting stuff done efficiently. The money they save per unit just means they can deploy more units and provide better services.

That revolt is manifesting itself in 2 ways. The first is seen in the quarterly reports of OEMs and ISVs. Rumors of IBM selling its X-series to Lenovo, Dell going private, Oracle trying to shift business, HP talking of the “new style of IT”… The second is enterprises are looking to emulate hyperscale datacenters as much as possible, and deploy private cloud infrastructure. And often as not, those will be running some of the same open source applications and file systems as the big hyperscale datacenters use.

Where are the hyperscale datacenters leading them? It’s a big list of changes, and they’re all over the place.

  • Simple, “vanity free” servers. Everything you need, nothing you don’t
  • Efficient racks/pods, minimized metal, shipping weight, airflow impediment
  • Simplified management , homogeneous across vendors
  • DAS systems with distributed file systems like HDFS, etc.
  • Flash acceleration for databases sensitive to latency
  • New hardware/software functions like memcached,  key-value stores…
  • Autonomous, self-managed, self-deployed clusters at scale
  • Disagregated servers – also called pooled resources
  • Alternate processor architectures (besides x86)
  • The promise of “far” main memory in massive chucks of next generation non-volatile memory like PCM, STT, ReRam, and possibly flash

But they’re also looking at a few different things. For example, global name space NAS file systems. Personally? I think this one’s a mistake. I like the idea of file systems/object stores, but the network interconnect seems like a bottleneck. Storage traffic is shared with network traffic, creates some network spine bottlenecks, creates consistency performance bottlenecks between the NAS heads, and – let’s face it – people usually skimp on the number of 10GE ports on the server and in the top of rack switch. A typical SAS storage card now has 8 x 12G ports – that’s 96G of bandwidth. Will servers have 10 x 10G ports? Yea. I didn’t think so either.

Anyway – all this is not academic. One Wall Street bank shared with me that – hold your breath – it could save 70% of its spend going this route. It was shocked. I wasn’t shocked, because at first blush this seems absurd – not possible. That’s how I reacted. I laughed. But… The systems are simpler and less costly to make. There is simply less there to make or ship than OEMs force into the machines for uniqueness and “value.” They are purchased from much lower margin manufacturers. They have massively reduced maintenance costs (there’s less to service, and, well, no OEM service contracts). And also important – some of the incredibly expensive software licenses are flipped to open source equivalents. Net savings of 70%. Easy. Stop laughing.

Disaggregation: Or in other words, Pooled Resources
But probably the most important trend from all of this is what server manufacturers are calling “disaggregation” (hey – you’re ripping apart my server!) but architects are more descriptively calling pooled resources.

First – the intent of disaggregation is not to rip the parts of a server to pieces to get lowest pricing on the components. No. If you’re buying by the rack anyway – why not package so you can put like with like. Each part has its own life cycle after all. CPUs are 18 months. DRAM is several years. Flash might be 3 years. Disks can be 5 to 7 years. Networks are 5 to 10 years. Power supplies are… forever? Why not replace each on its own natural failure/upgrade cycle? Why not make enclosures appropriate to the technology they hold? Disk drives need solid vibration-free mechanical enclosures of heavy metal. Processors need strong cooling. Flash wants to run hot. DRAM cool.

Second – pooling allows really efficient use of resources. Systems need slush resources. What happens to a systems that uses 100% of physical memory? It slows down a lot. If a database runs out of storage? It blue screens. If you don’t have enough network bandwidth? The result is, every server is over provisioned for its task. Extra DRAM, extra network bandwidth, extra flash, extra disk drive spindles.. If you have 1,000 nodes you can easily strand TBytes of DRAM, TBytes of flash, a TByte/s of network bandwidth of wasted capacity, and all that always burning power. Worse, if you plan wrong and deploy servers with too little disk or flash or DRAM, there’s not much you can do about it. Now think 10,000 or 100,000 nodes… Ouch.

If you pool those things across 30 to 100 servers, you can allocate as needed to individual servers. Just as importantly, you can configure systems logically, not physically. That means you don’t have to be perfect in planning ahead what configurations and how many of each you’ll need. You have sub-assemblies you slap into a rack, and hook up by configuration scripts, and get efficient resource allocation that can change over time. You need a lot of storage? A little? Higher performance flash? Extra network bandwidth? Just configure them.

That’s a big deal.

And of course, this sets the stage for immense pooled main memory – once the next generation non-volatile memories are ready – probably starting around 2015.

You can’t underestimate the operational problems associated with different platforms at scale. Many hyperscale datacenters today have around 6 platforms. If you think they are rolling out new versions of those before old ones are retired they often have 3 generations of each. That’s 18 distinct platforms, with multiple software revisions of each. That starts to get crazy when you may have 200,000 to 400,000 servers to manage and maintain in a lights out environment. Pooling resources and allocating them in the field goes a huge way to simplifying operations.

Alternate Processor Architecture
It didn’t always used to be Intel x86. There was a time when Intel was an upstart in the server business. It was Power, MIPs, Alpha, SPARC… (and before that IBM mainframes and minis, etc). Each of the changes was brought on by changing the cost structure. Mainframes got displaced by multi-processor RISC, which gave way to x86.

Today, we have Oracle saying they’re getting out of x86 commodity servers and doubling down on SPARC. IBM is selling off its x86 business and doubling down on Power (hey – don’t confuse that with PowerPC – which started as an architectural cut-down of Power – I was there…). And of course there is a rash of 64-bit ARM server SOCs coming – with HP and Dell already dabbling in it. What’s important to realize is that all of these offerings are focusing on the platform architecture, and how applications really perform in total, not just the processor.

Another view
Let me warp up with an email thread cut/paste from a smart friend – Wayne Nation. I think he summed up some of what’s going on well, in a sobering way most people don’t even consider.

“Does this remind you of a time, long ago, when the market was exploding with companies that started to make servers out of those cheap little desktop x86 CPUs? What is different this time? Cost reduction and disaggregation? No, cost and disagg are important still, but not new.

  • So what IS new?  I think it’s the massive hyperscale datacenter purchasing and intelligence of the customers at the hyperscale datacenter.
  • Before, it was the intelligence of a few server suppliers that drove innovation. Now, it is the intelligence and purchasing of a few massive hyperscale datacenter *buyers* that drive innovation.

A new CPU architecture? No, x86 was “new” before. ARM promises to reduce cost, as did Intel.

  • So what IS new? The promise of competitive multi-sourcing of the silicon – What’s so great about that?
  • ARM and licensees need to be as consistent and regular as Intel was in promising and delivering *silicon*.
  • When some of the half-dozen ARM SOC vendors fail to deliver, the others get stronger. Isn’t that how we arrived with Intel?
  • The ISA does not matter as much, and Intel (if smart) can still use that as a strength as much as ARM can do it.
  • This is Intel’s silicon to lose (and they may still lose). But Intel will need to take some pain, just like AS/400, S/360, DEC went through.

Disaggregation enables hyperscale datacenters to leverage vanity-free, but consistent delivery will determine the winning supplier. There is the potential for another Intel to rise from these other companies. “


Tags: , , , , , , , , , , , , , ,
Views: (7453)

I’m reminded that when I do what I do best and don’t try to be all things to all people, I get much more accomplished.  Interestingly, I’ve found that the same approach applies to server storage system controllers – and to the home PC I use for photo editing.

The question many of us face is whether it’s best to use an integrated or discrete solution. Think digital television. Do you want a TV with an integrated DVD player, or do you prefer a feature-rich, dedicated player that you can upgrade and replace independent of the TV? I’ve pondered a similar question many times when considering my PC: Do I use a motherboard with an integrated graphics controller or go with a discrete graphics adapter card.

If I look only at initial costs and am satisfied with the performance of my display for day-to-day computing activities, I could go with the integrated controller, something that many consumers do. But my needs aren’t that simple. I need multiple displays, higher screen resolution, higher display system performance, and the ability to upgrade and tune the graphics to my applications. To do these things, I go with a separate discrete graphics controller card.

Hardware RAID delivers enterprise-class data protection and features
In the datacenter, IT architects often face the choice between hardware RAID, a discrete solution, and software RAID, hardware RAID’s integrated counterpart. Hardware RAID offers enterprise-class robustness and features, such as higher performance without operating systems (OS) and application interference, particularly in compute-intensive RAID 5 and RAID 6 application environments.  Also, hardware-based RAID can help optimize the performance and scalability of the SAS protocol. Sure, the build of materials (BOM) costs with hardware RAID are higher when a RAID on Chip or IOC component enters the mix, but these purpose-built solutions are designed to deliver performance and flexibility unmatched by most software RAID solutions.

Enterprise-hardened RAID solutions that protect data, manage and deliver high availability can scale up and down because they are based on RAID-on-chip (ROC) solutions, and they are designed to provide a consistent experience and boot across OS’s and BIOS.

One of the biggest differences between hardware and software RAID is in data protection. For example, if the OS shuts down in the middle of a write, once it is back up the OS can’t recognize whether the write was compromised or failed because the RAID cache was from host memory.  A hardware RAID solution holds the write data in separate, non-volatile cache and completes the write when the system comes back online.  Even more subtly, the CPU and storage cache are offloaded from the host memory, freeing up resources for application performance.

Software RAID cost rises as features added
For software RAID to deliver write cache and advanced features, a non-volatile write cache via battery or flash backup schemes needs to be added, and suddenly the BOM costs are similar or higher than the more flexible hardware RAID solution.

In the end, LSI enterprise hardware RAID solutions bring many features and capabilities that simply cannot exist in a software RAID on-load environment.  To be sure, an enterprise server is no PC or TV, but the choice between a discrete and integrated solution, whether in consumer electronics or storage server technology, is of a kindred sort. I always feel gratified when we can help one of our customers make the best choice.

For more information about our enterprise RAID solutions please visit us at

Tags: , , , ,
Views: (7005)

When I am out on the road in Europe, visiting customers and partners, one common theme that comes up on a daily basis is that high-availability systems are essential to nearly all businesses regardless of size or industry. Sadly, all too often we see what can happen when systems running business-critical applications such as transaction processing, Web servers or electronic commerce are not accessible – potentially lost revenue and lost productivity, leading to dramatically downward-spiralling customer satisfaction.

To reduce this risk, the industry focus has been on achieving the best level of high availability, and for the enterprise market segment this has often meant installing and running storage area network (SAN) solutions. SANs can offer users a complete package – scalability, performance, centralised management and the all-important uptime or high availability.

Drawbacks of SAN
But for all its positives, the SAN also has its downsides. To ensure continuous application availability, server clustering and shared-node connections that build redundancy into a cluster and eliminate single points of failure are crucial. The solution is not only extremely complex, it can have a hefty price tag, amounting to tens of thousands of dollars, and can be hard for many smaller to medium-sized businesses to afford.

When considering budgets and  storage needs, many businesses have shied away from investing in a SAN and opted for a far simpler direct attached storage (DAS) solution – mainly because it can be  far easier to implement and considerably cheaper. Historically, however, the biggest problem with this was that DAS could not offer high availability, and recovery from a server or storage failure could take several hours or even days.

Combining the simplicity of DAS with the high availability of SAN storage
As businesses work to reduce storage costs, simplify deployment, and increase agility and uptime in the face of massive data growth, storage architects are often looking for a way to combine the best of both worlds: the simplicity of DAS storage and the high availability of SAN storage. The goal for many is to create a system that is not only cheaper than a regular SAN but also offers full redundancy, less management complexity and guarantees uptime for the business in case a server goes down.

LSI has pioneered an HA-DAS solution, Syncro™ CS, that costs approximately 30% less than traditional HA entry-level SAN solutions, depending on the solution/configuration. It reduces complexity by providing fully redundant, shared-node storage and application failover, without requiring storage networking hardware. Syncro CS solutions are also designed to reduce latency compared to SAN-based solutions, helping to accelerate storage I/O performance and speed applications.

The good news for businesses that rely on DAS is that they have an option, Syncro CS, to now more easily upgrade their DAS infrastructure to help achieve high availability, with easier management and lower cost. The result is a much simpler failover solution that  provides more affordable business continuity and reduces downtime.


Tags: , , , , , , ,
Views: (11464)

I want to warn you, there is some thick background information here first. But don’t worry. I’ll get to the meat of the topic and that’s this: Ultimately, I think that PCIe® cards will evolve to more external, rack-level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but other leaders in flash are going down this path too…

I’ve been working on enterprise flash storage since 2007 – mulling over how to make it work. Endurance, capacity, cost, performance have all been concerns that have been grappled with. Of course the flash is changing too as the nodes change: 60nm, 50nm, 35nm, 24nm, 20nm… and single level cell (SLC) to multi level cell (MLC) to triple level cell (TLC) and all the variants of these “trimmed” for specific use cases. The spec “endurance” has gone from 1 million program/erase cycles (PE) to 3,000, and in some cases 500.

It’s worth pointing out that almost all the “magic” that has been developed around flash was already scoped out in 2007. It just takes a while for a whole new industry to mature. Individual die capacity increased, meaning fewer die are needed for a solution – and that means less parallel bandwidth for data transfer… And the “requirement” for state-of-the-art single operation write latency has fallen well below the write latency of the flash itself. (What the ?? Yea – talk about that later in some other blog. But flash is ~1500uS write latency, where state of the art flash cards are ~50uS.) When I describe the state of technology it sounds pretty pessimistic.  I’m not. We’ve overcome a lot.

We built our first PCIe card solution at LSI in 2009. It wasn’t perfect, but it was better than anything else out there in many ways. We’ve learned a lot in the years since – both from making them, and from dealing with customer and users – about our own solutions and our competitors.  We’re lucky to be an important player in storage, so in general the big OEMs, large enterprises and the hyperscale datacenters all want to talk with us – not just about what we have or can sell, but what we could have and what we could do. They’re generous enough to share what works and what doesn’t. What the values of solutions are and what the pitfalls are too. Honestly? It’s the hyperscale datacenters in the lead both practically and in vision.

If you haven’t  nodded off to sleep yet, that’s a long-winded way of saying – things have changed fast, and, boy, we’ve learned a lot in just a few years.

Most important thing we’ve learned…
Most importantly, we’ve learned it’s latency that matters. No one is pushing the IOPs limits of flash, and no one is pushing the bandwidth limits of flash. But they sure are pushing the latency limits.

PCIe cards are great, but…
We’ve gotten lots of feedback, and one of the biggest things we’ve learned is – PCIe flash cards are awesome. They radically change performance profiles of most applications, especially databases allowing servers to run efficiently and actual work done by that server to multiply 4x to 10x (and in a few extreme cases 100x). So the feedback we get from large users is “PCIe cards are fantastic. We’re so thankful they came along. But…” There’s always a “but,” right??

It tends to be a pretty long list of frustrations, and they differ depending on the type of datacenter using them. We’re not the only ones hearing it. To be clear, none of these are stopping people from deploying PCIe flash… the attraction is just too compelling. But the problems are real, and they have real implications, and the market is asking for real solutions.

  • Stranded capacity & IOPs
    • Some “leftover” space is always needed in a PCIe card. Databases don’t do well when they run out of storage! But you still pay for that unused capacity.
    • All the IOPs and bandwidth are rarely used – sure latency is met, but there is capability left on the table.
    • Not enough capacity on a card – It’s hard to figure out how much flash a server/application will need. But there is no flexibility. If my working set goes one byte over the card capacity, well, that’s a problem.
  • Stranded data on server fail
    • If a server fails – all that valuable hot data is unavailable. Worse – it all needs to be re-constructed when the server does come online because it will be stale. It takes quite a while to rebuild 2TBytes of interesting data. Hours to days.
  • PCIe flash storage is a separate storage domain vs. disks and boot.
    • Have to explicitly manage LUNs, move data to make it hot.
    • Often have to manage via different API’s and management portals.
    • Applications may even have to be re-written to use different APIs, depending on the vendor.
  • Depending on the vendor, performance doesn’t scale.
    • One card gives awesome performance improvement. Two cards don’t  give quite the same improvement.
    • Three or four cards don’t give any improvement at all. Performance maxed out somewhere below 2 cards.  It turns out drivers and server onloaded code create resource bottlenecks, but this is more a competitor’s problem than ours.
  • Depending on the vendor, performance sags over time.
    • More and more computation (latency)  is needed in the server as flash wears and needs more error correction.
    • This is more a competitor’s problem than ours.
  • It’s hard to get cards in servers.
    • A PCIe card is a card – right? Not really. Getting a high capacity card in a half height, half length PCIe form factor is tough, but doable. However, running that card has problems.
    • It may need more than 25W of power  to run at full performance – the slot may or may not provide it. Flash burns power proportionately to activity, and writes/erases are especially intense on power. It’s really hard to remove more than 25W air cooling in a slot.
    • The air is preheated, or the slot doesn’t get good airflow. It ends up being a server by server/slot by slot qualification process. (yes, slot by slot…) As trivial as this sounds, it’s actually one of the biggest problems

Of course, everyone wants these fixed without affecting single operation latency, or increasing cost, etc. That’s what we’re here for though – right? Solve the impossible?

A quick summary is in order. It’s not looking good. For a given solution, flash is getting less reliable, there is less bandwidth available at capacity because there are fewer die, we’re driving latency way below the actual write latency of flash, and we’re not satisfied with the best solutions we have for all the reasons above.

The implications
If you think these through enough, you start to consider one basic path. It also turns out we’re not the only ones realizing this. Where will PCIe flash solutions evolve over the next 2, 3, 4 years? The basic goals are:

  • Unified storage infrastructure for boot, flash, and HDDs
  • Pooling of storage so that resources can be allocated/shared
  • Low latency, high performance as if those resources were DAS attached, or PCIe card flash
  • Bonus points for file store with a global name space

One easy answer would be – that’s a flash SAN or NAS. But that’s not the answer. Not many customers want a flash SAN or NAS – not for their new infrastructure, but more importantly, all the data is at the wrong end of the straw. The poor server is left sucking hard. Remember – this is flash, and people use flash for latency. Today these SAN type of flash devices have 4x-10x worse latency than PCIe cards. Ouch. You have to suck the data through a relatively low bandwidth interconnect, after passing through both the storage and network stacks. And there is interaction between the I/O threads of various servers and applications – you have to wait in line for that resource. It’s true there is a lot of startup energy in this space.  It seems to make sense if you’re a startup, because SAN/NAS is what people use today, and there’s lots of money spent in that market today. However, it’s not what the market is asking for.

Another easy answer is NVMe SSDs. Right? Everyone wants them – right? Well, OEMs at least. Front bay PCIe SSDs (HDD form factor or NVMe – lots of names) that crowd out your disk drive bays. But they don’t fix the problems. The extra mechanicals and form factor are more expensive, and just make replacing the cards every 5 years a few minutes faster. Wow. With NVME SSDs, you can fit fewer HDDs – not good. They also provide uniformly bad cooling, and hard limit power to 9W or 25W per device. But to protect the storage in these devices, you need to have enough of them that you can RAID or otherwise protect. Once you have enough of those for protection, they give you awesome capacity, IOPs and bandwidth, too much in fact, but that’s not what applications need – they need low latency for the working set of data.

What do I think the PCIe replacement solutions in the near future will look like? You need to pool the flash across servers (to optimize bandwidth and resource usage, and allocate appropriate capacity). You need to protect against failures/errors and limit the span of failure,  commit writes at very low latency (lower than native flash) and maintain low latency, bottleneck-free physical links to each server… To me that implies:

    • Small enclosure per rack handling ~32 or more servers
    • Enclosure manages temperature and cooling optimally for performance/endurance
    • Remote configuration/management of the resources allocated to each server
    • Ability to re-assign resources from one server to another in the event of server/VM blue-screen
    • Low-latency/high-bandwidth physical cable or backplane from each server to the enclosure
    • Replaceable inexpensive flash modules in case of failure
    • Protection across all modules (erasure coding) to allow continuous operation at very high bandwidth
    • NV memory to commit writes with extremely low latency
    • Ultimately – integrated with the whole storage architecture at the rack, the same APIs, drivers, etc.

That means the performance looks exactly as if each server had multiple PCIe cards. But the capacity and bandwidth resources are shared, and systems can remain resilient. So ultimately, I think that PCIe cards will evolve to more external, rack level, pooled flash solutions, without sacrificing all their great attributes today. This is just my opinion, but as I say – other leaders in flash are going down this path too…

What’s your opinion?

Tags: , , , , , , , , , , , , , , , ,
Views: (14653)

I remember in the mid-1990s the question of how many minutes away from a diversion airport a two-engine passenger jet should be allowed to fly in the event of an engine failure. Staying in the air long enough is one of those high-availability functions that really matters. In the case of the Boeing 777, it was the first aircraft to enter service with a 180-minute extended operations certification (ETOPS)1. This meant that longer over-water and remote terrain routes were immediately possible.

The question was “can a two-engine passenger aircraft be as safe as a four engine aircraft for long haul flights?” The short answer is yes. Reducing the points of failure from four engines to two, while meeting strict maintenance requirements and maintaining redundant systems, reduces the probability of a failure. The 777 and many other aircraft have proven to be safe for these longer flights. Recently, the 777 has received FAA approval for a 330-minute ETOPS rating2, which allows airlines to offer routes that are longer, straighter and more economical.

What does this have to do with a datacenter? It turns out that some hyperscale datacenters house hundreds of thousands of servers, each with its own boot drive. Each of these boot drives is a potential point of failure, which can drive up acquisition and operating costs and the odds of a breakdown. Datacenter managers need to control CapEx, so for the sheer volume of server boot drives they commonly use the lowest cost 2.5-inch notebook SATA hard drives. The problem is that these commodity hard drives tend to fail more often. This is not a huge issue with only a few servers. But in a datacenter with 200,000 servers, LSI has found through internal research that, on average, 40 to 200 drives fail per week! (2.5″ hard drive, ~2.5 to 4-year lifespan, which equates to a conservative 5% failure rate/year).

Traditionally, a hyperscale datacenter has a sea of racks filled with servers. LSI approximates that, in the majority of large datacenters, at least 60% of the servers (Web servers, database servers, etc.) use a boot drive requiring no more than 40GB of storage capacity since it performs only boot-up and journaling or logging. For higher reliability, the key is to consolidate these low-capacity drives, virtually speaking. With our Syncro™ MX-B Rack Boot Appliance, we can consolidate the boot drives for 24 or 48 of these servers into a single mirrored array (using LSI MegaRAID technology), which makes 40GB of virtual disk space available to each server.

Combining all these boot drives with fewer larger drives that are mirrored helps reduce total cost of ownership (TCO) and improves reliability, availability and serviceability. If a rack boot appliance drive fails, an alert is sent to the IT operator. The operator then simply replaces the failed drive, and the appliance automatically copies the disk image from the working drive. The upshot is that operations are simplified, OpEx is reduced, and there is usually no downtime.

Syncro MX-B not only improves reliability by reducing failure points; it also significantly reduces power requirements (up to 40% less in the 24-port version, up to 60% less in the 48-port version) – a good thing for the corporate utility bill and climate change. This, in turn, reduces cooling requirements, and helps make hardware upgrades less costly. With the boot drives disaggregated from the servers, there’s no need to simultaneously upgrade the drives, which typically are still functional during server hardware upgrades.

Syncro MX-B Rack Boot Appliance

In the case of both commercial aircraft and servers, less really can be more (or at least better) in some situations. Eliminating excess can make the whole system simpler and more efficient.

To learn more, please visit the LSI® Shared Storage Solutions web page:


Syncro MX-B Boot Rack Appliance Overview

  1. Federal Aviation Administration Advisory:
  2. Boeing press release:

Tags: , , , , , , ,
Views: (13524)

I’ve spent a lot of time with hyperscale datacenters around the world trying to understand their problems – and I really don’t care what area those problems are as long as they’re important to the datacenter. What is the #1 Real Problem for many hyperscale datacenters? It’s something you’ve probably never heard about, and probably have not even thought about. It’s called false disk failure. Some hyperscale datacenters have crafted their own solutions – but most have not.

Why is this important, you ask? Many large datacenters today have 1 million to 4 million hard disk drives (HDDs) in active operation. In anyone’s book that’s a lot. It’s also a very interesting statistical sample size of HDDs. Hyperscale datacenters get great pricing on HDDs. Probably better than OEMs get, and certainly better than the $79 for buying 1 HDD at your local Fry’s store. So you would imagine if a disk fails – no one cares – they’re cheap and easy to replace. But the burden of a failed disk is much more than the raw cost of the disk:

  • Disk rebuild and/or data replicate of 2TB or 3TB drive
    • Performance overhead of a RAID rebuild makes it difficult to justify, and can take days
    • Disk capacity must be added somewhere to compensate: ~$40-$50
    • Redistribute replicated data across many servers
    • Infrastructure overhead to rebalance workloads to other distributed servers
    • Person to service disk: remove and replace
      • And then ensure the HDD data cannot be accessed – wipe it or shred it

Let’s put some scale to this problem, and you’ll begin to understand the issue.  One modest size hyperscale datacenter has been very generous in sharing its real numbers. (When I say modest, they are ~1/4 to 1/2 the size of many other hyperscale datacenters, but they are still huge – more than 200k servers). Other hyperscale datacenters I have checked with say – yep, that’s about right. And one engineer I know at an HDD manufacturer said – “wow – I expected worse than that. That’s pretty good.” To be clear – these are very good HDDs they are using, it’s just that the numbers add up.

The raw data:


  • 300k SAS HDDs
  • 15-30 SAS failed per day
    • SAS false fail rate is about 30%~45% (10-15 per day)
    • About 1/1000 HDD annual false failure rate

Non-RAIDed (direct map) SATA drives behind HBAs

  • 1.2M SATA HDDs
  • 60-80 SATA failed disks per day
    • SATA false fail rate is about 40~55% (24-40 per day)
    • About 1/100 HDD annual false failure rate

What’s interesting is the relative failure rate of SAS drives vs. SATA. It’s about an order of magnitude worse in SATA drives than SAS. Frankly some of this is due to protocol differences. SAS allows far more error recovery capabilities, and because they also tend to be more expensive, I believe manufacturers invest in slightly higher quality electronics and components. I know the electronics we ship into SAS drives is certainly more sophisticated than SATA drives.

False fail? What? Yea, that’s an interesting topic. It turns out that about 40% of the time with SAS and about 50% of the time with SATA, the drive didn’t actually fail. It just lost its marbles for a while. When they pull the drive out and put it into a test jig, everything is just fine. And more interesting, when they put the drive back into service, it is no more statistically likely to fail again than any other drive in the datacenter. Why? No one knows. I suspect though.

I used to work on engine controllers. That’s a very paranoid business. If something goes wrong and someone crashes, you have a lawsuit on your hands. If a controller needs a recall, that’s millions of units to replace, with a multi-hundred dollar module, and hundreds of dollars in labor for each one replaced. No one is willing to take that risk. So we designed very carefully to handle soft errors in memory and registers. We incorporated ECC like servers use, background code checksums and scrubbing, and all sorts of proprietary techniques, including watchdogs and super-fast self-resets that could get operational again in less than a full revolution of the engine.  Why? – the events were statistically rare. The average controller might see 1 or 2 events in its lifetime, and a turn of the ignition would reset that state.  But the events do happen, and so do recalls and lawsuits… HDD controllers don’t have these protections, which is reasonable. It would be an inappropriate cost burden for their price point.

You remember the Toyota Prius accelerator problems? I know that controller was not protected for soft errors. And the source of the problem remained a “mystery.”  Maybe it just lost its marbles for a while? A false fail if you will. Just sayin’.

Back to HDDs. False fail is especially frustrating, because half the HDDs actually didn’t need to be replaced. All the operational costs were paid for no reason. The disk just needed a power cycle reset. (OK, that introduces all sorts of complex management by the RAID controller or application to manage that 10 second power reset cycle and application traffic created in that time – be we can handle that.)

Daily, this datacenter has to:

  • Physically replace 100 disk drives
    • Individually destroy or recycle the 100 failed drives
    • Replicate or rebuild 200-300 TBytes of data – just think about that
    • Rebalance the application load on at least 100 servers – more likely 100 clusters of servers – maybe 20,000 servers?
    • Handle the network traffic  load of ~200 TBytes of replicated data
      • That’s on the order of 50 hours of 10GBit Ethernet traffic…

And 1/2 of that is for no reason at all.

First – why not rebuild the disk if it’s RAIDed? Usually hyperscale datacenters use clustered applications. A traditional RAID rebuild drives the server performance to ~50%, and for a 2TByte drive, under heavy application load (definition of a hyperscale datacenter) can truly take up to a week.  50% performance for a week? In a cluster that means the overall cluster is running ~50% performance.  Say 200 nodes in a cluster – that means you just lost ~100 nodes of work – or 50% of cluster performance. It’s much simpler to just take the node offline with the failed drive, and get 99.5% cluster performance, and operationally redistribute the workload across multiple nodes (because you have replicated data elsewhere). But after rebuild, the node will have to be re-synced or re-imaged. There are ways to fix all this. We’ll talk about them on another day. Or you can simply run direct mapped storage, and unmounts the failed drive.

Next – Why replicate data over the network, and why is that a big deal? For geographic redundancy (say a natural disaster at one facility) and regional locality, hyperscale datacenters need multiple data copies. Often 3 copies so they can do double duty as high-availability copies, or in the case of some erasure coding, 2.2 to 2.5 copies (yea – weird math – how do you have 0.5 copy…). When you lose one copy, you are down to 2, possibly 1. You need to get back to a reliable number again. Fast. Customers are loyal because of your perfect data retention. So you need to replicate that data and re-distribute it across the datacenter on multiple servers. That’s network traffic, and possibly congestion, which affects other aspects of the operations of the datacenter. In this datacenter it’s about 50 hours of 10G Ethernet traffic every day.

To be fair, there is a new standard in SAS interfaces that will facilitate resetting a disk in-situ. And there is the start of discussion of the same around SATA – but that’s more problematic. Whatever the case, it will be a years before the ecosystem is in place to handle the problems this way.

What’s that mean to you?

Well. You can expect something like 1/100 of your drives to really fail this year. And you can expect another 1/100 of your drives to fail this year, but not actually be failed. You’ll still pay all the operational overhead of not actually having a failed drive – rebuilds, disk replacements, management interventions, scheduled downtime/maintenance time, and the OEM replacement price for that drive – what $600 or so ?… Depending on your size, that’s either a don’t care, or a big deal. There are ways to handle this, and they’re not expensive – much less than the disk carrier you already pay for to allow you to replace that drive – and it can be handled transparently – just a log entry without seeing any performance hiccups.  You just need to convince your OEM to carry the solution.

Tags: , , , , , , , , , , ,
Views: (21616)